Introduction to SQL for Analytics
2. Joins, CTEs, and Aggregations
5. Building Views in Azure PostgreSQL
Now that you’ve designed your architecture on paper, it’s time to build it. In Azure PostgreSQL, we use Views to turn our messy raw data into the structured Star Schema we discussed in the previous chapter.
A View is a virtual table. It doesn't store data itself; instead, it stores a SELECT statement. When you query a view, PostgreSQL runs that underlying query in the background.
Basic Syntax:
CREATE VIEW schema_name.view_name AS
SELECT column1, column2
FROM table_name;
Now let’s look at how to best create views!
In the Medallion architecture, the Staging layer is where we fix the issues with the raw data. Raw source systems often have terrible naming conventions or incorrect data types.
Let’s create stg_taxi_trips to clean up the raw input:
CREATE VIEW staging.stg_taxi_trips AS
SELECT
-- 1. Renaming for clarity
trip_distance AS trip_distance_km,
-- 2. Casting data types (String to Timestamp/Numeric)
CAST(tpep_pickup_datetime AS TIMESTAMP) AS pickup_at,
fare_amount::NUMERIC AS fare_amount,
-- 3. Basic cleaning
COALESCE(passenger_count, 0) AS passenger_count
FROM raw.taxi_trips;
vw_dim_zones)Our raw zone data might be a simple list, but in our Marts layer, we want it to be a clean lookup table for our reports.
CREATE VIEW marts.vw_dim_zones AS
SELECT
locationid AS location_id,
zone AS zone_name,
borough,
-- Adding a business flag
CASE WHEN borough = 'Unknown' THEN 'N/A' ELSE borough END AS borough_group
FROM staging.stg_zones;
vw_fact_trips)The Fact table selects the quantitative measures we need for analysis. Its grain is one row = one individual taxi trip.
CREATE VIEW marts.vw_fact_trips AS
SELECT
-- Keys for joining to dimensions
pickup_location_id,
dropoff_location_id,
pickup_at,
-- Measures (Quantitative data)
trip_distance,
fare_amount,
tip_amount,
fare_amount + tip_amount AS total_revenue
FROM staging.stg_taxi_trips
WHERE fare_amount > 0; -- Enforcing business rules at the Fact level
By using this view-based approach:
total_revenue should also include taxes, you update one line of code in the vw_fact_trips view. Instantly, every dashboard connected to it is updated.You’ve built the engine, but before you hand the keys to a Business Analyst, you need to make sure the wheels won't fall off. In data engineering, "it runs without errors" does not mean "the data is right."
Since views are "live," any weirdness in your raw data will flow straight through to your charts. You need to run these three "Sanity Checks" manually before declaring a view finished.
The Row Count Match: Your stg_taxi_trips should usually have the exact same number of rows as your raw table (unless you explicitly filtered some out). If these numbers don't match your expectations, you have a leak!
SELECT count() FROM raw.nyc_taxi_trips; SELECT count() FROM staging.stg_taxi_trips;The Null Trap: Check if your renaming or casting accidentally turned data into NULL.
SELECT count(*) FROM marts.vw_fact_trips WHERE pickup_location_id IS NULL;
Join Validation: This is the big one. Ensure every trip in your Fact view finds a match in your Dimension view. If a trip has a location_id that doesn't exist in your zones table, that trip might vanish from your final report!
🚀 Fun Fact: In 1999, NASA lost a $125 million Mars orbiter because one engineering team used metric units and another used imperial. The proof that Data validation isn't just about "neatness" 😃
Six months from now, you won't remember why you filtered out trips with a fare_amount of zero. Documentation is a gift to your future self.
In-Code Comments: Use -- for simple notes.
PostgreSQL Comments: You can actually attach descriptions directly to the database objects so they show up in tools like Azure Data Studio.
COMMENT ON VIEW marts.vw_fact_trips IS 'The primary fact table for taxi revenue. One row per completed trip.';
The Data Dictionary: This is a simple document (or a markdown file in your repo) that explains:
vw_fact_tripstotal_revenue (Sum of fare + tips).The whole reason we built this Star Schema with Views is to make hard questions feel easy. Let’s see it in action.
The Question: "Which borough generated the most tip revenue?"
SELECT
dim.borough,
SUM(fact.tip_amount) AS total_tips,
COUNT(*) AS trip_count
FROM marts.vw_fact_trips AS fact
JOIN marts.vw_dim_zones AS dim
ON fact.pickup_location_id = dim.location_id
GROUP BY 1
ORDER BY total_tips DESC;
Notice how clean that query is? You didn't have to worry about cleaning dates or filtering out bad data, the Views did that work for you. You just focused on the business logic.
💡 Manhattan usually dominates NYC taxi data, but if you look at "average tip percentage" instead of "total tips" you might find smaller boroughs or specific airport runs actually have the most generous passengers!
raw.nyc_taxi_trips table, do you need to manually refresh the marts.vw_fact_trips view to see the change?marts.vw_fact_trips view starts taking 30 seconds to load because it's doing too many calculations, what PostgreSQL feature should you investigate to speed it up?COMMENT ON VIEW better than just a comment in your .sql file?hour_of_day column?The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with ❤️ by the HackYourFuture community · Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.