Building Views in Azure PostgreSQL

Now that you’ve designed your architecture on paper, it’s time to build it. In Azure PostgreSQL, we use Views to turn our messy raw data into the structured Star Schema we discussed in the previous chapter.

A View is a virtual table. It doesn't store data itself; instead, it stores a SELECT statement. When you query a view, PostgreSQL runs that underlying query in the background.

Basic Syntax:

CREATE VIEW schema_name.view_name AS
SELECT column1, column2
FROM table_name;

When to use a View vs. a Table?

Use a Table when you are loading data from an external source (Raw layer) or when a query is so computationally heavy that it needs to be "frozen" in time for performance.
Use a View for logic. If you are renaming columns, casting data types, or joining tables, a view is your best friend because it ensures you aren't duplicating data and wasting Azure storage costs.

Now let’s look at how to best create views!

Step 1: Building the Staging View

In the Medallion architecture, the Staging layer is where we fix the issues with the raw data. Raw source systems often have terrible naming conventions or incorrect data types.

Let’s create stg_taxi_trips to clean up the raw input:

CREATE VIEW staging.stg_taxi_trips AS
SELECT 
    -- 1. Renaming for clarity 
    trip_distance AS trip_distance_km,
    
    -- 2. Casting data types (String to Timestamp/Numeric)
    CAST(tpep_pickup_datetime AS TIMESTAMP) AS pickup_at,
    fare_amount::NUMERIC AS fare_amount,
    
    -- 3. Basic cleaning
    COALESCE(passenger_count, 0) AS passenger_count
FROM raw.taxi_trips;

Step 2: Building the Dimension View (`vw_dim_zones`)

Our raw zone data might be a simple list, but in our Marts layer, we want it to be a clean lookup table for our reports.

CREATE VIEW marts.vw_dim_zones AS
SELECT 
    locationid AS location_id,
    zone AS zone_name,
    borough,
    -- Adding a business flag
    CASE WHEN borough = 'Unknown' THEN 'N/A' ELSE borough END AS borough_group
FROM staging.stg_zones;

Step 3: Building the Fact View (`vw_fact_trips`)

The Fact table selects the quantitative measures we need for analysis. Its grain is one row = one individual taxi trip.

CREATE VIEW marts.vw_fact_trips AS
SELECT 
    -- Keys for joining to dimensions
    pickup_location_id,
    dropoff_location_id,
    pickup_at,
    
    -- Measures (Quantitative data)
    trip_distance,
    fare_amount,
    tip_amount,
    fare_amount + tip_amount AS total_revenue
FROM staging.stg_taxi_trips
WHERE fare_amount > 0; -- Enforcing business rules at the Fact level

Why this matters for Azure PostgreSQL

By using this view-based approach:

Storage Efficiency: You aren't paying for extra gigabytes of storage for your "Staging" and "Marts" layers.
Agility: If the business decides that total_revenue should also include taxes, you update one line of code in the vw_fact_trips view. Instantly, every dashboard connected to it is updated.

Testing your views

You’ve built the engine, but before you hand the keys to a Business Analyst, you need to make sure the wheels won't fall off. In data engineering, "it runs without errors" does not mean "the data is right."

1. Stress-Testing Your Views

Since views are "live," any weirdness in your raw data will flow straight through to your charts. You need to run these three "Sanity Checks" manually before declaring a view finished.

The Row Count Match: Your stg_taxi_trips should usually have the exact same number of rows as your raw table (unless you explicitly filtered some out). If these numbers don't match your expectations, you have a leak!
- SELECT count() FROM raw.nyc_taxi_trips; SELECT count() FROM staging.stg_taxi_trips;
The Null Trap: Check if your renaming or casting accidentally turned data into NULL.

SELECT count(*) FROM marts.vw_fact_trips WHERE pickup_location_id IS NULL;
Join Validation: This is the big one. Ensure every trip in your Fact view finds a match in your Dimension view. If a trip has a location_id that doesn't exist in your zones table, that trip might vanish from your final report!

🚀 Fun Fact: In 1999, NASA lost a $125 million Mars orbiter because one engineering team used metric units and another used imperial. The proof that Data validation isn't just about "neatness" 😃

2. Documentation: Leave a Trail of Breadcrumbs

Six months from now, you won't remember why you filtered out trips with a fare_amount of zero. Documentation is a gift to your future self.

In-Code Comments: Use -- for simple notes.

PostgreSQL Comments: You can actually attach descriptions directly to the database objects so they show up in tools like Azure Data Studio.

COMMENT ON VIEW marts.vw_fact_trips IS 'The primary fact table for taxi revenue. One row per completed trip.';

The Data Dictionary: This is a simple document (or a markdown file in your repo) that explains:

Table Name: vw_fact_trips
Grain: One row per taxi trip.
Columns: total_revenue (Sum of fare + tips).

3. Putting it All Together: Answering Business Questions

The whole reason we built this Star Schema with Views is to make hard questions feel easy. Let’s see it in action.

The Question: "Which borough generated the most tip revenue?"

SELECT 
    dim.borough,
    SUM(fact.tip_amount) AS total_tips,
    COUNT(*) AS trip_count
FROM marts.vw_fact_trips AS fact
JOIN marts.vw_dim_zones AS dim 
    ON fact.pickup_location_id = dim.location_id
GROUP BY 1
ORDER BY total_tips DESC;

Notice how clean that query is? You didn't have to worry about cleaning dates or filtering out bad data, the Views did that work for you. You just focused on the business logic.

💡 Manhattan usually dominates NYC taxi data, but if you look at "average tip percentage" instead of "total tips" you might find smaller boroughs or specific airport runs actually have the most generous passengers!

🧠 Knowledge Check

If you update a row in your raw.nyc_taxi_trips table, do you need to manually refresh the marts.vw_fact_trips view to see the change?
If your marts.vw_fact_trips view starts taking 30 seconds to load because it's doing too many calculations, what PostgreSQL feature should you investigate to speed it up?
How can views help you hide sensitive financial data from a junior analyst who only needs to see trip distances?
You run a join between your Fact view and Dimension view, and the total row count decreases. What happened?
Why is a COMMENT ON VIEW better than just a comment in your .sql file?
If you wanted to find the "Busiest Pickup Hour," which view would you need to edit to add an hour_of_day column?
A manager asks, "Why can't I just query the Raw table?" Give them one reason based on Consistency and one based on Performance.

📖 Extra Reading

PostgreSQL Aggregate Functions: A list of all the math you can do (SUM, AVG, MIN, MAX) once your views are set up.
**The five key steps for building a robust data warehouse:** common mistakes to avoid!
Data Quality vs. Data Integrity: Why having "clean" data isn't the same as having "accurate" data.