Week 9 - SQL for Analytics

Introduction to SQL for Analytics

2. Joins, CTEs, and Aggregations

3. Data Validation Queries

4. Data Modeling Concepts

5. Building Views in Azure PostgreSQL

6. Gotchas & Pitfalls

7. Practice

8. Assignment

Lesson Plan

Building Views in Azure PostgreSQL

Now that you’ve designed your architecture on paper, it’s time to build it. In Azure PostgreSQL, we use Views to turn our messy raw data into the structured Star Schema we discussed in the previous chapter.

A View is a virtual table. It doesn't store data itself; instead, it stores a SELECT statement. When you query a view, PostgreSQL runs that underlying query in the background.

Basic Syntax:

CREATE VIEW schema_name.view_name AS
SELECT column1, column2
FROM table_name;

When to use a View vs. a Table?


Now let’s look at how to best create views!

Step 1: Building the Staging View

In the Medallion architecture, the Staging layer is where we fix the issues with the raw data. Raw source systems often have terrible naming conventions or incorrect data types.

Let’s create stg_taxi_trips to clean up the raw input:

CREATE VIEW staging.stg_taxi_trips AS
SELECT 
    -- 1. Renaming for clarity 
    trip_distance AS trip_distance_km,
    
    -- 2. Casting data types (String to Timestamp/Numeric)
    CAST(tpep_pickup_datetime AS TIMESTAMP) AS pickup_at,
    fare_amount::NUMERIC AS fare_amount,
    
    -- 3. Basic cleaning
    COALESCE(passenger_count, 0) AS passenger_count
FROM raw.taxi_trips;

Step 2: Building the Dimension View (vw_dim_zones)

Our raw zone data might be a simple list, but in our Marts layer, we want it to be a clean lookup table for our reports.

CREATE VIEW marts.vw_dim_zones AS
SELECT 
    locationid AS location_id,
    zone AS zone_name,
    borough,
    -- Adding a business flag
    CASE WHEN borough = 'Unknown' THEN 'N/A' ELSE borough END AS borough_group
FROM staging.stg_zones;

Step 3: Building the Fact View (vw_fact_trips)

The Fact table selects the quantitative measures we need for analysis. Its grain is one row = one individual taxi trip.

CREATE VIEW marts.vw_fact_trips AS
SELECT 
    -- Keys for joining to dimensions
    pickup_location_id,
    dropoff_location_id,
    pickup_at,
    
    -- Measures (Quantitative data)
    trip_distance,
    fare_amount,
    tip_amount,
    fare_amount + tip_amount AS total_revenue
FROM staging.stg_taxi_trips
WHERE fare_amount > 0; -- Enforcing business rules at the Fact level

Why this matters for Azure PostgreSQL

By using this view-based approach:

  1. Storage Efficiency: You aren't paying for extra gigabytes of storage for your "Staging" and "Marts" layers.
  2. Agility: If the business decides that total_revenue should also include taxes, you update one line of code in the vw_fact_trips view. Instantly, every dashboard connected to it is updated.

Testing your views

You’ve built the engine, but before you hand the keys to a Business Analyst, you need to make sure the wheels won't fall off. In data engineering, "it runs without errors" does not mean "the data is right."

1. Stress-Testing Your Views

Since views are "live," any weirdness in your raw data will flow straight through to your charts. You need to run these three "Sanity Checks" manually before declaring a view finished.

🚀 Fun Fact: In 1999, NASA lost a $125 million Mars orbiter because one engineering team used metric units and another used imperial. The proof that Data validation isn't just about "neatness" 😃


2. Documentation: Leave a Trail of Breadcrumbs

Six months from now, you won't remember why you filtered out trips with a fare_amount of zero. Documentation is a gift to your future self.

In-Code Comments: Use -- for simple notes.

PostgreSQL Comments: You can actually attach descriptions directly to the database objects so they show up in tools like Azure Data Studio.

COMMENT ON VIEW marts.vw_fact_trips IS 'The primary fact table for taxi revenue. One row per completed trip.';

The Data Dictionary: This is a simple document (or a markdown file in your repo) that explains:


3. Putting it All Together: Answering Business Questions

The whole reason we built this Star Schema with Views is to make hard questions feel easy. Let’s see it in action.

The Question: "Which borough generated the most tip revenue?"

SELECT 
    dim.borough,
    SUM(fact.tip_amount) AS total_tips,
    COUNT(*) AS trip_count
FROM marts.vw_fact_trips AS fact
JOIN marts.vw_dim_zones AS dim 
    ON fact.pickup_location_id = dim.location_id
GROUP BY 1
ORDER BY total_tips DESC;

Notice how clean that query is? You didn't have to worry about cleaning dates or filtering out bad data, the Views did that work for you. You just focused on the business logic.

💡 Manhattan usually dominates NYC taxi data, but if you look at "average tip percentage" instead of "total tips" you might find smaller boroughs or specific airport runs actually have the most generous passengers!


🧠 Knowledge Check

  1. If you update a row in your raw.nyc_taxi_trips table, do you need to manually refresh the marts.vw_fact_trips view to see the change?
  2. If your marts.vw_fact_trips view starts taking 30 seconds to load because it's doing too many calculations, what PostgreSQL feature should you investigate to speed it up?
  3. How can views help you hide sensitive financial data from a junior analyst who only needs to see trip distances?
  4. You run a join between your Fact view and Dimension view, and the total row count decreases. What happened?
  5. Why is a COMMENT ON VIEW better than just a comment in your .sql file?
  6. If you wanted to find the "Busiest Pickup Hour," which view would you need to edit to add an hour_of_day column?
  7. A manager asks, "Why can't I just query the Raw table?" Give them one reason based on Consistency and one based on Performance.

📖 Extra Reading

Suggested Topics


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.