Week 9 - SQL for Analytics

Introduction to SQL for Analytics

2. Joins, CTEs, and Aggregations

3. Data Validation Queries

4. OLAP vs OLTP and Modern Warehouses

4. Data Modeling Concepts

5. Building Views in Azure PostgreSQL

6. Gotchas & Pitfalls

7. Practice

8. Assignment

Lesson Plan

Joins, CTEs, and Aggregations

Welcome to the "heavy lifting" part of the week. In Week 6, you learned how to use Python to shove data into PostgreSQL using INSERT and how to peek at it using SELECT *.

But in a real data pipeline, the raw data is usually a mess. It's fragmented across different tables and full of codes that don't make sense to humans. Today, we move beyond just "reading" data and start "transforming" it using the three pillars of analytical SQL: Joins, Aggregations and CTEs.

Joins: Connecting the Dots

In Week 6, you created a single weather_readings table. In our NYC Taxi dataset, life is more complex. We have Trips (the events) and Zones (the labels).

If you look at the trips table, you’ll see pickuplocationid (Pickup) and dropofflocationid (Dropoff). These are just numbers, like 43 or 132. To turn "43" into "Central Park," we need a JOIN.

A JOIN allows us to combine multiple tables into a single one, combining the elements of each table together based on common identifiers. These common identifiers should be IDs that allow us to tell which row of Table A corresponds to each row of Table B.

We have different types of JOIN depending on what we want to keep from our two original tables:

The type of JOIN determines what we want to keep from each dataset ragarding the rows that did not have a match in A and B. In our dataset, this would be the output:

Now let’s look at how we can translate this to SQL syntax!

Let’s say we are asked to get the latest 10 trips with pickup location in Manhattan, including pickup time, trip distance and fare amount. This is how we can quickly get this info:

SELECT 
    z.borough,
    z.zone AS pickup_zone,
    t.pickup_datetime,
    t.trip_distance,
    t.fare_amount
FROM raw_trips t
INNER JOIN raw_zones z 
    ON t.pickup_location_id = z.location_id
WHERE z.borough = 'Manhattan' -- This command filters the pickup location to Manhattan
ORDER BY pickup_datetime DESC -- As we want to retrieve the latest trips, we sort by date descending
LIMIT 10;

💡 Do you start seeing the benefit of joining multiple tables instead of storing everything in one? If zone names were repeated in the trips table, we’d introduce a lot of redundancy. Let’s keep in mind that strings are the heaviest data types to store and process, hence using short numerical IDs is much more efficient.

Aggregations: The "Squash"

Now let’s see how we turn 57,000 rows (or millions of rows in real work life!) into just a few rows of pure insight. Analysts don't want to see every single taxi ride; they often want to see the Average Fare or Total Trips per day.

To do this, we use Aggregate Functions (SUM, AVG, COUNT, MIN, MAX) combined with a GROUP BY clause. Anything that follows the GROUP BY will be what defines how detailed our aggregation will be.

<aside> ⚠️ The Golden Rule of Aggregation: Every column in your SELECT statement must either be wrapped in an aggregate function (like SUM()) OR be listed in the GROUP BY clause. If you forget this, Postgres will throw an error that essentially says: "I don't know what to do with the extra columns!"

</aside>

Let’s look at how we can add multiple aggregations in a single query:

SELECT 
    z.borough,
    COUNT(*) AS total_trips,
    AVG(t.fare_amount) AS avg_fare,
    SUM(t.trip_distance) AS total_distance
FROM raw_trips t
INNER JOIN raw_zones z 
    ON t.pickup_location_id = z.location_id
GROUP BY z.borough
ORDER BY total_trips DESC;

You can see how in a single query we got the total number of trips per borough, their average fare and the total distance!

WHERE vs. HAVING: The Filter Order

As you start aggregating data, you'll eventually want to filter the results (e.g., "Show me only the boroughs with more than 1,000 trips").

In SQL, the order in which the database executes your command matters.

Think of it like this: If you are organizing a party, WHERE is the guest list (who gets in), and HAVING is the rule for which tables stay (only tables with more than 4 people).

SELECT 
    z.borough,
    COUNT(*) AS trip_count
FROM raw_trips t
JOIN raw_zones z 
	ON t.pickup_location_id = z.location_id
WHERE t.trip_distance > 2.0   -- 1. Only look at trips above 2km
GROUP BY z.borough
HAVING COUNT(*) > 1000        -- 2. Only show boroughs that have > 1000 of those trips
ORDER BY trip_count DESC;

Both commands filter, but one filters at the row-by-row level, the other filters the aggregated level!

🤓 Curious Geek Did you know that the SELECT clause is one of the last things the database actually processes? Even though you write it at the top, the database first looks at FROM, then JOIN, then WHERE, then GROUP BY. This is why you can't use an alias you created in the SELECT (like fare * 1.2 AS total) inside your WHERE clause: the WHERE happens before the SELECT even exists!

CTEs: Writing Clean Code

In Python, you wouldn't write a 100-line script inside a single function. You’d break it into steps. In SQL, we do this with CTEs (Common Table Expressions) using the WITH keyword.

CTEs are "temporary result sets" that exist only during the execution of that query. They make your SQL readable, debuggable, and much more professional.

A common pattern:

  1. CTE 1: Clean the data (filter out nulls, fix types).
  2. CTE 2: Join the tables.
  3. Final Select: Perform the aggregation.
WITH cleaned_trips AS (
    SELECT * FROM raw_trips 
    WHERE fare_amount > 0 AND trip_distance > 0
),
trips_with_boroughs AS (
    SELECT 
        ct.*,
        z.borough
    FROM cleaned_trips ct
    JOIN raw_zones z ON ct.pickup_location_id = z.location_id
)
SELECT 
    borough,
    COUNT(*) as trip_count
FROM trips_with_boroughs
GROUP BY 1  -- Shortcut for "first column in SELECT"
ORDER BY 2 DESC;

💡 Pro-Tip: If you see a query with nested subqueries (queries inside queries inside queries), refactor it into CTEs. Your future self (and your teammates) will thank you.

Window Functions: The "Calculated Column"

We’re getting advanced now ! Unlike GROUP BY, which "squashes" your rows, Window Functions allow you to perform calculations across a set of rows while still keeping all the individual rows visible.

A good mental model to keep in mind is:

You can always recognize a window function by the OVER() clause. They’re especially useful for:

Example: Ranking Zones by Revenue

SELECT 
    borough, 
    zone, 
    ROUND(total_revenue) as total_revenue,
    RANK() OVER (PARTITION BY borough ORDER BY total_revenue DESC) as revenue_rank
FROM (
    SELECT 
	    z.borough, 
	    z.zone, 
	    SUM(t.fare_amount) as total_revenue
    FROM raw_trips t 
    JOIN raw_zones z 
	    ON t.pickup_location_id = z.location_id
    GROUP BY 1, 2
) subquery;

What’s happening here:

Breaking down the window:

This is a snapshot of the results you would get! The ranking will restart for each borough considering we added borough in the partition statement.

image.png

In this case, what do you expected to get if you run the same window function without the PARTITION BY borough piece?

🧠 Curious box: “Can I filter on a window function?”

You might want to write something like this to retrieve only the rows that matched a specific ranking:

SELECT*,
       ROW_NUMBER() OVER (PARTITION BY borough ORDER BY total_revenue DESC)AS rn
FROM my_table
WHERE rn=1;-- ❌ this won't work in PostgreSQL

However, this will fail in PostGRE SQL because WHERE is evaluated before window functions are computed. The correct PostgreSQL approach is to use a subquery or CTE like this:

SELECT *
FROM (
		SELECT *, ROW_NUMBER() OVER (PARTITION BY borough ORDER BY total_revenue DESC) AS rn
		FROM my_table
) t
WHERE rn=1; -- This filters on the borough with the highest revenue! 

In other tools like BigQuery / Snowflake, the same result can be used without a CTE (using QUALIFY)

SELECT *, ROW_NUMBER() OVER (PARTITION BY borough ORDER BY total_revenue DESC)AS rn
FROM my_table
QUALIFY rn=1;

UNION and UNION ALL: Stacking Data

Sometimes you don't want to join tables side-by-side (Columns); you want to stack them on top of each other (Rows).

Suppose you have a table for green_taxi_trips and another for yellow_taxi_trips. To get a master taxi trip list, you use UNION. For this, you can use the following commands in PostGRE:

The syntax looks like this:

SELECT 'green' as type, pickup_datetime 
FROM green_trips
UNION ALL
SELECT 'yellow' as type, pickup_datetime 
FROM yellow_trips;

Curious Concept: The Cartesian Product (The "Explosion")

Ever written a query that never finishes, or suddenly returns billions of rows? You likely created a Cartesian Join (or Cross Join).

This happens when you join two tables but forget the ON clause, or use a join condition that isn't unique. Postgres will try to match every row of Table A with every row of Table B. If both tables have 1,000 rows, you just created a 1,000,000 row result set. In our taxi data, a mistake like this could crash your session!


🧠 Knowledge Check

  1. What is the difference between an INNER JOIN and a LEFT JOIN? What happens if a trip has a location_id that doesn't exist in the zones table?
  2. What would you change in our first query if instead of the pickup location zone you wanted to display the dropoff zone? What if you want to show both?
  3. You want to find the most expensive trip for each vendor_id. Which aggregate function do you use, and which column goes into the GROUP BY?
  4. Why are CTEs preferred over subqueries for complex analytical logic?
  5. Reference Week 6: Why is using SUM() in SQL more efficient than pulling all rows into a Python list and using sum(my_list)?