Joins, CTEs, and Aggregations

Welcome to the "heavy lifting" part of the week. In Week 6, you learned how to use Python to shove data into PostgreSQL using INSERT and how to peek at it using SELECT *.

But in a real data pipeline, the raw data is usually a mess. It's fragmented across different tables and full of codes that don't make sense to humans. Today, we move beyond just "reading" data and start "transforming" it using the three pillars of analytical SQL: Joins, Aggregations and CTEs.

Joins: Connecting the Dots

In Week 6, you created a single weather_readings table. In our NYC Taxi dataset, life is more complex. We have Trips (the events) and Zones (the labels).

If you look at the trips table, you’ll see pickuplocationid (Pickup) and dropofflocationid (Dropoff). These are just numbers, like 43 or 132. To turn "43" into "Central Park," we need a JOIN.

A JOIN allows us to combine multiple tables into a single one, combining the elements of each table together based on common identifiers. These common identifiers should be IDs that allow us to tell which row of Table A corresponds to each row of Table B.

We have different types of JOIN depending on what we want to keep from our two original tables:

The type of JOIN determines what we want to keep from each dataset ragarding the rows that did not have a match in A and B. In our dataset, this would be the output:

INNER JOIN: keeps only trips where the pickup/dropoff ID exists in the zones table (i.e. only matched locations).
LEFT JOIN: keeps all trips, even if a pickup/dropoff ID has no match in zones (missing names become NULL).
RIGHT JOIN: keeps all zone records, even if no trips reference them (unused zones still appear).
FULL JOIN: keeps everything from both trips and zones, showing matches where possible and NULLs where not.

Now let’s look at how we can translate this to SQL syntax!

Let’s say we are asked to get the latest 10 trips with pickup location in Manhattan, including pickup time, trip distance and fare amount. This is how we can quickly get this info:

SELECT 
    z.borough,
    z.zone AS pickup_zone,
    t.pickup_datetime,
    t.trip_distance,
    t.fare_amount
FROM raw_trips t
INNER JOIN raw_zones z 
    ON t.pickup_location_id = z.location_id
WHERE z.borough = 'Manhattan' -- This command filters the pickup location to Manhattan
ORDER BY pickup_datetime DESC -- As we want to retrieve the latest trips, we sort by date descending
LIMIT 10;

The Alias: Notice trips t and zones z. We use these short nicknames so we don't have to type the full table name every time.
The Bridge: ON t.pickup_location_id = z.location_id tells Postgres exactly how these two worlds are connected.

💡 Do you start seeing the benefit of joining multiple tables instead of storing everything in one? If zone names were repeated in the trips table, we’d introduce a lot of redundancy. Let’s keep in mind that strings are the heaviest data types to store and process, hence using short numerical IDs is much more efficient.

Aggregations: The "Squash"

Now let’s see how we turn 57,000 rows (or millions of rows in real work life!) into just a few rows of pure insight. Analysts don't want to see every single taxi ride; they often want to see the Average Fare or Total Trips per day.

To do this, we use Aggregate Functions (SUM, AVG, COUNT, MIN, MAX) combined with a GROUP BY clause. Anything that follows the GROUP BY will be what defines how detailed our aggregation will be.

AVG(fare_amount) grouped by pickup zone ? —> We will get one row per pickup zone
AVG(fare_amount) grouped by pickup zone AND dropoff zone ? —> We will get one row for each pickup<>dropoff zone combinations.. the number of rows will be equal to how many unique combinations we have in the tables!

<aside> ⚠️ The Golden Rule of Aggregation: Every column in your SELECT statement must either be wrapped in an aggregate function (like SUM()) OR be listed in the GROUP BY clause. If you forget this, Postgres will throw an error that essentially says: "I don't know what to do with the extra columns!"

</aside>

Let’s look at how we can add multiple aggregations in a single query:

SELECT 
    z.borough,
    COUNT(*) AS total_trips,
    AVG(t.fare_amount) AS avg_fare,
    SUM(t.trip_distance) AS total_distance
FROM raw_trips t
INNER JOIN raw_zones z 
    ON t.pickup_location_id = z.location_id
GROUP BY z.borough
ORDER BY total_trips DESC;

You can see how in a single query we got the total number of trips per borough, their average fare and the total distance!

WHERE vs. HAVING: The Filter Order

As you start aggregating data, you'll eventually want to filter the results (e.g., "Show me only the boroughs with more than 1,000 trips").

In SQL, the order in which the database executes your command matters.

WHERE filters rows before the math happens.
HAVING filters groups after the math happens.

Think of it like this: If you are organizing a party, WHERE is the guest list (who gets in), and HAVING is the rule for which tables stay (only tables with more than 4 people).

SELECT 
    z.borough,
    COUNT(*) AS trip_count
FROM raw_trips t
JOIN raw_zones z 
	ON t.pickup_location_id = z.location_id
WHERE t.trip_distance > 2.0   -- 1. Only look at trips above 2km
GROUP BY z.borough
HAVING COUNT(*) > 1000        -- 2. Only show boroughs that have > 1000 of those trips
ORDER BY trip_count DESC;

Both commands filter, but one filters at the row-by-row level, the other filters the aggregated level!

🤓 Curious Geek Did you know that the SELECT clause is one of the last things the database actually processes? Even though you write it at the top, the database first looks at FROM, then JOIN, then WHERE, then GROUP BY. This is why you can't use an alias you created in the SELECT (like fare * 1.2 AS total) inside your WHERE clause: the WHERE happens before the SELECT even exists!

CTEs: Writing Clean Code

In Python, you wouldn't write a 100-line script inside a single function. You’d break it into steps. In SQL, we do this with CTEs (Common Table Expressions) using the WITH keyword.

CTEs are "temporary result sets" that exist only during the execution of that query. They make your SQL readable, debuggable, and much more professional.

A common pattern:

CTE 1: Clean the data (filter out nulls, fix types).
CTE 2: Join the tables.
Final Select: Perform the aggregation.

WITH cleaned_trips AS (
    SELECT * FROM raw_trips 
    WHERE fare_amount > 0 AND trip_distance > 0
),
trips_with_boroughs AS (
    SELECT 
        ct.*,
        z.borough
    FROM cleaned_trips ct
    JOIN raw_zones z ON ct.pickup_location_id = z.location_id
)
SELECT 
    borough,
    COUNT(*) as trip_count
FROM trips_with_boroughs
GROUP BY 1  -- Shortcut for "first column in SELECT"
ORDER BY 2 DESC;

💡 Pro-Tip: If you see a query with nested subqueries (queries inside queries inside queries), refactor it into CTEs. Your future self (and your teammates) will thank you.

Window Functions: The "Calculated Column"

We’re getting advanced now ! Unlike GROUP BY, which "squashes" your rows, Window Functions allow you to perform calculations across a set of rows while still keeping all the individual rows visible.

A good mental model to keep in mind is:

*GROUP BY = shrink the table*
Window functions = add smart columns to the table

You can always recognize a window function by the OVER() clause. They’re especially useful for:

Rankings: “Who are the top 3 drivers in each borough?”
Running totals: “What is our cumulative revenue day by day?”
Comparisons: “How does this row compare to the average?”
Deduplication: “Keep only the most recent record per user”

Example: Ranking Zones by Revenue

SELECT 
    borough, 
    zone, 
    ROUND(total_revenue) as total_revenue,
    RANK() OVER (PARTITION BY borough ORDER BY total_revenue DESC) as revenue_rank
FROM (
    SELECT 
	    z.borough, 
	    z.zone, 
	    SUM(t.fare_amount) as total_revenue
    FROM raw_trips t 
    JOIN raw_zones z 
	    ON t.pickup_location_id = z.location_id
    GROUP BY 1, 2
) subquery;

What’s happening here:

First, we compute total_revenue per (borough, zone) using GROUP BY
Then, we apply a window function on top of that result

Breaking down the window:

PARTITION BY borough → Resets the ranking for each borough (like doing separate rankings per group)
ORDER BY total_revenue DESC → Highest revenue gets rank 1
RANK() → Assigns a rank, allowing ties (if two zones have the same revenue, they get the same rank)

This is a snapshot of the results you would get! The ranking will restart for each borough considering we added borough in the partition statement.

In this case, what do you expected to get if you run the same window function without the PARTITION BY borough piece?

🧠 Curious box: “Can I filter on a window function?”

You might want to write something like this to retrieve only the rows that matched a specific ranking:

SELECT*,
       ROW_NUMBER() OVER (PARTITION BY borough ORDER BY total_revenue DESC)AS rn
FROM my_table
WHERE rn=1;-- ❌ this won't work in PostgreSQL

However, this will fail in PostGRE SQL because WHERE is evaluated before window functions are computed. The correct PostgreSQL approach is to use a subquery or CTE like this:

SELECT *
FROM (
		SELECT *, ROW_NUMBER() OVER (PARTITION BY borough ORDER BY total_revenue DESC) AS rn
		FROM my_table
) t
WHERE rn=1; -- This filters on the borough with the highest revenue!

In other tools like BigQuery / Snowflake, the same result can be used without a CTE (using QUALIFY)

SELECT *, ROW_NUMBER() OVER (PARTITION BY borough ORDER BY total_revenue DESC)AS rn
FROM my_table
QUALIFY rn=1;

UNION and UNION ALL: Stacking Data

Sometimes you don't want to join tables side-by-side (Columns); you want to stack them on top of each other (Rows).

Suppose you have a table for green_taxi_trips and another for yellow_taxi_trips. To get a master taxi trip list, you use UNION. For this, you can use the following commands in PostGRE:

UNION ALL: Just stacks them. It’s fast and keeps every single row. (This is the most common use).
UNION: Stacks them and then runs a "Distinct" check to remove duplicates. It’s much slower because the database has to compare every row.

The syntax looks like this:

SELECT 'green' as type, pickup_datetime 
FROM green_trips
UNION ALL
SELECT 'yellow' as type, pickup_datetime 
FROM yellow_trips;

Curious Concept: The Cartesian Product (The "Explosion")

Ever written a query that never finishes, or suddenly returns billions of rows? You likely created a Cartesian Join (or Cross Join).

This happens when you join two tables but forget the ON clause, or use a join condition that isn't unique. Postgres will try to match every row of Table A with every row of Table B. If both tables have 1,000 rows, you just created a 1,000,000 row result set. In our taxi data, a mistake like this could crash your session!

🧠 Knowledge Check

What is the difference between an INNER JOIN and a LEFT JOIN? What happens if a trip has a location_id that doesn't exist in the zones table?
What would you change in our first query if instead of the pickup location zone you wanted to display the dropoff zone? What if you want to show both?
You want to find the most expensive trip for each vendor_id. Which aggregate function do you use, and which column goes into the GROUP BY?
Why are CTEs preferred over subqueries for complex analytical logic?
Reference Week 6: Why is using SUM() in SQL more efficient than pulling all rows into a Python list and using sum(my_list)?