Joins, CTEs, and Aggregations

Welcome to the "heavy lifting" part of the week. In Week 6, you learned how to use Python to shove data into PostgreSQL using INSERT and how to peek at it using SELECT *.

But in a real data pipeline, the raw data is usually a mess. It's fragmented across different tables and full of codes that don't make sense to humans. Today, we move beyond just "reading" data and start "transforming" it using the three pillars of analytical SQL: Joins, Aggregations and CTEs.

1. Joins: Connecting the Dots

In Week 6, you created a single weather_readings table. In our NYC Taxi dataset, life is more complex. We have Trips (the events) and Zones (the labels).

If you look at the trips table, you’ll see pickuplocationid (Pickup) and dropofflocationid (Dropoff). These are just numbers, like 43 or 132. To turn "43" into "Central Park," we need a JOIN.

A JOIN allows us to combine multiple tables into a single one, combining the elements of each table together based on common identifiers. These common identifiers should be IDs that allow us to tell which row of Table A corresponds to each row of Table B.

We have different types of JOIN depending on what we want to keep from our two original tables:

The type of JOIN determines what we want to keep from each dataset ragarding the rows that did not have a match in A and B. In our dataset, this would be the output:

INNER JOIN: keeps only trips where the pickup/dropoff ID exists in the zones table (i.e. only matched locations).
LEFT JOIN: keeps all trips, even if a pickup/dropoff ID has no match in zones (missing names become NULL).
RIGHT JOIN: keeps all zone records, even if no trips reference them (unused zones still appear).
FULL JOIN: keeps everything from both trips and zones, showing matches where possible and NULLs where not.

Now let’s look at how we can translate this to SQL syntax!

Let’s say we are asked to get the latest 10 trips with pickup location in Manhattan, including pickup time, trip distance and fare amount. This is how we can quickly get this info:

SELECT 
    z.borough,
    z.zone AS pickup_zone,
    t.pickup_datetime,
    t.trip_distance,
    t.fare_amount
FROM raw_trips t
INNER JOIN raw_zones z 
    ON t.pickup_location_id = z.location_id
WHERE z.borough = 'Manhattan' -- This command filters the pickup location to Manhattan
ORDER BY pickup_datetime DESC -- As we want to retrieve the latest trips, we sort by date descending
LIMIT 10;

The Alias: Notice trips t and zones z. We use these short nicknames so we don't have to type the full table name every time.
The Bridge: ON t.PULocationID = z.LocationID tells Postgres exactly how these two worlds are connected.

💡

Do you start seeing the benefit of joining multiple tables instead of storing everything in one? If zone names were repeated in the trips table, we’d introduce a lot of redundancy.

Let’s keep in mind that strings are the heaviest data types to process, hence working with numerical short IDs brings great performance improvements!

2. Aggregations: The "Squash"

Now let’s see how we turn 57,000 rows (or millions of rows in real work life!) into just a few rows of pure insight. Analysts don't want to see every single taxi ride; they often want to see the Average Fare or Total Trips per day.

To do this, we use Aggregate Functions (SUM, AVG, COUNT, MIN, MAX) combined with a GROUP BY clause. Anything that follows. the GROUP BY will be what defines how detailed our aggregation will be.

AVG(fare_amount) grouped by pickup zone ? —> We will get one row per pickup zone
AVG(fare_amount) grouped by pickup zone AND dropoff zone ? —> We will get one row for each pickup<>dropoff zone combinations.. the number of rows will be equal to how many unique combinations we have in the tables!

<aside> ⚠️ The Golden Rule of Aggregation: Every column in your SELECT statement must either be wrapped in an aggregate function (like SUM()) OR be listed in the GROUP BY clause. If you forget this, Postgres will throw an error that essentially says: "I don't know what to do with the extra columns!"

</aside>

Let’s look at how we can add multiple aggregations in a single query:

SELECT 
    z.borough,
    COUNT(*) AS total_trips,
    AVG(t.fare_amount) AS avg_fare,
    SUM(t.trip_distance) AS total_distance
FROM raw_trips t
INNER JOIN raw_zones z 
    ON t.pickup_location_id = z.location_id
GROUP BY z.borough
ORDER BY total_trips DESC;

You can see how in a single query we got the total number of trips per borough, their average fare and the total distance!

3. CTEs: Writing Clean Code

In Python, you wouldn't write a 100-line script inside a single function. You’d break it into steps. In SQL, we do this with CTEs (Common Table Expressions) using the WITH keyword.

CTEs are "temporary result sets" that exist only during the execution of that query. They make your SQL readable, debuggable, and—honestly—much more professional.

The Pattern:

CTE 1: Clean the data (filter out nulls, fix types).
CTE 2: Join the tables.
Final Select: Perform the aggregation.

WITH cleaned_trips AS (
    SELECT * FROM trips 
    WHERE fare_amount > 0 AND trip_distance > 0
),
trips_with_boroughs AS (
    SELECT 
        ct.*,
        z.borough
    FROM cleaned_trips ct
    JOIN zones z ON ct.PULocationID = z.LocationID
)
SELECT 
    borough,
    COUNT(*) as trip_count
FROM trips_with_boroughs
GROUP BY 1  -- Shortcut for "first column in SELECT"
ORDER BY 2 DESC;

<aside>

💡 Pro-Tip: If you see a query with nested subqueries (queries inside queries inside queries), refactor it into CTEs. Your future self (and your teammates) will thank you.

</aside>

Curious Concept: The Cartesian Product (The "Explosion")

Ever written a query that never finishes, or suddenly returns billions of rows? You likely created a Cartesian Join (or Cross Join).

This happens when you join two tables but forget the ON clause, or use a join condition that isn't unique. Postgres will try to match every row of Table A with every row of Table B. If both tables have 1,000 rows, you just created a 1,000,000 row result set. In our taxi data, a mistake like this could crash your session!

🧠 Knowledge Check

What is the difference between an INNER JOIN and a LEFT JOIN? (Hint: What happens if a trip has a LocationID that doesn't exist in the zones table?)
What would you change in our first query if instead of the pickup location zone you wanted to display the dropoff zone? What if you want to show both?
You want to find the most expensive trip for each vendor_id. Which aggregate function do you use, and which column goes into the GROUP BY?
Why are CTEs preferred over subqueries for complex analytical logic?
Reference Week 6: Why is using SUM() in SQL more efficient than pulling all rows into a Python list and using sum(my_list)?

📖 Extra Reading

Modern SQL: Window Functions: Once you master GROUP BY, Window Functions are the next "level up" for analytics.
PostgreSQL Joins Visualized: A classic visual guide to understanding how data merges.
CTE vs Subquery Performance: When does it matter for the database engine?