Introduction to SQL for Analytics
2. Joins, CTEs, and Aggregations
4. OLAP vs OLTP and Modern Warehouses
5. Building Views in Azure PostgreSQL
Welcome to the "heavy lifting" part of the week. In Week 6, you learned how to use Python to shove data into PostgreSQL using INSERT and how to peek at it using SELECT *.
But in a real data pipeline, the raw data is usually a mess. It's fragmented across different tables and full of codes that don't make sense to humans. Today, we move beyond just "reading" data and start "transforming" it using the three pillars of analytical SQL: Joins, Aggregations and CTEs.
In Week 6, you created a single weather_readings table. In our NYC Taxi dataset, life is more complex. We have Trips (the events) and Zones (the labels).
If you look at the trips table, you’ll see pickuplocationid (Pickup) and dropofflocationid (Dropoff). These are just numbers, like 43 or 132. To turn "43" into "Central Park," we need a JOIN.
A JOIN allows us to combine multiple tables into a single one, combining the elements of each table together based on common identifiers. These common identifiers should be IDs that allow us to tell which row of Table A corresponds to each row of Table B.
We have different types of JOIN depending on what we want to keep from our two original tables:

The type of JOIN determines what we want to keep from each dataset ragarding the rows that did not have a match in A and B. In our dataset, this would be the output:
Now let’s look at how we can translate this to SQL syntax!
Let’s say we are asked to get the latest 10 trips with pickup location in Manhattan, including pickup time, trip distance and fare amount. This is how we can quickly get this info:
SELECT
z.borough,
z.zone AS pickup_zone,
t.pickup_datetime,
t.trip_distance,
t.fare_amount
FROM raw_trips t
INNER JOIN raw_zones z
ON t.pickup_location_id = z.location_id
WHERE z.borough = 'Manhattan' -- This command filters the pickup location to Manhattan
ORDER BY pickup_datetime DESC -- As we want to retrieve the latest trips, we sort by date descending
LIMIT 10;
t.pickup_location_id = z.location_id tells Postgres exactly how these two worlds are connected.💡 Do you start seeing the benefit of joining multiple tables instead of storing everything in one? If zone names were repeated in the trips table, we’d introduce a lot of redundancy. Let’s keep in mind that strings are the heaviest data types to store and process, hence using short numerical IDs is much more efficient.
Now let’s see how we turn 57,000 rows (or millions of rows in real work life!) into just a few rows of pure insight. Analysts don't want to see every single taxi ride; they often want to see the Average Fare or Total Trips per day.
To do this, we use Aggregate Functions (SUM, AVG, COUNT, MIN, MAX) combined with a GROUP BY clause. Anything that follows the GROUP BY will be what defines how detailed our aggregation will be.
AVG(fare_amount) grouped by pickup zone ? —> We will get one row per pickup zoneAVG(fare_amount) grouped by pickup zone AND dropoff zone ? —> We will get one row for each pickup<>dropoff zone combinations.. the number of rows will be equal to how many unique combinations we have in the tables!<aside>
⚠️ The Golden Rule of Aggregation: Every column in your SELECT statement must either be wrapped in an aggregate function (like SUM()) OR be listed in the GROUP BY clause. If you forget this, Postgres will throw an error that essentially says: "I don't know what to do with the extra columns!"
</aside>
Let’s look at how we can add multiple aggregations in a single query:
SELECT
z.borough,
COUNT(*) AS total_trips,
AVG(t.fare_amount) AS avg_fare,
SUM(t.trip_distance) AS total_distance
FROM raw_trips t
INNER JOIN raw_zones z
ON t.pickup_location_id = z.location_id
GROUP BY z.borough
ORDER BY total_trips DESC;
You can see how in a single query we got the total number of trips per borough, their average fare and the total distance!
As you start aggregating data, you'll eventually want to filter the results (e.g., "Show me only the boroughs with more than 1,000 trips").
In SQL, the order in which the database executes your command matters.
WHERE filters rows before the math happens.HAVING filters groups after the math happens.Think of it like this: If you are organizing a party, WHERE is the guest list (who gets in), and HAVING is the rule for which tables stay (only tables with more than 4 people).
SELECT
z.borough,
COUNT(*) AS trip_count
FROM raw_trips t
JOIN raw_zones z
ON t.pickup_location_id = z.location_id
WHERE t.trip_distance > 2.0 -- 1. Only look at trips above 2km
GROUP BY z.borough
HAVING COUNT(*) > 1000 -- 2. Only show boroughs that have > 1000 of those trips
ORDER BY trip_count DESC;
Both commands filter, but one filters at the row-by-row level, the other filters the aggregated level!
🤓 Curious Geek Did you know that the
SELECTclause is one of the last things the database actually processes? Even though you write it at the top, the database first looks atFROM, thenJOIN, thenWHERE, thenGROUP BY. This is why you can't use an alias you created in theSELECT(likefare * 1.2 AS total) inside yourWHEREclause: theWHEREhappens before theSELECTeven exists!
In Python, you wouldn't write a 100-line script inside a single function. You’d break it into steps. In SQL, we do this with CTEs (Common Table Expressions) using the WITH keyword.
CTEs are "temporary result sets" that exist only during the execution of that query. They make your SQL readable, debuggable, and much more professional.
A common pattern:
WITH cleaned_trips AS (
SELECT * FROM raw_trips
WHERE fare_amount > 0 AND trip_distance > 0
),
trips_with_boroughs AS (
SELECT
ct.*,
z.borough
FROM cleaned_trips ct
JOIN raw_zones z ON ct.pickup_location_id = z.location_id
)
SELECT
borough,
COUNT(*) as trip_count
FROM trips_with_boroughs
GROUP BY 1 -- Shortcut for "first column in SELECT"
ORDER BY 2 DESC;
💡 Pro-Tip: If you see a query with nested subqueries (queries inside queries inside queries), refactor it into CTEs. Your future self (and your teammates) will thank you.
We’re getting advanced now ! Unlike GROUP BY, which "squashes" your rows, Window Functions allow you to perform calculations across a set of rows while still keeping all the individual rows visible.
A good mental model to keep in mind is:
*GROUP BY = shrink the table*You can always recognize a window function by the OVER() clause. They’re especially useful for:
Example: Ranking Zones by Revenue
SELECT
borough,
zone,
ROUND(total_revenue) as total_revenue,
RANK() OVER (PARTITION BY borough ORDER BY total_revenue DESC) as revenue_rank
FROM (
SELECT
z.borough,
z.zone,
SUM(t.fare_amount) as total_revenue
FROM raw_trips t
JOIN raw_zones z
ON t.pickup_location_id = z.location_id
GROUP BY 1, 2
) subquery;
What’s happening here:
total_revenue per (borough, zone) using GROUP BYBreaking down the window:
PARTITION BY borough
→ Resets the ranking for each borough (like doing separate rankings per group)ORDER BY total_revenue DESC
→ Highest revenue gets rank 1RANK()
→ Assigns a rank, allowing ties (if two zones have the same revenue, they get the same rank)This is a snapshot of the results you would get! The ranking will restart for each borough considering we added borough in the partition statement.

In this case, what do you expected to get if you run the same window function without the
PARTITION BY boroughpiece?
You might want to write something like this to retrieve only the rows that matched a specific ranking:
SELECT*,
ROW_NUMBER() OVER (PARTITION BY borough ORDER BY total_revenue DESC)AS rn
FROM my_table
WHERE rn=1;-- ❌ this won't work in PostgreSQL
However, this will fail in PostGRE SQL because WHERE is evaluated before window functions are computed. The correct PostgreSQL approach is to use a subquery or CTE like this:
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY borough ORDER BY total_revenue DESC) AS rn
FROM my_table
) t
WHERE rn=1; -- This filters on the borough with the highest revenue!
In other tools like BigQuery / Snowflake, the same result can be used without a CTE (using QUALIFY)
SELECT *, ROW_NUMBER() OVER (PARTITION BY borough ORDER BY total_revenue DESC)AS rn
FROM my_table
QUALIFY rn=1;
Sometimes you don't want to join tables side-by-side (Columns); you want to stack them on top of each other (Rows).
Suppose you have a table for green_taxi_trips and another for yellow_taxi_trips. To get a master taxi trip list, you use UNION. For this, you can use the following commands in PostGRE:
UNION ALL: Just stacks them. It’s fast and keeps every single row. (This is the most common use).UNION: Stacks them and then runs a "Distinct" check to remove duplicates. It’s much slower because the database has to compare every row.The syntax looks like this:
SELECT 'green' as type, pickup_datetime
FROM green_trips
UNION ALL
SELECT 'yellow' as type, pickup_datetime
FROM yellow_trips;
Ever written a query that never finishes, or suddenly returns billions of rows? You likely created a Cartesian Join (or Cross Join).
This happens when you join two tables but forget the ON clause, or use a join condition that isn't unique. Postgres will try to match every row of Table A with every row of Table B. If both tables have 1,000 rows, you just created a 1,000,000 row result set. In our taxi data, a mistake like this could crash your session!
location_id that doesn't exist in the zones table?vendor_id. Which aggregate function do you use, and which column goes into the GROUP BY?sum(my_list)?