Introduction to SQL for Analytics
4. OLAP vs OLTP and Modern Warehouses
6. Building Views in Azure PostgreSQL
Welcome to the "heavy lifting" part of the week. In Week 6, you learned how to use Python to shove data into PostgreSQL using INSERT and how to peek at it using SELECT *.
But in a real data pipeline, the raw data is usually a mess. It's fragmented across different tables and full of codes that don't make sense to humans. Today, we move beyond just "reading" data and start "transforming" it using the three pillars of analytical SQL: Joins, Aggregations and CTEs.
In Week 6, you created a single weather_readings table. In our NYC Taxi dataset, life is more complex. We have Trips (the events) and Zones (the labels).
If you look at the trips table, you’ll see pickuplocationid (Pickup) and dropofflocationid (Dropoff). These are just numbers, like 43 or 132. To turn "43" into "Central Park," we need a JOIN.
A JOIN allows us to combine multiple tables into a single one, combining the elements of each table together based on common identifiers. These common identifiers should be IDs that allow us to tell which row of Table A corresponds to each row of Table B.
We have different types of JOIN depending on what we want to keep from our two original tables:

The type of JOIN determines what we want to keep from each dataset ragarding the rows that did not have a match in A and B. In our dataset, this would be the output:
Now let’s look at how we can translate this to SQL syntax!
Let’s say we are asked to get the latest 10 trips with pickup location in Manhattan, including pickup time, trip distance and fare amount. This is how we can quickly get this info:
SELECT
z.borough,
z.zone AS pickup_zone,
t.pickup_datetime,
t.trip_distance,
t.fare_amount
FROM raw_trips t
INNER JOIN raw_zones z
ON t.pickup_location_id = z.location_id
WHERE z.borough = 'Manhattan' -- This command filters the pickup location to Manhattan
ORDER BY pickup_datetime DESC -- As we want to retrieve the latest trips, we sort by date descending
LIMIT 10;
💡
Do you start seeing the benefit of joining multiple tables instead of storing everything in one? If zone names were repeated in the trips table, we’d introduce a lot of redundancy.
Let’s keep in mind that strings are the heaviest data types to process, hence working with numerical short IDs brings great performance improvements!
Now let’s see how we turn 57,000 rows (or millions of rows in real work life!) into just a few rows of pure insight. Analysts don't want to see every single taxi ride; they often want to see the Average Fare or Total Trips per day.
To do this, we use Aggregate Functions (SUM, AVG, COUNT, MIN, MAX) combined with a GROUP BY clause. Anything that follows. the GROUP BY will be what defines how detailed our aggregation will be.
<aside> ⚠️ The Golden Rule of Aggregation: Every column in your SELECT statement must either be wrapped in an aggregate function (like SUM()) OR be listed in the GROUP BY clause. If you forget this, Postgres will throw an error that essentially says: "I don't know what to do with the extra columns!"
</aside>
Let’s look at how we can add multiple aggregations in a single query:
SELECT
z.borough,
COUNT(*) AS total_trips,
AVG(t.fare_amount) AS avg_fare,
SUM(t.trip_distance) AS total_distance
FROM raw_trips t
INNER JOIN raw_zones z
ON t.pickup_location_id = z.location_id
GROUP BY z.borough
ORDER BY total_trips DESC;
You can see how in a single query we got the total number of trips per borough, their average fare and the total distance!
In Python, you wouldn't write a 100-line script inside a single function. You’d break it into steps. In SQL, we do this with CTEs (Common Table Expressions) using the WITH keyword.
CTEs are "temporary result sets" that exist only during the execution of that query. They make your SQL readable, debuggable, and—honestly—much more professional.
The Pattern:
WITH cleaned_trips AS (
SELECT * FROM trips
WHERE fare_amount > 0 AND trip_distance > 0
),
trips_with_boroughs AS (
SELECT
ct.*,
z.borough
FROM cleaned_trips ct
JOIN zones z ON ct.PULocationID = z.LocationID
)
SELECT
borough,
COUNT(*) as trip_count
FROM trips_with_boroughs
GROUP BY 1 -- Shortcut for "first column in SELECT"
ORDER BY 2 DESC;
<aside>
💡 Pro-Tip: If you see a query with nested subqueries (queries inside queries inside queries), refactor it into CTEs. Your future self (and your teammates) will thank you.
</aside>
Ever written a query that never finishes, or suddenly returns billions of rows? You likely created a Cartesian Join (or Cross Join).
This happens when you join two tables but forget the ON clause, or use a join condition that isn't unique. Postgres will try to match every row of Table A with every row of Table B. If both tables have 1,000 rows, you just created a 1,000,000 row result set. In our taxi data, a mistake like this could crash your session!
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with ❤️ by the HackYourFuture community · Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.