Introduction to SQL for Analytics
2. Joins, CTEs, and Aggregations
5. Building Views in Azure PostgreSQL
You are officially at the finish line of the "Architect" phase! You’ve learned how to use join tables together, design a Star Schema, build layers in a Medallion architecture, and code them into PostgreSQL Views.
But before you start the practical lab, we need to talk about some common "Data Pitfalls." These are the silent mistakes that turn a brilliant architect into a frustrated troubleshooter. As we’ve seen with the ‘wrong join’ example in the previous chapter, most data bugs don't "break" your code, but they make production data inaccurate.
Let’s summarise common issues in a clear way!
This is the scariest pitfall in SQL. If you forget a JOIN condition (the ON clause), or if you join on the wrong key, PostgreSQL will pair every single row of the first table with every single row of the second.
ON conditions!COUNT(*) vs. COUNT(column)This is a classic exam (and job interview) question because it trips everyone up.
COUNT(*): Counts every single row, even if it’s full of NULL values.
COUNT(tip_amount): Only counts rows where the tip is not NULL.
If you have 100 trips but 10 of them have a NULL tip, these two functions will give you different answers.
💡 Fun Fact: In SQL,
NULLis not zero; it’s "unknown." That means any operation involvingNULLusually returnsNULL:
NULL + 5 → NULLNULL * 10 → NULLNULL = NULL → UNKNOWN(not true!)This is why
NULLcan quietly break your calculations if you’re not careful.
How you handle NULLs matters just as much as understanding them. Here are some guidlines:
1. Decide what NULL actually means in your data
Before replacing it, ask: is it truly “missing”, or does it mean something specific?
tip_amount = NULL → no tip recorded? or genuinely missing data?end_time = NULL → trip still ongoing? or bad ingestion?Don’t default everything to 0, sometimes that changes the meaning of your data.
2. Use COALESCE intentionally (not blindly)
When you do want to replace NULLs, use: COALESCE(tip_amount, 0)
This is especially useful for:
SUM(COALESCE(tip_amount, 0))But be careful! Replacing NULL with 0 changes “unknown” into “zero”, those are not always the same thing.
3. Be explicit in aggregations
Remember: AVG(column), SUM(column), COUNT(column) ignores NULLs
So always ask yourself:
“Am I okay excluding missing values here?”
If not, you may need: SUM(COALESCE(column, 0))
4. Always use IS NULL, never = NULL
column IS NULL -- correct
column = NULL -- doesn't work
This is one of the most common bugs.
5. Watch out for joins: NULLs in join keys can silently drop rows:
JOIN table_b ON table_a.key = table_b.key
If key is NULL in either table → no match → row disappears.
If that’s not what you want, you need to handle it explicitly.
Both your trips table and your zones table probably have a column called id or name. If you write SELECT id FROM ..., PostgreSQL will throw its hands up and say "Column reference is ambiguous."
fact.location_id, dim.location_id). It’s not just for the computer; it’s so the next human reading your code knows where the data came from.In many modern data warehouses, the database doesn't strictly force every location_id in the trips table to exist in the zones table. We call this "Soft Integrity."
INNER JOIN.LEFT JOIN where the dimension side is NULL) to find these "orphaned" trips.Now you’ll probably be asking: What to do when you find them?
There’s no single correct answer, it depends on your use case. You might:
For example:
COALESCE(borough,'Unknown')
The important part is not which option you choose, but that you choose it for a reason.
🤖 Using AI to help. You can also use AI as a quick sanity checker here. Try asking:
“What kinds of referential integrity issues could exist between these two table schemas?”
It can suggest checks you might forget, like:
- mismatched data types (
INTvsSTRING)- leading zeros or formatting issues
- partial keys or composite keys
SELECT *It is tempting to write SELECT * to see everything. In analytical queries, this is a bad habit:
In this course, we are using January 2024 (~57k rows) of Green Taxi data. The full NYC Taxi dataset contains billions of rows across several years. If you load that directly into Azure PostgreSQL without constraints, a few things happen:
SELECT * suddenly means scanning a massive table.A good rule of thumb here is:
If you are still learning the model, you should never be “big data constrained.”
In the previous chapters, we learned:
You now have the full theory. You know how to build a clean, professional data mart, and more importantly, you know how to avoid the common traps that catch junior engineers.