Week 9 - SQL for Analytics

Introduction to SQL for Analytics

2. Joins, CTEs, and Aggregations

3. Data Validation Queries

4. Data Modeling Concepts

5. Building Views in Azure PostgreSQL

6. Gotchas & Pitfalls

7. Practice

8. Assignment

Lesson Plan

Gotchas & Pitfalls

You are officially at the finish line of the "Architect" phase! You’ve learned how to use join tables together, design a Star Schema, build layers in a Medallion architecture, and code them into PostgreSQL Views.

But before you start the practical lab, we need to talk about some common "Data Pitfalls." These are the silent mistakes that turn a brilliant architect into a frustrated troubleshooter. As we’ve seen with the ‘wrong join’ example in the previous chapter, most data bugs don't "break" your code, but they make production data inaccurate.

Let’s summarise common issues in a clear way!


1. The Cartesian Explosion

This is the scariest pitfall in SQL. If you forget a JOIN condition (the ON clause), or if you join on the wrong key, PostgreSQL will pair every single row of the first table with every single row of the second.

2. The NULL Nuance: COUNT(*) vs. COUNT(column)

This is a classic exam (and job interview) question because it trips everyone up.

💡 Fun Fact: In SQL, NULL is not zero; it’s "unknown." That means any operation involving NULL usually returns NULL:

This is why NULL can quietly break your calculations if you’re not careful.

How you handle NULLs matters just as much as understanding them. Here are some guidlines:

1. Decide what NULL actually means in your data

Before replacing it, ask: is it truly “missing”, or does it mean something specific?

Don’t default everything to 0, sometimes that changes the meaning of your data.

2. Use COALESCE intentionally (not blindly)

When you do want to replace NULLs, use: COALESCE(tip_amount, 0)

This is especially useful for:

But be careful! Replacing NULL with 0 changes “unknown” into “zero”, those are not always the same thing.

3. Be explicit in aggregations

Remember: AVG(column), SUM(column), COUNT(column) ignores NULLs

So always ask yourself:

“Am I okay excluding missing values here?”

If not, you may need: SUM(COALESCE(column, 0))

4. Always use IS NULL, never = NULL

column IS NULL      -- correct
column = NULL       -- doesn't work

This is one of the most common bugs.

5. Watch out for joins: NULLs in join keys can silently drop rows:

JOIN table_b ON table_a.key = table_b.key

If key is NULL in either table → no match → row disappears.

If that’s not what you want, you need to handle it explicitly.

3. Ambiguous Column Names

Both your trips table and your zones table probably have a column called id or name. If you write SELECT id FROM ..., PostgreSQL will throw its hands up and say "Column reference is ambiguous."

4. Referential Integrity

In many modern data warehouses, the database doesn't strictly force every location_id in the trips table to exist in the zones table. We call this "Soft Integrity."

Now you’ll probably be asking: What to do when you find them?

There’s no single correct answer, it depends on your use case. You might:

For example:

COALESCE(borough,'Unknown')

The important part is not which option you choose, but that you choose it for a reason.

🤖 Using AI to help. You can also use AI as a quick sanity checker here. Try asking:

“What kinds of referential integrity issues could exist between these two table schemas?”

It can suggest checks you might forget, like:

5. The Danger of SELECT *

It is tempting to write SELECT * to see everything. In analytical queries, this is a bad habit:

6. Scale Management

In this course, we are using January 2024 (~57k rows) of Green Taxi data. The full NYC Taxi dataset contains billions of rows across several years. If you load that directly into Azure PostgreSQL without constraints, a few things happen:

  1. You burn through storage faster than you expect. Postgres in a cloud setup isn’t designed to casually hold billions of raw rows for exploration, it becomes expensive and unnecessary very quickly.
  2. Everything slows down. Queries that should take seconds start taking minutes, and simple debugging becomes frustrating because every SELECT * suddenly means scanning a massive table.
  3. You lose focus. It becomes harder and harder to keep track of grain, joins, and so on. You end up waiting for queries to finish or optimizing storage problems you don’t need yet.

A good rule of thumb here is:

If you are still learning the model, you should never be “big data constrained.”


🔗 Connecting the Dots

In the previous chapters, we learned:

  1. Queries (How to talk to data).
  2. Validation (How to check if the data is "clean").
  3. Modeling (How to organize data into Facts and Dimensions).
  4. Views (How to save that logic in Azure).

You now have the full theory. You know how to build a clean, professional data mart, and more importantly, you know how to avoid the common traps that catch junior engineers.