Gotchas & Pitfalls

You are officially at the finish line of the "Architect" phase! You’ve learned how to use join tables together, design a Star Schema, build layers in a Medallion architecture, and code them into PostgreSQL Views.

But before you start the practical lab, we need to talk about some common "Data Pitfalls." These are the silent mistakes that turn a brilliant architect into a frustrated troubleshooter. As we’ve seen with the ‘wrong join’ example in the previous chapter, most data bugs don't "break" your code, but they make production data inaccurate.

Let’s summarise common issues in a clear way!

1. The Cartesian Explosion

This is the scariest pitfall in SQL. If you forget a JOIN condition (the ON clause), or if you join on the wrong key, PostgreSQL will pair every single row of the first table with every single row of the second.

The Math: 50,000 trips $\times$ 260 zones = 13,000,000 rows.
The Consequence: Your query hangs, your Azure server usage spikes, and your numbers become astronomically high. Always double-check your ON conditions!

2. The NULL Nuance: `COUNT(*)` vs. `COUNT(column)`

This is a classic exam (and job interview) question because it trips everyone up.

COUNT(*): Counts every single row, even if it’s full of NULL values.
COUNT(tip_amount): Only counts rows where the tip is not NULL.

If you have 100 trips but 10 of them have a NULL tip, these two functions will give you different answers.

💡 Fun Fact: In SQL, NULL is not zero; it’s "unknown." That means any operation involving NULL usually returns NULL:

NULL + 5 → NULL

NULL * 10 → NULL

NULL = NULL → UNKNOWN (not true!)

This is why NULL can quietly break your calculations if you’re not careful.

How you handle NULLs matters just as much as understanding them. Here are some guidlines:

1. Decide what NULL actually means in your data

Before replacing it, ask: is it truly “missing”, or does it mean something specific?

tip_amount = NULL → no tip recorded? or genuinely missing data?
end_time = NULL → trip still ongoing? or bad ingestion?

Don’t default everything to 0, sometimes that changes the meaning of your data.

2. Use COALESCE intentionally (not blindly)

When you do want to replace NULLs, use: COALESCE(tip_amount, 0)

This is especially useful for:

sums likeSUM(COALESCE(tip_amount, 0))
calculations where NULL would wipe out the result

But be careful! Replacing NULL with 0 changes “unknown” into “zero”, those are not always the same thing.

3. Be explicit in aggregations

Remember: AVG(column), SUM(column), COUNT(column) ignores NULLs

So always ask yourself:

“Am I okay excluding missing values here?”

If not, you may need: SUM(COALESCE(column, 0))

4. Always use IS NULL, never = NULL

column IS NULL      -- correct
column = NULL       -- doesn't work

This is one of the most common bugs.

5. Watch out for joins: NULLs in join keys can silently drop rows:

JOIN table_b ON table_a.key = table_b.key

If key is NULL in either table → no match → row disappears.

If that’s not what you want, you need to handle it explicitly.

3. Ambiguous Column Names

Both your trips table and your zones table probably have a column called id or name. If you write SELECT id FROM ..., PostgreSQL will throw its hands up and say "Column reference is ambiguous."

The Fix: Always use aliases (fact.location_id, dim.location_id). It’s not just for the computer; it’s so the next human reading your code knows where the data came from.

4. Referential Integrity

In many modern data warehouses, the database doesn't strictly force every location_id in the trips table to exist in the zones table. We call this "Soft Integrity."

The Risk: If a taxi driver enters an invalid zone ID at the meter, that trip might simply "disappear" from your report when you do an INNER JOIN.
The Fix: Always run a validation query (a LEFT JOIN where the dimension side is NULL) to find these "orphaned" trips.

Now you’ll probably be asking: What to do when you find them?

There’s no single correct answer, it depends on your use case. You might:

Investigate upstream (data quality issue?)
Filter them out (if they’re negligible)
Map them to an “unknown” category (common in dashboards)

For example:

COALESCE(borough,'Unknown')

The important part is not which option you choose, but that you choose it for a reason.

🤖 Using AI to help. You can also use AI as a quick sanity checker here. Try asking:

“What kinds of referential integrity issues could exist between these two table schemas?”

It can suggest checks you might forget, like:

mismatched data types (INT vs STRING)

leading zeros or formatting issues

partial keys or composite keys

5. The Danger of `SELECT *`

It is tempting to write SELECT * to see everything. In analytical queries, this is a bad habit:

Performance: It forces Azure to read data you don't need, slowing down your dashboard.
Readability: It’s "lazy code." Explicitly naming your columns is a form of documentation.
Fragility over time: Schemas change, and new columns get added. This means that your query output can silently change, and this might break downstream dashboards or expose columns you didn’t intend to use, which is especially risky in production models.

6. Scale Management

In this course, we are using January 2024 (~57k rows) of Green Taxi data. The full NYC Taxi dataset contains billions of rows across several years. If you load that directly into Azure PostgreSQL without constraints, a few things happen:

You burn through storage faster than you expect. Postgres in a cloud setup isn’t designed to casually hold billions of raw rows for exploration, it becomes expensive and unnecessary very quickly.
Everything slows down. Queries that should take seconds start taking minutes, and simple debugging becomes frustrating because every SELECT * suddenly means scanning a massive table.
You lose focus. It becomes harder and harder to keep track of grain, joins, and so on. You end up waiting for queries to finish or optimizing storage problems you don’t need yet.

A good rule of thumb here is:

If you are still learning the model, you should never be “big data constrained.”

🔗 Connecting the Dots

In the previous chapters, we learned:

Queries (How to talk to data).
Validation (How to check if the data is "clean").
Modeling (How to organize data into Facts and Dimensions).
Views (How to save that logic in Azure).

You now have the full theory. You know how to build a clean, professional data mart, and more importantly, you know how to avoid the common traps that catch junior engineers.