Introduction to SQL for Analytics
2. Joins, CTEs, and Aggregations
5. Building Views in Azure PostgreSQL
In the previous chapter, you learned how to transform raw data into insights. But there is a massive assumption in that chapter: that the data is actually correct.
In production data engineering, this is a dangerous assumption. Pipes break, APIs change their format, and CSVs often arrive with "garbage" rows. If you build a dashboard on top of bad data, you aren't providing insights, you’re providing "hallucinations." This chapter is about using SQL to build a "firewall" around your data warehouse.
You’ve already learned how to use Python to handle errors (Week 3) and how to validate schemas with Pydantic (Week 4). So why do it again in SQL?
Because SQL validation is the final truth. Data can look perfect in your Python variables, but things can still go wrong during the COPY or INSERT process into Azure PostgreSQL. Validating at the SQL layer checks the state of the database. It’s your last chance to catch an issue before an analyst uses your data to make a business decision.
Now, let’s look at some of the things we need to watch out for!
The most common data quality issue is duplication. Maybe your pipeline ran twice by accident, or the source system sent a "retry" record.
To find them, we use the "Duplicate Pattern": GROUP BY the unique keys and check for a count greater than 1.
SELECT
vendor_id,
pickup_datetime,
pickup_location_id,
COUNT(*) as record_count
FROM raw_trips
GROUP BY 1, 2, 3
HAVING COUNT(*) > 1;
If this query returns rows, your "unique" records aren't actually unique. In a clean dataset, this query should return zero results.
A "Completeness" check tells you how much of your data is missing. If the passenger_count is NULL for 50% of your rows, any "Average Passengers" metric you calculate will be wrong.
SELECT
COUNT(*) as total_rows,
SUM(CASE WHEN pickup_location_id IS NULL THEN 1 ELSE 0 END) as null_pickups,
(SUM(CASE WHEN pickup_location_id IS NULL THEN 1 ELSE 0 END)::FLOAT / COUNT(*)) * 100 as null_percentage
FROM raw_trips;
💡 Pro-Tip: In Postgres, ::FLOAT is a quick way to cast an integer to a float so your division doesn't result in zero!
A fundamental rule of Data Engineering: What goes in must come out. If your Python script read 57,431 rows from the NYC Parquet file, but your Postgres table only has 57,400, you have a "Silent Failure." This is called a silent data loss.
At minimum, you should always validate that row counts match between stages of your pipeline:
Compare SELECT COUNT(*) FROM raw_trips with:
len(df))💡 Better practice: store expected row counts during ingestion and validate them automatically.
-- Example validation table
SELECT
source_file,
expected_rows,
loaded_rows,
expected_rows- loaded_rowsAS difference
FROM ingestion_audit;
Matching counts is necessary, but unfortunatelt it is not always sufficient:
For stronger validation, consider:
Row count checks are your first alarm bell, and something every pipeline should have.
Since we are joining trips to zones, what happens if a trip has a pickup_location_id of 999, but that ID doesn't exist in the zones table?
This is an Orphaned Key. An INNER JOIN will simply delete that trip from your results, making your revenue look lower than it actually is.
Look at the example below:

With a simple query like this, you can quickly check if you have this issue in your data:
SELECT DISTINCT t.pickup_location_id
FROM raw_trips t
LEFT JOIN raw_zones z ON t.pickup_location_id = z.location_id
WHERE z.location_id IS NULL;
Any result here means you have trips happening in "Ghost Zones" that aren't in your lookup table.
Data often contains "impossible" values. Using our NYC dataset, a "Data Quality" specialist would look for:
SELECT *
FROM raw_trips
WHERE fare_amount < 0
OR trip_distance > 500
OR pickup_datetime > NOW();
Sometimes the problem isn’t what the value is, but how it’s stored or whether it follows expected rules. These issues won’t always break your queries, but they will quietly corrupt your analysis.
Common examples:
"12.5" instead of 12.5)"2024/01/01" vs "01-01-2024"This query is and example of how you can catch Type Issues, for example when columns look numeric but behave like text (common in raw ingestions).
SELECT*
FROM raw_trips
WHERE fare_amount::TEXT!~'^[0-9]+(\\.[0-9]+)?$';
This flags values that don’t look like valid numbers.
At this point, you’ve seen the most common data quality checks:
None of these checks are complex, but together they answer a critical question: Can I trust this dataset? We need to keep in mind that data quality is not a one-time task. These checks should be:
Instead of running these checks manually in DBeaver, a production-grade pipeline standardizes and automates them.
To avoid rewriting the same queries every time, you can define a small library of validation queries, each with a clear name and purpose:
duplicate_check_tripsnull_check_pickup_locationrange_check_trip_distanceEach query returns a simple, interpretable result (e.g. a count, percentage, or boolean flag). The key is consistency: every check should be easy to run, compare, and log.
A common pattern is to store the output of these checks in a dedicated table, for example data_quality_logs:
| --- | --- | --- | --- |
Each time your pipeline runs, it executes the validation queries and appends the results to this table.
This gives you:
From here, it’s straightforward to build a dashboard showing the overall health of your data pipeline.
These checks shouldn’t be optional, but rather be part of your pipeline itself.
A typical flow looks like:

🤓 Curious Geek: The "Write-Audit-Publish" (WAP) Pattern
Advanced teams don't write data directly into the main table. They write to a Staging table, run these validation queries, and only if the validations pass do they move the data into the final table. If the data fails validation, the pipeline stops and alerts the engineer. It's like CI/CD, but for data!
This is how this set-up looks like:

LEFT JOIN used to find "orphaned" records instead of an INNER JOIN?In the previous chapter, 9.2: Joins, CTEs, and Aggregations, you learned how to transform raw data into insights. But there is a massive assumption in that chapter: that the data is actually correct.
In production data engineering, this is a dangerous assumption. Pipes break, APIs change their format, and CSVs often arrive with "garbage" rows. If you build a dashboard on top of bad data, you aren't providing insights—you’re providing "hallucinations." This chapter is about using SQL to build a "firewall" around your data warehouse.