3. Data Validation Queries

In the previous chapter, you learned how to transform raw data into insights. But there is a massive assumption in that chapter: that the data is actually correct.

In production data engineering, this is a dangerous assumption. Pipes break, APIs change their format, and CSVs often arrive with "garbage" rows. If you build a dashboard on top of bad data, you aren't providing insights, you’re providing "hallucinations." This chapter is about using SQL to build a "firewall" around your data warehouse.

Why SQL-Layer Validation?

You’ve already learned how to use Python to handle errors (Week 3) and how to validate schemas with Pydantic (Week 4). So why do it again in SQL?

Because SQL validation is the final truth. Data can look perfect in your Python variables, but things can still go wrong during the COPY or INSERT process into Azure PostgreSQL. Validating at the SQL layer checks the state of the database. It’s your last chance to catch an issue before an analyst uses your data to make a business decision.

Now, let’s look at some of the things we need to watch out for!

Hunting for Duplicates

The most common data quality issue is duplication. Maybe your pipeline ran twice by accident, or the source system sent a "retry" record.

To find them, we use the "Duplicate Pattern": GROUP BY the unique keys and check for a count greater than 1.

SELECT 
    vendor_id, 
    pickup_datetime, 
    pickup_location_id, 
    COUNT(*) as record_count
FROM raw_trips
GROUP BY 1, 2, 3
HAVING COUNT(*) > 1;

If this query returns rows, your "unique" records aren't actually unique. In a clean dataset, this query should return zero results.

Measuring Completeness (The NULL Check)

A "Completeness" check tells you how much of your data is missing. If the passenger_count is NULL for 50% of your rows, any "Average Passengers" metric you calculate will be wrong.

SELECT 
    COUNT(*) as total_rows,
    SUM(CASE WHEN pickup_location_id IS NULL THEN 1 ELSE 0 END) as null_pickups,
    (SUM(CASE WHEN pickup_location_id IS NULL THEN 1 ELSE 0 END)::FLOAT / COUNT(*)) * 100 as null_percentage
FROM raw_trips;

💡 Pro-Tip: In Postgres, ::FLOAT is a quick way to cast an integer to a float so your division doesn't result in zero!

Row Count: Source vs. Target

A fundamental rule of Data Engineering: What goes in must come out. If your Python script read 57,431 rows from the NYC Parquet file, but your Postgres table only has 57,400, you have a "Silent Failure." This is called a silent data loss.

At minimum, you should always validate that row counts match between stages of your pipeline:

source file → staging table
staging table → final table

Compare SELECT COUNT(*) FROM raw_trips with:

the number of rows in your source file (e.g. from Python: len(df))
or a metadata table / ingestion log if you track it

💡 Better practice: store expected row counts during ingestion and validate them automatically.

-- Example validation table
SELECT
    source_file,
    expected_rows,
    loaded_rows,
    expected_rows- loaded_rowsAS difference
FROM ingestion_audit;

Matching counts is necessary, but unfortunatelt it is not always sufficient:

You could have the same number of rows but different data (e.g. duplicates + missing rows)
You could have partial updates or schema issues

For stronger validation, consider:

Checksums / hashes on key columns
Distinct counts on unique IDs
Duplicate checks (as shown above)

Row count checks are your first alarm bell, and something every pipeline should have.

Referential Integrity (Orphaned Keys)

Since we are joining trips to zones, what happens if a trip has a pickup_location_id of 999, but that ID doesn't exist in the zones table?

This is an Orphaned Key. An INNER JOIN will simply delete that trip from your results, making your revenue look lower than it actually is.

Look at the example below:

With a simple query like this, you can quickly check if you have this issue in your data:

SELECT DISTINCT t.pickup_location_id
FROM raw_trips t
LEFT JOIN raw_zones z ON t.pickup_location_id = z.location_id
WHERE z.location_id IS NULL;

Any result here means you have trips happening in "Ghost Zones" that aren't in your lookup table.

Range and Format Validation

Data often contains "impossible" values. Using our NYC dataset, a "Data Quality" specialist would look for:

Negative Fares: Did the taxi driver pay the passenger?
Future Dates: Was the trip taken in 2029?
Extreme Distances: Did a taxi drive 5,000 miles inside Manhattan?

SELECT * 
FROM raw_trips 
WHERE fare_amount < 0 
   OR trip_distance > 500 
   OR pickup_datetime > NOW();

Sometimes the problem isn’t what the value is, but how it’s stored or whether it follows expected rules. These issues won’t always break your queries, but they will quietly corrupt your analysis.

Common examples:

Wrong data types: numbers stored as text ("12.5" instead of 12.5)
Inconsistent formats: dates like "2024/01/01" vs "01-01-2024"
Unexpected categories: values outside the allowed set (e.g. a new or misspelled category)

This query is and example of how you can catch Type Issues, for example when columns look numeric but behave like text (common in raw ingestions).

SELECT*
FROM raw_trips
WHERE fare_amount::TEXT!~'^[0-9]+(\\.[0-9]+)?$';

This flags values that don’t look like valid numbers.

Building a Reusable Validation Layer

At this point, you’ve seen the most common data quality checks:

Duplicates → are rows truly unique?
NULLs / Completeness → how much data is missing?
Row counts → did anything get lost in the pipeline?
Ranges → do the values actually make sense?

None of these checks are complex, but together they answer a critical question: Can I trust this dataset? We need to keep in mind that data quality is not a one-time task. These checks should be:

Automated (run every time your pipeline runs)
Monitored (fail loudly when something breaks)
Expected (you assume data will break at some point)

Instead of running these checks manually in DBeaver, a production-grade pipeline standardizes and automates them.

To avoid rewriting the same queries every time, you can define a small library of validation queries, each with a clear name and purpose:

duplicate_check_trips
null_check_pickup_location
range_check_trip_distance

Each query returns a simple, interpretable result (e.g. a count, percentage, or boolean flag). The key is consistency: every check should be easy to run, compare, and log.

Logging Validation Results

A common pattern is to store the output of these checks in a dedicated table, for example data_quality_logs:

| --- | --- | --- | --- |

Each time your pipeline runs, it executes the validation queries and appends the results to this table.

This gives you:

Traceability: what broke, when
Monitoring: trends over time (is null % increasing?)
Debugging: quick identification of failing checks

From here, it’s straightforward to build a dashboard showing the overall health of your data pipeline.

Automating Validation in Your Pipeline

These checks shouldn’t be optional, but rather be part of your pipeline itself.

A typical flow looks like:

🤓 Curious Geek: The "Write-Audit-Publish" (WAP) Pattern

Advanced teams don't write data directly into the main table. They write to a Staging table, run these validation queries, and only if the validations pass do they move the data into the final table. If the data fails validation, the pipeline stops and alerts the engineer. It's like CI/CD, but for data!

This is how this set-up looks like:

🧠 Knowledge Check

Why is a LEFT JOIN used to find "orphaned" records instead of an INNER JOIN?
Your row count validation shows 100 missing rows. Where in your pipeline (Week 4 or 5 concepts) might those rows have disappeared?

In the previous chapter, 9.2: Joins, CTEs, and Aggregations, you learned how to transform raw data into insights. But there is a massive assumption in that chapter: that the data is actually correct.

In production data engineering, this is a dangerous assumption. Pipes break, APIs change their format, and CSVs often arrive with "garbage" rows. If you build a dashboard on top of bad data, you aren't providing insights—you’re providing "hallucinations." This chapter is about using SQL to build a "firewall" around your data warehouse.