Week 3 - Ingesting and Validating Data

Introduction to Data Ingestion

Ingesting from APIs

Production Error Handling

Reading Multiple File Formats

Data Validation with Pydantic

Writing to Databases

Gotchas & Pitfalls

Practice

Assignment: Build a Validated Ingestion Pipeline

Career relevance: Week 3

Week 3 Glossary

Going Further: Optional Deep Dives

Week 3 Kickoff Slides

History: APIs and Data Transfer

Reading Multiple File Formats

APIs are not your only data source. In many pipelines, data arrives as files: a partner sends a CSV, a sensor logs to JSON, an analytics team exports Parquet. Each format has strengths and quirks.

This chapter teaches you how to read CSV, JSON, and Parquet files in Python, understand when to use each format, and normalize them into a consistent structure your pipeline can process.

By the end of this chapter, you should be able to:

Why Multiple Formats?

<aside> πŸ–ΌοΈ Visual: CSV vs JSON vs Parquet

</aside>

In a perfect world, everyone would agree on one format. In reality:

Your pipeline needs to handle whatever it receives. The goal is to convert all formats into the same internal structure (a list of dictionaries) so the rest of your pipeline does not care where the data came from.

CSV: The Universal (and Fragile) Format

CSV (Comma-Separated Values) is the most common file format in data engineering. It is simple, but that simplicity hides problems.

Building on Week 1's File Operations, where you used csv.DictReader to turn rows into dictionaries and learned that every CSV value is a string, this chapter covers the messy real-world cases you encounter during ingestion.

<aside> ⚠️ Reminder: All CSV values are strings. "18.5" is not a float until you convert it. This is why validation (in Data Validation with Pydantic) matters.

</aside>

Handling Different Delimiters

Not all "CSV" files use commas. Some use semicolons (;), tabs (\t), or pipes (|).

Python's csv module reads delimited text files into row dicts.

import csv

def read_csv_file(path: str, delimiter: str = ",") -> list[dict]:
    """Read a CSV file with a configurable delimiter."""
    with open(path, newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f, delimiter=delimiter)
        return list(reader)

# European CSVs often use semicolons
data = read_csv_file("data_eu.csv", delimiter=";")

<aside> πŸ’‘ If a CSV file looks garbled, check the delimiter. Open the file in a text editor and look at what separates the values.

</aside>

Delimiters are not the only thing that can go wrong with CSV files:

<aside> ⚠️ Watch out for encoding issues. If your CSV contains special characters (like Danish letters: Γ¦, ΓΈ, Γ₯) and you see garbled text, try encoding="latin-1" instead of "utf-8".

</aside>

JSON: Structured but Nested

JSON (JavaScript Object Notation) is the standard format for API responses and web data. Unlike CSV, JSON preserves types (numbers, booleans, nulls) and supports nested structures. Building on Week 1's json.load() and json.dump(), the challenge now is handling unexpected structures from external sources.

<aside> πŸ“˜ Core Program Refresher: Python's json.load() / json.dumps() are equivalent to JSON.parse() / JSON.stringify() in JavaScript. Same concept, different syntax.

</aside>

Handling Unexpected JSON Structure

JSON files from external sources are not always a simple list. They might wrap data in an object with metadata. A defensive reader handles both cases:

import json

def read_json_file(path: str) -> list[dict]:
    """Read a JSON file and return a list of dictionaries."""
    with open(path, encoding="utf-8") as f:
        data = json.load(f)

    # JSON files might be a list or a dict with a list inside
    if isinstance(data, list):
        return data
    if isinstance(data, dict) and "results" in data:
        return data["results"]

    raise ValueError(f"Unexpected JSON structure in {path}")

Flattening Nested JSON

JSON often has nested structures that you need to flatten into simple key-value pairs.

# Nested JSON from an API
nested = {
    "station": "Copenhagen",
    "location": {
        "latitude": 55.67,
        "longitude": 12.56,
    },
    "readings": {
        "temperature": 18.5,
        "humidity": 72,
    },
}

# Flattened for your pipeline
flat = {
    "station": nested["station"],
    "latitude": nested["location"]["latitude"],
    "longitude": nested["location"]["longitude"],
    "temperature": nested["readings"]["temperature"],
    "humidity": nested["readings"]["humidity"],
}

A cleaner approach is to write a dedicated flatten function:

def flatten_weather_record(record: dict) -> dict:
    """Flatten a nested weather record into a flat dictionary."""
    return {
        "station": record["station"],
        "latitude": record["location"]["latitude"],
        "longitude": record["location"]["longitude"],
        "temperature": record["readings"]["temperature"],
        "humidity": record["readings"]["humidity"],
    }

<aside> ⌨️ Hands on: Write a function flatten_records(records: list[dict]) -> list[dict] that takes a list of nested weather records and returns a list of flat dictionaries. Use the flatten_weather_record function above as a helper.

</aside>

<aside> πŸš€ Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=3&chapter=file_formats&exercise=w3_file_formats__flatten_records&lang=python

</aside>

Parquet: The Analytics Powerhouse

Parquet is a columnar storage file format designed for efficient storage and fast analytics. It stores data by column rather than by row, which makes it much faster for queries that only need a few columns from a large dataset.

Feature CSV JSON Parquet
Human-readable Yes Yes No (binary)
Types preserved No (all strings) Partially (no dates) Yes (full schema)
File size Large Large Small (compressed)
Read speed Slow Medium Fast
Nested data No Yes Yes
Best for Small data, sharing APIs, config files Analytics, large datasets

Reading Parquet

Parquet requires the pyarrow or fastparquet library. Install with uv pip install pyarrow. The simplest way to read is with pandas via pandas.read_parquet():

import pandas as pd

def read_parquet_file(path: str) -> list[dict]:
    """Read a Parquet file and return a list of dictionaries."""
    df = pd.read_parquet(path)
    return df.to_dict(orient="records")

This converts the Parquet data into the same list-of-dicts format as CSV and JSON. Your pipeline does not need to know the original format.

<aside> ⚠️ Out of scope: Fully understanding Parquet internals (row groups, column chunks, compression codecs) is beyond this week. For now, you only need to know how to read Parquet files into your pipeline. The Curious Geek section at the end of this chapter explains what makes the format special if you want to dig deeper.

</aside>

The code above uses pandas, but only as a Parquet reader:

<aside> πŸ’‘ You do not need to fully learn pandas for this chapter. It is used here as a convenient Parquet reader. The to_dict(orient="records") method converts a DataFrame into a list of dictionaries, the same list-of-dicts format from earlier chapters. You will meet pandas in depth later in the track.

</aside>

Parquet is the default storage format for large-scale data engineering.

<aside> πŸ’‘ In the wild: Hugging Face Datasets stores every published dataset in Parquet format. When you call load_dataset("squad"), it downloads Parquet files and reads them with PyArrow. The same columnar format that speeds up queries on your laptop powers one of the world's largest ML dataset repositories.

</aside>

Normalizing to a Common Format

In Week 2 (Chapters 4 and 6), you wrote transformation functions like normalize_names() that cleaned individual fields. The same idea applies here, but at a higher level: instead of cleaning one field, you map entire records from different sources into a single standard shape.

def normalize_weather_record(raw: dict) -> dict:
    """Convert any raw weather record into a standard format."""
    return {
        "station": str(raw.get("station", raw.get("station_name", "unknown"))),
        "timestamp": str(raw.get("timestamp", raw.get("time", raw.get("date", "")))),
        "temperature_c": raw.get("temperature_c", raw.get("temperature", raw.get("temp"))),
        "humidity_pct": raw.get("humidity_pct", raw.get("humidity", raw.get("rh"))),
    }

This function handles the fact that different sources use different field names. After normalization, every record has the same keys, and the rest of your pipeline only needs to understand one format.

# Read from different sources
csv_data = read_csv_file("stations.csv")
json_data = read_json_file("api_export.json")
parquet_data = read_parquet_file("archive.parquet")

# Normalize to the same format
all_records = []
for record in csv_data + json_data + parquet_data:
    all_records.append(normalize_weather_record(record))

# Now all records have the same shape - ready for validation

Try the normalization pattern yourself before moving on.

Knowledge Check

<aside> πŸš€ Try it in the widget: Interactive Quiz: Reading Multiple File Formats

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_3_ch4_reading_file_formats_quiz&embed=1

Prefer a video walkthrough? Here is a beginner-friendly tour of the csv module:

<aside> 🎬 Struggling with this concept? Watch this beginner-friendly video:

Watch on YouTube

</aside>

https://www.youtube.com/watch?v=q5uM4VKywbA

Once the readers feel routine, lean on an LLM for the messy-file diagnosis:

<aside> πŸ’‘ Using AI to help: When you encounter a messy file, paste a few rows (⚠️ Ensure no PII or sensitive company data is included!) into an LLM and ask it to detect the delimiter, encoding, or nesting structure. It can also help you write the normalization function to map different field names to your standard schema.

</aside>

Extra reading

<aside> πŸ“š For declarative ingestion frameworks (dlt, Airbyte), full courses, and community resources, see the optional Going Further page.

</aside>

One last aside before the next chapter:

<aside> πŸ€“ Curious Geek: The CSV "standard" is not really a standard

</aside>