Week 3 - Ingesting and Validating Data
Introduction to Data Ingestion
Assignment: Build a Validated Ingestion Pipeline
Going Further: Optional Deep Dives
History: APIs and Data Transfer
APIs are not your only data source. In many pipelines, data arrives as files: a partner sends a CSV, a sensor logs to JSON, an analytics team exports Parquet. Each format has strengths and quirks.
This chapter teaches you how to read CSV, JSON, and Parquet files in Python, understand when to use each format, and normalize them into a consistent structure your pipeline can process.
By the end of this chapter, you should be able to:
pandas.read_parquet() and convert it to the same list-of-dicts shape as the other readers.<aside> πΌοΈ Visual: CSV vs JSON vs Parquet
</aside>
In a perfect world, everyone would agree on one format. In reality:
Your pipeline needs to handle whatever it receives. The goal is to convert all formats into the same internal structure (a list of dictionaries) so the rest of your pipeline does not care where the data came from.
CSV (Comma-Separated Values) is the most common file format in data engineering. It is simple, but that simplicity hides problems.
Building on Week 1's File Operations, where you used csv.DictReader to turn rows into dictionaries and learned that every CSV value is a string, this chapter covers the messy real-world cases you encounter during ingestion.
<aside>
β οΈ Reminder: All CSV values are strings. "18.5" is not a float until you convert it. This is why validation (in Data Validation with Pydantic) matters.
</aside>
Not all "CSV" files use commas. Some use semicolons (;), tabs (\t), or pipes (|).
Python's csv module reads delimited text files into row dicts.
import csv
def read_csv_file(path: str, delimiter: str = ",") -> list[dict]:
"""Read a CSV file with a configurable delimiter."""
with open(path, newline="", encoding="utf-8") as f:
reader = csv.DictReader(f, delimiter=delimiter)
return list(reader)
# European CSVs often use semicolons
data = read_csv_file("data_eu.csv", delimiter=";")
<aside> π‘ If a CSV file looks garbled, check the delimiter. Open the file in a text editor and look at what separates the values.
</aside>
Delimiters are not the only thing that can go wrong with CSV files:
<aside>
β οΈ Watch out for encoding issues. If your CSV contains special characters (like Danish letters: Γ¦, ΓΈ, Γ₯) and you see garbled text, try encoding="latin-1" instead of "utf-8".
</aside>
JSON (JavaScript Object Notation) is the standard format for API responses and web data. Unlike CSV, JSON preserves types (numbers, booleans, nulls) and supports nested structures. Building on Week 1's json.load() and json.dump(), the challenge now is handling unexpected structures from external sources.
<aside>
π Core Program Refresher: Python's json.load() / json.dumps() are equivalent to JSON.parse() / JSON.stringify() in JavaScript. Same concept, different syntax.
</aside>
JSON files from external sources are not always a simple list. They might wrap data in an object with metadata. A defensive reader handles both cases:
import json
def read_json_file(path: str) -> list[dict]:
"""Read a JSON file and return a list of dictionaries."""
with open(path, encoding="utf-8") as f:
data = json.load(f)
# JSON files might be a list or a dict with a list inside
if isinstance(data, list):
return data
if isinstance(data, dict) and "results" in data:
return data["results"]
raise ValueError(f"Unexpected JSON structure in {path}")
JSON often has nested structures that you need to flatten into simple key-value pairs.
# Nested JSON from an API
nested = {
"station": "Copenhagen",
"location": {
"latitude": 55.67,
"longitude": 12.56,
},
"readings": {
"temperature": 18.5,
"humidity": 72,
},
}
# Flattened for your pipeline
flat = {
"station": nested["station"],
"latitude": nested["location"]["latitude"],
"longitude": nested["location"]["longitude"],
"temperature": nested["readings"]["temperature"],
"humidity": nested["readings"]["humidity"],
}
A cleaner approach is to write a dedicated flatten function:
def flatten_weather_record(record: dict) -> dict:
"""Flatten a nested weather record into a flat dictionary."""
return {
"station": record["station"],
"latitude": record["location"]["latitude"],
"longitude": record["location"]["longitude"],
"temperature": record["readings"]["temperature"],
"humidity": record["readings"]["humidity"],
}
<aside>
β¨οΈ Hands on: Write a function flatten_records(records: list[dict]) -> list[dict] that takes a list of nested weather records and returns a list of flat dictionaries. Use the flatten_weather_record function above as a helper.
</aside>
<aside> π Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=3&chapter=file_formats&exercise=w3_file_formats__flatten_records&lang=python
</aside>
Parquet is a columnar storage file format designed for efficient storage and fast analytics. It stores data by column rather than by row, which makes it much faster for queries that only need a few columns from a large dataset.
| Feature | CSV | JSON | Parquet |
|---|---|---|---|
| Human-readable | Yes | Yes | No (binary) |
| Types preserved | No (all strings) | Partially (no dates) | Yes (full schema) |
| File size | Large | Large | Small (compressed) |
| Read speed | Slow | Medium | Fast |
| Nested data | No | Yes | Yes |
| Best for | Small data, sharing | APIs, config files | Analytics, large datasets |
Parquet requires the pyarrow or fastparquet library. Install with uv pip install pyarrow. The simplest way to read is with pandas via pandas.read_parquet():
import pandas as pd
def read_parquet_file(path: str) -> list[dict]:
"""Read a Parquet file and return a list of dictionaries."""
df = pd.read_parquet(path)
return df.to_dict(orient="records")
This converts the Parquet data into the same list-of-dicts format as CSV and JSON. Your pipeline does not need to know the original format.
<aside> β οΈ Out of scope: Fully understanding Parquet internals (row groups, column chunks, compression codecs) is beyond this week. For now, you only need to know how to read Parquet files into your pipeline. The Curious Geek section at the end of this chapter explains what makes the format special if you want to dig deeper.
</aside>
The code above uses pandas, but only as a Parquet reader:
<aside>
π‘ You do not need to fully learn pandas for this chapter. It is used here as a convenient Parquet reader. The to_dict(orient="records") method converts a DataFrame into a list of dictionaries, the same list-of-dicts format from earlier chapters. You will meet pandas in depth later in the track.
</aside>
Parquet is the default storage format for large-scale data engineering.
<aside>
π‘ In the wild: Hugging Face Datasets stores every published dataset in Parquet format. When you call load_dataset("squad"), it downloads Parquet files and reads them with PyArrow. The same columnar format that speeds up queries on your laptop powers one of the world's largest ML dataset repositories.
</aside>
In Week 2 (Chapters 4 and 6), you wrote transformation functions like normalize_names() that cleaned individual fields. The same idea applies here, but at a higher level: instead of cleaning one field, you map entire records from different sources into a single standard shape.
def normalize_weather_record(raw: dict) -> dict:
"""Convert any raw weather record into a standard format."""
return {
"station": str(raw.get("station", raw.get("station_name", "unknown"))),
"timestamp": str(raw.get("timestamp", raw.get("time", raw.get("date", "")))),
"temperature_c": raw.get("temperature_c", raw.get("temperature", raw.get("temp"))),
"humidity_pct": raw.get("humidity_pct", raw.get("humidity", raw.get("rh"))),
}
This function handles the fact that different sources use different field names. After normalization, every record has the same keys, and the rest of your pipeline only needs to understand one format.
# Read from different sources
csv_data = read_csv_file("stations.csv")
json_data = read_json_file("api_export.json")
parquet_data = read_parquet_file("archive.parquet")
# Normalize to the same format
all_records = []
for record in csv_data + json_data + parquet_data:
all_records.append(normalize_weather_record(record))
# Now all records have the same shape - ready for validation
Try the normalization pattern yourself before moving on.
, vs ;), and how do you spot each issue in a misbehaving CSV?<aside> π Try it in the widget: Interactive Quiz: Reading Multiple File Formats
</aside>
https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_3_ch4_reading_file_formats_quiz&embed=1
Prefer a video walkthrough? Here is a beginner-friendly tour of the csv module:
<aside> π¬ Struggling with this concept? Watch this beginner-friendly video:
</aside>
https://www.youtube.com/watch?v=q5uM4VKywbA
Once the readers feel routine, lean on an LLM for the messy-file diagnosis:
<aside> π‘ Using AI to help: When you encounter a messy file, paste a few rows (β οΈ Ensure no PII or sensitive company data is included!) into an LLM and ask it to detect the delimiter, encoding, or nesting structure. It can also help you write the normalization function to map different field names to your standard schema.
</aside>
<aside> π For declarative ingestion frameworks (dlt, Airbyte), full courses, and community resources, see the optional Going Further page.
</aside>
One last aside before the next chapter:
<aside> π€ Curious Geek: The CSV "standard" is not really a standard
</aside>