Week 3 - Ingesting and Validating Data
Introduction to Data Ingestion
Assignment: Build a Validated Ingestion Pipeline
APIs are not your only data source. In many pipelines, data arrives as files: a partner sends a CSV, a sensor logs to JSON, an analytics team exports Parquet. Each format has strengths and quirks.
This chapter teaches you how to read CSV, JSON, and Parquet files in Python, understand when to use each format, and normalize them into a consistent structure your pipeline can process.
<aside> πΌοΈ Visual: CSV vs JSON vs Parquet
</aside>
In a perfect world, everyone would agree on one format. In reality:
Your pipeline needs to handle whatever it receives. The goal is to convert all formats into the same internal structure (a list of dictionaries) so the rest of your pipeline does not care where the data came from.
CSV (Comma-Separated Values) is the most common file format in data engineering. It is simple, but that simplicity hides problems.
Building on Week 1 (Chapter 9: File Operations), where you used csv.DictReader to turn rows into dictionaries and learned that every CSV value is a string, this chapter covers the messy real-world cases you encounter during ingestion.
<aside>
β οΈ Reminder: All CSV values are strings. "18.5" is not a float until you convert it. This is why validation (Chapter 5) matters.
</aside>
Not all "CSV" files use commas. Some use semicolons (;), tabs (\\t), or pipes (|).
import csv
def read_csv_file(path: str, delimiter: str = ",") -> list[dict]:
"""Read a CSV file with a configurable delimiter."""
with open(path, newline="", encoding="utf-8") as f:
reader = csv.DictReader(f, delimiter=delimiter)
return list(reader)
# European CSVs often use semicolons
data = read_csv_file("data_eu.csv", delimiter=";")
<aside> π‘ If a CSV file looks garbled, check the delimiter. Open the file in a text editor and look at what separates the values.
</aside>
Delimiters are not the only thing that can go wrong with CSV files:
<aside>
β οΈ Watch out for encoding issues. If your CSV contains special characters (like Danish letters: Γ¦, ΓΈ, Γ₯) and you see garbled text, try encoding="latin-1" instead of "utf-8".
</aside>
JSON (JavaScript Object Notation) is the standard format for API responses and web data. Unlike CSV, JSON preserves types (numbers, booleans, nulls) and supports nested structures. Building on Week 1's json.load() and json.dump(), the challenge now is handling unexpected structures from external sources.
<aside>
π Core Program Refresher: Python's json.load() / json.dumps() are equivalent to JSON.parse() / JSON.stringify() in JavaScript. Same concept, different syntax.
</aside>
JSON files from external sources are not always a simple list. They might wrap data in an object with metadata. A defensive reader handles both cases:
import json
def read_json_file(path: str) -> list[dict]:
"""Read a JSON file and return a list of dictionaries."""
with open(path, encoding="utf-8") as f:
data = json.load(f)
# JSON files might be a list or a dict with a list inside
if isinstance(data, list):
return data
if isinstance(data, dict) and "results" in data:
return data["results"]
raise ValueError(f"Unexpected JSON structure in {path}")
JSON often has nested structures that you need to flatten into simple key-value pairs.
# Nested JSON from an API
nested = {
"station": "Copenhagen",
"location": {
"latitude": 55.67,
"longitude": 12.56,
},
"readings": {
"temperature": 18.5,
"humidity": 72,
},
}
# Flattened for your pipeline
flat = {
"station": nested["station"],
"latitude": nested["location"]["latitude"],
"longitude": nested["location"]["longitude"],
"temperature": nested["readings"]["temperature"],
"humidity": nested["readings"]["humidity"],
}
A cleaner approach is to write a dedicated flatten function:
def flatten_weather_record(record: dict) -> dict:
"""Flatten a nested weather record into a flat dictionary."""
return {
"station": record["station"],
"latitude": record["location"]["latitude"],
"longitude": record["location"]["longitude"],
"temperature": record["readings"]["temperature"],
"humidity": record["readings"]["humidity"],
}
<aside>
β¨οΈ Hands on: Write a function flatten_records(records: list[dict]) -> list[dict] that takes a list of nested weather records and returns a list of flat dictionaries. Use the flatten_weather_record function above as a helper.
</aside>
<aside> π Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=3&chapter=file_formats&exercise=w3_file_formats__flatten_records&lang=python
</aside>
Parquet is a columnar file format designed for efficient storage and fast analytics. It stores data by column rather than by row, which makes it much faster for queries that only need a few columns from a large dataset.
| Feature | CSV | JSON | Parquet |
|---|---|---|---|
| Human-readable | Yes | Yes | No (binary) |
| Types preserved | No (all strings) | Partially (no dates) | Yes (full schema) |
| File size | Large | Large | Small (compressed) |
| Read speed | Slow | Medium | Fast |
| Nested data | No | Yes | Yes |
| Best for | Small data, sharing | APIs, config files | Analytics, large datasets |
Parquet requires the pyarrow or fastparquet library. The simplest way is with pandas:
import pandas as pd
def read_parquet_file(path: str) -> list[dict]:
"""Read a Parquet file and return a list of dictionaries."""
df = pd.read_parquet(path)
return df.to_dict(orient="records")
This converts the Parquet data into the same list-of-dicts format as CSV and JSON. Your pipeline does not need to know the original format.
<aside> β οΈ Out of scope: Fully understanding Parquet internals (row groups, column chunks, compression codecs) is beyond this week. For now, you only need to know how to read Parquet files into your pipeline. The Curious Geek section at the end of this chapter explains what makes the format special if you want to dig deeper.
</aside>
The code above uses pandas, but only as a Parquet reader:
<aside>
π‘ You do not need to fully learn pandas for this chapter. It is used here as a convenient Parquet reader. The to_dict(orient="records") method converts a DataFrame into a list of dictionaries, the same list-of-dicts format from earlier chapters. You will learn about pandas in depth in Week 4.
</aside>
In Week 2 (Chapters 4 and 6), you wrote transformation functions like normalize_names() that cleaned individual fields. The same idea applies here, but at a higher level: instead of cleaning one field, you map entire records from different sources into a single standard shape.
def normalize_weather_record(raw: dict) -> dict:
"""Convert any raw weather record into a standard format."""
return {
"station": str(raw.get("station", raw.get("station_name", "unknown"))),
"timestamp": str(raw.get("timestamp", raw.get("time", raw.get("date", "")))),
"temperature_c": raw.get("temperature_c", raw.get("temperature", raw.get("temp"))),
"humidity_pct": raw.get("humidity_pct", raw.get("humidity", raw.get("rh"))),
}
This function handles the fact that different sources use different field names. After normalization, every record has the same keys, and the rest of your pipeline only needs to understand one format.
# Read from different sources
csv_data = read_csv_file("stations.csv")
json_data = read_json_file("api_export.json")
parquet_data = read_parquet_file("archive.parquet")
# Normalize to the same format
all_records = []
for record in csv_data + json_data + parquet_data:
all_records.append(normalize_weather_record(record))
# Now all records have the same shape - ready for validation
<aside>
π‘ In the wild: dlt (data load tool) is a Python library for building data pipelines. You import dlt, declare your sources and destinations in plain Python, and dlt handles schema inference, normalization, and loading. The normalization pattern you learned in this chapter: mapping different source formats into a single standard shape, is exactly what dlt does under the hood.
</aside>
<aside> π‘ Using AI to help: When you encounter a messy file, paste a few rows (β οΈ Ensure no PII or sensitive company data is included!) into an LLM and ask it to detect the delimiter, encoding, or nesting structure. It can also help you write the normalization function to map different field names to your standard schema.
</aside>
<aside> πΌοΈ Visual: Row Storage vs Column Storage
</aside>
The visual above shows why Parquet is fast. Here is what happens under the hood:
<aside> π€ Curious Geek: Parquet Under the Hood
Columnar storage: CSV stores data row by row. Parquet stores data column by column, so a query like SELECT AVG(age) only reads the age column instead of every row. For tables with 50+ columns, the I/O difference is massive. See Apache Parquet Format for the full specification.
Built-in compression: Same-type columns compress far better than mixed rows. A column of repeating city names compresses more efficiently than interleaved CSV fields. Typical Parquet files are 5-10x smaller than the equivalent CSV. The Parquet encoding documentation explains the encoding schemes (dictionary, run-length, delta) that make this possible.
Schema embedded in the file: Unlike CSV, a Parquet file stores its full schema: column names, types, and nullability. The Thrift schema definition shows exactly what metadata is stored in the file footer.
Limitations: Parquet is not human-readable, appending single rows is inefficient (it is designed for batch writes), and it requires a library like PyArrow to read in Python. For small datasets or quick data sharing, CSV is still simpler.
Where you will see it: Data lakes (S3, GCS), Spark jobs, BigQuery exports, and any analytics pipeline dealing with millions of rows. The Databricks guide to Parquet gives a good overview of how it fits into the modern data stack.
</aside>
The HackYourFuture curriculum is licensed underΒ CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.