Week 3 - Ingesting and Validating Data

Introduction to Data Ingestion

Ingesting from APIs

Production Error Handling

Reading Multiple File Formats

Data Validation with Pydantic

Writing to Databases

Practice

Assignment: Build a Validated Ingestion Pipeline

Gotchas & Pitfalls

Lesson Plan

✅ Data Validation with Pydantic

In Week 2 (Chapter 5), you used dataclasses with __post_init__ to validate data. It worked, but you had to write all the validation logic yourself: type checking, range checking, error messages.

Pydantic is a library that does all of that for you, and more. It automatically converts types, enforces constraints, and gives you detailed error messages when data is invalid. If dataclasses are a bicycle, Pydantic is a car.

This chapter teaches you how to define Pydantic models, use automatic type coercion, add constraints, write custom validators, and validate batches of records.

From Dataclass to Pydantic

Here is the upgrade path. In Week 2 (Chapter 5), you wrote:

from dataclasses import dataclass

@dataclass
class WeatherReading:
    station: str
    temperature_c: float
    humidity_pct: int

    def __post_init__(self):
        if not isinstance(self.temperature_c, (int, float)):
            raise TypeError("temperature must be a number")
        if self.humidity_pct < 0 or self.humidity_pct > 100:
            raise ValueError("humidity must be 0-100")

In Week 3, the same thing with Pydantic:

from pydantic import BaseModel, Field

class WeatherReading(BaseModel):
    station: str
    temperature_c: float
    humidity_pct: int = Field(ge=0, le=100)

<aside> 🎬 Animation: Pydantic Type Coercion - Strings Become Real Types

</aside>

That is it. Three lines replace twelve. Pydantic handles:

Installing Pydantic

pip install pydantic

Or with uv (as used in this project):

uv add pydantic

Type Coercion: Pydantic Fixes Your Data

The killer feature of Pydantic is type coercion. When data arrives from a CSV, everything is a string (as you saw in Week 1's File Operations and this week's Chapter 4). Pydantic converts it for you.

from pydantic import BaseModel

class WeatherReading(BaseModel):
    station: str
    temperature_c: float
    humidity_pct: int

# CSV data comes in as strings - Pydantic converts automatically
reading = WeatherReading(
    station="Copenhagen",
    temperature_c="18.5",   # String -> float: 18.5
    humidity_pct="72",      # String -> int: 72
)

print(reading.temperature_c)  # 18.5 (a real float, not a string)
print(type(reading.temperature_c))  # <class 'float'>

Compare this to raw CSV handling:

# BAD: manual conversion, crashes on bad data ❌
temp = float(row["temperature_c"])  # ValueError if row is "N/A"

# GOOD: Pydantic handles conversion and gives clear errors ✅
reading = WeatherReading(**row)  # Converts types or raises ValidationError

Field Constraints

Pydantic's Field function lets you add constraints without writing if statements:

from pydantic import BaseModel, Field

class WeatherReading(BaseModel):
    station: str = Field(min_length=1)             # Not empty
    temperature_c: float = Field(ge=-90, le=60)    # Earth's temperature range
    humidity_pct: int = Field(ge=0, le=100)        # Percentage
    wind_speed_kmh: float = Field(ge=0)            # Not negative
Constraint Meaning Example
ge=0 Greater or equal Non-negative values
le=100 Less or equal Percentages
gt=0 Strictly greater Must be positive
min_length=1 Minimum string length Non-empty strings
max_length=50 Maximum string length Bounded input
pattern=r"^\\d{4}-\\d{2}-\\d{2}$" Regex pattern Date format

Validation Errors: What Went Wrong

When validation fails, Pydantic does not merely say "invalid data." It tells you exactly which field failed and why.

from pydantic import ValidationError  # noqa: verify

try:
    reading = WeatherReading(
        station="",
        temperature_c="not_a_number",
        humidity_pct=150,
    )
except ValidationError as e:
    print(e)

Output:

3 validation errors for WeatherReading
station
  String should have at least 1 character [type=string_too_short]
temperature_c
  Input should be a valid number [type=float_parsing]
humidity_pct
  Input should be less than or equal to 100 [type=less_than_equal]

Three fields, three clear error messages. This is gold for debugging ingestion pipelines.

<aside> 💡 Start strict. It is easier to loosen validation later than to find bad data that slipped through months ago.

</aside>

Pydantic's coercion is powerful, but it has boundaries.

<aside> ⚠️ Pydantic coercion has limits. It can convert "18.5" to 18.5, but it cannot convert "hot" to a float. When coercion fails, you get a ValidationError - which is exactly what you want.

</aside>

Custom Validators

Sometimes built-in constraints are not enough. You can write custom validators for business logic.

The code below uses decorators (@field_validator, @classmethod). A decorator is a Python feature written as @something above a function. It modifies or extends that function's behavior. @classmethod means the method receives the class itself (as cls) rather than an instance. You do not need to understand decorators deeply right now: just know that @field_validator("timestamp") tells Pydantic to run this method whenever the timestamp field is validated:

from pydantic import BaseModel, Field, field_validator

class WeatherReading(BaseModel):
    station: str = Field(min_length=1)
    temperature_c: float = Field(ge=-90, le=60)
    humidity_pct: int = Field(ge=0, le=100)
    timestamp: str

    @field_validator("timestamp")
    @classmethod
    def timestamp_must_be_iso_format(cls, v: str) -> str:
        """Check that timestamp looks like an ISO date."""
        if "T" not in v and " " not in v:
            raise ValueError(f"timestamp must contain date and time, got: {v}")
        return v

    @field_validator("station")
    @classmethod
    def station_must_be_stripped(cls, v: str) -> str:
        """Strip whitespace from station names."""
        return v.strip()

Notice that validators can both check data (raise ValueError if invalid) and transform data (strip whitespace and return the cleaned value).

<aside> ⌨️ Hands on: Add a @field_validator to the WeatherReading model that converts the station name to title case (e.g., "COPENHAGEN" becomes "Copenhagen"). Hint: use .title() on the string and return the result.

</aside>

<aside> 🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=3&chapter=pydantic_validation&exercise=w3_pydantic_validation__weather_model&lang=python

</aside>

Batch Validation: Processing Many Records

In a real pipeline, you validate hundreds or thousands of records. Use the error accumulation pattern from this week's Chapter 3 (which builds on the approach you saw in Week 2's PipelineRunner):

from pydantic import ValidationError  # noqa: verify

def validate_readings(raw_records: list[dict]) -> tuple[list[WeatherReading], list[dict]]:
    """Validate a list of raw records, returning valid records and errors."""
    valid = []
    errors = []

    for i, record in enumerate(raw_records):
        try:
            reading = WeatherReading(**record)
            valid.append(reading)
        except ValidationError as e:
            errors.append({
                "index": i,
                "record": record,
                "errors": e.errors(),
            })

    return valid, errors

Usage:

raw_data = [
    {"station": "Copenhagen", "temperature_c": "18.5", "humidity_pct": "72", "timestamp": "2025-01-15T10:00"},
    {"station": "", "temperature_c": "abc", "humidity_pct": "150", "timestamp": "bad"},
    {"station": "Aarhus", "temperature_c": "15.2", "humidity_pct": "65", "timestamp": "2025-01-15T11:00"},
]

valid, errors = validate_readings(raw_data)
print(f"Valid: {len(valid)}, Errors: {len(errors)}")
# Valid: 2, Errors: 1

The errors list contains enough detail to debug every failed record.

Converting Pydantic Models to Dictionaries

When you need to send validated data to a database or file, convert back to a dictionary:

# Single record  # noqa: verify
reading_dict = reading.model_dump()

# List of records
rows = [r.model_dump() for r in valid_records]

This is the Pydantic equivalent of dataclasses.asdict() from Week 2 (Chapter 5).

<aside> 💡 In the wild: FastAPI, one of the most popular Python web frameworks, uses Pydantic models for automatic request validation. When you define an API endpoint, FastAPI validates incoming JSON against your Pydantic model, the same pattern you are using here for data ingestion. As a data engineer, you will mostly consume APIs rather than build them. But understanding how validation works on both sides helps you write better ingestion code, and some DE roles do involve building internal APIs to serve data to other teams.

</aside>

Pydantic has evolved significantly between major versions:

<aside> 🤓 Curious Geek: Pydantic V1 vs V2

Pydantic V2 (released 2023) was rewritten in Rust and is 5-50x faster than V1. If you see older tutorials using @validator (instead of @field_validator) or .dict() (instead of .model_dump()), that is V1 syntax. Always use V2 for new projects.

</aside>

🧠 Knowledge Check

  1. What does Pydantic do automatically when you pass a string "18.5" to a float field?
  2. How does Field(ge=0, le=100) replace manual validation in __post_init__?
  3. Why should you use batch validation (accumulating errors) instead of validating one record at a time?

<aside> 💡 Using AI to help: Paste a sample API response or CSV row (⚠️ Ensure no PII or sensitive company data is included!) into an LLM and ask it to generate a Pydantic model with the right types and constraints. It is a fast way to scaffold your validation layer.

</aside>

Extra reading


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.