Week 3 - Ingesting and Validating Data

Introduction to Data Ingestion

Ingesting from APIs

Production Error Handling

Reading Multiple File Formats

Data Validation with Pydantic

Writing to Databases

Gotchas & Pitfalls

Practice

Assignment: Build a Validated Ingestion Pipeline

Career relevance: Week 3

Week 3 Glossary

Going Further: Optional Deep Dives

Week 3 Kickoff Slides

History: APIs and Data Transfer

Data Validation with Pydantic

In Week 2's Dataclasses chapter, you used dataclasses with __post_init__ to validate data. It worked, but you had to write all the validation logic yourself: type checking, range checking, error messages.

Pydantic is a library that does all of that for you, and more. It automatically converts types, enforces constraints, and gives you detailed error messages when data is invalid. If dataclasses are a bicycle, Pydantic is a car.

This chapter teaches you how to define Pydantic models, use automatic type coercion, add constraints, write custom validators, and validate batches of records.

By the end of this chapter, you should be able to:

From Dataclass to Pydantic

Here is the upgrade path. In Week 2's Dataclasses chapter, you wrote:

from dataclasses import dataclass

@dataclass
class WeatherReading:
    station: str
    temperature_c: float
    humidity_pct: int

    def __post_init__(self):
        if not isinstance(self.temperature_c, (int, float)):
            raise TypeError("temperature must be a number")
        if self.humidity_pct < 0 or self.humidity_pct > 100:
            raise ValueError("humidity must be 0-100")

In Week 3, the same thing with Pydantic (BaseModel + Field):

from pydantic import BaseModel, Field

class WeatherReading(BaseModel):
    station: str
    temperature_c: float
    humidity_pct: int = Field(ge=0, le=100)

That is it. Three lines replace twelve. Pydantic handles:

Installing Pydantic

pip install pydantic

Or with uv (as used in this project):

uv add pydantic

Type Coercion: Pydantic Fixes Your Data

The killer feature of Pydantic is type coercion. When data arrives from a CSV, everything is a string (as you saw in Week 1's File Operations and this week's Reading Multiple File Formats). Pydantic converts it for you.

from pydantic import BaseModel

class WeatherReading(BaseModel):
    station: str
    temperature_c: float
    humidity_pct: int

# CSV data comes in as strings - Pydantic converts automatically
reading = WeatherReading(
    station="Copenhagen",
    temperature_c="18.5",   # String -> float: 18.5
    humidity_pct="72",      # String -> int: 72
)

print(reading.temperature_c)  # 18.5 (a real float, not a string)
print(type(reading.temperature_c))  # <class 'float'>

Compare this to raw CSV handling:

# BAD: manual conversion, crashes on bad data ❌
temp = float(row["temperature_c"])  # ValueError if row is "N/A"

# GOOD: Pydantic handles conversion and gives clear errors ✅
reading = WeatherReading(**row)  # Converts types or raises ValidationError

Field Constraints

Pydantic's Field function lets you add constraints without writing if statements:

from pydantic import BaseModel, Field

class WeatherReading(BaseModel):
    station: str = Field(min_length=1)             # Not empty
    temperature_c: float = Field(ge=-90, le=60)    # Earth's temperature range
    humidity_pct: int = Field(ge=0, le=100)        # Percentage
    wind_speed_kmh: float = Field(ge=0)            # Not negative
Constraint Meaning Example
ge=0 Greater or equal Non-negative values
le=100 Less or equal Percentages
gt=0 Strictly greater Must be positive
min_length=1 Minimum string length Non-empty strings
max_length=50 Maximum string length Bounded input
pattern=r"^\d{4}-\d{2}-\d{2}$" Regex pattern Date format

Validation Errors: What Went Wrong

When validation fails, Pydantic raises a ValidationError that does not merely say "invalid data": it tells you exactly which field failed and why.

from pydantic import ValidationError  # noqa: verify

try:
    reading = WeatherReading(
        station="",
        temperature_c="not_a_number",
        humidity_pct=150,
    )
except ValidationError as e:
    print(e)

Output:

3 validation errors for WeatherReading
station
  String should have at least 1 character [type=string_too_short]
temperature_c
  Input should be a valid number [type=float_parsing]
humidity_pct
  Input should be less than or equal to 100 [type=less_than_equal]

Three fields, three clear error messages. This is gold for debugging ingestion pipelines.

<aside> 💡 Start strict. It is easier to loosen validation later than to find bad data that slipped through months ago.

</aside>

Pydantic's coercion is powerful, but it has boundaries.

<aside> ⚠️ Pydantic coercion has limits. It can convert "18.5" to 18.5, but it cannot convert "hot" to a float. When coercion fails, you get a ValidationError - which is exactly what you want.

</aside>

Custom Validators

Sometimes built-in constraints are not enough. You can write custom validators for business logic.

The code below uses decorators (@field_validator, @classmethod). A decorator is a Python feature written as @something above a function. It modifies or extends that function's behavior. @classmethod means the method receives the class itself (as cls) rather than an instance. You do not need to understand decorators deeply right now: just know that @field_validator (Pydantic docs) tells Pydantic to run this method whenever the named field is validated:

from pydantic import BaseModel, Field, field_validator

class WeatherReading(BaseModel):
    station: str = Field(min_length=1)
    temperature_c: float = Field(ge=-90, le=60)
    humidity_pct: int = Field(ge=0, le=100)
    timestamp: str

    @field_validator("timestamp")
    @classmethod
    def timestamp_must_be_iso_format(cls, v: str) -> str:
        """Check that timestamp looks like an ISO date."""
        if "T" not in v and " " not in v:
            raise ValueError(f"timestamp must contain date and time, got: {v}")
        return v

    @field_validator("station")
    @classmethod
    def station_must_be_stripped(cls, v: str) -> str:
        """Strip whitespace from station names."""
        return v.strip()

Notice that validators can both check data (raise ValueError if invalid) and transform data (strip whitespace and return the cleaned value).

<aside> ⌨️ Hands on: Add a @field_validator to the WeatherReading model that converts the station name to title case (e.g., "COPENHAGEN" becomes "Copenhagen"). Hint: use .title() on the string and return the result.

</aside>

<aside> 🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=3&chapter=pydantic_validation&exercise=w3_pydantic_validation__weather_model&lang=python

</aside>

Batch Validation: Processing Many Records

In a real pipeline, you validate hundreds or thousands of records. Use the error accumulation pattern from this week's Production Error Handling (which builds on the approach you saw in Week 2's PipelineRunner):

from pydantic import ValidationError  # noqa: verify

def validate_readings(raw_records: list[dict]) -> tuple[list[WeatherReading], list[dict]]:
    """Validate a list of raw records, returning valid records and errors."""
    valid = []
    errors = []

    for i, record in enumerate(raw_records):
        try:
            reading = WeatherReading(**record)
            valid.append(reading)
        except ValidationError as e:
            errors.append({
                "index": i,
                "record": record,
                "errors": e.errors(),
            })

    return valid, errors

Usage:

raw_data = [
    {"station": "Copenhagen", "temperature_c": "18.5", "humidity_pct": "72", "timestamp": "2025-01-15T10:00"},
    {"station": "", "temperature_c": "abc", "humidity_pct": "150", "timestamp": "bad"},
    {"station": "Aarhus", "temperature_c": "15.2", "humidity_pct": "65", "timestamp": "2025-01-15T11:00"},
]

valid, errors = validate_readings(raw_data)
print(f"Valid: {len(valid)}, Errors: {len(errors)}")
# Valid: 2, Errors: 1

The errors list contains enough detail to debug every failed record.

Converting Pydantic Models to Dictionaries

When you need to send validated data to a database or file, convert back to a dictionary:

# Single record  # noqa: verify
reading_dict = reading.model_dump()

# List of records
rows = [r.model_dump() for r in valid_records]

This is the Pydantic equivalent of dataclasses.asdict() from Week 2's Dataclasses chapter.

<aside> 🤓 Curious Geek: Pydantic powers FastAPI

FastAPI, one of the most popular Python web frameworks, uses Pydantic models for automatic request validation. When you define an API endpoint, FastAPI validates incoming JSON against your Pydantic model, the same pattern you are using here for ingestion. As a data engineer you will mostly consume APIs, but knowing both sides helps you write better ingestion code, and some DE roles do involve building internal APIs to serve data to other teams.

</aside>

Pydantic has evolved significantly between major versions:

<aside> 🤓 Curious Geek: Pydantic V1 vs V2

Pydantic V2 (released 2023) was rewritten in Rust and is 5-50x faster than V1. If you see older tutorials using @validator (instead of @field_validator) or .dict() (instead of .model_dump()), that is V1 syntax. Always use V2 for new projects.

</aside>

Pydantic is now standard across modern Python data and ML codebases:

<aside> 💡 In the wild: LangChain uses Pydantic models to define the structured output schema for every LLM tool call. The same BaseModel pattern you just wrote validates every JSON response that comes back from the model before application code touches it, the same validate-before-processing discipline this chapter teaches.

</aside>

If you want to see each Pydantic mechanic from this chapter running in isolation, three short demo scripts each focus on one behaviour:

<aside> 📦 Pydantic demos (≤40 lines each, one mechanic per file):

Save the gist, python pydantic_<file>.py, read the labelled output, and the three concepts land in five minutes.

</aside>

Time to put your own model to work on a small batch.

Knowledge Check

<aside> 🚀 Try it in the widget: Interactive Quiz: Data Validation with Pydantic

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_3_ch5_pydantic_validation_quiz&embed=1

Prefer a video walkthrough? Here is a beginner-friendly Pydantic V2 tutorial:

<aside> 🎬 Struggling with this concept? Watch this beginner-friendly video:

</aside>

https://www.youtube.com/watch?v=502XOB0u8OY

Once the syntax feels familiar, an LLM can scaffold models for new data shapes: