Week 3 - Ingesting and Validating Data
Introduction to Data Ingestion
Assignment: Build a Validated Ingestion Pipeline
Going Further: Optional Deep Dives
History: APIs and Data Transfer
In Week 2's Dataclasses chapter, you used dataclasses with __post_init__ to validate data. It worked, but you had to write all the validation logic yourself: type checking, range checking, error messages.
Pydantic is a library that does all of that for you, and more. It automatically converts types, enforces constraints, and gives you detailed error messages when data is invalid. If dataclasses are a bicycle, Pydantic is a car.
This chapter teaches you how to define Pydantic models, use automatic type coercion, add constraints, write custom validators, and validate batches of records.
By the end of this chapter, you should be able to:
BaseModel with typed fields and Field(...) constraints.@field_validator that both checks and transforms a value (such as stripping or title-casing a string).Here is the upgrade path. In Week 2's Dataclasses chapter, you wrote:
from dataclasses import dataclass
@dataclass
class WeatherReading:
station: str
temperature_c: float
humidity_pct: int
def __post_init__(self):
if not isinstance(self.temperature_c, (int, float)):
raise TypeError("temperature must be a number")
if self.humidity_pct < 0 or self.humidity_pct > 100:
raise ValueError("humidity must be 0-100")
In Week 3, the same thing with Pydantic (BaseModel + Field):
from pydantic import BaseModel, Field
class WeatherReading(BaseModel):
station: str
temperature_c: float
humidity_pct: int = Field(ge=0, le=100)
That is it. Three lines replace twelve. Pydantic handles:
temperature_c="18.5", it converts it to 18.5 (a float)Field(ge=0, le=100) means "greater or equal to 0, less or equal to 100"pip install pydantic
Or with uv (as used in this project):
uv add pydantic
The killer feature of Pydantic is type coercion. When data arrives from a CSV, everything is a string (as you saw in Week 1's File Operations and this week's Reading Multiple File Formats). Pydantic converts it for you.
from pydantic import BaseModel
class WeatherReading(BaseModel):
station: str
temperature_c: float
humidity_pct: int
# CSV data comes in as strings - Pydantic converts automatically
reading = WeatherReading(
station="Copenhagen",
temperature_c="18.5", # String -> float: 18.5
humidity_pct="72", # String -> int: 72
)
print(reading.temperature_c) # 18.5 (a real float, not a string)
print(type(reading.temperature_c)) # <class 'float'>
Compare this to raw CSV handling:
# BAD: manual conversion, crashes on bad data ❌
temp = float(row["temperature_c"]) # ValueError if row is "N/A"
# GOOD: Pydantic handles conversion and gives clear errors ✅
reading = WeatherReading(**row) # Converts types or raises ValidationError
Pydantic's Field function lets you add constraints without writing if statements:
from pydantic import BaseModel, Field
class WeatherReading(BaseModel):
station: str = Field(min_length=1) # Not empty
temperature_c: float = Field(ge=-90, le=60) # Earth's temperature range
humidity_pct: int = Field(ge=0, le=100) # Percentage
wind_speed_kmh: float = Field(ge=0) # Not negative
| Constraint | Meaning | Example |
|---|---|---|
ge=0 |
Greater or equal | Non-negative values |
le=100 |
Less or equal | Percentages |
gt=0 |
Strictly greater | Must be positive |
min_length=1 |
Minimum string length | Non-empty strings |
max_length=50 |
Maximum string length | Bounded input |
pattern=r"^\d{4}-\d{2}-\d{2}$" |
Regex pattern | Date format |
When validation fails, Pydantic raises a ValidationError that does not merely say "invalid data": it tells you exactly which field failed and why.
from pydantic import ValidationError # noqa: verify
try:
reading = WeatherReading(
station="",
temperature_c="not_a_number",
humidity_pct=150,
)
except ValidationError as e:
print(e)
Output:
3 validation errors for WeatherReading
station
String should have at least 1 character [type=string_too_short]
temperature_c
Input should be a valid number [type=float_parsing]
humidity_pct
Input should be less than or equal to 100 [type=less_than_equal]
Three fields, three clear error messages. This is gold for debugging ingestion pipelines.
<aside> 💡 Start strict. It is easier to loosen validation later than to find bad data that slipped through months ago.
</aside>
Pydantic's coercion is powerful, but it has boundaries.
<aside>
⚠️ Pydantic coercion has limits. It can convert "18.5" to 18.5, but it cannot convert "hot" to a float. When coercion fails, you get a ValidationError - which is exactly what you want.
</aside>
Sometimes built-in constraints are not enough. You can write custom validators for business logic.
The code below uses decorators (@field_validator, @classmethod). A decorator is a Python feature written as @something above a function. It modifies or extends that function's behavior. @classmethod means the method receives the class itself (as cls) rather than an instance. You do not need to understand decorators deeply right now: just know that @field_validator (Pydantic docs) tells Pydantic to run this method whenever the named field is validated:
from pydantic import BaseModel, Field, field_validator
class WeatherReading(BaseModel):
station: str = Field(min_length=1)
temperature_c: float = Field(ge=-90, le=60)
humidity_pct: int = Field(ge=0, le=100)
timestamp: str
@field_validator("timestamp")
@classmethod
def timestamp_must_be_iso_format(cls, v: str) -> str:
"""Check that timestamp looks like an ISO date."""
if "T" not in v and " " not in v:
raise ValueError(f"timestamp must contain date and time, got: {v}")
return v
@field_validator("station")
@classmethod
def station_must_be_stripped(cls, v: str) -> str:
"""Strip whitespace from station names."""
return v.strip()
Notice that validators can both check data (raise ValueError if invalid) and transform data (strip whitespace and return the cleaned value).
<aside>
⌨️ Hands on: Add a @field_validator to the WeatherReading model that converts the station name to title case (e.g., "COPENHAGEN" becomes "Copenhagen"). Hint: use .title() on the string and return the result.
</aside>
<aside> 🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=3&chapter=pydantic_validation&exercise=w3_pydantic_validation__weather_model&lang=python
</aside>
In a real pipeline, you validate hundreds or thousands of records. Use the error accumulation pattern from this week's Production Error Handling (which builds on the approach you saw in Week 2's PipelineRunner):
from pydantic import ValidationError # noqa: verify
def validate_readings(raw_records: list[dict]) -> tuple[list[WeatherReading], list[dict]]:
"""Validate a list of raw records, returning valid records and errors."""
valid = []
errors = []
for i, record in enumerate(raw_records):
try:
reading = WeatherReading(**record)
valid.append(reading)
except ValidationError as e:
errors.append({
"index": i,
"record": record,
"errors": e.errors(),
})
return valid, errors
Usage:
raw_data = [
{"station": "Copenhagen", "temperature_c": "18.5", "humidity_pct": "72", "timestamp": "2025-01-15T10:00"},
{"station": "", "temperature_c": "abc", "humidity_pct": "150", "timestamp": "bad"},
{"station": "Aarhus", "temperature_c": "15.2", "humidity_pct": "65", "timestamp": "2025-01-15T11:00"},
]
valid, errors = validate_readings(raw_data)
print(f"Valid: {len(valid)}, Errors: {len(errors)}")
# Valid: 2, Errors: 1
The errors list contains enough detail to debug every failed record.
When you need to send validated data to a database or file, convert back to a dictionary:
# Single record # noqa: verify
reading_dict = reading.model_dump()
# List of records
rows = [r.model_dump() for r in valid_records]
This is the Pydantic equivalent of dataclasses.asdict() from Week 2's Dataclasses chapter.
<aside> 🤓 Curious Geek: Pydantic powers FastAPI
FastAPI, one of the most popular Python web frameworks, uses Pydantic models for automatic request validation. When you define an API endpoint, FastAPI validates incoming JSON against your Pydantic model, the same pattern you are using here for ingestion. As a data engineer you will mostly consume APIs, but knowing both sides helps you write better ingestion code, and some DE roles do involve building internal APIs to serve data to other teams.
</aside>
Pydantic has evolved significantly between major versions:
<aside> 🤓 Curious Geek: Pydantic V1 vs V2
Pydantic V2 (released 2023) was rewritten in Rust and is 5-50x faster than V1. If you see older tutorials using @validator (instead of @field_validator) or .dict() (instead of .model_dump()), that is V1 syntax. Always use V2 for new projects.
</aside>
Pydantic is now standard across modern Python data and ML codebases:
<aside>
💡 In the wild: LangChain uses Pydantic models to define the structured output schema for every LLM tool call. The same BaseModel pattern you just wrote validates every JSON response that comes back from the model before application code touches it, the same validate-before-processing discipline this chapter teaches.
</aside>
If you want to see each Pydantic mechanic from this chapter running in isolation, three short demo scripts each focus on one behaviour:
<aside> 📦 Pydantic demos (≤40 lines each, one mechanic per file):
"18.5" → 18.5 ✓, "hot" → ✗, strict mode rejection.Field(le=100) for range, @field_validator for transforms (@classmethod line included).e.errors() vs str(e).Save the gist, python pydantic_<file>.py, read the labelled output, and the three concepts land in five minutes.
</aside>
Time to put your own model to work on a small batch.
"18.5" to a float field?Field(ge=0, le=100) replace manual validation in __post_init__?@field_validator instead of relying on Field(...) constraints?<aside> 🚀 Try it in the widget: Interactive Quiz: Data Validation with Pydantic
</aside>
https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_3_ch5_pydantic_validation_quiz&embed=1
Prefer a video walkthrough? Here is a beginner-friendly Pydantic V2 tutorial:
<aside> 🎬 Struggling with this concept? Watch this beginner-friendly video:
</aside>
https://www.youtube.com/watch?v=502XOB0u8OY
Once the syntax feels familiar, an LLM can scaffold models for new data shapes: