Week 3 - Ingesting and Validating Data
Introduction to Data Ingestion
Assignment: Build a Validated Ingestion Pipeline
In Week 2 (Chapter 5), you used dataclasses with __post_init__ to validate data. It worked, but you had to write all the validation logic yourself: type checking, range checking, error messages.
Pydantic is a library that does all of that for you, and more. It automatically converts types, enforces constraints, and gives you detailed error messages when data is invalid. If dataclasses are a bicycle, Pydantic is a car.
This chapter teaches you how to define Pydantic models, use automatic type coercion, add constraints, write custom validators, and validate batches of records.
Here is the upgrade path. In Week 2 (Chapter 5), you wrote:
from dataclasses import dataclass
@dataclass
class WeatherReading:
station: str
temperature_c: float
humidity_pct: int
def __post_init__(self):
if not isinstance(self.temperature_c, (int, float)):
raise TypeError("temperature must be a number")
if self.humidity_pct < 0 or self.humidity_pct > 100:
raise ValueError("humidity must be 0-100")
In Week 3, the same thing with Pydantic:
from pydantic import BaseModel, Field
class WeatherReading(BaseModel):
station: str
temperature_c: float
humidity_pct: int = Field(ge=0, le=100)
<aside> 🎬 Animation: Pydantic Type Coercion - Strings Become Real Types
</aside>
That is it. Three lines replace twelve. Pydantic handles:
temperature_c="18.5", it converts it to 18.5 (a float)Field(ge=0, le=100) means "greater or equal to 0, less or equal to 100"pip install pydantic
Or with uv (as used in this project):
uv add pydantic
The killer feature of Pydantic is type coercion. When data arrives from a CSV, everything is a string (as you saw in Week 1's File Operations and this week's Chapter 4). Pydantic converts it for you.
from pydantic import BaseModel
class WeatherReading(BaseModel):
station: str
temperature_c: float
humidity_pct: int
# CSV data comes in as strings - Pydantic converts automatically
reading = WeatherReading(
station="Copenhagen",
temperature_c="18.5", # String -> float: 18.5
humidity_pct="72", # String -> int: 72
)
print(reading.temperature_c) # 18.5 (a real float, not a string)
print(type(reading.temperature_c)) # <class 'float'>
Compare this to raw CSV handling:
# BAD: manual conversion, crashes on bad data ❌
temp = float(row["temperature_c"]) # ValueError if row is "N/A"
# GOOD: Pydantic handles conversion and gives clear errors ✅
reading = WeatherReading(**row) # Converts types or raises ValidationError
Pydantic's Field function lets you add constraints without writing if statements:
from pydantic import BaseModel, Field
class WeatherReading(BaseModel):
station: str = Field(min_length=1) # Not empty
temperature_c: float = Field(ge=-90, le=60) # Earth's temperature range
humidity_pct: int = Field(ge=0, le=100) # Percentage
wind_speed_kmh: float = Field(ge=0) # Not negative
| Constraint | Meaning | Example |
|---|---|---|
ge=0 |
Greater or equal | Non-negative values |
le=100 |
Less or equal | Percentages |
gt=0 |
Strictly greater | Must be positive |
min_length=1 |
Minimum string length | Non-empty strings |
max_length=50 |
Maximum string length | Bounded input |
pattern=r"^\\d{4}-\\d{2}-\\d{2}$" |
Regex pattern | Date format |
When validation fails, Pydantic does not merely say "invalid data." It tells you exactly which field failed and why.
from pydantic import ValidationError # noqa: verify
try:
reading = WeatherReading(
station="",
temperature_c="not_a_number",
humidity_pct=150,
)
except ValidationError as e:
print(e)
Output:
3 validation errors for WeatherReading
station
String should have at least 1 character [type=string_too_short]
temperature_c
Input should be a valid number [type=float_parsing]
humidity_pct
Input should be less than or equal to 100 [type=less_than_equal]
Three fields, three clear error messages. This is gold for debugging ingestion pipelines.
<aside> 💡 Start strict. It is easier to loosen validation later than to find bad data that slipped through months ago.
</aside>
Pydantic's coercion is powerful, but it has boundaries.
<aside>
⚠️ Pydantic coercion has limits. It can convert "18.5" to 18.5, but it cannot convert "hot" to a float. When coercion fails, you get a ValidationError - which is exactly what you want.
</aside>
Sometimes built-in constraints are not enough. You can write custom validators for business logic.
The code below uses decorators (@field_validator, @classmethod). A decorator is a Python feature written as @something above a function. It modifies or extends that function's behavior. @classmethod means the method receives the class itself (as cls) rather than an instance. You do not need to understand decorators deeply right now: just know that @field_validator("timestamp") tells Pydantic to run this method whenever the timestamp field is validated:
from pydantic import BaseModel, Field, field_validator
class WeatherReading(BaseModel):
station: str = Field(min_length=1)
temperature_c: float = Field(ge=-90, le=60)
humidity_pct: int = Field(ge=0, le=100)
timestamp: str
@field_validator("timestamp")
@classmethod
def timestamp_must_be_iso_format(cls, v: str) -> str:
"""Check that timestamp looks like an ISO date."""
if "T" not in v and " " not in v:
raise ValueError(f"timestamp must contain date and time, got: {v}")
return v
@field_validator("station")
@classmethod
def station_must_be_stripped(cls, v: str) -> str:
"""Strip whitespace from station names."""
return v.strip()
Notice that validators can both check data (raise ValueError if invalid) and transform data (strip whitespace and return the cleaned value).
<aside>
⌨️ Hands on: Add a @field_validator to the WeatherReading model that converts the station name to title case (e.g., "COPENHAGEN" becomes "Copenhagen"). Hint: use .title() on the string and return the result.
</aside>
<aside> 🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=3&chapter=pydantic_validation&exercise=w3_pydantic_validation__weather_model&lang=python
</aside>
In a real pipeline, you validate hundreds or thousands of records. Use the error accumulation pattern from this week's Chapter 3 (which builds on the approach you saw in Week 2's PipelineRunner):
from pydantic import ValidationError # noqa: verify
def validate_readings(raw_records: list[dict]) -> tuple[list[WeatherReading], list[dict]]:
"""Validate a list of raw records, returning valid records and errors."""
valid = []
errors = []
for i, record in enumerate(raw_records):
try:
reading = WeatherReading(**record)
valid.append(reading)
except ValidationError as e:
errors.append({
"index": i,
"record": record,
"errors": e.errors(),
})
return valid, errors
Usage:
raw_data = [
{"station": "Copenhagen", "temperature_c": "18.5", "humidity_pct": "72", "timestamp": "2025-01-15T10:00"},
{"station": "", "temperature_c": "abc", "humidity_pct": "150", "timestamp": "bad"},
{"station": "Aarhus", "temperature_c": "15.2", "humidity_pct": "65", "timestamp": "2025-01-15T11:00"},
]
valid, errors = validate_readings(raw_data)
print(f"Valid: {len(valid)}, Errors: {len(errors)}")
# Valid: 2, Errors: 1
The errors list contains enough detail to debug every failed record.
When you need to send validated data to a database or file, convert back to a dictionary:
# Single record # noqa: verify
reading_dict = reading.model_dump()
# List of records
rows = [r.model_dump() for r in valid_records]
This is the Pydantic equivalent of dataclasses.asdict() from Week 2 (Chapter 5).
<aside> 💡 In the wild: FastAPI, one of the most popular Python web frameworks, uses Pydantic models for automatic request validation. When you define an API endpoint, FastAPI validates incoming JSON against your Pydantic model, the same pattern you are using here for data ingestion. As a data engineer, you will mostly consume APIs rather than build them. But understanding how validation works on both sides helps you write better ingestion code, and some DE roles do involve building internal APIs to serve data to other teams.
</aside>
Pydantic has evolved significantly between major versions:
<aside> 🤓 Curious Geek: Pydantic V1 vs V2
Pydantic V2 (released 2023) was rewritten in Rust and is 5-50x faster than V1. If you see older tutorials using @validator (instead of @field_validator) or .dict() (instead of .model_dump()), that is V1 syntax. Always use V2 for new projects.
</aside>
"18.5" to a float field?Field(ge=0, le=100) replace manual validation in __post_init__?<aside> 💡 Using AI to help: Paste a sample API response or CSV row (⚠️ Ensure no PII or sensitive company data is included!) into an LLM and ask it to generate a Pydantic model with the right types and constraints. It is a fast way to scaffold your validation layer.
</aside>
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.