Week 3 - Ingesting and Validating Data
Introduction to Data Ingestion
Assignment: Build a Validated Ingestion Pipeline
Welcome to Week 3! Students have read the material and built clean pipelines in Week 2. Today the focus shifts from "processing local data" to "ingesting external data safely." The goal is hands-on practice: fetch from an API, validate with Pydantic, and store in SQLite - all in class.
By the end of this lesson, students should be able to:
| Time | Activity | Duration |
|---|---|---|
| 0:00 | Welcome & Kahoot Quiz | 15 min |
| 0:15 | Live Demo: The Broken Ingestion Script | 15 min |
| 0:30 | Workshop 1: API Fetching + Retry Logic | 25 min |
| 0:55 | Break | 10 min |
| 1:05 | Workshop 2: Pydantic Validation + SQLite Storage | 30 min |
| 1:35 | Assignment Launch: Connecting the Dots | 15 min |
| 1:50 | Q&A & Wrap Up | 10 min |
| 2:00 | End | - |
Total: 2 hours
Goal: Check understanding of the Week 3 reading material before diving in.
"18.5" (a string) to a float field in Pydantic?Start the class with this script on screen. Ask: "What happens when this script runs at 3 AM and the API is down?"
import requests
import json
data = requests.get(
"<https://api.open-meteo.com/v1/forecast?latitude=55.67&longitude=12.56&hourly=temperature_2m>"
).json()
readings = []
for i in range(len(data["hourly"]["time"])):
readings.append({
"time": data["hourly"]["time"][i],
"temp": data["hourly"]["temperature_2m"][i],
})
with open("weather.json", "w") as f:
json.dump(readings, f)
print(f"Saved {len(readings)} readings")
ConnectionError. Ask: "How do we survive this?" (Answer: retry with backoff)null?" It gets saved to the file silently. (Answer: validation)timeout=10)Count the problems together. This becomes the motivation for every chapter.
Goal: Students build a resilient API fetcher with retry logic. This covers Chapters 2 and 3 (APIs and Error Handling).
Task: Fetch weather data from Open-Meteo and transform it into a list of dictionaries.
Instructions for students:
requests.get() with timeout=10 and params={} (not hardcoded URL)response.raise_for_status() to catch HTTP errorsstation, timestamp, temperature_c, humidity_pctSuccess criteria: Running the script prints 3 weather readings for Copenhagen.
Task: Wrap the fetch in a retry loop with exponential backoff.
Instructions for students:
for attempt in range(max_retries) loop around the requestConnectionError and Timeout - these are retryable2 ** attempt seconds between retriesKey moment: Have students test with https://httpstat.us/500 to see the retry behavior in action. They should see the exponential delays in real time.
Discussion: "Why wait 1, 2, 4 seconds instead of 1, 1, 1?" (Because if the server is overloaded, constant retries make it worse)
Goal: Students validate data with Pydantic and store it in SQLite. This covers Chapters 5 and 6.
Task: Create a Pydantic model for weather readings.
Instructions for students:
WeatherReading model with station (str), temperature_c (float), humidity_pct (int, 0-100), timestamp (str)Field(ge=0, le=100) constraint on humidity_pct@field_validator for station that strips whitespaceWeatherReading(station="", temperature_c="abc", humidity_pct=150, timestamp="bad") and observe the error messageKey moment: Show the detailed ValidationError output. Compare it to the cryptic ValueError they would get from a raw dataclass. Pydantic tells you exactly what is wrong with each field.
Task: Validate a list of records, separating valid from invalid.
Instructions for students:
validate_batch function that loops through recordsValidationError for each record