Week 3 - Ingesting and Validating Data

Introduction to Data Ingestion

Ingesting from APIs

Production Error Handling

Reading Multiple File Formats

Data Validation with Pydantic

Writing to Databases

Practice

Assignment: Build a Validated Ingestion Pipeline

Gotchas & Pitfalls

Lesson Plan

🗓️ Lesson Plan

Theme: From Local to External

Welcome to Week 3! Students have read the material and built clean pipelines in Week 2. Today the focus shifts from "processing local data" to "ingesting external data safely." The goal is hands-on practice: fetch from an API, validate with Pydantic, and store in SQLite - all in class.

Goals

By the end of this lesson, students should be able to:

Schedule

Time Activity Duration
0:00 Welcome & Kahoot Quiz 15 min
0:15 Live Demo: The Broken Ingestion Script 15 min
0:30 Workshop 1: API Fetching + Retry Logic 25 min
0:55 Break 10 min
1:05 Workshop 2: Pydantic Validation + SQLite Storage 30 min
1:35 Assignment Launch: Connecting the Dots 15 min
1:50 Q&A & Wrap Up 10 min
2:00 End -

Total: 2 hours


Kahoot Quiz (15 min)

Goal: Check understanding of the Week 3 reading material before diving in.

Topics to include

  1. Ingestion: What is the difference between local and external data?
  2. Error types: Is a 500 Server Error transient or permanent?
  3. Retry: What does exponential backoff mean? (Wait 1s, 2s, 4s, 8s...)
  4. HTTP: What does status code 429 mean? (Too Many Requests)
  5. CSV gotcha: What type is every value in a CSV file? (String)
  6. Pydantic: What happens when you pass "18.5" (a string) to a float field in Pydantic?
  7. SQL safety: Why should you never use f-strings in SQL queries?
  8. Idempotency: What does an upsert do when the record already exists?

Live Demo: The Broken Ingestion Script (15 min)

Start the class with this script on screen. Ask: "What happens when this script runs at 3 AM and the API is down?"

import requests
import json

data = requests.get(
    "<https://api.open-meteo.com/v1/forecast?latitude=55.67&longitude=12.56&hourly=temperature_2m>"
).json()

readings = []
for i in range(len(data["hourly"]["time"])):
    readings.append({
        "time": data["hourly"]["time"][i],
        "temp": data["hourly"]["temperature_2m"][i],
    })

with open("weather.json", "w") as f:
    json.dump(readings, f)

print(f"Saved {len(readings)} readings")

Teaching Points (do these live)

  1. Disconnect from WiFi (or use a bad URL) and run it. It crashes with ConnectionError. Ask: "How do we survive this?" (Answer: retry with backoff)
  2. Ask: "What if one temperature value is null?" It gets saved to the file silently. (Answer: validation)
  3. Ask: "What if you run this twice?" The file gets overwritten. All history is lost. (Answer: database with upserts)
  4. Ask: "Where is the timeout?" There is none. If the API hangs, the script hangs forever. (Answer: timeout=10)
  5. Ask: "What if a partner sends the same data as a CSV with different field names?" The script can only read this one API. (Answer: normalization)

Count the problems together. This becomes the motivation for every chapter.


Workshop 1: API Fetching + Retry Logic (25 min)

Goal: Students build a resilient API fetcher with retry logic. This covers Chapters 2 and 3 (APIs and Error Handling).

Part A: Basic API Fetch (10 min)

Task: Fetch weather data from Open-Meteo and transform it into a list of dictionaries.

Instructions for students:

  1. Use requests.get() with timeout=10 and params={} (not hardcoded URL)
  2. Call response.raise_for_status() to catch HTTP errors
  3. Transform the response into a list of dicts with keys: station, timestamp, temperature_c, humidity_pct
  4. Print the first 3 records

Success criteria: Running the script prints 3 weather readings for Copenhagen.

Part B: Add Retry Logic (15 min)

Task: Wrap the fetch in a retry loop with exponential backoff.

Instructions for students:

  1. Add a for attempt in range(max_retries) loop around the request
  2. Catch ConnectionError and Timeout - these are retryable
  3. Wait 2 ** attempt seconds between retries
  4. After the final attempt, return an empty list (do not crash the pipeline)
  5. Test by temporarily using a bad URL, then switching back

Key moment: Have students test with https://httpstat.us/500 to see the retry behavior in action. They should see the exponential delays in real time.

Discussion: "Why wait 1, 2, 4 seconds instead of 1, 1, 1?" (Because if the server is overloaded, constant retries make it worse)


Workshop 2: Pydantic Validation + SQLite Storage (30 min)

Goal: Students validate data with Pydantic and store it in SQLite. This covers Chapters 5 and 6.

Part A: Pydantic Model (10 min)

Task: Create a Pydantic model for weather readings.

Instructions for students:

  1. Create a WeatherReading model with station (str), temperature_c (float), humidity_pct (int, 0-100), timestamp (str)
  2. Add Field(ge=0, le=100) constraint on humidity_pct
  3. Add a @field_validator for station that strips whitespace
  4. Try creating a WeatherReading(station="", temperature_c="abc", humidity_pct=150, timestamp="bad") and observe the error message

Key moment: Show the detailed ValidationError output. Compare it to the cryptic ValueError they would get from a raw dataclass. Pydantic tells you exactly what is wrong with each field.

Part B: Batch Validation (5 min)

Task: Validate a list of records, separating valid from invalid.

Instructions for students:

  1. Write a validate_batch function that loops through records
  2. Try/except ValidationError for each record
  3. Collect valid records in one list, error details in another
  4. Test with a mix of good and bad records