Week 3 - Ingesting and Validating Data

Introduction to Data Ingestion

Ingesting from APIs

Production Error Handling

Reading Multiple File Formats

Data Validation with Pydantic

Writing to Databases

Practice

Assignment: Build a Validated Ingestion Pipeline

Gotchas & Pitfalls

Lesson Plan

Content

🛠️ Practice

These exercises reinforce the core skills from this week. Each one is short and focused - complete them before starting the assignment. They build on each other, so work through them in order.


Exercise 1: Write a Retry Function

This function crashes on the first failure. Your job: make it resilient.

# BAD: no retry, crashes immediately ❌
import requests

def fetch_data(url: str) -> dict:
    response = requests.get(url)
    response.raise_for_status()
    return response.json()

Your task:

  1. Add a max_retries parameter (default 3)
  2. Wrap the request in a retry loop with exponential backoff (2 ** attempt seconds)
  3. Only retry on ConnectionError and Timeout exceptions
  4. After the final attempt, raise the original exception
  5. Print a message before each retry: "Attempt {n} failed, retrying in {wait}s..."

Success criteria: If you call fetch_data("<https://httpstat.us/500>"), it retries 3 times with increasing delays, then raises an exception.


Exercise 2: Paginated API Fetch

Write a function that fetches all results from a paginated API endpoint.

Your task:

  1. Create a function fetch_all_pages(base_url: str) -> list[dict]
  2. The API returns {"results": [...], "page": 1, "total_pages": 5}
  3. Loop through pages until page >= total_pages
  4. Collect all results into a single list
  5. Add a 0.5-second delay between requests to be polite

Use this test URL to practice: https://api.open-meteo.com/v1/forecast?latitude=55.67&longitude=12.56&hourly=temperature_2m&forecast_days=1

Success criteria: The function returns a flat list of all records across all pages. For a single-page API, it returns the results from that one page.


<aside> 💡 Exercises 1-2 cover the "fetch" side of ingestion. Exercises 3-5 cover the "validate and store" side. Exercise 6 combines everything.

</aside>

Exercise 3: Read and Normalize File Formats

You receive weather data in three formats. Normalize them all to the same shape.

Your task:

  1. Create a file data/stations.csv with columns: station_name, temp, humidity, date
  2. Create a file data/readings.json containing a list of objects with keys: station, temperature_c, humidity_pct, timestamp
  3. Write a read_csv_file(path) function that reads the CSV
  4. Write a read_json_file(path) function that reads the JSON
  5. Write a normalize_record(record: dict) -> dict function that outputs:
   {"station": str, "temperature_c": float, "humidity_pct": int, "timestamp": str}

regardless of the input field names

Success criteria: Both normalize_record(csv_row) and normalize_record(json_row) return dictionaries with the same four keys.


Exercise 4: Pydantic Validation

Create a Pydantic model and validate a batch of records.

Your task:

  1. Create a WeatherReading model with:
  1. Add a @field_validator for station that strips whitespace and converts to title case
  2. Write a function validate_batch(records: list[dict]) -> tuple[list[WeatherReading], list[dict]] that returns valid records and error details
  3. Test with this data:
test_data = [
    {"station": "copenhagen", "temperature_c": "18.5", "humidity_pct": "72", "timestamp": "2025-01-15T10:00"},
    {"station": "", "temperature_c": "abc", "humidity_pct": "150", "timestamp": "bad"},
    {"station": "  AARHUS  ", "temperature_c": "15.2", "humidity_pct": "65", "timestamp": "2025-01-15T11:00"},
]

Success criteria: validate_batch(test_data) returns 2 valid records (Copenhagen and Aarhus, both title-cased) and 1 error. The error detail includes which fields failed.


Exercise 5: SQLite Writer

Store validated weather data in a SQLite database.

Your task:

  1. Create a function create_table(db_path: str) that creates a weather_readings table with columns: station, timestamp, temperature_c, humidity_pct, plus a UNIQUE(station, timestamp) constraint
  2. Create a function upsert_readings(db_path: str, readings: list[dict]) that inserts records using ON CONFLICT ... DO UPDATE SET
  3. Create a function query_by_station(db_path: str, station: str) -> list[dict] that returns all readings for a station
  4. Use parameterized queries (? placeholders) for all SQL operations

Success criteria: Insert 3 records, then re-insert the same records with updated temperatures. Query the database and verify only 3 rows exist (not 6), with the updated temperatures.


<aside> ⚠️ If Exercise 5 feels hard, make sure your Pydantic model from Exercise 4 works first. The database only stores data that passes validation.

</aside>

Exercise 6: Mini Pipeline

Combine all the pieces into a small end-to-end pipeline.

Your task:

  1. Fetch weather data from the Open-Meteo API for Copenhagen (latitude 55.67, longitude 12.56)
  2. Normalize the API response into a list of dictionaries with keys: station, timestamp, temperature_c, humidity_pct
  3. Validate each record with your Pydantic WeatherReading model
  4. Store valid records in a SQLite database using your upsert_readings function
  5. Print a summary: total fetched, valid count, error count, rows in database
# Your pipeline should look something like this:
raw_records = fetch_weather(latitude=55.67, longitude=12.56, days=1)
valid, errors = validate_batch(raw_records)
upsert_readings("weather.db", [r.model_dump() for r in valid])
print(f"Fetched: {len(raw_records)}, Valid: {len(valid)}, Errors: {len(errors)}")

Success criteria: Running the script twice produces the same number of rows in the database (upserts, not duplicates). The summary shows 0 errors for clean API data.

<aside> 💡 Using AI to help: If you get stuck on an exercise, paste the error message and your code (⚠️ Ensure no PII or sensitive company data is included!) into an LLM. Ask it to explain what went wrong and suggest a fix. This is especially useful for debugging SQL syntax and Pydantic validation errors.

</aside>


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.