Week 3 - Ingesting and Validating Data
Introduction to Data Ingestion
Assignment: Build a Validated Ingestion Pipeline
Content
These exercises reinforce the core skills from this week. Each one is short and focused - complete them before starting the assignment. They build on each other, so work through them in order.
This function crashes on the first failure. Your job: make it resilient.
# BAD: no retry, crashes immediately ❌
import requests
def fetch_data(url: str) -> dict:
response = requests.get(url)
response.raise_for_status()
return response.json()
Your task:
max_retries parameter (default 3)2 ** attempt seconds)ConnectionError and Timeout exceptions"Attempt {n} failed, retrying in {wait}s..."Success criteria: If you call fetch_data("<https://httpstat.us/500>"), it retries 3 times with increasing delays, then raises an exception.
Write a function that fetches all results from a paginated API endpoint.
Your task:
fetch_all_pages(base_url: str) -> list[dict]{"results": [...], "page": 1, "total_pages": 5}page >= total_pagesUse this test URL to practice: https://api.open-meteo.com/v1/forecast?latitude=55.67&longitude=12.56&hourly=temperature_2m&forecast_days=1
Success criteria: The function returns a flat list of all records across all pages. For a single-page API, it returns the results from that one page.
<aside> 💡 Exercises 1-2 cover the "fetch" side of ingestion. Exercises 3-5 cover the "validate and store" side. Exercise 6 combines everything.
</aside>
You receive weather data in three formats. Normalize them all to the same shape.
Your task:
data/stations.csv with columns: station_name, temp, humidity, datedata/readings.json containing a list of objects with keys: station, temperature_c, humidity_pct, timestampread_csv_file(path) function that reads the CSVread_json_file(path) function that reads the JSONnormalize_record(record: dict) -> dict function that outputs: {"station": str, "temperature_c": float, "humidity_pct": int, "timestamp": str}
regardless of the input field names
Success criteria: Both normalize_record(csv_row) and normalize_record(json_row) return dictionaries with the same four keys.
Create a Pydantic model and validate a batch of records.
Your task:
WeatherReading model with:station: str, minimum length 1temperature_c: float, between -90 and 60humidity_pct: int, between 0 and 100timestamp: str@field_validator for station that strips whitespace and converts to title casevalidate_batch(records: list[dict]) -> tuple[list[WeatherReading], list[dict]] that returns valid records and error detailstest_data = [
{"station": "copenhagen", "temperature_c": "18.5", "humidity_pct": "72", "timestamp": "2025-01-15T10:00"},
{"station": "", "temperature_c": "abc", "humidity_pct": "150", "timestamp": "bad"},
{"station": " AARHUS ", "temperature_c": "15.2", "humidity_pct": "65", "timestamp": "2025-01-15T11:00"},
]
Success criteria: validate_batch(test_data) returns 2 valid records (Copenhagen and Aarhus, both title-cased) and 1 error. The error detail includes which fields failed.
Store validated weather data in a SQLite database.
Your task:
create_table(db_path: str) that creates a weather_readings table with columns: station, timestamp, temperature_c, humidity_pct, plus a UNIQUE(station, timestamp) constraintupsert_readings(db_path: str, readings: list[dict]) that inserts records using ON CONFLICT ... DO UPDATE SETquery_by_station(db_path: str, station: str) -> list[dict] that returns all readings for a station? placeholders) for all SQL operationsSuccess criteria: Insert 3 records, then re-insert the same records with updated temperatures. Query the database and verify only 3 rows exist (not 6), with the updated temperatures.
<aside> ⚠️ If Exercise 5 feels hard, make sure your Pydantic model from Exercise 4 works first. The database only stores data that passes validation.
</aside>
Combine all the pieces into a small end-to-end pipeline.
Your task:
station, timestamp, temperature_c, humidity_pctWeatherReading modelupsert_readings function# Your pipeline should look something like this:
raw_records = fetch_weather(latitude=55.67, longitude=12.56, days=1)
valid, errors = validate_batch(raw_records)
upsert_readings("weather.db", [r.model_dump() for r in valid])
print(f"Fetched: {len(raw_records)}, Valid: {len(valid)}, Errors: {len(errors)}")
Success criteria: Running the script twice produces the same number of rows in the database (upserts, not duplicates). The summary shows 0 errors for clean API data.
<aside> 💡 Using AI to help: If you get stuck on an exercise, paste the error message and your code (⚠️ Ensure no PII or sensitive company data is included!) into an LLM. Ask it to explain what went wrong and suggest a fix. This is especially useful for debugging SQL syntax and Pydantic validation errors.
</aside>
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.