Week 3 - Ingesting and Validating Data
Introduction to Data Ingestion
Assignment: Build a Validated Ingestion Pipeline
Going Further: Optional Deep Dives
History: APIs and Data Transfer
Content
These exercises reinforce the core skills from this week. Each one is short and focused - complete them before starting the assignment. They build on each other, so work through them in order.
All Week 3 exercises live under data-track/week-3/ in HYF's Learning-Resources repo. One Codespace covers all six exercises.
<aside> 💻 Open in GitHub Codespaces
</aside>
The repo's .devcontainer/data-track/ boots Python 3.11 + ruff + Pylance for every exercise. From the Codespace's Explorer, navigate into data-track/week-3/exercise_N/.
Prefer your own VS Code? Clone locally instead:
git clone <https://github.com/HackYourFuture/Learning-Resources.git>
cd Learning-Resources/data-track/week-3
code .
Each exercise folder ships its own requirements.txt (when needed) and a per-exercise README with detailed instructions.
Each exercise_N/solutions/ folder holds the answer in-place. The starter file is filled with the answer code, the original # TODO comments are preserved, and # WHY ...: notes sit under each non-obvious choice.
Read the WHY notes, not the code. The point is the reasoning, not the syntax.
The solution sits next to your starter under solutions/ rather than on a separate branch. The folder name and the deliberate "open this folder to see the answer" click are the whole barrier, and they are enough. Time-box yourself: 10-30 minutes of honest attempt before you open solutions/. The struggle is where the learning happens.
You can diff your attempt against the reference once you have tried:
diff exercise_1/exercise.py exercise_1/solutions/exercise.py
This function crashes on the first failure. Your job: make it resilient.
# BAD: no retry, crashes immediately ❌
import requests
def fetch_data(url: str) -> dict:
response = requests.get(url)
response.raise_for_status()
return response.json()
Your task:
max_retries parameter (default 3)2 ** attempt seconds)ConnectionError and Timeout exceptions"Attempt {n} failed, retrying in {wait}s..."Success criteria: If you call fetch_data("<https://httpstat.us/500>"), it retries 3 times with increasing delays, then raises an exception.
<aside>
📦 Files: exercise_1/: use the Codespace you opened at the top of this page.
</aside>
Write a function that fetches all results from a paginated API endpoint.
Your task:
fetch_all_pages(base_url: str) -> list[dict]{"results": [...], "page": 1, "total_pages": 5}page >= total_pagesUse this test URL to practice: https://api.open-meteo.com/v1/forecast?latitude=55.67&longitude=12.56&hourly=temperature_2m&forecast_days=1
Success criteria: The function returns a flat list of all records across all pages. For a single-page API, it returns the results from that one page.
<aside>
📦 Files: exercise_2/: includes a local 3-page stub so you can verify offline before pointing at a real API.
</aside>
<aside> 💡 Exercises 1-2 cover the "fetch" side of ingestion. Exercises 3-5 cover the "validate and store" side. Exercise 6 combines everything.
</aside>
You receive weather data in three formats. Normalize them all to the same shape.
Your task:
data/stations.csv with columns: station_name, temp, humidity, datedata/readings.json containing a list of objects with keys: station, temperature_c, humidity_pct, timestampread_csv_file(path) function that reads the CSVread_json_file(path) function that reads the JSONnormalize_record(record: dict) -> dict function that outputs: {"station": str, "temperature_c": float, "humidity_pct": int, "timestamp": str}
regardless of the input field names
Success criteria: Both normalize_record(csv_row) and normalize_record(json_row) return dictionaries with the same four keys.
<aside>
📦 Files: exercise_3/: ships sample data/stations.csv and data/readings.json with deliberately different field names.
</aside>
Create a Pydantic model and validate a batch of records.
Your task:
WeatherReading model with:station: str, minimum length 1temperature_c: float, between -90 and 60humidity_pct: int, between 0 and 100timestamp: str@field_validator for station that strips whitespace and converts to title casevalidate_batch(records: list[dict]) -> tuple[list[WeatherReading], list[dict]] that returns valid records and error detailstest_data = [
{"station": "copenhagen", "temperature_c": "18.5", "humidity_pct": "72", "timestamp": "2025-01-15T10:00"},
{"station": "", "temperature_c": "abc", "humidity_pct": "150", "timestamp": "bad"},
{"station": " AARHUS ", "temperature_c": "15.2", "humidity_pct": "65", "timestamp": "2025-01-15T11:00"},
]
Success criteria: validate_batch(test_data) returns 2 valid records (Copenhagen and Aarhus, both title-cased) and 1 error. The error detail includes which fields failed.
<aside>
📦 Files: exercise_4/: use the Codespace you opened at the top of this page.
</aside>
Store validated weather data in a SQLite database.
Your task:
create_table(db_path: str) that creates a weather_readings table with columns: station, timestamp, temperature_c, humidity_pct, plus a UNIQUE(station, timestamp) constraintupsert_readings(db_path: str, readings: list[dict]) that inserts records using ON CONFLICT ... DO UPDATE SETquery_by_station(db_path: str, station: str) -> list[dict] that returns all readings for a station? placeholders) for all SQL operationsSuccess criteria: Insert 3 records, then re-insert the same records with updated temperatures. Query the database and verify only 3 rows exist (not 6), with the updated temperatures.
<aside>
📦 Files: exercise_5/: SQLite ships with Python, no extra install. The generated weather.db is gitignored.
</aside>
<aside> ⚠️ If Exercise 5 feels hard, make sure your Pydantic model from Exercise 4 works first. The database only stores data that passes validation.
</aside>
Combine all the pieces into a small end-to-end pipeline.
Your task:
station, timestamp, temperature_c, humidity_pctWeatherReading modelupsert_readings function# Your pipeline should look something like this:
raw_records = fetch_weather(latitude=55.67, longitude=12.56, days=1)
valid, errors = validate_batch(raw_records)
upsert_readings("weather.db", [r.model_dump() for r in valid])
print(f"Fetched: {len(raw_records)}, Valid: {len(valid)}, Errors: {len(errors)}")
Success criteria: Running the script twice produces the same number of rows in the database (upserts, not duplicates). The summary shows 0 errors for clean API data.
<aside>
📦 Files: exercise_6/: capstone, uses requests + pydantic + sqlite3. The Open-Meteo call needs network (no API key).
</aside>
Once Exercise 6 runs cleanly, you have built a full ingestion pipeline.
<aside> 💡 Using AI to help: If you get stuck on an exercise, paste the error message and your code (⚠️ Ensure no PII or sensitive company data is included!) into an LLM. Ask it to explain what went wrong and suggest a fix. This is especially useful for debugging SQL syntax and Pydantic validation errors.
</aside>