Week 3 - Ingesting and Validating Data
Introduction to Data Ingestion
Assignment: Build a Validated Ingestion Pipeline
You recently joined DataBridge Inc., a weather data aggregator. The company collects weather readings from multiple sources - APIs, CSV exports from weather stations, and JSON feeds from partner networks.
The previous engineer built a script that "mostly works." It fetches data from one API, writes it directly to a file, skips validation entirely, and crashes if anything goes wrong. Your manager has asked you to replace it with a proper ingestion pipeline that handles multiple sources, validates all data, and stores it in a database.
This is what you are replacing. Do not use this code directly. Study it, identify the problems, then build something better.
import requests
import json
# 1. Hardcoded URL, no error handling
data = requests.get("<https://api.open-meteo.com/v1/forecast?latitude=55.67&longitude=12.56&hourly=temperature_2m>").json()
# 2. No validation - trusts the API completely
readings = []
for i in range(len(data["hourly"]["time"])):
readings.append({
"time": data["hourly"]["time"][i],
"temp": data["hourly"]["temperature_2m"][i],
})
# 3. No database, dumps to a file
with open("weather.json", "w") as f:
json.dump(readings, f)
print(f"Saved {len(readings)} readings")
Problems:
null or "N/A"?)Fetch hourly weather data from the Open-Meteo API. This API is free and requires no API key.
Use these parameters:
latitude: 55.67 (Copenhagen)longitude: 12.56hourly: temperature_2m,relative_humidity_2mforecast_days: 7Create a file data/weather_stations.csv with at least 10 rows of weather station data. Include intentional problems:
station,timestamp,temperature_c,humidity_pct
Copenhagen,2025-01-15T10:00,18.5,72
,2025-01-15T11:00,20.1,65
Aarhus,2025-01-15T12:00,N/A,58
Odense,2025-01-15T13:00,15.2,150
Copenhagen,2025-01-15T10:00,19.0,70
Aalborg,,14.8,62
Roskilde,2025-01-15T15:00,-95.0,45
Esbjerg,2025-01-15T16:00,16.3,
Herning,2025-01-15T17:00,17.1,55
Kolding,2025-01-15T18:00,14.9,68
Problems to handle: empty station names, N/A values, humidity over 100, duplicate records, missing timestamps, temperatures outside valid range, missing humidity values.
Your pipeline must demonstrate the Week 3 concepts. Each task maps to a chapter you studied.
fetch_with_retry function that retries on transient errors with exponential backoffmax_retries parameter (default 3) and a timeout for each requeststation, timestamp, temperature_c, humidity_pctstation to "Open-Meteo Copenhagen" for API recordscsv.DictReader (as you learned in Week 1, Chapter 9: File Operations)station, timestamp, temperature_c, humidity_pcttemperature_c to float and humidity_pct to int where possible"N/A"), keep it as-is and let validation catch itWeatherReading Pydantic model with:station: str, minimum length 1timestamp: str, non-emptytemperature_c: float, between -90 and 60humidity_pct: int, between 0 and 100@field_validator for station that strips whitespace and converts to title caseweather.db with a weather_readings tableON CONFLICT ... DO UPDATE SET (upsert) to handle duplicate (station, timestamp) pairs? placeholders) for all SQL operationsraw_weather table before validationpipeline.py that orchestrates all the steps:raw_weather tableweather_readings tableoutput/error_report.json === Pipeline Summary ===
API records fetched: 168
CSV records read: 10
Total raw records: 178
Valid records: 170
Invalid records: 8
Records in database: 170
Error report: output/error_report.json
<aside> 📘 Core Program Refresher: This debugging task builds directly on what you practiced in Week 3's Using LLMs for Efficient Learning.
</aside>
AI_DEBUG.md and document:Your project structure should look like this:
week3-assignment/
├── data/
│ └── weather_stations.csv
├── output/
│ └── error_report.json (generated by pipeline)
├── models.py (Pydantic WeatherReading model)
├── ingest_api.py (API fetching with retry)
├── ingest_files.py (CSV/JSON file reading)
├── validate.py (batch validation function)
├── database.py (SQLite create, upsert, query)
├── pipeline.py (orchestrator)
├── weather.db (generated by pipeline)
├── .env.example
├── .gitignore (remember to exclude .env and weather.db!)
├── AI_DEBUG.md
└── requirements.txt
<aside> ⚠️ Test with the CSV first. The CSV file is where most validation errors hide. Get the CSV → validate → database flow working before adding the API source.
</aside>
| --- | --- |
<aside> 💡 This task is optional but gives you real-world API practice using your own cloud account.
</aside>
Use the Azure CLI to get an access token, then call the Azure Resource Manager REST API to list your resource groups:
# Get a token (run this in your terminal, not Python)
az login
az account get-access-token --query accessToken -o tsv
import requests # noqa: verify
token = "YOUR_TOKEN_HERE" # paste from the command above
subscription_id = "YOUR_SUBSCRIPTION_ID" # from the Azure Portal
url = f"<https://management.azure.com/subscriptions/{subscription_id}/resourcegroups?api-version=2024-03-01>"
headers = {"Authorization": f"Bearer {token}"}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
for rg in response.json()["value"]:
print(f"Resource Group: {rg['name']}, Location: {rg['location']}")
This is the same pattern as the Open-Meteo API: requests.get() with headers, status check, JSON parsing. The only difference is authentication. Save the output as output/azure_resource_groups.json.
<aside> ⚠️ Never commit your access token. Tokens expire after about an hour, but treat them like passwords.
</aside>
data/weather_stations.csv file early. Design the messy data intentionally so you know what errors to expect.weather.db and output/ to your .gitignore. Generated files do not belong in version control.<aside> 💡 Using AI to help: Before writing code, paste the Assignment Requirements (⚠️ Ensure no PII or sensitive company data is included!) together with the starter script into an LLM and ask it to help you break down the problem. Ask it how the different modules (API, File Reading, Validating, Database) should interact and flow into each other.
</aside>