Week 3 - Ingesting and Validating Data
Introduction to Data Ingestion
Assignment: Build a Validated Ingestion Pipeline
Going Further: Optional Deep Dives
History: APIs and Data Transfer
You recently joined DataBridge Inc., a weather data aggregator. The company collects weather readings from multiple sources - APIs, CSV exports from weather stations, and JSON feeds from partner networks.
The previous engineer built a script that "mostly works." It fetches data from one API, writes it directly to a file, skips validation entirely, and crashes if anything goes wrong. Your manager has asked you to replace it with a proper ingestion pipeline that handles multiple sources, validates all data, and stores it in a database.
This is what you are replacing. Do not use this code directly. Study it, identify the problems, then build something better.
import requests
import json
# 1. Hardcoded URL, no error handling
data = requests.get("<https://api.open-meteo.com/v1/forecast?latitude=55.67&longitude=12.56&hourly=temperature_2m>").json()
# 2. No validation - trusts the API completely
readings = []
for i in range(len(data["hourly"]["time"])):
readings.append({
"time": data["hourly"]["time"][i],
"temp": data["hourly"]["temperature_2m"][i],
})
# 3. No database, dumps to a file
with open("weather.json", "w") as f:
json.dump(readings, f)
print(f"Saved {len(readings)} readings")
Problems:
null or "N/A"?)Fetch hourly weather data from the Open-Meteo API. This API is free and requires no API key.
Use these parameters:
latitude: 55.67 (Copenhagen)longitude: 12.56hourly: temperature_2m,relative_humidity_2mforecast_days: 7Create a file data/weather_stations.csv with at least 10 rows of weather station data. Include intentional problems:
station,timestamp,temperature_c,humidity_pct
Copenhagen,2025-01-15T10:00,18.5,72
,2025-01-15T11:00,20.1,65
Aarhus,2025-01-15T12:00,N/A,58
Odense,2025-01-15T13:00,15.2,150
Copenhagen,2025-01-15T10:00,19.0,70
Aalborg,,14.8,62
Roskilde,2025-01-15T15:00,-95.0,45
Esbjerg,2025-01-15T16:00,16.3,
Herning,2025-01-15T17:00,17.1,55
Kolding,2025-01-15T18:00,14.9,68
Problems to handle: empty station names, N/A values, humidity over 100, duplicate records, missing timestamps, temperatures outside valid range, missing humidity values.
Your pipeline must demonstrate the Week 3 concepts. Each task maps to a chapter you studied.
fetch_with_retry function that retries on transient errors with exponential backoffmax_retries parameter (default 3) and a timeout for each requeststation, timestamp, temperature_c, humidity_pctstation to "Open-Meteo Copenhagen" for API recordscsv.DictReader (as you learned in Week 1, Chapter 9: File Operations)station, timestamp, temperature_c, humidity_pcttemperature_c to float and humidity_pct to int where possible"N/A"), keep it as-is and let validation catch itWeatherReading Pydantic model with:station: str, minimum length 1timestamp: str, non-emptytemperature_c: float, between -90 and 60humidity_pct: int, between 0 and 100@field_validator for station that strips whitespace and converts to title caseweather.db with a weather_readings tableON CONFLICT ... DO UPDATE SET (upsert) to handle duplicate (station, timestamp) pairs? placeholders) for all SQL operationsraw_weather table before validationpipeline.py that orchestrates all the steps:raw_weather tableweather_readings tableoutput/error_report.json === Pipeline Summary ===
API records fetched: 168
CSV records read: 10
Total raw records: 178
Valid records: 170
Invalid records: 8
Records in database: 170
Error report: output/error_report.json
Every REST API call in this assignment goes to a free, public endpoint. Real data engineering work talks to authenticated Azure services instead. In this task you use the Azure CLI to get a token, call the Azure Resource Manager (ARM) API with it, and cross-check the result in the portal.
CLI steps (run in your terminal):
az login
az account show --output table
az group list --output table
You will see the class resource group rg-hyf-students-readonly in the output. Note its name and region.
<aside>
💡 Azure shows only the resource groups your account has access to, so the list will be short. That is expected: you are looking for rg-hyf-students-readonly.
</aside>
Portal step:
Open the Azure Portal, navigate to Resource groups, find rg-hyf-students-readonly, and record its region and subscription ID: you will use them in the next step.
Python step:
import requests
token = "YOUR_TOKEN_HERE" # az account get-access-token --query accessToken -o tsv
subscription_id = "YOUR_SUBSCRIPTION_ID" # from the portal above
url = f"<https://management.azure.com/subscriptions/{subscription_id}/resourcegroups?api-version=2024-03-01>"
headers = {"Authorization": f"Bearer {token}"}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
for rg in response.json()["value"]:
print(f"{rg['name']}: {rg['location']}")
Save the full JSON response as output/azure_resource_groups.json.
Then write 3 sentences in output/azure_compare.md covering:
api-version in the URL: why Azure pins the API version and what would happen if you left it off.<aside> ⚠️ Never commit your access token. Tokens expire after about an hour, but treat them like passwords.
</aside>
<aside> 📘 Core Program Refresher: This debugging task builds directly on what you practiced in Week 3's Using LLMs for Efficient Learning.
</aside>
AI_DEBUG.md and document:Your project structure should look like this:
week3-assignment/
├── data/
│ └── weather_stations.csv
├── output/
│ ├── error_report.json (generated by pipeline)
│ ├── azure_resource_groups.json (Task 7)
│ └── azure_compare.md (Task 7)
├── models.py (Pydantic WeatherReading model)
├── ingest_api.py (API fetching with retry)
├── ingest_files.py (CSV/JSON file reading)
├── validate.py (batch validation function)
├── database.py (SQLite create, upsert, query)
├── pipeline.py (orchestrator)
├── weather.db (generated by pipeline)
├── .env.example
├── .gitignore (remember to exclude .env and weather.db!)
├── AI_DEBUG.md
└── requirements.txt
<aside> ⚠️ Test with the CSV first. The CSV file is where most validation errors hide. Get the CSV → validate → database flow working before adding the API source.
</aside>
Your teacher will review each task end-to-end. The dimensions below are what they look at; the exact weighting and reference answers are intentionally kept out of the student view so that an LLM cannot reverse-engineer a passing submission.