🎒 Assignment: Build a Validated Ingestion Pipeline

The Scenario

You recently joined DataBridge Inc., a weather data aggregator. The company collects weather readings from multiple sources - APIs, CSV exports from weather stations, and JSON feeds from partner networks.

The previous engineer built a script that "mostly works." It fetches data from one API, writes it directly to a file, skips validation entirely, and crashes if anything goes wrong. Your manager has asked you to replace it with a proper ingestion pipeline that handles multiple sources, validates all data, and stores it in a database.

The Messy Starter Script

This is what you are replacing. Do not use this code directly. Study it, identify the problems, then build something better.

import requests
import json

# 1. Hardcoded URL, no error handling
data = requests.get("<https://api.open-meteo.com/v1/forecast?latitude=55.67&longitude=12.56&hourly=temperature_2m>").json()

# 2. No validation - trusts the API completely
readings = []
for i in range(len(data["hourly"]["time"])):
    readings.append({
        "time": data["hourly"]["time"][i],
        "temp": data["hourly"]["temperature_2m"][i],
    })

# 3. No database, dumps to a file
with open("weather.json", "w") as f:
    json.dump(readings, f)

print(f"Saved {len(readings)} readings")

Problems:

Hardcoded URL and parameters
No retry logic for network failures
No validation (what if temperature is null or "N/A"?)
No database - restarts lose history
No error reporting
Only one data source

The Input Data

Source 1: Open-Meteo API

Fetch hourly weather data from the Open-Meteo API. This API is free and requires no API key.

Use these parameters:

latitude: 55.67 (Copenhagen)
longitude: 12.56
hourly: temperature_2m,relative_humidity_2m
forecast_days: 7

Source 2: CSV File

Create a file data/weather_stations.csv with at least 10 rows of weather station data. Include intentional problems:

station,timestamp,temperature_c,humidity_pct
Copenhagen,2025-01-15T10:00,18.5,72
,2025-01-15T11:00,20.1,65
Aarhus,2025-01-15T12:00,N/A,58
Odense,2025-01-15T13:00,15.2,150
Copenhagen,2025-01-15T10:00,19.0,70
Aalborg,,14.8,62
Roskilde,2025-01-15T15:00,-95.0,45
Esbjerg,2025-01-15T16:00,16.3,
Herning,2025-01-15T17:00,17.1,55
Kolding,2025-01-15T18:00,14.9,68

Problems to handle: empty station names, N/A values, humidity over 100, duplicate records, missing timestamps, temperatures outside valid range, missing humidity values.

Requirements

Your pipeline must demonstrate the Week 3 concepts. Each task maps to a chapter you studied.

Task 1: Error Handling (Chapter 3)

Write a fetch_with_retry function that retries on transient errors with exponential backoff
Classify errors: retry on connection errors and 5xx status codes, fail immediately on 4xx
Set a max_retries parameter (default 3) and a timeout for each request
Log each retry attempt with the error details

Task 2: API Ingestion (Chapter 2)

Fetch weather data from the Open-Meteo API using your retry function
Transform the API response into a list of flat dictionaries with keys: station, timestamp, temperature_c, humidity_pct
Set station to "Open-Meteo Copenhagen" for API records
Handle the case where the API returns no data gracefully (return an empty list, do not crash)

Task 3: File Reading (Chapter 4)

Read the CSV file using csv.DictReader (as you learned in Week 1, Chapter 9: File Operations)
Normalize the CSV records to the same dictionary format as the API records: station, timestamp, temperature_c, humidity_pct
Handle type conversion: CSV values are strings, convert temperature_c to float and humidity_pct to int where possible
If a value cannot be converted (like "N/A"), keep it as-is and let validation catch it

Task 4: Pydantic Validation (Chapter 5)

Create a WeatherReading Pydantic model with:
station: str, minimum length 1
timestamp: str, non-empty
temperature_c: float, between -90 and 60
humidity_pct: int, between 0 and 100
Add a @field_validator for station that strips whitespace and converts to title case
Validate all records (from both sources) and separate into valid and invalid lists
Generate an error report: for each invalid record, capture the index, source, raw data, and Pydantic error details

Task 5: Database Storage (Chapter 6)

Create a SQLite database weather.db with a weather_readings table
Use ON CONFLICT ... DO UPDATE SET (upsert) to handle duplicate (station, timestamp) pairs
Use parameterized queries (? placeholders) for all SQL operations
Store the raw ingested data in a separate raw_weather table before validation
After inserting, query the database and print the total row count

Task 6: Pipeline Orchestration

Create a pipeline.py that orchestrates all the steps:

Fetch from API (with retry)
Read from CSV
Store raw data in raw_weather table
Validate all records with Pydantic
Upsert valid records into weather_readings table
Save error report as output/error_report.json
Print a summary of the pipeline run

The summary should include:

  === Pipeline Summary ===
  API records fetched: 168
  CSV records read: 10
  Total raw records: 178
  Valid records: 170
  Invalid records: 8
  Records in database: 170
  Error report: output/error_report.json

Task 7: Azure CLI and Portal

Every REST API call in this assignment goes to a free, public endpoint. Real data engineering work talks to authenticated Azure services instead. In this task you use the Azure CLI to get a token, call the Azure Resource Manager (ARM) API with it, and cross-check the result in the portal.

CLI steps (run in your terminal):

az login
az account show --output table
az group list --output table

You will see the class resource group rg-hyf-students-readonly in the output. Note its name and region.

<aside> 💡 Azure shows only the resource groups your account has access to, so the list will be short. That is expected: you are looking for rg-hyf-students-readonly.

</aside>

Portal step:

Open the Azure Portal, navigate to Resource groups, find rg-hyf-students-readonly, and record its region and subscription ID: you will use them in the next step.

Python step:

import requests

token = "YOUR_TOKEN_HERE"  # az account get-access-token --query accessToken -o tsv
subscription_id = "YOUR_SUBSCRIPTION_ID"  # from the portal above

url = f"<https://management.azure.com/subscriptions/{subscription_id}/resourcegroups?api-version=2024-03-01>"
headers = {"Authorization": f"Bearer {token}"}

response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()

for rg in response.json()["value"]:
    print(f"{rg['name']}: {rg['location']}")

Save the full JSON response as output/azure_resource_groups.json.

Then write 3 sentences in output/azure_compare.md covering:

Auth: how the Azure call authenticates compared to Open-Meteo (Bearer token vs nothing).
Schema verbosity: how nested the Azure response is compared to Open-Meteo's flat columnar arrays.
api-version in the URL: why Azure pins the API version and what would happen if you left it off.

<aside> ⚠️ Never commit your access token. Tokens expire after about an hour, but treat them like passwords.

</aside>

Task 8: AI-Assisted Debugging

<aside> 📘 Core Program Refresher: This debugging task builds directly on what you practiced in Week 3's Using LLMs for Efficient Learning.

</aside>

While building your pipeline, you will encounter at least one bug. (If not, introduce one intentionally.)
Ask an LLM (ChatGPT, Claude, etc.) to help you debug it.
Create AI_DEBUG.md and document:

The Error: What went wrong? (Paste the traceback)
The Prompt: What did you ask the AI?
The Solution: What did the AI suggest? Did it work?
Reflection: Did you understand why it was broken?

Deliverables

Your project structure should look like this:

week3-assignment/
├── data/
│   └── weather_stations.csv
├── output/
│   ├── error_report.json          (generated by pipeline)
│   ├── azure_resource_groups.json (Task 7)
│   └── azure_compare.md           (Task 7)
├── models.py                      (Pydantic WeatherReading model)
├── ingest_api.py                  (API fetching with retry)
├── ingest_files.py                (CSV/JSON file reading)
├── validate.py                    (batch validation function)
├── database.py                    (SQLite create, upsert, query)
├── pipeline.py                    (orchestrator)
├── weather.db                     (generated by pipeline)
├── .env.example
├── .gitignore                     (remember to exclude .env and weather.db!)
├── AI_DEBUG.md
└── requirements.txt

<aside> ⚠️ Test with the CSV first. The CSV file is where most validation errors hide. Get the CSV → validate → database flow working before adding the API source.

</aside>

How you will be evaluated

Your teacher will review each task end-to-end. The dimensions below are what they look at; the exact weighting and reference answers are intentionally kept out of the student view so that an LLM cannot reverse-engineer a passing submission.