Week 3 - Ingesting and Validating Data

Introduction to Data Ingestion

Ingesting from APIs

Production Error Handling

Reading Multiple File Formats

Data Validation with Pydantic

Writing to Databases

Gotchas & Pitfalls

Practice

Assignment: Build a Validated Ingestion Pipeline

Career relevance: Week 3

Week 3 Glossary

Going Further: Optional Deep Dives

Week 3 Kickoff Slides

History: APIs and Data Transfer

🎒 Assignment: Build a Validated Ingestion Pipeline

The Scenario

You recently joined DataBridge Inc., a weather data aggregator. The company collects weather readings from multiple sources - APIs, CSV exports from weather stations, and JSON feeds from partner networks.

The previous engineer built a script that "mostly works." It fetches data from one API, writes it directly to a file, skips validation entirely, and crashes if anything goes wrong. Your manager has asked you to replace it with a proper ingestion pipeline that handles multiple sources, validates all data, and stores it in a database.

The Messy Starter Script

This is what you are replacing. Do not use this code directly. Study it, identify the problems, then build something better.

import requests
import json

# 1. Hardcoded URL, no error handling
data = requests.get("<https://api.open-meteo.com/v1/forecast?latitude=55.67&longitude=12.56&hourly=temperature_2m>").json()

# 2. No validation - trusts the API completely
readings = []
for i in range(len(data["hourly"]["time"])):
    readings.append({
        "time": data["hourly"]["time"][i],
        "temp": data["hourly"]["temperature_2m"][i],
    })

# 3. No database, dumps to a file
with open("weather.json", "w") as f:
    json.dump(readings, f)

print(f"Saved {len(readings)} readings")

Problems:

The Input Data

Source 1: Open-Meteo API

Fetch hourly weather data from the Open-Meteo API. This API is free and requires no API key.

Use these parameters:

Source 2: CSV File

Create a file data/weather_stations.csv with at least 10 rows of weather station data. Include intentional problems:

station,timestamp,temperature_c,humidity_pct
Copenhagen,2025-01-15T10:00,18.5,72
,2025-01-15T11:00,20.1,65
Aarhus,2025-01-15T12:00,N/A,58
Odense,2025-01-15T13:00,15.2,150
Copenhagen,2025-01-15T10:00,19.0,70
Aalborg,,14.8,62
Roskilde,2025-01-15T15:00,-95.0,45
Esbjerg,2025-01-15T16:00,16.3,
Herning,2025-01-15T17:00,17.1,55
Kolding,2025-01-15T18:00,14.9,68

Problems to handle: empty station names, N/A values, humidity over 100, duplicate records, missing timestamps, temperatures outside valid range, missing humidity values.

Requirements

Your pipeline must demonstrate the Week 3 concepts. Each task maps to a chapter you studied.

Task 1: Error Handling (Chapter 3)

Task 2: API Ingestion (Chapter 2)

Task 3: File Reading (Chapter 4)

Task 4: Pydantic Validation (Chapter 5)

Task 5: Database Storage (Chapter 6)

Task 6: Pipeline Orchestration

  1. Fetch from API (with retry)
  2. Read from CSV
  3. Store raw data in raw_weather table
  4. Validate all records with Pydantic
  5. Upsert valid records into weather_readings table
  6. Save error report as output/error_report.json
  7. Print a summary of the pipeline run
  === Pipeline Summary ===
  API records fetched: 168
  CSV records read: 10
  Total raw records: 178
  Valid records: 170
  Invalid records: 8
  Records in database: 170
  Error report: output/error_report.json

Task 7: Azure CLI and Portal

Every REST API call in this assignment goes to a free, public endpoint. Real data engineering work talks to authenticated Azure services instead. In this task you use the Azure CLI to get a token, call the Azure Resource Manager (ARM) API with it, and cross-check the result in the portal.

CLI steps (run in your terminal):

az login
az account show --output table
az group list --output table

You will see the class resource group rg-hyf-students-readonly in the output. Note its name and region.

<aside> 💡 Azure shows only the resource groups your account has access to, so the list will be short. That is expected: you are looking for rg-hyf-students-readonly.

</aside>

Portal step:

Open the Azure Portal, navigate to Resource groups, find rg-hyf-students-readonly, and record its region and subscription ID: you will use them in the next step.

Python step:

import requests

token = "YOUR_TOKEN_HERE"  # az account get-access-token --query accessToken -o tsv
subscription_id = "YOUR_SUBSCRIPTION_ID"  # from the portal above

url = f"<https://management.azure.com/subscriptions/{subscription_id}/resourcegroups?api-version=2024-03-01>"
headers = {"Authorization": f"Bearer {token}"}

response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()

for rg in response.json()["value"]:
    print(f"{rg['name']}: {rg['location']}")

Save the full JSON response as output/azure_resource_groups.json.

Then write 3 sentences in output/azure_compare.md covering:

  1. Auth: how the Azure call authenticates compared to Open-Meteo (Bearer token vs nothing).
  2. Schema verbosity: how nested the Azure response is compared to Open-Meteo's flat columnar arrays.
  3. api-version in the URL: why Azure pins the API version and what would happen if you left it off.

<aside> ⚠️ Never commit your access token. Tokens expire after about an hour, but treat them like passwords.

</aside>

Task 8: AI-Assisted Debugging

<aside> 📘 Core Program Refresher: This debugging task builds directly on what you practiced in Week 3's Using LLMs for Efficient Learning.

</aside>

  1. While building your pipeline, you will encounter at least one bug. (If not, introduce one intentionally.)
  2. Ask an LLM (ChatGPT, Claude, etc.) to help you debug it.
  3. Create AI_DEBUG.md and document:

Deliverables

Your project structure should look like this:

week3-assignment/
├── data/
│   └── weather_stations.csv
├── output/
│   ├── error_report.json          (generated by pipeline)
│   ├── azure_resource_groups.json (Task 7)
│   └── azure_compare.md           (Task 7)
├── models.py                      (Pydantic WeatherReading model)
├── ingest_api.py                  (API fetching with retry)
├── ingest_files.py                (CSV/JSON file reading)
├── validate.py                    (batch validation function)
├── database.py                    (SQLite create, upsert, query)
├── pipeline.py                    (orchestrator)
├── weather.db                     (generated by pipeline)
├── .env.example
├── .gitignore                     (remember to exclude .env and weather.db!)
├── AI_DEBUG.md
└── requirements.txt

<aside> ⚠️ Test with the CSV first. The CSV file is where most validation errors hide. Get the CSV → validate → database flow working before adding the API source.

</aside>

How you will be evaluated

Your teacher will review each task end-to-end. The dimensions below are what they look at; the exact weighting and reference answers are intentionally kept out of the student view so that an LLM cannot reverse-engineer a passing submission.