Week 3 - Ingesting and Validating Data

Introduction to Data Ingestion

Ingesting from APIs

Production Error Handling

Reading Multiple File Formats

Data Validation with Pydantic

Writing to Databases

Practice

Assignment: Build a Validated Ingestion Pipeline

Gotchas & Pitfalls

Lesson Plan

🎒 Assignment: Build a Validated Ingestion Pipeline

The Scenario

You recently joined DataBridge Inc., a weather data aggregator. The company collects weather readings from multiple sources - APIs, CSV exports from weather stations, and JSON feeds from partner networks.

The previous engineer built a script that "mostly works." It fetches data from one API, writes it directly to a file, skips validation entirely, and crashes if anything goes wrong. Your manager has asked you to replace it with a proper ingestion pipeline that handles multiple sources, validates all data, and stores it in a database.

The Messy Starter Script

This is what you are replacing. Do not use this code directly. Study it, identify the problems, then build something better.

import requests
import json

# 1. Hardcoded URL, no error handling
data = requests.get("<https://api.open-meteo.com/v1/forecast?latitude=55.67&longitude=12.56&hourly=temperature_2m>").json()

# 2. No validation - trusts the API completely
readings = []
for i in range(len(data["hourly"]["time"])):
    readings.append({
        "time": data["hourly"]["time"][i],
        "temp": data["hourly"]["temperature_2m"][i],
    })

# 3. No database, dumps to a file
with open("weather.json", "w") as f:
    json.dump(readings, f)

print(f"Saved {len(readings)} readings")

Problems:

The Input Data

Source 1: Open-Meteo API

Fetch hourly weather data from the Open-Meteo API. This API is free and requires no API key.

Use these parameters:

Source 2: CSV File

Create a file data/weather_stations.csv with at least 10 rows of weather station data. Include intentional problems:

station,timestamp,temperature_c,humidity_pct
Copenhagen,2025-01-15T10:00,18.5,72
,2025-01-15T11:00,20.1,65
Aarhus,2025-01-15T12:00,N/A,58
Odense,2025-01-15T13:00,15.2,150
Copenhagen,2025-01-15T10:00,19.0,70
Aalborg,,14.8,62
Roskilde,2025-01-15T15:00,-95.0,45
Esbjerg,2025-01-15T16:00,16.3,
Herning,2025-01-15T17:00,17.1,55
Kolding,2025-01-15T18:00,14.9,68

Problems to handle: empty station names, N/A values, humidity over 100, duplicate records, missing timestamps, temperatures outside valid range, missing humidity values.

Requirements

Your pipeline must demonstrate the Week 3 concepts. Each task maps to a chapter you studied.

Task 1: Error Handling (Chapter 3)

Task 2: API Ingestion (Chapter 2)

Task 3: File Reading (Chapter 4)

Task 4: Pydantic Validation (Chapter 5)

Task 5: Database Storage (Chapter 6)

Task 6: Pipeline Orchestration

  1. Fetch from API (with retry)
  2. Read from CSV
  3. Store raw data in raw_weather table
  4. Validate all records with Pydantic
  5. Upsert valid records into weather_readings table
  6. Save error report as output/error_report.json
  7. Print a summary of the pipeline run
  === Pipeline Summary ===
  API records fetched: 168
  CSV records read: 10
  Total raw records: 178
  Valid records: 170
  Invalid records: 8
  Records in database: 170
  Error report: output/error_report.json

Task 7: AI-Assisted Debugging

<aside> 📘 Core Program Refresher: This debugging task builds directly on what you practiced in Week 3's Using LLMs for Efficient Learning.

</aside>

  1. While building your pipeline, you will encounter at least one bug. (If not, introduce one intentionally.)
  2. Ask an LLM (ChatGPT, Claude, etc.) to help you debug it.
  3. Create AI_DEBUG.md and document:

Deliverables

Your project structure should look like this:

week3-assignment/
├── data/
│   └── weather_stations.csv
├── output/
│   └── error_report.json          (generated by pipeline)
├── models.py                      (Pydantic WeatherReading model)
├── ingest_api.py                  (API fetching with retry)
├── ingest_files.py                (CSV/JSON file reading)
├── validate.py                    (batch validation function)
├── database.py                    (SQLite create, upsert, query)
├── pipeline.py                    (orchestrator)
├── weather.db                     (generated by pipeline)
├── .env.example
├── .gitignore                     (remember to exclude .env and weather.db!)
├── AI_DEBUG.md
└── requirements.txt

<aside> ⚠️ Test with the CSV first. The CSV file is where most validation errors hide. Get the CSV → validate → database flow working before adding the API source.

</aside>

Rubric

| --- | --- |

Bonus: Query the Azure ARM API

<aside> 💡 This task is optional but gives you real-world API practice using your own cloud account.

</aside>

Use the Azure CLI to get an access token, then call the Azure Resource Manager REST API to list your resource groups:

# Get a token (run this in your terminal, not Python)
az login
az account get-access-token --query accessToken -o tsv
import requests  # noqa: verify

token = "YOUR_TOKEN_HERE"  # paste from the command above
subscription_id = "YOUR_SUBSCRIPTION_ID"  # from the Azure Portal

url = f"<https://management.azure.com/subscriptions/{subscription_id}/resourcegroups?api-version=2024-03-01>"
headers = {"Authorization": f"Bearer {token}"}

response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()

for rg in response.json()["value"]:
    print(f"Resource Group: {rg['name']}, Location: {rg['location']}")

This is the same pattern as the Open-Meteo API: requests.get() with headers, status check, JSON parsing. The only difference is authentication. Save the output as output/azure_resource_groups.json.

<aside> ⚠️ Never commit your access token. Tokens expire after about an hour, but treat them like passwords.

</aside>

Tips

<aside> 💡 Using AI to help: Before writing code, paste the Assignment Requirements (⚠️ Ensure no PII or sensitive company data is included!) together with the starter script into an LLM and ask it to help you break down the problem. Ask it how the different modules (API, File Reading, Validating, Database) should interact and flow into each other.

</aside>

Submission