Week 3 - Ingesting and Validating Data

Introduction to Data Ingestion

Ingesting from APIs

Production Error Handling

Reading Multiple File Formats

Data Validation with Pydantic

Writing to Databases

Gotchas & Pitfalls

Practice

Assignment: Build a Validated Ingestion Pipeline

Career relevance: Week 3

Week 3 Glossary

Going Further: Optional Deep Dives

Week 3 Kickoff Slides

History: APIs and Data Transfer

Introduction to Data Ingestion

From Pipeline to the Real World

In Week 2, you controlled the whole pipeline, including the data itself and its final destination. However, in real-world data engineering, you often have little control over the data or how it gets to you. APIs go down, CSV files arrive with surprise columns, and JSON payloads have "temperature" as sometimes a number and sometimes the string "N/A".

This week, you learn how to reliably get data from external sources into your pipeline: pulling from APIs, reading different file formats, validating what you receive, and storing clean results in a database.

By the end of this chapter, you should be able to:

<aside> 馃摌 Core Program Refresher:

This week builds directly on concepts from the Core program. If you need a quick reminder, check out these earlier lessons:

This week, you combine those skills in Python: pulling data from APIs with requests, validating it with Pydantic, and storing it in SQLite.

</aside>

What is Data Ingestion?

Data ingestion is the process of pulling raw data from external sources into your system. It is the first step in any data pipeline: before you can transform, analyze, or visualize data, you need to get it.

Think of it like a kitchen. Before you can cook (transform), you need to buy ingredients (ingest). Sometimes the store is out of stock (API down). Sometimes they give you the wrong item (schema mismatch). Sometimes the packaging is damaged (corrupted data). A good chef checks the ingredients before cooking.

Local Data vs External Data

In Week 2, all your data was local. This week, the rules change.

Local Data (Week 2) External Data (Week 3)
Source Files on your machine APIs, remote databases, cloud storage
Control You decide the format Someone else decides the format
Availability Always there Server might be down
Schema You wrote it It might change without warning
Quality You cleaned it Could be anything
Speed Instant Network latency, rate limits

This is why ingestion is hard. You are writing code that depends on systems you do not control.

The Ingestion Flow

Every ingestion flow follows the same three-step pattern:

flowchart LR
    S["Source<br/>API 路 CSV 路 JSON"] --> V["Validate<br/>Pydantic 路 types 路 constraints"] --> D["Store<br/>SQLite 路 database 路 file"]
  1. Extract: Pull raw data from the source (API call, file read, database query)
  2. Validate: Check that the data matches your expected schema (types, ranges, required fields)
  3. Store: Write the validated data to a reliable destination (database, file)

If validation fails, you do not store the bad data. You log it, report it, and move on. This is the key difference from Week 2: you cannot crash. You need to handle failures gracefully.

Running Example: Weather Data Pipeline

Throughout this week, you will build a weather data ingestion pipeline. The idea is simple:

  1. Fetch weather data from the Open-Meteo API (free, no API key needed)
  2. Read additional weather data from CSV and JSON files
  3. Validate all readings with Pydantic
  4. Store clean data in a SQLite database

This is a realistic scenario. Weather data companies do exactly this: they collect data from many sources, validate it, and store it for analysis.

Here is a preview of what you will build:

# By the end of this week, you can write this:  # noqa: verify
readings = fetch_weather_api(city="Copenhagen", days=7)
file_readings = read_weather_csv("data/stations.csv")

all_readings = readings + file_readings

valid, errors = validate_readings(all_readings)
print(f"Valid: {len(valid)}, Errors: {len(errors)}")

save_to_database(valid, db_path="weather.db")

Each chapter teaches one piece of this puzzle.

<aside> 馃挕 Before building a pipeline for any external source, always check the API documentation (like the Open-Meteo API) or file spec (like the data dictionary for the CSV file you are reading) first. Five minutes of reading can save hours of debugging.

</aside>

Azure: Your Cloud Is Also an API

In Week 2, you explored the Azure Portal in a browser. But the portal is just a graphical wrapper around a REST API. Every button you click sends an HTTP request to the Azure Resource Manager API.

This week you learn API ingestion with requests. The Azure ARM API (management.azure.com) is a real API that you already have credentials for. Later in the track you will store pipeline outputs in Azure Blob Storage and deploy your pipelines to Azure Container Apps. Understanding that Azure is "just an API" makes all of that less intimidating.

<aside> 馃 Curious Geek: Infrastructure as an API

Every major cloud provider (Azure, AWS, GCP) exposes its entire infrastructure as a REST API. When you click "Create Storage Account" in the portal, it sends a PUT request to management.azure.com. Tools like Terraform and Pulumi automate cloud deployments by calling these same APIs.

</aside>

Your Journey This Week

Chapter What You Learn Pipeline Piece
2. Ingesting APIs HTTP requests, pagination, rate limits Fetching weather data
3. Error Handling Retry logic, error accumulation Surviving failures
4. File Formats CSV, JSON, Parquet readers Reading local files
5. Pydantic Schema validation, type coercion Checking every reading
6. Databases SQLite, parameterized queries, upserts Storing clean data

By Friday, you will have a complete ingestion system that handles messy real-world data.

<aside> 馃 Curious Geek: ELT vs ETL

Classic ETL (Extract, Transform, Load) transforms data before storing it. Modern ELT (Extract, Load, Transform) stores raw data first and transforms it later. ELT is popular because storage is cheap and you can always re-transform. This week, you store raw data in a "raw" table (ELT-style) before any cleanup.

</aside>

Before moving on to the API chapter, spend a few minutes mapping the ingestion flow onto a data source you actually use.

<aside> 鈱笍 Hands on: Pick one data source you interact with every day (a weather app, a streaming service, your bank's transactions page). On paper, sketch the three-step ingestion flow for it: what the extract step would fetch, what fields the validate step would check (types, ranges, required keys), and where the store step would land the data. Then list one transient error and one permanent error you would expect from the source. Keep your notes: you will revisit them as the week progresses.

</aside>

Knowledge Check

<aside> 馃殌 Try it in the widget: Interactive Quiz: Introduction to Data Ingestion

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_3_ch1_intro_data_ingestion_quiz&embed=1

Once the basics land, AI can speed up the exploration of a new source:

<aside> 馃挕 Using AI to help: When exploring a new data source, try pasting a sample response (鈿狅笍 Ensure no PII or sensitive company data is included!) into an LLM and ask: "What are the edge cases and potential failure points I need to handle when ingesting this data?" It will often spot issues you hadn't considered.

</aside>

Extra reading

<aside> 馃摎 For full courses, books, and community resources, see the optional Going Further page.

</aside>


Next up: Ingesting from APIs, where you fetch real weather data with requests, handle pagination, and survive transient network failures.