Week 3 - Ingesting and Validating Data

Introduction to Data Ingestion

Ingesting from APIs

Production Error Handling

Reading Multiple File Formats

Data Validation with Pydantic

Writing to Databases

Practice

Assignment: Build a Validated Ingestion Pipeline

Gotchas & Pitfalls

Lesson Plan

📥 Introduction to Data Ingestion

From Pipeline to the Real World

In Week 2, you controlled the whole pipeline, including the data itself and its final destination. However, in real-world data engineering, you often have little control over the data or how it gets to you. APIs go down, CSV files arrive with surprise columns, and JSON payloads have "temperature" as sometimes a number and sometimes the string "N/A".

This week, you learn how to reliably get data from external sources into your pipeline — pulling from APIs, reading different file formats, validating what you receive, and storing clean results in a database.

<aside> 📘 Core Program Refresher:

This week builds directly on concepts from the Core program. If you need a quick reminder, check out these earlier lessons:

This week, you combine those skills in Python: pulling data from APIs with requests, validating it with Pydantic, and storing it in SQLite.

</aside>

What is Data Ingestion?

Data ingestion is the process of pulling raw data from external sources into your system. It is the first step in any data pipeline: before you can transform, analyze, or visualize data, you need to get it.

Think of it like a kitchen. Before you can cook (transform), you need to buy ingredients (ingest). Sometimes the store is out of stock (API down). Sometimes they give you the wrong item (schema mismatch). Sometimes the packaging is damaged (corrupted data). A good chef checks the ingredients before cooking.

Local Data vs External Data

In Week 2, all your data was local. This week, the rules change.

Local Data (Week 2) External Data (Week 3)
Source Files on your machine APIs, remote databases, cloud storage
Control You decide the format Someone else decides the format
Availability Always there Server might be down
Schema You wrote it It might change without warning
Quality You cleaned it Could be anything
Speed Instant Network latency, rate limits

This is why ingestion is hard. You are writing code that depends on systems you do not control.

The Ingestion Flow

Every ingestion pipeline follows the same three-step pattern:

<aside> 🎬 Animation: Ingestion Flow - Source, Validate, Store

</aside>

[Source] --> [Validate] --> [Store]
   |             |             |
   API       Pydantic       SQLite
   CSV       Type checks    Database
   JSON      Constraints    File
  1. Extract: Pull raw data from the source (API call, file read, database query)
  2. Validate: Check that the data matches your expected schema (types, ranges, required fields)
  3. Store: Write the validated data to a reliable destination (database, file)

If validation fails, you do not store the bad data. You log it, report it, and move on. This is the key difference from Week 2: you cannot crash. You need to handle failures gracefully.

Running Example: Weather Data Pipeline

Throughout this week, you will build a weather data ingestion pipeline. The idea is simple:

  1. Fetch weather data from the Open-Meteo API (free, no API key needed)
  2. Read additional weather data from CSV and JSON files
  3. Validate all readings with Pydantic
  4. Store clean data in a SQLite database

This is a realistic scenario. Weather data companies do exactly this: they collect data from many sources, validate it, and store it for analysis.

Here is a preview of what you will build:

# By the end of this week, you can write this:  # noqa: verify
readings = fetch_weather_api(city="Copenhagen", days=7)
file_readings = read_weather_csv("data/stations.csv")

all_readings = readings + file_readings

valid, errors = validate_readings(all_readings)
print(f"Valid: {len(valid)}, Errors: {len(errors)}")

save_to_database(valid, db_path="weather.db")

Each chapter teaches one piece of this puzzle.

<aside> 💡 Before building a pipeline for any external source, always check the API documentation (like the Open-Meteo API) or file spec (like the data dictionary for the CSV file you are reading) first. Five minutes of reading can save hours of debugging.

</aside>

Azure: Your Cloud Is Also an API

In Week 2, you explored the Azure Portal in a browser. But the portal is just a graphical wrapper around a REST API. Every button you click sends an HTTP request to the Azure Resource Manager API.

This week you learn API ingestion with requests. The Azure ARM API (management.azure.com) is a real API that you already have credentials for. In Week 4, you will start storing pipeline outputs in Azure Blob Storage, and by Week 6 you will deploy your pipelines to Azure Container Apps. Understanding that Azure is "just an API" makes all of that less intimidating.

<aside> 🤓 Curious Geek: Infrastructure as an API

Every major cloud provider (Azure, AWS, GCP) exposes its entire infrastructure as a REST API. When you click "Create Storage Account" in the portal, it sends a PUT request to management.azure.com. Tools like Terraform and Pulumi automate cloud deployments by calling these same APIs. You will use Terraform in Week 14.

</aside>

Your Journey This Week

Chapter What You Learn Pipeline Piece
2. Ingesting APIs HTTP requests, pagination, rate limits Fetching weather data
3. Error Handling Retry logic, error accumulation Surviving failures
4. File Formats CSV, JSON, Parquet readers Reading local files
5. Pydantic Schema validation, type coercion Checking every reading
6. Databases SQLite, parameterized queries, upserts Storing clean data

By Friday, you will have a complete ingestion system that handles messy real-world data.

<aside> 🤓 Curious Geek: ELT vs ETL

Classic ETL (Extract, Transform, Load) transforms data before storing it. Modern ELT (Extract, Load, Transform) stores raw data first and transforms it later. ELT is popular because storage is cheap and you can always re-transform. This week, you store raw data in a "raw" table (ELT-style) before any cleanup.

</aside>

🧠 Knowledge Check

  1. Why is ingesting data from an API harder than reading a local CSV file?
  2. What are the three steps in the ingestion flow, and why does validation come before storage?
  3. Why should an ingestion pipeline not crash when it encounters bad data?
  4. Think about a data source you use regularly (a weather app, a news feed, a fitness tracker). List 3 things that could go wrong if you tried to automatically pull data from it every hour. What would you do about each one?

<aside> 💡 Using AI to help: When exploring a new data source, try pasting a sample response (⚠️ Ensure no PII or sensitive company data is included!) into an LLM and ask: "What are the edge cases and potential failure points I need to handle when ingesting this data?" It will often spot issues you hadn't considered.

</aside>

Extra reading


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.