Introduction to Data Ingestion

From Pipeline to the Real World

In Week 2, you controlled the whole pipeline, including the data itself and its final destination. However, in real-world data engineering, you often have little control over the data or how it gets to you. APIs go down, CSV files arrive with surprise columns, and JSON payloads have "temperature" as sometimes a number and sometimes the string "N/A".

This week, you learn how to reliably get data from external sources into your pipeline: pulling from APIs, reading different file formats, validating what you receive, and storing clean results in a database.

By the end of this chapter, you should be able to:

Explain why ingesting from external sources is harder than reading local files you control.
Identify the three stages of an ingestion pipeline (extract, validate, store) and what each one is responsible for.
Describe the week's running weather example and how each chapter maps to one piece of the pipeline.

<aside> 📘 Core Program Refresher:

This week builds directly on concepts from the Core program. If you need a quick reminder, check out these earlier lessons:

Using APIs - Fetching data from web APIs
HTTP Protocol - How HTTP requests and responses work
SQL Basics - CREATE TABLE, INSERT, SELECT
SQLite - The lightweight database you will use this week

This week, you combine those skills in Python: pulling data from APIs with requests, validating it with Pydantic, and storing it in SQLite.

</aside>

What is Data Ingestion?

Data ingestion is the process of pulling raw data from external sources into your system. It is the first step in any data pipeline: before you can transform, analyze, or visualize data, you need to get it.

Think of it like a kitchen. Before you can cook (transform), you need to buy ingredients (ingest). Sometimes the store is out of stock (API down). Sometimes they give you the wrong item (schema mismatch). Sometimes the packaging is damaged (corrupted data). A good chef checks the ingredients before cooking.

Local Data vs External Data

In Week 2, all your data was local. This week, the rules change.

	Local Data (Week 2)	External Data (Week 3)
Source	Files on your machine	APIs, remote databases, cloud storage
Control	You decide the format	Someone else decides the format
Availability	Always there	Server might be down
Schema	You wrote it	It might change without warning
Quality	You cleaned it	Could be anything
Speed	Instant	Network latency, rate limits

This is why ingestion is hard. You are writing code that depends on systems you do not control.

The Ingestion Flow

Every ingestion flow follows the same three-step pattern:

flowchart LR
    S["Source<br/>API · CSV · JSON"] --> V["Validate<br/>Pydantic · types · constraints"] --> D["Store<br/>SQLite · database · file"]

Extract: Pull raw data from the source (API call, file read, database query)
Validate: Check that the data matches your expected schema (types, ranges, required fields)
Store: Write the validated data to a reliable destination (database, file)

If validation fails, you do not store the bad data. You log it, report it, and move on. This is the key difference from Week 2: you cannot crash. You need to handle failures gracefully.

Running Example: Weather Data Pipeline

Throughout this week, you will build a weather data ingestion pipeline. The idea is simple:

Fetch weather data from the Open-Meteo API (free, no API key needed)
Read additional weather data from CSV and JSON files
Validate all readings with Pydantic
Store clean data in a SQLite database

This is a realistic scenario. Weather data companies do exactly this: they collect data from many sources, validate it, and store it for analysis.

Here is a preview of what you will build:

# By the end of this week, you can write this:  # noqa: verify
readings = fetch_weather_api(city="Copenhagen", days=7)
file_readings = read_weather_csv("data/stations.csv")

all_readings = readings + file_readings

valid, errors = validate_readings(all_readings)
print(f"Valid: {len(valid)}, Errors: {len(errors)}")

save_to_database(valid, db_path="weather.db")

Each chapter teaches one piece of this puzzle.

<aside> 💡 Before building a pipeline for any external source, always check the API documentation (like the Open-Meteo API) or file spec (like the data dictionary for the CSV file you are reading) first. Five minutes of reading can save hours of debugging.

</aside>

Azure: Your Cloud Is Also an API

In Week 2, you explored the Azure Portal in a browser. But the portal is just a graphical wrapper around a REST API. Every button you click sends an HTTP request to the Azure Resource Manager API.

This week you learn API ingestion with requests. The Azure ARM API (management.azure.com) is a real API that you already have credentials for. Later in the track you will store pipeline outputs in Azure Blob Storage and deploy your pipelines to Azure Container Apps. Understanding that Azure is "just an API" makes all of that less intimidating.

<aside> 🤓 Curious Geek: Infrastructure as an API

Every major cloud provider (Azure, AWS, GCP) exposes its entire infrastructure as a REST API. When you click "Create Storage Account" in the portal, it sends a PUT request to management.azure.com. Tools like Terraform and Pulumi automate cloud deployments by calling these same APIs.

</aside>

Your Journey This Week

Chapter	What You Learn	Pipeline Piece
2. Ingesting APIs	HTTP requests, pagination, rate limits	Fetching weather data
3. Error Handling	Retry logic, error accumulation	Surviving failures
4. File Formats	CSV, JSON, Parquet readers	Reading local files
5. Pydantic	Schema validation, type coercion	Checking every reading
6. Databases	SQLite, parameterized queries, upserts	Storing clean data

By Friday, you will have a complete ingestion system that handles messy real-world data.

<aside> 🤓 Curious Geek: ELT vs ETL

Classic ETL (Extract, Transform, Load) transforms data before storing it. Modern ELT (Extract, Load, Transform) stores raw data first and transforms it later. ELT is popular because storage is cheap and you can always re-transform. This week, you store raw data in a "raw" table (ELT-style) before any cleanup.

</aside>

Before moving on to the API chapter, spend a few minutes mapping the ingestion flow onto a data source you actually use.

<aside> ⌨️ Hands on: Pick one data source you interact with every day (a weather app, a streaming service, your bank's transactions page). On paper, sketch the three-step ingestion flow for it: what the extract step would fetch, what fields the validate step would check (types, ranges, required keys), and where the store step would land the data. Then list one transient error and one permanent error you would expect from the source. Keep your notes: you will revisit them as the week progresses.

</aside>

Knowledge Check

1. Why is ingesting data from an API harder than reading a local CSV file?
1. What are the three steps in the ingestion flow, and why does validation come before storage?
1. Why should an ingestion pipeline not crash when it encounters bad data?
1. Think about a data source you use regularly (a weather app, a news feed, a fitness tracker). List 3 things that could go wrong if you tried to automatically pull data from it every hour. What would you do about each one?

<aside> 🚀 Try it in the widget: Interactive Quiz: Introduction to Data Ingestion

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_3_ch1_intro_data_ingestion_quiz&embed=1

Once the basics land, AI can speed up the exploration of a new source:

<aside> 💡 Using AI to help: When exploring a new data source, try pasting a sample response (⚠️ Ensure no PII or sensitive company data is included!) into an LLM and ask: "What are the edge cases and potential failure points I need to handle when ingesting this data?" It will often spot issues you hadn't considered.

</aside>

Extra reading

Real Python: HTTP Requests: practical guide to the requests library you will use in the next chapter
Open-Meteo API docs: the free weather API you will use this week
Azure Resource Manager REST API: official reference for the API behind every Azure Portal click

<aside> 📚 For full courses, books, and community resources, see the optional Going Further page.

</aside>

Next up: Ingesting from APIs, where you fetch real weather data with requests, handle pagination, and survive transient network failures.