Week 3 - Ingesting and Validating Data
Introduction to Data Ingestion
Assignment: Build a Validated Ingestion Pipeline
In Week 2, you controlled the whole pipeline, including the data itself and its final destination. However, in real-world data engineering, you often have little control over the data or how it gets to you. APIs go down, CSV files arrive with surprise columns, and JSON payloads have "temperature" as sometimes a number and sometimes the string "N/A".
This week, you learn how to reliably get data from external sources into your pipeline — pulling from APIs, reading different file formats, validating what you receive, and storing clean results in a database.
<aside> 📘 Core Program Refresher:
This week builds directly on concepts from the Core program. If you need a quick reminder, check out these earlier lessons:
This week, you combine those skills in Python: pulling data from APIs with requests, validating it with Pydantic, and storing it in SQLite.
</aside>
Data ingestion is the process of pulling raw data from external sources into your system. It is the first step in any data pipeline: before you can transform, analyze, or visualize data, you need to get it.
Think of it like a kitchen. Before you can cook (transform), you need to buy ingredients (ingest). Sometimes the store is out of stock (API down). Sometimes they give you the wrong item (schema mismatch). Sometimes the packaging is damaged (corrupted data). A good chef checks the ingredients before cooking.
In Week 2, all your data was local. This week, the rules change.
| Local Data (Week 2) | External Data (Week 3) | |
|---|---|---|
| Source | Files on your machine | APIs, remote databases, cloud storage |
| Control | You decide the format | Someone else decides the format |
| Availability | Always there | Server might be down |
| Schema | You wrote it | It might change without warning |
| Quality | You cleaned it | Could be anything |
| Speed | Instant | Network latency, rate limits |
This is why ingestion is hard. You are writing code that depends on systems you do not control.
Every ingestion pipeline follows the same three-step pattern:
<aside> 🎬 Animation: Ingestion Flow - Source, Validate, Store
</aside>
[Source] --> [Validate] --> [Store]
| | |
API Pydantic SQLite
CSV Type checks Database
JSON Constraints File
If validation fails, you do not store the bad data. You log it, report it, and move on. This is the key difference from Week 2: you cannot crash. You need to handle failures gracefully.
Throughout this week, you will build a weather data ingestion pipeline. The idea is simple:
This is a realistic scenario. Weather data companies do exactly this: they collect data from many sources, validate it, and store it for analysis.
Here is a preview of what you will build:
# By the end of this week, you can write this: # noqa: verify
readings = fetch_weather_api(city="Copenhagen", days=7)
file_readings = read_weather_csv("data/stations.csv")
all_readings = readings + file_readings
valid, errors = validate_readings(all_readings)
print(f"Valid: {len(valid)}, Errors: {len(errors)}")
save_to_database(valid, db_path="weather.db")
Each chapter teaches one piece of this puzzle.
<aside> 💡 Before building a pipeline for any external source, always check the API documentation (like the Open-Meteo API) or file spec (like the data dictionary for the CSV file you are reading) first. Five minutes of reading can save hours of debugging.
</aside>
In Week 2, you explored the Azure Portal in a browser. But the portal is just a graphical wrapper around a REST API. Every button you click sends an HTTP request to the Azure Resource Manager API.
This week you learn API ingestion with requests. The Azure ARM API (management.azure.com) is a real API that you already have credentials for. In Week 4, you will start storing pipeline outputs in Azure Blob Storage, and by Week 6 you will deploy your pipelines to Azure Container Apps. Understanding that Azure is "just an API" makes all of that less intimidating.
<aside> 🤓 Curious Geek: Infrastructure as an API
Every major cloud provider (Azure, AWS, GCP) exposes its entire infrastructure as a REST API. When you click "Create Storage Account" in the portal, it sends a PUT request to management.azure.com. Tools like Terraform and Pulumi automate cloud deployments by calling these same APIs. You will use Terraform in Week 14.
</aside>
| Chapter | What You Learn | Pipeline Piece |
|---|---|---|
| 2. Ingesting APIs | HTTP requests, pagination, rate limits | Fetching weather data |
| 3. Error Handling | Retry logic, error accumulation | Surviving failures |
| 4. File Formats | CSV, JSON, Parquet readers | Reading local files |
| 5. Pydantic | Schema validation, type coercion | Checking every reading |
| 6. Databases | SQLite, parameterized queries, upserts | Storing clean data |
By Friday, you will have a complete ingestion system that handles messy real-world data.
<aside> 🤓 Curious Geek: ELT vs ETL
Classic ETL (Extract, Transform, Load) transforms data before storing it. Modern ELT (Extract, Load, Transform) stores raw data first and transforms it later. ELT is popular because storage is cheap and you can always re-transform. This week, you store raw data in a "raw" table (ELT-style) before any cleanup.
</aside>
<aside> 💡 Using AI to help: When exploring a new data source, try pasting a sample response (⚠️ Ensure no PII or sensitive company data is included!) into an LLM and ask: "What are the edge cases and potential failure points I need to handle when ingesting this data?" It will often spot issues you hadn't considered.
</aside>
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.