Week 3 - Ingesting and Validating Data
Introduction to Data Ingestion
Assignment: Build a Validated Ingestion Pipeline
Going Further: Optional Deep Dives
History: APIs and Data Transfer
A single place to look up every data-ingestion and validation term you meet this week. Entries use ### term headers so each definition has a stable anchor that other chapters link to. Browse by chapter or use Cmd/Ctrl + F.
The process of pulling raw data from external sources (APIs, files, databases) into your system. It is the first step of any data pipeline: before you can transform, analyze, or visualize data, you have to get it.
Extract-Transform-Load. The classic pattern of pulling data, reshaping it in flight, then writing the cleaned result to a target. Compare with ELT, which loads first and transforms later.
Extract-Load-Transform. The modern variant where raw data is stored as-is first (cheap storage), then transformed inside the warehouse. Week 3's "raw table" pattern is an ELT building block.
The three-step pattern every ingestion pipeline follows: extract from the source, validate each record, store the survivors. Each Week 3 chapter teaches one piece of this flow.
A web API that exposes resources at URLs and is accessed with HTTP methods (GET, POST, ...). Most public data APIs in this week (Open-Meteo, Azure ARM, GitHub) are REST APIs.
The numeric result of an HTTP request: 2xx success, 3xx redirect, 4xx client error, 5xx server error. raise_for_status() turns 4xx and 5xx into Python exceptions you can catch.
The technique APIs use to return large datasets in chunks. Offset-based uses page numbers (?page=2); cursor-based uses an opaque token returned by the previous response.
A maximum number of requests an API will accept per time window. Exceeding it returns 429 Too Many Requests along with a Retry-After header telling you how long to wait before retrying.
The maximum time requests.get() will wait for a response before raising requests.exceptions.Timeout. Always set one: without it, a stalled server freezes your pipeline silently.
An error that could plausibly succeed if you retried the same request a moment later, with no code change. Examples: 500, 503, 429, ConnectionError. Retry transient errors with backoff.
An error that will fail every time until you change the request or the data. Examples: 401 Unauthorized, 404 Not Found, a Pydantic ValidationError. Log and skip permanent errors; never retry them.
A retry strategy where the wait between attempts doubles each time (1s, 2s, 4s, 8s). Gives an overloaded server an exponentially growing recovery window instead of hammering it at a fixed rate.
A for attempt in range(max_retries) loop wrapped around an external call. Tries the operation, sleeps on transient failure, gives up after a configurable number of attempts.
The pattern of appending failing records to an errors list instead of crashing on the first one. Lets a batch of 10,000 records survive a handful of bad rows and finish with a (valid, errors) tuple.
Comma-Separated Values: a plain-text tabular format with no schema. Every value comes back as a string and must be parsed or validated downstream. Universal but fragile.
JavaScript Object Notation: a text format for structured data with {"key": "value"} objects (keys must be quoted strings) and [item, ...] arrays. Preserves numbers, booleans, and nulls; the default for API responses.
A columnar binary file format designed for analytics. Stores schema in the file, compresses 5-10x smaller than CSV, and reads only the columns a query needs.
Storing data column-by-column on disk instead of row-by-row. Makes queries that touch a few columns of a wide table dramatically faster, because the I/O skips unused columns entirely.
A Python library that validates data against typed models at runtime. Define a BaseModel with typed fields; Pydantic coerces inputs, enforces constraints, and raises ValidationError on bad data.
Pydantic's automatic conversion of inputs to the declared field type: "18.5" becomes the float 18.5, "72" becomes the int 72. Fails with a clear ValidationError when the input cannot be parsed.
A @field_validator("field_name") decorator that runs custom logic when a field is validated. Used for business rules (allowed values, format checks) or transformations (strip whitespace, title-case).
The Pydantic exception raised when one or more fields fail validation. Carries a structured list of per-field errors (field name, error type, message) via e.errors(), ideal for accumulation reports.
A SQL statement where values are sent as separate arguments (cursor.execute("INSERT ... VALUES (?)", (value,))) instead of being concatenated into the SQL string. Prevents SQL injection.
A vulnerability where user input concatenated into a SQL string is executed as code. A station name of '; DROP TABLE weather_readings; -- deletes your table. Parameterized queries make it impossible.
An "update or insert" operation. INSERT ... ON CONFLICT(...) DO UPDATE SET ... inserts the row if it is new and updates it if it conflicts with a unique constraint. The key to idempotent pipelines.
The property that running the same operation twice produces the same end state. An idempotent pipeline can be re-run safely (after a crash, on overlapping windows, for backfills) without duplicating or corrupting data.
The SQLite (and PostgreSQL) clause that defines what happens when an INSERT would violate a unique constraint. ON CONFLICT(station, timestamp) DO UPDATE SET ... is the canonical upsert form.
<aside> 💡 If a word is new, add your own one-sentence note next to it. A glossary you have edited is much stickier than one you have only read.
</aside>
This page is supplementary. Nothing here is tested directly; use it as a lookup while you work through the numbered chapters and the assignment.
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with ❤️ by the HackYourFuture community · Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.