Week 3 - Ingesting and Validating Data
Introduction to Data Ingestion
Assignment: Build a Validated Ingestion Pipeline
Going Further: Optional Deep Dives
History: APIs and Data Transfer
Indicative as of May 2026: see Sources for current numbers.
This page answers two questions students ask every week: why am I learning this, and how does it help me find a job?
It is scoped to Week 3 content (data ingestion: REST APIs with requests, retry logic and exponential backoff, structured error handling, reading CSV / JSON / Parquet, Pydantic validation, SQLite writes with parameterized queries and upserts). Other weeks' career pages each cover their week's tools, not these. Generic NL junior-data career content (salary bands, day-to-day work, what employers do not expect from juniors) lives in one shared page across the curriculum and is not repeated here.
The numbers below are a rough reading of public NL postings as of May 2026. They are indicative, not measured. A separate project crawls Dutch data postings and will replace the qualitative claims here with measured percentages once the dataset is ready; placeholders are marked ~XX% for that swap.
Ingestion is the part of data engineering that connects your code to the real world. NL postings phrase it in many ways ("integrates with external sources", "builds API connectors", "validates incoming data"), but the underlying skills are the same five: HTTP clients, retry logic, validation, file formats, and database writes. Week 3 covers all five.
| Role | Postings expecting Week 3's ingestion stack | What the posting expects |
|---|---|---|
| Data Engineer (DE) | ~XX% (high, likely 80%+) | "Builds and maintains ingestion pipelines from REST APIs and files", "implements retry logic and error handling", "validates incoming data before it reaches the warehouse". Week 3 is the bread-and-butter of junior DE work; postings expect this on day one and do not teach it in onboarding. |
| Analytics Engineer (AE) | ~XX% (mid, likely 40-50%) | Lighter and skewed toward the validation side. Most AE work is dbt / SQL transformations, but utility scripts that fetch reference data (currency rates, country lookups) from an API and load to the warehouse are a routine AE task. "Comfortable with Python and APIs" usually means the Week 3 register: read a JSON response, validate it, write to a table. |
| Data Scientist (DS) | ~XX% (mid, likely 30-40%) | Notebook-DS roles ingest from APIs to feed feature pipelines; ML-engineer-DS roles routinely build the ingestion side themselves. Postings list "Python, requests, Pydantic" or "experience with REST APIs" on the must-haves. Less common: deep SQL writes. The DS side leans on the API and validation pieces of Week 3 more than the database piece. |
| Data Analyst (DA) | ~XX% (low, likely 10-20%) | Rare. Most DA work consumes data that someone else ingested. Postings occasionally mention "comfortable pulling from an API" for self-service contexts. When they do, the bar is Ch2 level (one requests.get + JSON parsing), not the full retry-and-validation stack. |
The directional shape: Week 3 maps tightly onto the junior DE role and the ingestion-heavy slice of every other data role. If a posting says "builds data pipelines" without further qualification, this is what they mean by "the input side."
The chapters teach the standard-library and Pydantic defaults. NL postings name a wider range of alternatives once you reach medior level.
| Concept | Tool taught | Common NL alternatives | Practical implication |
|---|---|---|---|
| HTTP client | requests |
httpx (async + HTTP/2), aiohttp, plain urllib |
requests is the universal floor. Postings that name a specific client usually mean httpx, because async ingestion against many sources is the obvious next step. The mental model (status codes, headers, params, timeouts) transfers 1:1. |
| Retry logic | manual time.sleep(2 ** n) retry loop |
tenacity decorators, urllib3.Retry adapter, framework-level retries (Airflow / Prefect task retries) |
Hand-rolled retries clear the junior bar. Production teams use tenacity or a framework retry to avoid scattering retry logic across every source. The pattern (transient vs permanent, exponential backoff, jitter) is what they really test for in interviews. |
| Validation | Pydantic v2 with BaseModel, Field, @field_validator |
pydantic is the de-facto NL default; alternatives are dataclasses + manual __post_init__, marshmallow, cerberus, attrs + custom validators |
Postings increasingly name Pydantic explicitly. If they do not, "validates data against a schema" is what they mean. Knowing Pydantic v2 syntax (not v1's @validator / .dict()) is the medior-level expectation. |
| File formats | stdlib csv, json; pyarrow / pandas for Parquet |
pandas.read_csv / read_json / read_parquet, polars, duckdb (in-process SQL over Parquet) |
The stdlib csv and json modules are the right place to start. Once volumes grow past a few hundred MB, postings expect pandas (Week 4) or polars. duckdb shows up in NL postings for analytics-engineering roles that want SQL semantics over Parquet without a warehouse. |
| Declarative ingestion | (not taught this week) | dlt (declarative load tool, very active NL community), Airbyte (connector platform), Singer / Meltano, Fivetran (managed SaaS) |
Week 3 builds ingestion by hand on purpose. NL teams adopt dlt or Airbyte once they have 5+ sources to maintain; the SQL and HTTP patterns you learn here are what those frameworks generate under the hood. |
| Database writes | SQLite + ? parameters + ON CONFLICT DO UPDATE |
PostgreSQL with psycopg / SQLAlchemy, Azure Database for PostgreSQL (the HYF class DB), MotherDuck / DuckDB for analytical writes |
SQLite is the training-wheels choice. Almost every NL DE posting names PostgreSQL specifically; Azure shops list Azure Database for PostgreSQL or Azure SQL. The parameterized-query and upsert patterns you learned transfer unchanged: the connection library is the only thing that swaps. |
What this means for your CV: lead with "Python ingestion (REST APIs with requests + retry, Pydantic validation, SQLite / PostgreSQL writes with upserts)" as a single phrase, not a checklist of modules.
Postings phrase the ingestion bar at three levels:
Week 3 does not yet practice async ingestion (httpx + asyncio), production retry libraries (tenacity), declarative frameworks (dlt), or non-SQLite databases. Those are the bridge from junior to medior and show up later in the track (the cloud-databases week swaps SQLite for Postgres; the orchestration week introduces task-level retries via Airflow). Week 3 is the foundation a hiring manager assumes when they invite a junior to a take-home assignment.
Strong line a student can copy-adapt:
Built a resilient weather-data ingestion pipeline in Python: fetched hourly readings from the Open-Meteo REST API with
requests(timeout=10, exponential-backoff retry onConnectionErrorandTimeout, classified transient vs permanent HTTP errors), normalized CSV and JSON inputs into a shared schema, validated every record with a Pydantic v2BaseModel(Field(ge=-90, le=60)constraints,@field_validatorfor timestamp parsing), and wrote survivors to SQLite using parameterizedINSERT ... ON CONFLICT(station, timestamp) DO UPDATE SETfor idempotent re-runs. Accumulated per-record errors into a structured report instead of crashing on the first bad row.
Recruiter keywords this carries: Python, requests, retry, exponential backoff, transient vs permanent errors, Pydantic, validation, SQLite, ON CONFLICT, upsert, idempotency, ETL, ingestion, CSV, JSON, Parquet, REST API, parameterized queries.
Weaker alternative for contrast (avoid):
Wrote a Python script that downloads weather data and stores it in a database.
The weaker version drops every recruiter keyword and could be claimed by anyone who has ever run pip install requests. The strong version names the specific patterns (retry classification, Pydantic constraints, ON CONFLICT upserts, error accumulation) that signal the candidate has built ingestion code that survives Monday morning when the upstream API is flaky.
When asked "tell me about a Python project you have built", the Week 3 assignment gives you a 90-second answer:
I built an ingestion pipeline for a fictional weather-data company. It pulls hourly readings from the Open-Meteo REST API and a CSV partner feed, validates every record against a Pydantic model, and writes the survivors to SQLite. Three things I am proud of. First, it survives transient network failures: the API client classifies errors as transient or permanent, retries transient ones with exponential backoff, and gives up cleanly on permanent ones. Second, the writes are idempotent: re-running the same input produces the same database state, no duplicates, because I used
INSERT ... ON CONFLICT DO UPDATE SETkeyed on(station, timestamp). Third, the pipeline never crashes on a bad row: I accumulate per-record validation errors into a structured report so I can debug after the run instead of losing every good record before the bad one.
This answer hits six interview-relevant concepts in ninety seconds: ingestion, error classification, retry / backoff, validation, idempotency, and error accumulation. Tailor it to whatever the interviewer brings up in follow-up.
Two honest follow-ups if asked "what would you do differently?":
tenacity so the retry policy is a decorator on the fetch function instead of an inline for attempt in range(...) loop. That makes the retry policy visible at the call site and lets me change it in one place."ON CONFLICT upsert pattern, and add a dead-letter table for the records the Pydantic validator rejects, so the next run can replay them after a schema fix."Week 3 builds the ingestion side, not the whole pipeline. After this week you are not yet:
python pipeline.py. Production ingestion runs on a schedule (Airflow, Prefect, Dagster), retries failed tasks at the task level, alerts on missed SLAs, and shares state across runs through a metadata store. Those concerns arrive in the orchestration week later in the track.Naming these honestly in an interview ("I have built unattended-style ingestion against a single source; I have not yet wired it into a scheduler or set up dead-letter replay") signals more maturity than overclaiming.
Mark this page indicative, not statistical. The ~XX% figures will be replaced with measured percentages once the postings-crawler project ships.
<aside> 💭 For generic NL junior data-career content (salary bands, day-to-day work, what employers do not expect from any junior), one shared page across all weeks is the right home. That page does not exist yet; for now, treat this page as Week-3-specific only.
</aside>
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with ❤️ by the HackYourFuture community · Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.