Week 2 - Structuring Pipelines
Introduction to Data Pipelines
Configuration & Secrets (.env)
Linting and Formatting with Ruff
Week 2 Assignment: Clean Pipeline
Indicative as of May 2026: see Sources for current numbers.
This page answers two questions students ask every week: why am I learning this, and how does it help me find a job?
It is scoped to Week 2 content (structuring Python pipelines: configuration management, separation of concerns, dataclasses, functional composition, pytest, and ruff). Other weeks' career pages each cover their week's tools, not these. Generic NL junior-data career content (salary bands, day-to-day work, what employers do not expect from juniors) lives in one shared page across the curriculum and is not repeated here.
The numbers below are a rough reading of public NL postings as of May 2026. They are indicative, not measured. A separate project crawls Dutch data postings and will replace the qualitative claims here with measured percentages once the dataset is ready; placeholders are marked ~XX% for that swap.
Almost every Python-touching NL data posting expects the patterns from this week: they just rarely name them explicitly. "Clean code", "modular design", "writes tests", "production-ready Python" all map to what you practiced. The frequency-by-role matters less here than the language postings use.
| Role | Postings expecting structured-Python skills | What the posting expects |
|---|---|---|
| Data Engineer (DE) | ~85% | Hands-on. "Production Python", "modular pipeline code", "writes tests for transformation logic", "uses environment variables for config". The Week 2 patterns are table-stakes, not differentiators. |
| Analytics Engineer (AE) | ~30% | Lighter expectation: most AE work happens in dbt/SQL, but utility scripts and CI tooling still ride on Python. "Comfortable with Python" usually means the Week 2 register. |
| Data Scientist (DS) | ~50% | Notebook-first roles often skip these patterns entirely; product-DS roles (recommender systems, ML pipelines) expect them at the DE level. The split is roughly notebook-DS vs ML-engineer-DS. |
| Data Analyst (DA) | ~10% | Rarely required. Some scale-ups hire analysts who maintain shared Python utilities; most do not. |
The directional shape: DE roles assume you write structured Python; AE/DS roles assume you can read it; DA roles do not require it but it is a CV differentiator. If you are aiming at DE postings, this week is the floor: recruiters do not ask "do you separate I/O from logic?" the way they ask "do you know dbt?", but interviewers will see whether you do it in your code-screen submission.
The tools and patterns this week introduces have NL-market-standard alternatives. Most postings do not name a specific library; they name the pattern and expect you to bring tools to it.
| Concept | Tool taught | Common NL alternatives | Practical implication |
|---|---|---|---|
| Config & secrets | python-dotenv + os.environ |
pydantic-settings, dynaconf, AWS Secrets Manager, Azure Key Vault |
All build on the same env-var contract. NL teams on Azure typically layer Key Vault on top in production (covered later in the track). The .env pattern is what runs locally. |
| Data modelling | @dataclass |
Pydantic (very common in production, ~XX% of postings name it explicitly when validation comes up), attrs, plain dicts |
Pydantic is widely used for validated data models in 2026 NL postings. Dataclasses are the gateway pattern; the next week introduces Pydantic on top. |
| Functional composition | hand-rolled pipe() |
toolz, returns, pandas method chaining |
Most production pipelines compose with simple variable rebinding (the way you learned), not with a pipe library. Pandas chaining is the same pattern with different syntax (introduced when pandas comes in). |
| Testing | pytest |
unittest (legacy), nose (deprecated) |
pytest is the standard. unittest shows up in older codebases at banks and government employers. |
| Linting/formatting | ruff |
black + flake8 + isort (the stack ruff replaced) |
ruff has effectively won the NL market in the last 18 months; postings still occasionally name black because the description is older than the codebase. |
What this means for your CV: lead with "structured Python (config, dataclasses, pytest, ruff)" as a single phrase. Recruiters scan for the keywords; interviewers care that you can defend why each pattern is there.
Postings phrase the expectation at three levels:
tests/ folder is more than most juniors arrive with.The chapter does not yet practice Pydantic, mypy, or full dependency management: the next week introduces Pydantic, and the type-checking and packaging pieces come later. That progression is intentional: Week 2 is the structural foundation everything else builds on.
Strong line a student can copy-adapt:
Refactored a single-script CSV pipeline into a modular Python project: separated I/O from transform logic, modelled
Transactionrecords as a@dataclasswith__post_init__validation, composed transforms as pure functions returning new collections (no mutation), wrote a pytest suite covering the pure transforms with@pytest.fixtureand@pytest.mark.parametrize, loaded configuration from.envvia a centralisedconfig.py, and enforced style withruff check+ruff format.
Recruiter keywords this carries: Python, dataclasses, pytest, fixtures, parametrize, pure functions, separation of concerns, dependency injection, dotenv, ruff, modular design.
Weaker alternative for contrast (avoid):
Wrote some Python scripts that clean CSV files.
The weaker version drops every recruiter keyword and could be claimed by anyone who has ever opened a CSV in a notebook. The strong version names the specific patterns and the specific tools.
Three sentences that cover the assignment cleanly when an interviewer asks "tell me about a project you have built":
Transaction dataclass for the row schema, and a chain of pure transform functions for the business rules. The point was that I could test each transform without touching the filesystem.".env file loaded by a centralised config.py that raised ValueError if a required variable was missing. That pattern was deliberate: I wanted the pipeline to fail loudly at startup rather than crash mid-run with a NoneType error."Two honest follow-ups if asked "what would you do differently?":
@dataclass for the row model, but for production I'd reach for Pydantic so I get validation errors with field paths instead of writing __post_init__ checks by hand. The next week introduces that, and the migration is small because the field declarations transfer directly."list[dict]; for a real workload I'd benchmark against a pandas or polars version, since the row-by-row pattern stops being competitive once the dataset crosses a few hundred MB. The architecture would not change: only the transform internals."Week 2 is the foundation, not the finished article. After this week you are not yet:
mypy or pyright against them. Hints without a checker are documentation, not enforcement. Strict static typing is a separate discipline introduced later in the track.These are the senior-shaped skills the chapter does not yet make you qualified for. Naming them honestly in an interview ("I know where my skills end") is more impressive than a junior overclaim.
Mark this page indicative, not statistical. Numbers will be replaced with measured percentages once the postings-crawler project ships.
<aside> 💭 For generic NL junior data-career content (salary bands, day-to-day work, what employers do not expect from any junior), one shared page across all weeks is the right home. That page does not exist yet; for now, treat this page as Week-2-specific only.
</aside>
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with ❤️ by the HackYourFuture community · Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.