Week 2 - Structuring Data Pipelines
Introduction to Data Pipelines
Configuration & Secrets (.env)
Separation of Concerns (I/O vs Logic)
Linting and Formatting with Ruff
Assignment: Refactoring to a Clean Pipeline
By the end of this chapter, you should be able to:
In Week 1, you learned Python syntax. You wrote scripts that read a file, applied some logic, and printed a result. That's a great starting point, and exactly how most engineers begin.
But in Data Engineering, "working on my machine" is not enough.
You build systems that run:
If your Week 1 script runs on a server and encounters a missing file, it crashes. If the data format changes slightly, it crashes. If the server runs out of memory, it crashes.
Week 2 is about professionalizing your Python.
You will stop writing "scripts" and start writing "pipelines."
At its simplest, a data pipeline is a process that moves data from A to B reliably.
Robustness > Complexity
A simple script that never fails is better than a complex system that breaks every night.
To achieve this reliability, you need to change your habits. Stop "hardcoding" and start "engineering."
Imagine this script:
<!-- runner:expect-fail -->
# bad_script.py
data = open("/Users/lassebenninga/Downloads/data.csv").read()
print(process(data))
This script is a disaster waiting to happen.
This week, you will fix these bad habits one by one.
It's not about being clever. It's about being boring.
flowchart LR
subgraph script ["⚠️ Week 1 script (one file, one function)"]
s1[read CSV<br/>from hardcoded path] --> s2[clean + transform<br/>+ calculate<br/>+ validate]
s2 --> s3[write CSV<br/>to hardcoded path]
end
subgraph system ["✅ Week 2 system (layered, testable)"]
direction TB
cfg[config.py<br/><i>paths, secrets</i>] --> orch
models[models.py<br/><i>row schema</i>] --> orch
transforms[transforms.py<br/><i>pure functions</i>] --> orch
io[io_layer.py<br/><i>read / write</i>] --> orch
tests[tests/<br/><i>verify transforms</i>] -.verifies.-> transforms
orch[pipeline.py<br/><b>orchestration</b>]
end
classDef bad fill:#7a3a3a,stroke:#fff,color:#fff
classDef good fill:#2d5a87,stroke:#fff,color:#fff
classDef test fill:#5a8c3a,stroke:#fff,color:#fff
class s1,s2,s3 bad
class cfg,models,transforms,io,orch good
class tests test
The system has more files, but each one is small, replaceable, and individually testable. That is what "professional" buys you.
| Amateur Script | Professional Pipeline |
|---|---|
| Fragile: Breaks if a folder moves. | Robust: Configurable paths. |
| Implicit: Hidden logic in 100 lines. | Clean: Small functions with clear names. |
| Unsafe: Passwords in code. | Secure: Credentials in .env. |
| Untestable: Only "test" is running it. | Verified: Unit tests for logic. |
You will take a messy script and refactor (restructure without changing behavior) it step by step.
By Friday, you won't just have a script. You'll have a small data engineering system.
| Concept | The "Script" Mindset | The "System" Mindset |
|---|---|---|
| Failure | "I'll just run it again." | "It should retry automatically." |
| Data | "I know what the CSV looks like." | "I validate the data before processing." |
| State | "I remember which files I processed." | "The system tracks what is done." |
<aside> 💡 Key Takeaway: A data pipeline isn't just code. It's code plus the environment that makes it run reliably.
</aside>
In Week 1, you created your Azure account. This week, you won't deploy anything yet, but you should start understanding where your code will eventually run.
Open the Azure Portal and browse your Resource groups. A resource group is like a project folder in the cloud: it groups related services together. Right now yours is empty, but later in the track you will deploy containers there, then add databases, storage accounts, and scheduled jobs.
<aside> ⌨️ Hands on: Log into portal.azure.com and find your resource group. Click into it and explore the "Overview" tab. Note the Region (where your resources will physically run) and the Subscription name. You will need both later. Also open Cost Management + Billing from the left sidebar and check that your current cost is €0.
</aside>
Understanding the portal now, while there is nothing to break, makes the cloud weeks much less overwhelming.
<aside> 🤓 Curious Geek: The Origin of "Pipeline"
The term "data pipeline" comes from Unix pipes (1973), where Doug McIlroy proposed connecting programs like garden hoses: the output of one becomes the input of the next.
The | symbol in your terminal (cat file.csv | grep "error" | wc -l) is a literal pipeline. Data engineering just scaled this idea to terabytes!
</aside>
Before checking your understanding, take a moment to apply what you read to the bad script from earlier.
<aside> ⌨️ Hands on: Look at the "works on my machine" script above. List 3 specific changes you would make to turn it into a professional pipeline. Think about config, error handling, and testing.
</aside>
<aside> 📚 For full courses, books, and community resources, see the optional Going Further page.
</aside>
Next up: Configuration & Secrets, where you take the first concrete step away from "works on my machine" by moving credentials and paths out of your code into a .env file.
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with ❤️ by the HackYourFuture community · Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.