Week 2 - Structuring Data Pipelines
Introduction to Data Pipelines
Configuration & Secrets (.env)
Separation of Concerns (I/O vs Logic)
Linting and Formatting with Ruff
Assignment: Refactoring to a Clean Pipeline
In Week 1, you learned Python syntax. You wrote scripts that read a file, applied some logic, and printed a result. That's a great starting point, and exactly how most engineers begin.
But in Data Engineering, "working on my machine" is not enough.
You build systems that run:
If your Week 1 script runs on a server and encounters a missing file, it crashes. If the data format changes slightly, it crashes. If the server runs out of memory, it crashes.
Week 2 is about professionalizing your Python.
You will stop writing "scripts" and start writing "pipelines."
At its simplest, a data pipeline is a process that moves data from A to B reliably.
Robustness > Complexity
A simple script that never fails is better than a complex system that breaks every night.
To achieve this reliability, you need to change your habits. Stop "hardcoding" and start "engineering."
Imagine this script:
# bad_script.py
data = open("/Users/lassebenninga/Downloads/data.csv").read()
print(process(data))
This script is a disaster waiting to happen.
This week, you will fix these bad habits one by one.
It's not about being clever. It's about being boring.
| Amateur Script | Professional Pipeline |
|---|---|
| Fragile: Breaks if a folder moves. | Robust: Configurable paths. |
| Implicit: Hidden logic in 100 lines. | Clean: Small functions with clear names. |
| Unsafe: Passwords in code. | Secure: Credentials in .env. |
| Untestable: Only "test" is running it. | Verified: Unit tests for logic. |
You will take a messy script and refactor (restructure without changing behavior) it step by step.
By Friday, you won't just have a script. You'll have a Data Engineering System.
| Concept | The "Script" Mindset | The "System" Mindset |
|---|---|---|
| Failure | "I'll just run it again." | "It should retry automatically." |
| Data | "I know what the CSV looks like." | "I validate the data before processing." |
| State | "I remember which files I processed." | "The system tracks what is done." |
<aside> 💡 Key Takeaway: A data pipeline isn't just code. It's code plus the environment that makes it run reliably.
</aside>
In Week 1, you created your Azure account. This week, you won't deploy anything yet, but you should start understanding where your code will eventually run.
Open the Azure Portal and browse your Resource groups. A resource group is like a project folder in the cloud: it groups related services together. Right now yours is empty, but by Week 5 you will deploy containers there, and by Week 6 you will create databases, storage accounts, and scheduled jobs.
<aside> ⌨️ Hands on: Log into portal.azure.com and find your resource group. Click into it and explore the "Overview" tab. Note the Region (where your resources will physically run) and the Subscription name. You will need both later. Also open Cost Management + Billing from the left sidebar and check that your current cost is €0.
</aside>
Understanding the portal now, while there is nothing to break, makes Week 5-6 much less overwhelming.
<aside> 🤓 Curious Geek: The Origin of "Pipeline"
The term "data pipeline" comes from Unix pipes (1973), where Doug McIlroy proposed connecting programs like garden hoses: the output of one becomes the input of the next.
The | symbol in your terminal (cat file.csv | grep "error" | wc -l) is a literal pipeline. Data engineering just scaled this idea to terabytes!
</aside>
<aside> ⌨️ Hands on: Look at the "works on my machine" script above. List 3 specific changes you would make to turn it into a professional pipeline. Think about config, error handling, and testing.
</aside>
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.