Week 2 - Structuring Data Pipelines

Introduction to Data Pipelines

Configuration & Secrets (.env)

Separation of Concerns (I/O vs Logic)

OOP vs Functional Programming

Dataclasses for Data Objects

Functional Composition

Testing with Pytest

Linting and Formatting with Ruff

Practice

Assignment: Refactoring to a Clean Pipeline

Gotchas & Pitfalls

Lesson Plan

🔄 Introduction to Data Pipelines

1. From Script to System

In Week 1, you learned Python syntax. You wrote scripts that read a file, applied some logic, and printed a result. That's a great starting point, and exactly how most engineers begin.

But in Data Engineering, "working on my machine" is not enough.

You build systems that run:

If your Week 1 script runs on a server and encounters a missing file, it crashes. If the data format changes slightly, it crashes. If the server runs out of memory, it crashes.

Week 2 is about professionalizing your Python.

You will stop writing "scripts" and start writing "pipelines."

What is a Data Pipeline?

At its simplest, a data pipeline is a process that moves data from A to B reliably.

Robustness > Complexity

A simple script that never fails is better than a complex system that breaks every night.

To achieve this reliability, you need to change your habits. Stop "hardcoding" and start "engineering."

The "works on my machine" trap

Imagine this script:

# bad_script.py
data = open("/Users/lassebenninga/Downloads/data.csv").read()
print(process(data))

This script is a disaster waiting to happen.

  1. Hardcoded Path: It will fail on any other computer.
  2. No Error Handling: What if the file is missing?
  3. No Config: How do I change the input file without editing code?

This week, you will fix these bad habits one by one.

What Does "Professional" Code Look Like?

It's not about being clever. It's about being boring.

Amateur Script Professional Pipeline
Fragile: Breaks if a folder moves. Robust: Configurable paths.
Implicit: Hidden logic in 100 lines. Clean: Small functions with clear names.
Unsafe: Passwords in code. Secure: Credentials in .env.
Untestable: Only "test" is running it. Verified: Unit tests for logic.

Your Journey This Week

You will take a messy script and refactor (restructure without changing behavior) it step by step.

  1. Config: Move settings out of the code. (Chapter 2)
  2. Logic: Separate "reading data" from "processing data." (Chapter 3)
  3. Structure: Define what a "row" of data looks like. (Chapter 5)
  4. Testing: Verify it works without running the whole thing. (Chapter 7)

By Friday, you won't just have a script. You'll have a Data Engineering System.

The Engineer's Mindset

Concept The "Script" Mindset The "System" Mindset
Failure "I'll just run it again." "It should retry automatically."
Data "I know what the CSV looks like." "I validate the data before processing."
State "I remember which files I processed." "The system tracks what is done."

<aside> 💡 Key Takeaway: A data pipeline isn't just code. It's code plus the environment that makes it run reliably.

</aside>

Azure: Where Your Pipelines Will Run

In Week 1, you created your Azure account. This week, you won't deploy anything yet, but you should start understanding where your code will eventually run.

Open the Azure Portal and browse your Resource groups. A resource group is like a project folder in the cloud: it groups related services together. Right now yours is empty, but by Week 5 you will deploy containers there, and by Week 6 you will create databases, storage accounts, and scheduled jobs.

<aside> ⌨️ Hands on: Log into portal.azure.com and find your resource group. Click into it and explore the "Overview" tab. Note the Region (where your resources will physically run) and the Subscription name. You will need both later. Also open Cost Management + Billing from the left sidebar and check that your current cost is €0.

</aside>

Understanding the portal now, while there is nothing to break, makes Week 5-6 much less overwhelming.

<aside> 🤓 Curious Geek: The Origin of "Pipeline"

The term "data pipeline" comes from Unix pipes (1973), where Doug McIlroy proposed connecting programs like garden hoses: the output of one becomes the input of the next.

The | symbol in your terminal (cat file.csv | grep "error" | wc -l) is a literal pipeline. Data engineering just scaled this idea to terabytes!

</aside>

🧠 Knowledge Check

  1. Why might a script that works perfectly on your laptop fail when running on a server?
  2. What is the difference between "getting the right answer" and "engineering a robust solution"?

<aside> ⌨️ Hands on: Look at the "works on my machine" script above. List 3 specific changes you would make to turn it into a professional pipeline. Think about config, error handling, and testing.

</aside>

Extra reading


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.