Week 2 - Structuring Data Pipelines
Introduction to Data Pipelines
Configuration & Secrets (.env)
Separation of Concerns (I/O vs Logic)
Linting and Formatting with Ruff
Assignment: Refactoring to a Clean Pipeline
These gotchas are specific to the patterns you learned this week. Each one is a subtle trap that catches even experienced developers when working with modules, dataclasses, and composable functions.
You write a clean_data(rows) function that modifies dicts in-place. Your tests pass because the output looks correct.
Your "raw" data is silently corrupted. The original list of dicts was mutated, and any downstream function that depends on the raw data is now working with the cleaned version. The fix is the {**r, "key": value} spread pattern from Chapter 6 (Functional Composition).
raw = [{"price": 100}, {"price": 200}]
# BAD ❌: mutates the original
for r in raw:
r["price"] = r["price"] * 1.21
# raw is now [{"price": 121.0}, {"price": 242.0}] - original data is gone!
# GOOD ✅: creates new dicts
with_vat = [{**r, "price": round(r["price"] * 1.21, 2)} for r in raw]
# raw is still [{"price": 100}, {"price": 200}] - safe!
<aside>
💡 The {**row, "key": value} pattern creates a new dict instead of mutating the original, which is what makes your transform functions composable. See Chapter 6 for the full pattern.
</aside>
You add tags: list = [] to a dataclass, just like a normal function default.
Python catches this and raises a ValueError. The fix is to use field(default_factory=...):
# BAD ❌: Python raises ValueError
from dataclasses import dataclass
@dataclass
class PipelineConfig:
source: str
tags: list = [] # ValueError: mutable default!
# GOOD ✅: use field(default_factory=...)
from dataclasses import dataclass, field
@dataclass
class PipelineConfig:
source: str
tags: list = field(default_factory=list) # New list per instance
<aside> ⚠️ This is actually Python protecting you from a bug that also exists in regular functions (covered in Week 1). The difference is that dataclasses refuse to compile with mutable defaults, while regular functions silently share the same object across calls.
</aside>
You split your code into logic.py and io_layer.py (good!). Then logic.py imports from io_layer.py, and io_layer.py imports from logic.py.
Python crashes with ImportError: cannot import name .... When two modules import each other, Python gets stuck in a loop: module A needs B to finish loading, but B needs A to finish loading.
# BAD ❌: circular import
# logic.py
from io_layer import save_csv # io_layer hasn't finished loading!
# io_layer.py
from logic import clean_data # logic hasn't finished loading!
# GOOD ✅: one-directional dependency
# logic.py: pure functions, imports NOTHING from io_layer
def clean_data(df):
return df.dropna()
# io_layer.py: handles I/O, can import from logic
from logic import clean_data
# main.py: orchestrates both
from logic import clean_data
from io_layer import save_csv
The fix is exactly what Chapter 3 teaches: Separation of Concerns. Logic should never depend on I/O.
You write one big function that reads the file, cleans the data, applies business rules, and saves the output. It works! Why split it up?
You cannot test any part of it without running the whole thing. You need actual files on disk, a database connection, and network access, just to check if your VAT calculation is correct. Chapter 3 (Separation of Concerns) shows the full refactoring pattern.
# BAD ❌: untestable god function
def run_pipeline():
with open("sales.csv") as f:
rows = list(csv.DictReader(f)) # I/O mixed with...
rows = [{**r, "price": float(r["price"]) * 1.21} for r in rows] # ...logic
with open("output.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=rows[0].keys())
writer.writerows(rows) # How do you test the VAT calc?
# GOOD ✅: extract the logic into a pure function (see Chapter 3)
def apply_vat(rows: list[dict], rate: float) -> list[dict]:
return [{**r, "price": round(float(r["price"]) * (1 + rate), 2)} for r in rows]
You create a file named email.py, json.py, or code.py to test some code.
Python treats your local file as a module. When you (or another library) try to import the real standard library module, Python imports your file instead. This leads to confusing AttributeErrors because your file doesn't have the functions the library expects.
# file structure:
# ├── email.py <-- YOU CREATED THIS
# └── main.py
# main.py
import email # Imports YOUR file, not the standard library!
from email.message import EmailMessage # ImportError!
<aside>
⚠️ Rule: Never name your files the same as standard Python modules. Common forbidden names: email.py, json.py, csv.py, random.py, math.py, test.py.
</aside>
You define a variable inside a class but outside __init__, thinking it belongs to the object.
Variables defined outside __init__ are Class Attributes. They are shared by all instances of that class. If one pipeline modifies it, every other pipeline sees the change!
# BAD ❌: Shared state
class Pipeline:
errors = [] # CLASS attribute (shared by all)
def add_error(self, err):
self.errors.append(err)
p1 = Pipeline()
p1.add_error("Fail 1")
p2 = Pipeline()
print(p2.errors) # ['Fail 1'] - WAIT WHAT? p2 shouldn't have p1's errors!
# GOOD ✅: Instance state
class Pipeline:
def __init__(self):
self.errors = [] # INSTANCE attribute (unique to this object)
def add_error(self, err):
self.errors.append(err)
p1 = Pipeline()
p1.add_error("Fail 1")
p2 = Pipeline()
print(p2.errors) # [] - Safe!
{**row, "price": new_price} prevent mutations in your pipeline?tags: list = [] in a dataclass, but silently allow it in a regular function?logic.py and io_layer.py that both need to call each other. How do you break the circular import?csv.py a bad idea?results = [] directly under class Pipeline: instead of inside __init__?The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.