Week 2 - Structuring Data Pipelines

Introduction to Data Pipelines

Configuration & Secrets (.env)

Separation of Concerns (I/O vs Logic)

OOP vs Functional Programming

Dataclasses for Data Objects

Functional Composition

Testing with Pytest

Linting and Formatting with Ruff

Practice

Assignment: Refactoring to a Clean Pipeline

Gotchas & Pitfalls

Lesson Plan

⚠️ Gotchas & Pitfalls

These gotchas are specific to the patterns you learned this week. Each one is a subtle trap that catches even experienced developers when working with modules, dataclasses, and composable functions.

1. The mutation trap (dict transforms)

The Misconception

You write a clean_data(rows) function that modifies dicts in-place. Your tests pass because the output looks correct.

The Reality

Your "raw" data is silently corrupted. The original list of dicts was mutated, and any downstream function that depends on the raw data is now working with the cleaned version. The fix is the {**r, "key": value} spread pattern from Chapter 6 (Functional Composition).

Example

raw = [{"price": 100}, {"price": 200}]

# BAD ❌: mutates the original
for r in raw:
    r["price"] = r["price"] * 1.21
# raw is now [{"price": 121.0}, {"price": 242.0}] - original data is gone!

# GOOD ✅: creates new dicts
with_vat = [{**r, "price": round(r["price"] * 1.21, 2)} for r in raw]
# raw is still [{"price": 100}, {"price": 200}] - safe!

<aside> 💡 The {**row, "key": value} pattern creates a new dict instead of mutating the original, which is what makes your transform functions composable. See Chapter 6 for the full pattern.

</aside>


2. The mutable default in dataclasses

The Misconception

You add tags: list = [] to a dataclass, just like a normal function default.

The Reality

Python catches this and raises a ValueError. The fix is to use field(default_factory=...):

# BAD ❌: Python raises ValueError
from dataclasses import dataclass

@dataclass
class PipelineConfig:
    source: str
    tags: list = []  # ValueError: mutable default!
# GOOD ✅: use field(default_factory=...)
from dataclasses import dataclass, field

@dataclass
class PipelineConfig:
    source: str
    tags: list = field(default_factory=list)  # New list per instance

<aside> ⚠️ This is actually Python protecting you from a bug that also exists in regular functions (covered in Week 1). The difference is that dataclasses refuse to compile with mutable defaults, while regular functions silently share the same object across calls.

</aside>


3. The circular import

The Misconception

You split your code into logic.py and io_layer.py (good!). Then logic.py imports from io_layer.py, and io_layer.py imports from logic.py.

The Reality

Python crashes with ImportError: cannot import name .... When two modules import each other, Python gets stuck in a loop: module A needs B to finish loading, but B needs A to finish loading.

Example

# BAD ❌: circular import
# logic.py
from io_layer import save_csv  # io_layer hasn't finished loading!

# io_layer.py
from logic import clean_data   # logic hasn't finished loading!
# GOOD ✅: one-directional dependency
# logic.py: pure functions, imports NOTHING from io_layer
def clean_data(df):
    return df.dropna()

# io_layer.py: handles I/O, can import from logic
from logic import clean_data

# main.py: orchestrates both
from logic import clean_data
from io_layer import save_csv

The fix is exactly what Chapter 3 teaches: Separation of Concerns. Logic should never depend on I/O.


4. The "god function"

The Misconception

You write one big function that reads the file, cleans the data, applies business rules, and saves the output. It works! Why split it up?

The Reality

You cannot test any part of it without running the whole thing. You need actual files on disk, a database connection, and network access, just to check if your VAT calculation is correct. Chapter 3 (Separation of Concerns) shows the full refactoring pattern.

Example

# BAD ❌: untestable god function
def run_pipeline():
    with open("sales.csv") as f:
        rows = list(csv.DictReader(f))           # I/O mixed with...
    rows = [{**r, "price": float(r["price"]) * 1.21} for r in rows]  # ...logic
    with open("output.csv", "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=rows[0].keys())
        writer.writerows(rows)                    # How do you test the VAT calc?

# GOOD ✅: extract the logic into a pure function (see Chapter 3)
def apply_vat(rows: list[dict], rate: float) -> list[dict]:
    return [{**r, "price": round(float(r["price"]) * (1 + rate), 2)} for r in rows]

5. The shadowed standard library

The Misconception

You create a file named email.py, json.py, or code.py to test some code.

The Reality

Python treats your local file as a module. When you (or another library) try to import the real standard library module, Python imports your file instead. This leads to confusing AttributeErrors because your file doesn't have the functions the library expects.

Example

# file structure:
# ├── email.py  <-- YOU CREATED THIS
# └── main.py

# main.py
import email  # Imports YOUR file, not the standard library!
from email.message import EmailMessage  # ImportError!

<aside> ⚠️ Rule: Never name your files the same as standard Python modules. Common forbidden names: email.py, json.py, csv.py, random.py, math.py, test.py.

</aside>


6. The phantom state (class attributes)

The Misconception

You define a variable inside a class but outside __init__, thinking it belongs to the object.

The Reality

Variables defined outside __init__ are Class Attributes. They are shared by all instances of that class. If one pipeline modifies it, every other pipeline sees the change!

Example

# BAD ❌: Shared state
class Pipeline:
    errors = []  # CLASS attribute (shared by all)

    def add_error(self, err):
        self.errors.append(err)

p1 = Pipeline()
p1.add_error("Fail 1")

p2 = Pipeline()
print(p2.errors)  # ['Fail 1'] - WAIT WHAT? p2 shouldn't have p1's errors!
# GOOD ✅: Instance state
class Pipeline:
    def __init__(self):
        self.errors = []  # INSTANCE attribute (unique to this object)

    def add_error(self, err):
        self.errors.append(err)

p1 = Pipeline()
p1.add_error("Fail 1")

p2 = Pipeline()
print(p2.errors)  # [] - Safe!

🧠 Knowledge Check

  1. Why does {**row, "price": new_price} prevent mutations in your pipeline?
  2. Why does Python reject tags: list = [] in a dataclass, but silently allow it in a regular function?
  3. You have logic.py and io_layer.py that both need to call each other. How do you break the circular import?
  4. A colleague writes one big function that reads a CSV, cleans it, and saves it. Why is this hard to test?
  5. Why is naming your file csv.py a bad idea?
  6. What happens if you define results = [] directly under class Pipeline: instead of inside __init__?

Extra reading


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.