Week 2 - Structuring Data Pipelines

Introduction to Data Pipelines

Configuration & Secrets (.env)

Separation of Concerns (I/O vs Logic)

OOP vs Functional Programming

Dataclasses for Data Objects

Functional Composition

Testing with Pytest

Linting and Formatting with Ruff

Practice

Gotchas & Pitfalls

Assignment: Refactoring to a Clean Pipeline

โš ๏ธ Gotchas & Pitfalls

These gotchas are specific to the patterns you learned this week. Each one is a subtle trap that catches even experienced developers when working with modules, dataclasses, and composable functions.

<aside> ๐Ÿ’ก Using AI to help: Paste a function or module (โš ๏ธ no real data, no PII) into an LLM and ask "does this code hit any of the six gotchas in this chapter?". LLMs are good at pattern-matching these. Verify each flagged spot against the relevant section below: the LLM is the senior engineer skimming your code; you're still doing the fix.

</aside>

1. The mutation trap (dict transforms)

The Misconception

You write a clean_data(rows) function that modifies dicts in-place. Your tests pass because the output looks correct.

The Reality

Your "raw" data is silently corrupted. The original list of dicts was mutated, and any downstream function that depends on the raw data is now working with the cleaned version. The fix is the {**r, "key": value} spread pattern from Functional Composition.

Example

raw = [{"price": 100}, {"price": 200}]

# BAD โŒ: mutates the original
for r in raw:
    r["price"] = r["price"] * 1.21
# raw is now [{"price": 121.0}, {"price": 242.0}] - original data is gone!

# GOOD โœ…: creates new dicts
with_vat = [{**r, "price": round(r["price"] * 1.21, 2)} for r in raw]
# raw is still [{"price": 100}, {"price": 200}] - safe!

<aside> ๐Ÿ’ก The {**row, "key": value} pattern creates a new dict instead of mutating the original, which is what makes your transform functions composable. See Functional Composition for the full pattern.

</aside>

Time to write your own non-mutating transform.

<aside> โŒจ๏ธ Hands on: Implement apply_vat(rows, rate) so it returns a new list without mutating the original. The widget's third assertion specifically checks that the input list is unchanged after your function runs.

๐Ÿš€ Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=2&chapter=gotchas&exercise=w2_gotchas__no_mutation&lang=python

</aside>


2. The mutable default in dataclasses

The Misconception

You add tags: list = [] to a dataclass, just like a normal function default.

The Reality

Python catches this and raises a ValueError. That ValueError IS the gotcha: the snippet below fails on purpose to demonstrate it. The fix is to use field(default_factory=...):

<!-- runner:expect-fail -->

# BAD โŒ: Python raises ValueError
from dataclasses import dataclass

@dataclass
class PipelineConfig:
    source: str
    tags: list = []  # ValueError: mutable default!
# GOOD โœ…: use field(default_factory=...)
from dataclasses import dataclass, field

@dataclass
class PipelineConfig:
    source: str
    tags: list = field(default_factory=list)  # New list per instance

<aside> โš ๏ธ This is actually Python protecting you from a bug that also exists in regular functions (covered in Week 1). The difference is that dataclasses refuse to compile with mutable defaults, while regular functions silently share the same object across calls.

</aside>


3. The circular import

The Misconception

You split your code into logic.py and io_layer.py (good!). Then logic.py imports from io_layer.py, and io_layer.py imports from logic.py.

The Reality

Python crashes with ImportError: cannot import name .... When two modules import each other, Python gets stuck in a loop: module A needs B to finish loading, but B needs A to finish loading.

Example

The two snippets below show file layout, not standalone scripts. They reference modules (logic.py, io_layer.py, main.py) that exist in your project; running them in isolation will raise ModuleNotFoundError.

<!-- runner:expect-fail -->

# BAD โŒ: circular import
# logic.py
from io_layer import save_csv  # io_layer hasn't finished loading!

# io_layer.py
from logic import clean_data   # logic hasn't finished loading!

<!-- runner:expect-fail -->

# GOOD โœ…: one-directional dependency
# logic.py: pure functions, imports NOTHING from io_layer
def clean_data(df):
    return df.dropna()

# io_layer.py: handles I/O, can import from logic
from logic import clean_data

# main.py: orchestrates both
from logic import clean_data
from io_layer import save_csv

The fix is exactly what Separation of Concerns teaches: logic should never depend on I/O. Make the pure-function layer free of I/O imports, and the cycle goes away.


4. The "god function"

The Misconception

You write one big function that reads the file, cleans the data, applies business rules, and saves the output. It works! Why split it up?

The Reality

You cannot test any part of it without running the whole thing. You need actual files on disk, a database connection, and network access, just to check if your VAT calculation is correct. Separation of Concerns shows the full refactoring pattern.

Example

# BAD โŒ: untestable god function
def run_pipeline():
    with open("sales.csv") as f:
        rows = list(csv.DictReader(f))           # I/O mixed with...
    rows = [{**r, "price": float(r["price"]) * 1.21} for r in rows]  # ...logic
    with open("output.csv", "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=rows[0].keys())
        writer.writerows(rows)                    # How do you test the VAT calc?

# GOOD โœ…: extract the logic into a pure function (see Separation of Concerns)
def apply_vat(rows: list[dict], rate: float) -> list[dict]:
    return [{**r, "price": round(float(r["price"]) * (1 + rate), 2)} for r in rows]

5. The shadowed standard library

The Misconception

You create a file named email.py, json.py, or code.py to test some code.

The Reality

Python treats your local file as a module. When you (or another library) try to import the real standard library module, Python imports your file instead. This leads to confusing AttributeErrors because your file doesn't have the functions the library expects.

Example

# file structure:
# โ”œโ”€โ”€ email.py  <-- YOU CREATED THIS
# โ””โ”€โ”€ main.py

# main.py
import email  # Imports YOUR file, not the standard library!
from email.message import EmailMessage  # ImportError!

<aside> โš ๏ธ Rule: Never name your files the same as standard Python modules. Common forbidden names: email.py, json.py, csv.py, random.py, math.py, test.py.

</aside>


6. The phantom state (class attributes)

The Misconception

You define a variable inside a class but outside __init__, thinking it belongs to the object.

The Reality

Variables defined outside __init__ are class attributes. They are shared by all instances of that class. If one pipeline modifies it, every other pipeline sees the change!

Example

# BAD โŒ: Shared state
class Pipeline:
    errors = []  # CLASS attribute (shared by all)

    def add_error(self, err):
        self.errors.append(err)

p1 = Pipeline()
p1.add_error("Fail 1")

p2 = Pipeline()
print(p2.errors)  # ['Fail 1'] - WAIT WHAT? p2 shouldn't have p1's errors!
# GOOD โœ…: Instance state
class Pipeline:
    def __init__(self):
        self.errors = []  # INSTANCE attribute (unique to this object)

    def add_error(self, err):
        self.errors.append(err)

p1 = Pipeline()
p1.add_error("Fail 1")

p2 = Pipeline()
print(p2.errors)  # [] - Safe!

๐Ÿง  Knowledge Check

Extra reading


Next up: Assignment, where you apply every Week 2 pattern (config, dataclasses, separation of concerns, composable transforms, pytest) to refactor a single messy script into a clean, tested pipeline.


The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with โค๏ธ by the HackYourFuture community ยท Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.