Week 2 - Structuring Data Pipelines

Introduction to Data Pipelines

Configuration & Secrets (.env)

Separation of Concerns (I/O vs Logic)

OOP vs Functional Programming

Dataclasses for Data Objects

Functional Composition

Testing with Pytest

Linting and Formatting with Ruff

Practice

Assignment: Refactoring to a Clean Pipeline

Gotchas & Pitfalls

Lesson Plan

🗓️ Lesson Plan

Theme: From Script to System

Welcome to Week 2! Students have read the material and written basic Python in Week 1. Today we shift from "writing scripts" to "building systems." The goal is consolidation: review the concepts through live coding, connect them to the assignment, and make sure students understand why structure matters before they build it themselves.

Goals

By the end of this lesson, students should be able to:

Schedule

Time Activity Duration
0:00 Welcome & Kahoot Quiz 15 min
0:15 Live Demo: The Disaster Script 15 min
0:30 Workshop 1: Config + Separation of Concerns 25 min
0:55 Break 10 min
1:05 Workshop 2: Composable Transforms + Dataclasses 25 min
1:30 Live Demo: Testing Your Pipeline 10 min
1:40 Assignment Launch: Connecting the Dots 15 min
1:55 Q&A & Wrap Up 5 min
2:00 End -

Total: 2 hours


Kahoot Quiz (15 min)

Goal: Check understanding of the Week 2 reading material before diving in.

Topics to include

  1. Pipelines: What does ETL stand for? Which step reads data from a source?
  2. Config: What happens if you commit a .env file to GitHub?
  3. Separation of Concerns: Should a transform function open a file? (No!)
  4. Dataclasses: What does __post_init__ do?
  5. Pure Functions: What does a pure function never do? (Mutate input / print / read files)
  6. Composition: What does df.copy() prevent?
  7. Testing: What command runs all pytest tests in a folder?

Live Demo: The Disaster Script (15 min)

Start the class with this script on screen. Ask: "What is wrong with this?"

# disastrous_pipeline.py
import csv

data = []
with open("/Users/steve/Downloads/sales.csv") as f:
    for row in csv.DictReader(f):
        data.append(row)

for row in data:
    row["product_name"] = row["product_name"].strip().title()
    row["price"] = float(row["price"])
    row["revenue"] = row["price"] * int(row["quantity"])
    row["vat"] = row["revenue"] * 0.21

total = sum(r["revenue"] for r in data)
print(f"Total: {total}")

with open("clean.csv", "w") as f:
    w = csv.DictWriter(f, fieldnames=data[0].keys())
    w.writeheader()
    w.writerows(data)

Teaching Points (do these live)

  1. Change the path and run it. It crashes. Ask: "How do we fix this without editing the code?" (Answer: .env)
  2. Ask: "What is 0.21?" Nobody knows. (Answer: Magic number, should be a named constant or config value)
  3. Ask: "How do I test the VAT calculation without a CSV file?" (Answer: You can't, I/O and logic are mixed)
  4. Ask: "If I print data[0] before and after the loop, is it the same?" Run it. It's mutated. (Answer: Side effects)
  5. Ask: "What if price is the string 'ten'?" It crashes with no useful message. (Answer: No validation)

Count the problems together on the whiteboard. This becomes the motivation for every chapter.


Workshop 1: Config + Separation of Concerns (25 min)

Goal: Students refactor the disaster script into config + pure functions. This covers Chapters 2 and 3.

Part A: Extract Config (10 min)

Task: Create a .env file and config.py module.

Instructions for students:

  1. Create .env with INPUT_PATH=data/sales.csv and OUTPUT_PATH=output/clean.csv
  2. Create config.py that loads both variables and raises ValueError if missing
  3. Update the script to use from config import INPUT_PATH, OUTPUT_PATH

Success Criteria: The script runs. Change only .env to point to a different file, and it still works.

Part B: Extract Pure Functions (15 min)

Task: Pull the transform logic out of the loop into composable functions.

Instructions for students:

  1. Create a clean_names(rows) function that strips and title-cases product_name, returns a new list
  2. Create a calculate_revenue(rows) function that adds revenue and vat fields, returns a new list
  3. Chain them: data = clean_names(raw) then data = calculate_revenue(data)
  4. Verify: print raw[0] after the chain. It should be unchanged (no mutation)

Key Moment: Have students print the original data before and after. If they forgot {**row, "key": val}, the original is mutated. This is the "aha" moment for immutability.

Discussion: "Why is it better to have 3 small functions than 1 big loop?" (Testable, reorderable, reusable)


Workshop 2: Composable Transforms + Dataclasses (25 min)

Goal: Students add validation with dataclasses and see OOP vs functional in action. This covers Chapters 4, 5, and 6.

Part A: When to Use a Class (5 min)

Quick Discussion: Show two code snippets side by side:

# A: Function (no state)
def apply_vat(price, rate=0.21):
    return round(price * (1 + rate), 2)

# B: Class (has state)
class DatabaseConnector:
    def __init__(self, host, port):
        self.host = host
        self.connected = False
    def connect(self):
        self.connected = True

Ask: "Which one needs to be a class? Why?" (B needs state. A is just input-output, a function is simpler.)

Rule of Thumb: Data & Config = Classes. Logic & Transforms = Functions.

Part B: Add a Dataclass (10 min)

Task: Create a Transaction dataclass with __post_init__ validation.

Instructions for students:

  1. Define Transaction with fields: product_name: str, price: float, quantity: int, revenue: float
  2. Add __post_init__: raise ValueError if price < 0 or product_name is empty
  3. At the end of the pipeline, convert cleaned dicts into Transaction instances
  4. Try creating Transaction(product_name="", price=-5, quantity=1, revenue=0): it should crash immediately

Key Moment: "Would you rather crash here with a clear ValueError, or discover bad data in your CEO's revenue report next Monday?"

Part C: Build the Full Chain (10 min)

Task: Wire up 3+ composable functions into a pipeline.

Instructions for students:

  1. Add a remove_invalid(rows) function (drops rows with empty names or negative prices)
  2. Chain all functions together: