Week 2 - Structuring Data Pipelines
Introduction to Data Pipelines
Configuration & Secrets (.env)
Separation of Concerns (I/O vs Logic)
Linting and Formatting with Ruff
Assignment: Refactoring to a Clean Pipeline
Welcome to Week 2! Students have read the material and written basic Python in Week 1. Today we shift from "writing scripts" to "building systems." The goal is consolidation: review the concepts through live coding, connect them to the assignment, and make sure students understand why structure matters before they build it themselves.
By the end of this lesson, students should be able to:
.env configuration and pure transform functions| Time | Activity | Duration |
|---|---|---|
| 0:00 | Welcome & Kahoot Quiz | 15 min |
| 0:15 | Live Demo: The Disaster Script | 15 min |
| 0:30 | Workshop 1: Config + Separation of Concerns | 25 min |
| 0:55 | Break | 10 min |
| 1:05 | Workshop 2: Composable Transforms + Dataclasses | 25 min |
| 1:30 | Live Demo: Testing Your Pipeline | 10 min |
| 1:40 | Assignment Launch: Connecting the Dots | 15 min |
| 1:55 | Q&A & Wrap Up | 5 min |
| 2:00 | End | - |
Total: 2 hours
Goal: Check understanding of the Week 2 reading material before diving in.
.env file to GitHub?__post_init__ do?df.copy() prevent?Start the class with this script on screen. Ask: "What is wrong with this?"
# disastrous_pipeline.py
import csv
data = []
with open("/Users/steve/Downloads/sales.csv") as f:
for row in csv.DictReader(f):
data.append(row)
for row in data:
row["product_name"] = row["product_name"].strip().title()
row["price"] = float(row["price"])
row["revenue"] = row["price"] * int(row["quantity"])
row["vat"] = row["revenue"] * 0.21
total = sum(r["revenue"] for r in data)
print(f"Total: {total}")
with open("clean.csv", "w") as f:
w = csv.DictWriter(f, fieldnames=data[0].keys())
w.writeheader()
w.writerows(data)
.env)data[0] before and after the loop, is it the same?" Run it. It's mutated. (Answer: Side effects)price is the string 'ten'?" It crashes with no useful message. (Answer: No validation)Count the problems together on the whiteboard. This becomes the motivation for every chapter.
Goal: Students refactor the disaster script into config + pure functions. This covers Chapters 2 and 3.
Task: Create a .env file and config.py module.
Instructions for students:
.env with INPUT_PATH=data/sales.csv and OUTPUT_PATH=output/clean.csvconfig.py that loads both variables and raises ValueError if missingfrom config import INPUT_PATH, OUTPUT_PATHSuccess Criteria: The script runs. Change only .env to point to a different file, and it still works.
Task: Pull the transform logic out of the loop into composable functions.
Instructions for students:
clean_names(rows) function that strips and title-cases product_name, returns a new listcalculate_revenue(rows) function that adds revenue and vat fields, returns a new listdata = clean_names(raw) then data = calculate_revenue(data)raw[0] after the chain. It should be unchanged (no mutation)Key Moment: Have students print the original data before and after. If they forgot {**row, "key": val}, the original is mutated. This is the "aha" moment for immutability.
Discussion: "Why is it better to have 3 small functions than 1 big loop?" (Testable, reorderable, reusable)
Goal: Students add validation with dataclasses and see OOP vs functional in action. This covers Chapters 4, 5, and 6.
Quick Discussion: Show two code snippets side by side:
# A: Function (no state)
def apply_vat(price, rate=0.21):
return round(price * (1 + rate), 2)
# B: Class (has state)
class DatabaseConnector:
def __init__(self, host, port):
self.host = host
self.connected = False
def connect(self):
self.connected = True
Ask: "Which one needs to be a class? Why?" (B needs state. A is just input-output, a function is simpler.)
Rule of Thumb: Data & Config = Classes. Logic & Transforms = Functions.
Task: Create a Transaction dataclass with __post_init__ validation.
Instructions for students:
Transaction with fields: product_name: str, price: float, quantity: int, revenue: float__post_init__: raise ValueError if price < 0 or product_name is emptyTransaction instancesTransaction(product_name="", price=-5, quantity=1, revenue=0): it should crash immediatelyKey Moment: "Would you rather crash here with a clear ValueError, or discover bad data in your CEO's revenue report next Monday?"
Task: Wire up 3+ composable functions into a pipeline.
Instructions for students:
remove_invalid(rows) function (drops rows with empty names or negative prices)