Week 2 - Structuring Pipelines

Introduction to Data Pipelines

Configuration & Secrets (.env)

Separation of Concerns

OOP vs Functional Programming

Dataclasses for Data Objects

Functional Composition

Testing with Pytest

Linting and Formatting with Ruff

Practice

Gotchas & Pitfalls

Week 2 Assignment: Clean Pipeline

Week 2 Glossary

Going Further

Career relevance: Week 2

Week 2 Kickoff Slides

🛠️ Practice

These exercises reinforce the core skills from this week. Each one is short and focused: complete them before starting the assignment. They build on each other, so work through them in order.

The "BAD" snippets in each exercise are intentional anti-patterns: they're written to fail or behave badly so you can refactor them. Don't expect them to run as-is.

Open the workspace once

All Week 2 exercises live under data-track/week-2/ in HYF's Learning-Resources repo. One Codespace covers all five exercises.

<aside> 💻 Open in GitHub Codespaces

</aside>

The repo's data-track/.devcontainer/ boots Python 3.11 + ruff + Pylance for every exercise. From the Codespace's Explorer, navigate into data-track/week-2/exercise_N/.

Prefer your own VS Code? Clone locally instead:

git clone <https://github.com/HackYourFuture/Learning-Resources.git>
cd Learning-Resources/data-track/week-2
code .

Each exercise folder ships its own requirements.txt (when needed) and a per-exercise README with detailed instructions. The chapter prose below is the high-level brief; the folder READMEs are the detailed brief.

Reference solutions (peek only after attempting)

Each exercise_N/solutions/ folder holds the answer in-place. The starter file is filled with the answer code, the original # TODO comments are preserved, and # WHY ...: notes sit under each non-obvious choice. The file you read is the question and the answer side-by-side.

Read the WHY notes, not the code. The point is the reasoning, not the syntax.

Spoiler discipline

The solution sits next to your starter under solutions/ rather than on a separate branch. The folder name and the deliberate "open this folder to see the answer" click are the whole barrier, and they are enough. Time-box yourself if you find yourself peeking reflexively: 10 minutes of honest attempt before you open solutions/. The struggle is where the learning happens.

You can diff your attempt against the reference once you have tried:

diff exercise_3/exercise.py exercise_3/solutions/exercise.py

Exercise 1: Move Secrets to .env

This script has hardcoded credentials. Your job: make it safe.

<aside> ⚠️ Never commit your .env. Add .env to .gitignore before you create it, not after. A leaked .env in git history needs a force-push to remove plus an immediate rotation of every key in it.

</aside>

# BAD: secrets in code ❌
import requests

API_KEY = "sk-abc123-secret"
BASE_URL = "<https://api.example.com>"

response = requests.get(f"{BASE_URL}/data", headers={"Authorization": API_KEY})
print(response.json())

Your task:

  1. Create a .env file with API_KEY and BASE_URL
  2. Create a config.py that loads these using python-dotenv
  3. Refactor the script to import from config.py
  4. Add .env to .gitignore

Success criteria: The script works the same way, but no secrets appear in your Python files.

<aside> 📦 Files: exercise_1/ (use the Codespace you opened at the top of this page).

</aside>


Exercise 2: Model Data with a Dataclass

This code uses a dictionary to represent a weather reading. Refactor it to use a dataclass. The KeyError below is the whole point: the typo "temprature" goes unnoticed until runtime, and a dataclass would have caught it before you ran the code.

# BAD: fragile dictionary ❌
reading = {
    "city": "Amsterdam",
    "temp_celsius": 18.5,
    "humidity": 72,
}

# Typo goes unnoticed until runtime
print(reading["temprature"])  # KeyError!

Your task:

  1. Create a WeatherReading dataclass with city (str), temp_celsius (float), and humidity (int)
  2. Add a __post_init__ check: humidity must be between 0 and 100
  3. Add a method is_hot() that returns True if temp_celsius > 30
  4. Add a method to_dict() using dataclasses.asdict

Success criteria: Creating WeatherReading(city="Amsterdam", temp_celsius=18.5, humidity=150) raises a ValueError.

<aside> 📦 Files: exercise_2/.

</aside>


Exercise 3: Separate I/O from Logic

This "god function" does everything at once. Split it into three layers.

# BAD: I/O and logic mixed together ❌
import csv

def process_sales():
    with open("sales.csv") as f:
        reader = csv.DictReader(f)
        rows = list(reader)

    total = 0
    for row in rows:
        price = float(row["price"])
        quantity = int(row["quantity"])
        if price > 0 and quantity > 0:
            total += price * quantity

    with open("report.txt", "w") as f:
        f.write(f"Total revenue: €{total:.2f}")

Your task:

  1. Create read_sales(path): I/O function that reads and returns the data
  2. Create calculate_revenue(rows): pure function that computes the total (no file access!)
  3. Create write_report(total, path): I/O function that writes the result
  4. Create run_pipeline(input_path, output_path): orchestrator that calls the three functions in order

Success criteria: You can call calculate_revenue() with test data (a list of dicts) without any file on disk.

<aside> 📦 Files: exercise_3/: includes a sales.csv with mixed valid / zero-price / negative-price rows.

</aside>


Exercise 4: Write Tests with Pytest

Now test the pure function from Exercise 3.

Your task:

Create a file test_sales.py with the following tests:

  1. test_calculate_revenue_basic: pass in two valid rows, check the total is correct
  2. test_calculate_revenue_skips_invalid: pass in rows with negative prices or zero quantities, verify they are excluded
  3. Use @pytest.fixture to create a reusable sample_rows fixture
  4. Use @pytest.mark.parametrize to test multiple edge cases: empty list, single row, rows with zero price

Run with:

pytest test_sales.py -v

Success criteria: All tests pass, and each test is independent (no test depends on another test's result).

<aside> 📦 Files: exercise_4/: includes a reference sales.py so you can write tests even before finishing Exercise 3.

</aside>


Exercise 5: Refactor a "god function"

This is the final boss. Combine everything you've practiced.

# BAD: everything in one place ❌
import json

def run():
    with open("users.json") as f:
        users = json.load(f)

    active = []
    for user in users:
        if user["status"] == "active" and user["age"] >= 18:
            user["email"] = user["email"].lower().strip()
            active.append(user)

    with open("active_users.json", "w") as f:
        json.dump(active, f)

    print(f"Processed {len(active)} users")

run()

Your task:

  1. Create a User dataclass with fields: name, email, age, status
  2. Move file paths to a .env file and load them via config.py
  3. Separate into: read_users(path), filter_active_adults(users), clean_email(user), save_users(users, path)
  4. Write at least 3 tests for your pure functions
  5. Create a main.py orchestrator that ties it all together
  6. Add a short AI_NOTES.md listing one thing an LLM helped you with and one suggestion you rejected (with the reason). One paragraph each, no more.

Success criteria: Running pytest shows all tests green. Running main.py produces the same output as the original script.

<aside> 📦 Files: exercise_5/: includes the input users.json (5 users: mix of active/inactive, adult/minor) so the expected output of "Processed 3 users" is reproducible.

</aside>

This is the largest exercise in the week: give it more time than the others, and use the LLM as a sparring partner.

<aside> 💡 Using AI to help: This is the right exercise to lean on an LLM. Paste one function at a time (⚠️ no real data, no PII) and ask for a refactoring suggestion that splits I/O from logic. Then check: does the suggested split make the logic testable without files? If not, push back and ask again. The point is to argue with the suggestion, not accept it.

</aside>


Exercise 6: OOP vs Functional: Spot the State

Concepts: OOP vs Functional (Ch4)

Three pipeline components are each written as a class. For each one, decide: does it hold state that methods share via self, or is it just input → output with nothing stored?

class NameCleaner:
    def clean(self, rows: list[dict]) -> list[dict]:
        return [{**r, "product_name": r["product_name"].strip().title()} for r in rows]

class DatabaseSink:
    def __init__(self):
        self._written: list[dict] = []
        self.written_count: int = 0

    def write(self, row: dict) -> None:
        self._written.append(row)
        self.written_count += 1

    def flush(self) -> list[dict]:
        return self._written

class ReportFormatter:
    def format(self, rows: list[dict]) -> str:
        lines = [f"{r['product_name']}: €{float(r['price']):.2f}" for r in rows]
        return "\n".join(lines)

Your task:

  1. Write a comment above each class: # KEEP AS CLASS or # REFACTOR TO FUNCTION
  2. For each one marked # REFACTOR TO FUNCTION, rewrite it as a plain function
  3. Run the checks in the starter to verify your refactored functions produce the same output

Success criteria: Both refactored functions produce the same results as the original class methods. The one correct class is left unchanged.

<aside> 📦 Files: exercise_6/: use the Codespace you opened at the top of this page.

</aside>


Ruff Practice (Ch8)

Concepts: Linting and Formatting (Ch8)

A deliberately messy Python file with intentional violations. Your goal: run ruff check, identify each rule code, fix them one at a time, and re-run until clean.

cd data-track/week-2/ruff_practice
pip install ruff
ruff check messy.py

You should see 6 errors across 5 rule codes: F401 (twice), F841, B006, E501, E711. Fix them one at a time. When stuck, compare against messy_fixed.py.

<aside> 📦 Files: ruff_practice/: use the Codespace you opened at the top of this page.

</aside>