Week 2 - Structuring Data Pipelines

Introduction to Data Pipelines

Configuration & Secrets (.env)

Separation of Concerns (I/O vs Logic)

OOP vs Functional Programming

Dataclasses for Data Objects

Functional Composition

Testing with Pytest

Linting and Formatting with Ruff

Practice

Gotchas & Pitfalls

Assignment: Refactoring to a Clean Pipeline

🛠️ Practice

These exercises reinforce the core skills from this week. Each one is short and focused: complete them before starting the assignment. They build on each other, so work through them in order.

The "BAD" snippets in each exercise are intentional anti-patterns: they're written to fail or behave badly so you can refactor them. Don't expect them to run as-is.

Open the workspace once

The starter code lives in the hyf-data-track-python-exercises repository on the w2 branch. One Codespace covers all five exercises.

<aside> 💻 Open in GitHub Codespaces

</aside>

Prefer your own VS Code? Clone locally instead:

git clone -b w2 <https://github.com/lassebenni/hyf-data-track-python-exercises.git>
cd hyf-data-track-python-exercises
code .

Each exercise folder ships its own requirements.txt (when needed) and a per-exercise README with detailed instructions. The chapter prose below is the high-level brief; the folder READMEs are the detailed brief.

Reference solutions (peek only after attempting)

When you've made an honest attempt at an exercise (or got stuck for more than 10 minutes), peek at the w2-solutions branch. Each starter file is filled in-place with the answer plus # WHY ...: notes explaining non-obvious choices. The original # TODO comments are preserved so you can read the question and the answer side-by-side.

Read the WHY notes, not the code. The point is the reasoning, not the syntax.

Spoiler discipline

If you switch branches with uncommitted work in your tree, git refuses with Your local changes ... would be overwritten by checkout. Two safe paths:

  1. Commit or git stash first, then git checkout w2-solutions. Switch back with git checkout w2.
  2. Browse the solution file directly on GitHub without switching branches: open the w2-solutions branch in the GitHub web UI and read it there. No local state changes.

Exercise 1: Move Secrets to .env

This script has hardcoded credentials. Your job: make it safe.

<aside> ⚠️ Never commit your .env. Add .env to .gitignore before you create it, not after. A leaked .env in git history needs a force-push to remove plus an immediate rotation of every key in it.

</aside>

<!-- runner:expect-fail -->

# BAD: secrets in code ❌
import requests

API_KEY = "sk-abc123-secret"
BASE_URL = "<https://api.example.com>"

response = requests.get(f"{BASE_URL}/data", headers={"Authorization": API_KEY})
print(response.json())

Your task:

  1. Create a .env file with API_KEY and BASE_URL
  2. Create a config.py that loads these using python-dotenv
  3. Refactor the script to import from config.py
  4. Add .env to .gitignore

Success criteria: The script works the same way, but no secrets appear in your Python files.

<aside> 📦 Files: exercise_1/ on the w2 branch (use the Codespace you opened at the top of this page).

</aside>


Exercise 2: Model Data with a Dataclass

This code uses a dictionary to represent a weather reading. Refactor it to use a dataclass. The KeyError below is the whole point: the typo "temprature" goes unnoticed until runtime, and a dataclass would have caught it before you ran the code.

<!-- runner:expect-fail -->

# BAD: fragile dictionary ❌
reading = {
    "city": "Amsterdam",
    "temp_celsius": 18.5,
    "humidity": 72,
}

# Typo goes unnoticed until runtime
print(reading["temprature"])  # KeyError!

Your task:

  1. Create a WeatherReading dataclass with city (str), temp_celsius (float), and humidity (int)
  2. Add a __post_init__ check: humidity must be between 0 and 100
  3. Add a method is_hot() that returns True if temp_celsius > 30
  4. Add a method to_dict() using dataclasses.asdict

Success criteria: Creating WeatherReading(city="Amsterdam", temp_celsius=18.5, humidity=150) raises a ValueError.

<aside> 📦 Files: exercise_2/ on the w2 branch.

</aside>


Exercise 3: Separate I/O from Logic

This "god function" does everything at once. Split it into three layers.

# BAD: I/O and logic mixed together ❌
import csv

def process_sales():
    with open("sales.csv") as f:
        reader = csv.DictReader(f)
        rows = list(reader)

    total = 0
    for row in rows:
        price = float(row["price"])
        quantity = int(row["quantity"])
        if price > 0 and quantity > 0:
            total += price * quantity

    with open("report.txt", "w") as f:
        f.write(f"Total revenue: €{total:.2f}")

Your task:

  1. Create read_sales(path): I/O function that reads and returns the data
  2. Create calculate_revenue(rows): pure function that computes the total (no file access!)
  3. Create write_report(total, path): I/O function that writes the result
  4. Create run_pipeline(input_path, output_path): orchestrator that calls the three functions in order

Success criteria: You can call calculate_revenue() with test data (a list of dicts) without any file on disk.

<aside> 📦 Files: exercise_3/ on the w2 branch: includes a sales.csv with mixed valid / zero-price / negative-price rows.

</aside>


Exercise 4: Write Tests with Pytest

Now test the pure function from Exercise 3.

Your task:

Create a file test_sales.py with the following tests:

  1. test_calculate_revenue_basic: pass in two valid rows, check the total is correct
  2. test_calculate_revenue_skips_invalid: pass in rows with negative prices or zero quantities, verify they are excluded
  3. Use @pytest.fixture to create a reusable sample_rows fixture
  4. Use @pytest.mark.parametrize to test multiple edge cases: empty list, single row, rows with zero price

Run with:

pytest test_sales.py -v

Success criteria: All tests pass, and each test is independent (no test depends on another test's result).

<aside> 📦 Files: exercise_4/ on the w2 branch: includes a reference sales.py so you can write tests even before finishing Exercise 3.

</aside>


Exercise 5: Refactor a "god function"

This is the final boss. Combine everything you've practiced.

<!-- runner:expect-fail -->

# BAD: everything in one place ❌
import json

def run():
    with open("users.json") as f:
        users = json.load(f)

    active = []
    for user in users:
        if user["status"] == "active" and user["age"] >= 18:
            user["email"] = user["email"].lower().strip()
            active.append(user)

    with open("active_users.json", "w") as f:
        json.dump(active, f)

    print(f"Processed {len(active)} users")

run()

Your task:

  1. Create a User dataclass with fields: name, email, age, status
  2. Move file paths to a .env file and load them via config.py
  3. Separate into: read_users(path), filter_active_adults(users), clean_email(user), save_users(users, path)
  4. Write at least 3 tests for your pure functions
  5. Create a main.py orchestrator that ties it all together
  6. Add a short AI_NOTES.md listing one thing an LLM helped you with and one suggestion you rejected (with the reason). One paragraph each, no more.

Success criteria: Running pytest shows all tests green. Running main.py produces the same output as the original script.

<aside> 📦 Files: exercise_5/ on the w2 branch: includes the input users.json (5 users: mix of active/inactive, adult/minor) so the expected output of "Processed 3 users" is reproducible.

</aside>

This is the largest exercise in the week: give it more time than the others, and use the LLM as a sparring partner.

<aside> 💡 Using AI to help: This is the right exercise to lean on an LLM. Paste one function at a time (⚠️ no real data, no PII) and ask for a refactoring suggestion that splits I/O from logic. Then check: does the suggested split make the logic testable without files? If not, push back and ask again. The point is to argue with the suggestion, not accept it.

</aside>