Week 2 - Structuring Data Pipelines

Introduction to Data Pipelines

Configuration & Secrets (.env)

Separation of Concerns (I/O vs Logic)

OOP vs Functional Programming

Dataclasses for Data Objects

Functional Composition

Testing with Pytest

Linting and Formatting with Ruff

Practice

Assignment: Refactoring to a Clean Pipeline

Gotchas & Pitfalls

Lesson Plan

🛠️ Practice

8. Practice

These exercises reinforce the core skills from this week. Each one is short and focused: complete them before starting the assignment. They build on each other, so work through them in order.


Exercise 1: Move Secrets to .env

This script has hardcoded credentials. Your job: make it safe.

# BAD: secrets in code ❌
import requests

API_KEY = "sk-abc123-secret"
BASE_URL = "<https://api.example.com>"

response = requests.get(f"{BASE_URL}/data", headers={"Authorization": API_KEY})
print(response.json())

Your task:

  1. Create a .env file with API_KEY and BASE_URL
  2. Create a config.py that loads these using python-dotenv
  3. Refactor the script to import from config.py
  4. Add .env to .gitignore

Success criteria: The script works the same way, but no secrets appear in your Python files.


Exercise 2: Model Data with a Dataclass

This code uses a dictionary to represent a weather reading. Refactor it to use a dataclass.

# BAD: fragile dictionary ❌
reading = {
    "city": "Amsterdam",
    "temp_celsius": 18.5,
    "humidity": 72,
}

# Typo goes unnoticed until runtime
print(reading["temprature"])  # KeyError!

Your task:

  1. Create a WeatherReading dataclass with city (str), temp_celsius (float), and humidity (int)
  2. Add a __post_init__ check: humidity must be between 0 and 100
  3. Add a method is_hot() that returns True if temp_celsius > 30
  4. Add a method to_dict() using dataclasses.asdict

Success criteria: Creating WeatherReading(city="Amsterdam", temp_celsius=18.5, humidity=150) raises a ValueError.


Exercise 3: Separate I/O from Logic

This "god function" does everything at once. Split it into three layers.

# BAD: I/O and logic mixed together ❌
import csv

def process_sales():
    with open("sales.csv") as f:
        reader = csv.DictReader(f)
        rows = list(reader)

    total = 0
    for row in rows:
        price = float(row["price"])
        quantity = int(row["quantity"])
        if price > 0 and quantity > 0:
            total += price * quantity

    with open("report.txt", "w") as f:
        f.write(f"Total revenue: €{total:.2f}")

Your task:

  1. Create read_sales(path): I/O function that reads and returns the data
  2. Create calculate_revenue(rows): pure function that computes the total (no file access!)
  3. Create write_report(total, path): I/O function that writes the result
  4. Create run_pipeline(input_path, output_path): orchestrator that calls the three functions in order

Success criteria: You can call calculate_revenue() with test data (a list of dicts) without any file on disk.


Exercise 4: Write Tests with Pytest

Now test the pure function from Exercise 3.

Your task:

Create a file test_sales.py with the following tests:

  1. test_calculate_revenue_basic: pass in two valid rows, check the total is correct
  2. test_calculate_revenue_skips_invalid: pass in rows with negative prices or zero quantities, verify they are excluded
  3. Use @pytest.fixture to create a reusable sample_rows fixture
  4. Use @pytest.mark.parametrize to test multiple edge cases: empty list, single row, rows with zero price

Run with:

pytest test_sales.py -v

Success criteria: All tests pass, and each test is independent (no test depends on another test's result).


Exercise 5: Refactor a "god function"

This is the final boss. Combine everything you've practiced.

# BAD: everything in one place ❌
import json

def run():
    with open("users.json") as f:
        users = json.load(f)

    active = []
    for user in users:
        if user["status"] == "active" and user["age"] >= 18:
            user["email"] = user["email"].lower().strip()
            active.append(user)

    with open("active_users.json", "w") as f:
        json.dump(active, f)

    print(f"Processed {len(active)} users")

run()

Your task:

  1. Create a User dataclass with fields: name, email, age, status
  2. Move file paths to a .env file and load them via config.py
  3. Separate into: read_users(path), filter_active_adults(users), clean_email(user), save_users(users, path)
  4. Write at least 3 tests for your pure functions
  5. Create a main.py orchestrator that ties it all together

Success criteria: Running pytest shows all tests green. Running main.py produces the same output as the original script.


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.