Week 2 - Structuring Data Pipelines
Introduction to Data Pipelines
Configuration & Secrets (.env)
Separation of Concerns (I/O vs Logic)
Linting and Formatting with Ruff
Assignment: Refactoring to a Clean Pipeline
These exercises reinforce the core skills from this week. Each one is short and focused: complete them before starting the assignment. They build on each other, so work through them in order.
This script has hardcoded credentials. Your job: make it safe.
# BAD: secrets in code ❌
import requests
API_KEY = "sk-abc123-secret"
BASE_URL = "<https://api.example.com>"
response = requests.get(f"{BASE_URL}/data", headers={"Authorization": API_KEY})
print(response.json())
Your task:
.env file with API_KEY and BASE_URLconfig.py that loads these using python-dotenvconfig.py.env to .gitignoreSuccess criteria: The script works the same way, but no secrets appear in your Python files.
This code uses a dictionary to represent a weather reading. Refactor it to use a dataclass.
# BAD: fragile dictionary ❌
reading = {
"city": "Amsterdam",
"temp_celsius": 18.5,
"humidity": 72,
}
# Typo goes unnoticed until runtime
print(reading["temprature"]) # KeyError!
Your task:
WeatherReading dataclass with city (str), temp_celsius (float), and humidity (int)__post_init__ check: humidity must be between 0 and 100is_hot() that returns True if temp_celsius > 30to_dict() using dataclasses.asdictSuccess criteria: Creating WeatherReading(city="Amsterdam", temp_celsius=18.5, humidity=150) raises a ValueError.
This "god function" does everything at once. Split it into three layers.
# BAD: I/O and logic mixed together ❌
import csv
def process_sales():
with open("sales.csv") as f:
reader = csv.DictReader(f)
rows = list(reader)
total = 0
for row in rows:
price = float(row["price"])
quantity = int(row["quantity"])
if price > 0 and quantity > 0:
total += price * quantity
with open("report.txt", "w") as f:
f.write(f"Total revenue: €{total:.2f}")
Your task:
read_sales(path): I/O function that reads and returns the datacalculate_revenue(rows): pure function that computes the total (no file access!)write_report(total, path): I/O function that writes the resultrun_pipeline(input_path, output_path): orchestrator that calls the three functions in orderSuccess criteria: You can call calculate_revenue() with test data (a list of dicts) without any file on disk.
Now test the pure function from Exercise 3.
Your task:
Create a file test_sales.py with the following tests:
test_calculate_revenue_basic: pass in two valid rows, check the total is correcttest_calculate_revenue_skips_invalid: pass in rows with negative prices or zero quantities, verify they are excluded@pytest.fixture to create a reusable sample_rows fixture@pytest.mark.parametrize to test multiple edge cases: empty list, single row, rows with zero priceRun with:
pytest test_sales.py -v
Success criteria: All tests pass, and each test is independent (no test depends on another test's result).
This is the final boss. Combine everything you've practiced.
# BAD: everything in one place ❌
import json
def run():
with open("users.json") as f:
users = json.load(f)
active = []
for user in users:
if user["status"] == "active" and user["age"] >= 18:
user["email"] = user["email"].lower().strip()
active.append(user)
with open("active_users.json", "w") as f:
json.dump(active, f)
print(f"Processed {len(active)} users")
run()
Your task:
User dataclass with fields: name, email, age, status.env file and load them via config.pyread_users(path), filter_active_adults(users), clean_email(user), save_users(users, path)main.py orchestrator that ties it all togetherSuccess criteria: Running pytest shows all tests green. Running main.py produces the same output as the original script.
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.