Week 2 - Structuring Data Pipelines
Introduction to Data Pipelines
Configuration & Secrets (.env)
Separation of Concerns (I/O vs Logic)
Linting and Formatting with Ruff
Assignment: Refactoring to a Clean Pipeline
These exercises reinforce the core skills from this week. Each one is short and focused: complete them before starting the assignment. They build on each other, so work through them in order.
The "BAD" snippets in each exercise are intentional anti-patterns: they're written to fail or behave badly so you can refactor them. Don't expect them to run as-is.
The starter code lives in the hyf-data-track-python-exercises repository on the w2 branch. One Codespace covers all five exercises.
<aside> 💻 Open in GitHub Codespaces
</aside>
Prefer your own VS Code? Clone locally instead:
git clone -b w2 <https://github.com/lassebenni/hyf-data-track-python-exercises.git>
cd hyf-data-track-python-exercises
code .
Each exercise folder ships its own requirements.txt (when needed) and a per-exercise README with detailed instructions. The chapter prose below is the high-level brief; the folder READMEs are the detailed brief.
When you've made an honest attempt at an exercise (or got stuck for more than 10 minutes), peek at the w2-solutions branch. Each starter file is filled in-place with the answer plus # WHY ...: notes explaining non-obvious choices. The original # TODO comments are preserved so you can read the question and the answer side-by-side.
Read the WHY notes, not the code. The point is the reasoning, not the syntax.
If you switch branches with uncommitted work in your tree, git refuses with Your local changes ... would be overwritten by checkout. Two safe paths:
git stash first, then git checkout w2-solutions. Switch back with git checkout w2.w2-solutions branch in the GitHub web UI and read it there. No local state changes.This script has hardcoded credentials. Your job: make it safe.
<aside>
⚠️ Never commit your .env. Add .env to .gitignore before you create it, not after. A leaked .env in git history needs a force-push to remove plus an immediate rotation of every key in it.
</aside>
<!-- runner:expect-fail -->
# BAD: secrets in code ❌
import requests
API_KEY = "sk-abc123-secret"
BASE_URL = "<https://api.example.com>"
response = requests.get(f"{BASE_URL}/data", headers={"Authorization": API_KEY})
print(response.json())
Your task:
.env file with API_KEY and BASE_URLconfig.py that loads these using python-dotenvconfig.py.env to .gitignoreSuccess criteria: The script works the same way, but no secrets appear in your Python files.
<aside>
📦 Files: exercise_1/ on the w2 branch (use the Codespace you opened at the top of this page).
</aside>
This code uses a dictionary to represent a weather reading. Refactor it to use a dataclass. The KeyError below is the whole point: the typo "temprature" goes unnoticed until runtime, and a dataclass would have caught it before you ran the code.
<!-- runner:expect-fail -->
# BAD: fragile dictionary ❌
reading = {
"city": "Amsterdam",
"temp_celsius": 18.5,
"humidity": 72,
}
# Typo goes unnoticed until runtime
print(reading["temprature"]) # KeyError!
Your task:
WeatherReading dataclass with city (str), temp_celsius (float), and humidity (int)__post_init__ check: humidity must be between 0 and 100is_hot() that returns True if temp_celsius > 30to_dict() using dataclasses.asdictSuccess criteria: Creating WeatherReading(city="Amsterdam", temp_celsius=18.5, humidity=150) raises a ValueError.
<aside>
📦 Files: exercise_2/ on the w2 branch.
</aside>
This "god function" does everything at once. Split it into three layers.
# BAD: I/O and logic mixed together ❌
import csv
def process_sales():
with open("sales.csv") as f:
reader = csv.DictReader(f)
rows = list(reader)
total = 0
for row in rows:
price = float(row["price"])
quantity = int(row["quantity"])
if price > 0 and quantity > 0:
total += price * quantity
with open("report.txt", "w") as f:
f.write(f"Total revenue: €{total:.2f}")
Your task:
read_sales(path): I/O function that reads and returns the datacalculate_revenue(rows): pure function that computes the total (no file access!)write_report(total, path): I/O function that writes the resultrun_pipeline(input_path, output_path): orchestrator that calls the three functions in orderSuccess criteria: You can call calculate_revenue() with test data (a list of dicts) without any file on disk.
<aside>
📦 Files: exercise_3/ on the w2 branch: includes a sales.csv with mixed valid / zero-price / negative-price rows.
</aside>
Now test the pure function from Exercise 3.
Your task:
Create a file test_sales.py with the following tests:
test_calculate_revenue_basic: pass in two valid rows, check the total is correcttest_calculate_revenue_skips_invalid: pass in rows with negative prices or zero quantities, verify they are excluded@pytest.fixture to create a reusable sample_rows fixture@pytest.mark.parametrize to test multiple edge cases: empty list, single row, rows with zero priceRun with:
pytest test_sales.py -v
Success criteria: All tests pass, and each test is independent (no test depends on another test's result).
<aside>
📦 Files: exercise_4/ on the w2 branch: includes a reference sales.py so you can write tests even before finishing Exercise 3.
</aside>
This is the final boss. Combine everything you've practiced.
<!-- runner:expect-fail -->
# BAD: everything in one place ❌
import json
def run():
with open("users.json") as f:
users = json.load(f)
active = []
for user in users:
if user["status"] == "active" and user["age"] >= 18:
user["email"] = user["email"].lower().strip()
active.append(user)
with open("active_users.json", "w") as f:
json.dump(active, f)
print(f"Processed {len(active)} users")
run()
Your task:
User dataclass with fields: name, email, age, status.env file and load them via config.pyread_users(path), filter_active_adults(users), clean_email(user), save_users(users, path)main.py orchestrator that ties it all togetherAI_NOTES.md listing one thing an LLM helped you with and one suggestion you rejected (with the reason). One paragraph each, no more.Success criteria: Running pytest shows all tests green. Running main.py produces the same output as the original script.
<aside>
📦 Files: exercise_5/ on the w2 branch: includes the input users.json (5 users: mix of active/inactive, adult/minor) so the expected output of "Processed 3 users" is reproducible.
</aside>
This is the largest exercise in the week: give it more time than the others, and use the LLM as a sparring partner.
<aside> 💡 Using AI to help: This is the right exercise to lean on an LLM. Paste one function at a time (⚠️ no real data, no PII) and ask for a refactoring suggestion that splits I/O from logic. Then check: does the suggested split make the logic testable without files? If not, push back and ask again. The point is to argue with the suggestion, not accept it.
</aside>