Week 2 - Structuring Pipelines
Introduction to Data Pipelines
Configuration & Secrets (.env)
Linting and Formatting with Ruff
Week 2 Assignment: Clean Pipeline
These exercises reinforce the core skills from this week. Each one is short and focused: complete them before starting the assignment. They build on each other, so work through them in order.
The "BAD" snippets in each exercise are intentional anti-patterns: they're written to fail or behave badly so you can refactor them. Don't expect them to run as-is.
All Week 2 exercises live under data-track/week-2/ in HYF's Learning-Resources repo. One Codespace covers all five exercises.
<aside> 💻 Open in GitHub Codespaces
</aside>
The repo's data-track/.devcontainer/ boots Python 3.11 + ruff + Pylance for every exercise. From the Codespace's Explorer, navigate into data-track/week-2/exercise_N/.
Prefer your own VS Code? Clone locally instead:
git clone <https://github.com/HackYourFuture/Learning-Resources.git>
cd Learning-Resources/data-track/week-2
code .
Each exercise folder ships its own requirements.txt (when needed) and a per-exercise README with detailed instructions. The chapter prose below is the high-level brief; the folder READMEs are the detailed brief.
Each exercise_N/solutions/ folder holds the answer in-place. The starter file is filled with the answer code, the original # TODO comments are preserved, and # WHY ...: notes sit under each non-obvious choice. The file you read is the question and the answer side-by-side.
Read the WHY notes, not the code. The point is the reasoning, not the syntax.
The solution sits next to your starter under solutions/ rather than on a separate branch. The folder name and the deliberate "open this folder to see the answer" click are the whole barrier, and they are enough. Time-box yourself if you find yourself peeking reflexively: 10 minutes of honest attempt before you open solutions/. The struggle is where the learning happens.
You can diff your attempt against the reference once you have tried:
diff exercise_3/exercise.py exercise_3/solutions/exercise.py
This script has hardcoded credentials. Your job: make it safe.
<aside>
⚠️ Never commit your .env. Add .env to .gitignore before you create it, not after. A leaked .env in git history needs a force-push to remove plus an immediate rotation of every key in it.
</aside>
# BAD: secrets in code ❌
import requests
API_KEY = "sk-abc123-secret"
BASE_URL = "<https://api.example.com>"
response = requests.get(f"{BASE_URL}/data", headers={"Authorization": API_KEY})
print(response.json())
Your task:
.env file with API_KEY and BASE_URLconfig.py that loads these using python-dotenvconfig.py.env to .gitignoreSuccess criteria: The script works the same way, but no secrets appear in your Python files.
<aside>
📦 Files: exercise_1/ (use the Codespace you opened at the top of this page).
</aside>
This code uses a dictionary to represent a weather reading. Refactor it to use a dataclass. The KeyError below is the whole point: the typo "temprature" goes unnoticed until runtime, and a dataclass would have caught it before you ran the code.
# BAD: fragile dictionary ❌
reading = {
"city": "Amsterdam",
"temp_celsius": 18.5,
"humidity": 72,
}
# Typo goes unnoticed until runtime
print(reading["temprature"]) # KeyError!
Your task:
WeatherReading dataclass with city (str), temp_celsius (float), and humidity (int)__post_init__ check: humidity must be between 0 and 100is_hot() that returns True if temp_celsius > 30to_dict() using dataclasses.asdictSuccess criteria: Creating WeatherReading(city="Amsterdam", temp_celsius=18.5, humidity=150) raises a ValueError.
<aside>
📦 Files: exercise_2/.
</aside>
This "god function" does everything at once. Split it into three layers.
# BAD: I/O and logic mixed together ❌
import csv
def process_sales():
with open("sales.csv") as f:
reader = csv.DictReader(f)
rows = list(reader)
total = 0
for row in rows:
price = float(row["price"])
quantity = int(row["quantity"])
if price > 0 and quantity > 0:
total += price * quantity
with open("report.txt", "w") as f:
f.write(f"Total revenue: €{total:.2f}")
Your task:
read_sales(path): I/O function that reads and returns the datacalculate_revenue(rows): pure function that computes the total (no file access!)write_report(total, path): I/O function that writes the resultrun_pipeline(input_path, output_path): orchestrator that calls the three functions in orderSuccess criteria: You can call calculate_revenue() with test data (a list of dicts) without any file on disk.
<aside>
📦 Files: exercise_3/: includes a sales.csv with mixed valid / zero-price / negative-price rows.
</aside>
Now test the pure function from Exercise 3.
Your task:
Create a file test_sales.py with the following tests:
test_calculate_revenue_basic: pass in two valid rows, check the total is correcttest_calculate_revenue_skips_invalid: pass in rows with negative prices or zero quantities, verify they are excluded@pytest.fixture to create a reusable sample_rows fixture@pytest.mark.parametrize to test multiple edge cases: empty list, single row, rows with zero priceRun with:
pytest test_sales.py -v
Success criteria: All tests pass, and each test is independent (no test depends on another test's result).
<aside>
📦 Files: exercise_4/: includes a reference sales.py so you can write tests even before finishing Exercise 3.
</aside>
This is the final boss. Combine everything you've practiced.
# BAD: everything in one place ❌
import json
def run():
with open("users.json") as f:
users = json.load(f)
active = []
for user in users:
if user["status"] == "active" and user["age"] >= 18:
user["email"] = user["email"].lower().strip()
active.append(user)
with open("active_users.json", "w") as f:
json.dump(active, f)
print(f"Processed {len(active)} users")
run()
Your task:
User dataclass with fields: name, email, age, status.env file and load them via config.pyread_users(path), filter_active_adults(users), clean_email(user), save_users(users, path)main.py orchestrator that ties it all togetherAI_NOTES.md listing one thing an LLM helped you with and one suggestion you rejected (with the reason). One paragraph each, no more.Success criteria: Running pytest shows all tests green. Running main.py produces the same output as the original script.
<aside>
📦 Files: exercise_5/: includes the input users.json (5 users: mix of active/inactive, adult/minor) so the expected output of "Processed 3 users" is reproducible.
</aside>
This is the largest exercise in the week: give it more time than the others, and use the LLM as a sparring partner.
<aside> 💡 Using AI to help: This is the right exercise to lean on an LLM. Paste one function at a time (⚠️ no real data, no PII) and ask for a refactoring suggestion that splits I/O from logic. Then check: does the suggested split make the logic testable without files? If not, push back and ask again. The point is to argue with the suggestion, not accept it.
</aside>
Concepts: OOP vs Functional (Ch4)
Three pipeline components are each written as a class. For each one, decide: does it hold state that methods share via self, or is it just input → output with nothing stored?
class NameCleaner:
def clean(self, rows: list[dict]) -> list[dict]:
return [{**r, "product_name": r["product_name"].strip().title()} for r in rows]
class DatabaseSink:
def __init__(self):
self._written: list[dict] = []
self.written_count: int = 0
def write(self, row: dict) -> None:
self._written.append(row)
self.written_count += 1
def flush(self) -> list[dict]:
return self._written
class ReportFormatter:
def format(self, rows: list[dict]) -> str:
lines = [f"{r['product_name']}: €{float(r['price']):.2f}" for r in rows]
return "\n".join(lines)
Your task:
# KEEP AS CLASS or # REFACTOR TO FUNCTION# REFACTOR TO FUNCTION, rewrite it as a plain functionSuccess criteria: Both refactored functions produce the same results as the original class methods. The one correct class is left unchanged.
<aside>
📦 Files: exercise_6/: use the Codespace you opened at the top of this page.
</aside>
Concepts: Linting and Formatting (Ch8)
A deliberately messy Python file with intentional violations. Your goal: run ruff check, identify each rule code, fix them one at a time, and re-run until clean.
cd data-track/week-2/ruff_practice
pip install ruff
ruff check messy.py
You should see 6 errors across 5 rule codes: F401 (twice), F841, B006, E501, E711. Fix them one at a time. When stuck, compare against messy_fixed.py.
<aside>
📦 Files: ruff_practice/: use the Codespace you opened at the top of this page.
</aside>