Week 2 - Structuring Pipelines
Introduction to Data Pipelines
Configuration & Secrets (.env)
Linting and Formatting with Ruff
Week 2 Assignment: Clean Pipeline
You will work in a GitHub repository per assignment, hand in your work as a Pull Request on your own copy, and your teacher reviews your PR directly.
The Week 2 assignment template repo lives in the HYF organization. Your cohort uses a fork of this template in the HackYourAssignment organization: your teacher will share the link to your cohort's fork at the start of the week. Fork your cohort's repo (not the HYF one) so that your PR targets the correct organization.
<aside> 💡 Can't find the cohort link yet? You can still read the task structure and plan your solution in the HYF repo while you wait.
</aside>
git switch -c week2-attempt.main. Share the PR URL with your teacher once you are ready for review.<aside> 💻 Open in GitHub Codespaces
</aside>
<aside>
💡 Working in Codespaces? The Azure CLI is pre-installed in the Codespace. Before starting Task 3, run az login --use-device-code and sign in with the HackYourFuture credentials your teacher provided. See AZURE_LOGIN.md for step-by-step instructions.
</aside>
You have just been hired as a Data Engineer at "MessyCorp Inc." The previous engineer left a single script running on his laptop that loads sales data, cleans it, calculates revenue, and dumps the results to a CSV. He emailed you the file and said: "It works fine, just don't change the folder names."
Your manager wants you to turn this into a professional, testable pipeline before the team grows. The data is messy (whitespace, missing fields, bad values), the code is worse, and there are no tests. Time to fix all of it.
The sales data is already in the repo at task-1/data/messy_sales.csv. It contains 15 transactions with intentional problems:
This is what you are refactoring. Do not use this code directly. Study it, understand what it does, then rebuild it properly. The path is intentionally broken (it points at a previous engineer's laptop) so the script fails at the first line until you replace it with INPUT_PATH from your config.
import csv
# 1. Hardcoded path!
data = []
with open("/Users/steve/Downloads/sales.csv") as f:
reader = csv.DictReader(f)
for row in reader:
data.append(row)
# 2. Inline cleaning: no functions!
clean = []
for row in data:
if row["product_name"].strip() == "":
continue
row["product_name"] = row["product_name"].strip().title()
row["price"] = float(row["price"])
row["quantity"] = int(row["quantity"])
if row["price"] < 0:
continue
row["revenue"] = row["price"] * row["quantity"]
row["vat"] = row["revenue"] * 0.21
clean.append(row)
# 3. Summary: mixed with everything else!
total = 0
for row in clean:
total += row["revenue"]
print("Total revenue:", total)
# 4. Side effect!
with open("clean.csv", "w") as f:
writer = csv.DictWriter(f, fieldnames=clean[0].keys())
writer.writeheader()
writer.writerows(clean)
print("Done")
Notice the problems: hardcoded paths, no functions, mutation everywhere, magic numbers, no error handling, no tests.
<aside>
💡 What you write vs. what is provided. The repo gives you stub files with function signatures, docstrings, and raise NotImplementedError placeholders. Your job is to implement them. Read each file before you start writing: the stubs tell you exactly what each function should do and where it lives.
</aside>
All pipeline code lives in task-1/. Work inside that folder.
Open task-1/src/config.py. Implement the two TODOs:
.env using python-dotenvINPUT_PATH and OUTPUT_PATH as module-level constants; raise ValueError if either is missingThe .env.example in task-1/ shows the expected variable names. Copy it to .env and fill in the paths before running the pipeline.
<aside>
⚠️ Never commit your .env file. The .gitignore entry in the repo root already excludes it.
</aside>
Open task-1/src/models.py. Replace the stub with the full Transaction dataclass:
@dataclass
class Transaction:
transaction_id: int
product_name: str
category: str
price: float
quantity: int
customer_email: str
date: str
revenue: float = 0.0
vat: float = 0.0
Add __post_init__ validation:
price must be >= 0 (raise ValueError if negative)product_name must not be empty (raise ValueError)Open task-1/src/transforms.py. Implement the four pure functions, each taking a list of dicts and returning a new list (no mutation):
def remove_invalid(rows: list[dict]) -> list[dict]:
"""Remove rows with empty product_name or negative price."""
def clean_fields(rows: list[dict]) -> list[dict]:
"""Strip/title-case product_name, lowercase email, default missing category to 'Unknown'."""
def calculate_revenue(rows: list[dict], vat_rate: float = 0.21) -> list[dict]:
"""Add 'revenue' (price * quantity) and 'vat' (revenue * vat_rate) fields."""
def filter_zero_quantity(rows: list[dict]) -> list[dict]:
"""Remove rows where quantity is 0."""
Chain them in this order in your pipeline:
data = remove_invalid(raw_rows)
data = clean_fields(data)
data = filter_zero_quantity(data)
data = calculate_revenue(data)
<aside> 💡 This is why pure functions matter: you can always go back to the original data if something goes wrong in one of the steps.
</aside>
Open task-1/src/pipeline.py. Implement the I/O functions and the run() orchestrator so the pipeline:
config.pytask-1/data/messy_sales.csv into a list of dictstransforms.pyTransaction dataclass instancestask-1/output/clean_sales.csvKeep I/O (reading/writing files) separate from logic (transforms). The transform functions should never open files or print anything.
Open task-1/tests/test_transforms.py. Implement the four test stubs:
test_remove_invalid_drops_empty_names: verify rows with empty product names are removedtest_clean_fields_normalizes_names: verify product names get stripped and title-casedtest_calculate_revenue_adds_fields: verify revenue and VAT are calculated correctlytest_no_mutation: verify the original list is unchanged after running transformsExample structure for the first test:
def test_remove_invalid_drops_empty_names():
data = [
{"product_name": "Laptop", "price": 999.99},
{"product_name": "", "price": 50.0},
{"product_name": " ", "price": 25.0},
]
result = remove_invalid(data)
assert len(result) == 1
assert result[0]["product_name"] == "Laptop"
Run from the task-1/ folder:
cd task-1
pytest tests/
<aside>
⌨️ Hands on: Want a challenge? Add a filter_by_date_range(rows, start, end) transform that only keeps transactions within a date range. Write a test for it.
</aside>
Open task-2/AI_DEBUG.md. While building Task 1, capture one debugging session where you used an LLM to fix a real bug. Fill in the four sections:
<aside>
⚠️ Using AI to help: fictional data like messy_sales.csv is safe to paste. Never share real customer data or PII with an LLM.
</aside>
In later weeks you will move data between Python scripts and Azure Blob Storage programmatically. Before you write a single line of SDK code, get comfortable doing it from the terminal using the Azure CLI and manually via the Azure Portal.
<aside>
💡 Codespaces: az is pre-installed. Run az login --use-device-code and follow the prompts (see AZURE_LOGIN.md).
</aside>
cd .. if you are still inside task-1), and run your pipeline so task-1/output/clean_sales.csv exists. az account show --query "{user:user.name, subscription:name, tenant:homeTenantId}" --output table
If the output does not show your HackYourFuture account, follow the steps in AZURE_LOGIN.md to log in.
<your-github-username> with your exact GitHub username: all students share one storage account, so the name must be unique: az storage container create \
--account-name sthyfstudentsdemo \
--name week2-<your-github-username> \
--auth-mode login \
--output json
A successful create prints {"created": true}.
az storage blob upload \
--account-name sthyfstudentsdemo \
--container-name week2-<your-github-username> \
--file task-1/output/clean_sales.csv \
--name clean_sales.csv \
--overwrite true \
--auth-mode login \
--output json
A successful upload prints the blob properties as JSON (look for "etag" in the output).
az storage blob url \
--account-name sthyfstudentsdemo \
--container-name week2-<your-github-username> \
--name clean_sales.csv \
--auth-mode login \
--output tsv
task-3/assets/blob_url.txt.Now upload the original messy data using the Azure Portal, so you can see the container you just created with the CLI.
sthyfstudentsdemo in the top search bar, or find it in the rg-hyf-students-readonly resource group).week2-<your-github-username> container (the one you created via the CLI!). You should see clean_sales.csv already inside it.task-1/data/messy_sales.csv from your local machine and upload it.clean_sales.csv and messy_sales.csv visible. Make sure your account email is visible in the top-right corner.