Week 2 - Structuring Pipelines

Introduction to Data Pipelines

Configuration & Secrets (.env)

Separation of Concerns

OOP vs Functional Programming

Dataclasses for Data Objects

Functional Composition

Testing with Pytest

Linting and Formatting with Ruff

Practice

Gotchas & Pitfalls

Week 2 Assignment: Clean Pipeline

Week 2 Glossary

Going Further

Career relevance: Week 2

Week 2 Kickoff Slides

🎒 Week 2 Assignment: Clean Pipeline

How to start

You will work in a GitHub repository per assignment, hand in your work as a Pull Request on your own copy, and your teacher reviews your PR directly.

The Week 2 assignment template repo lives in the HYF organization. Your cohort uses a fork of this template in the HackYourAssignment organization: your teacher will share the link to your cohort's fork at the start of the week. Fork your cohort's repo (not the HYF one) so that your PR targets the correct organization.

<aside> 💡 Can't find the cohort link yet? You can still read the task structure and plan your solution in the HYF repo while you wait.

</aside>

  1. Fork your cohort's repo from the HackYourAssignment organization into your own GitHub account.
  2. Clone your fork locally.
  3. Create a feature branch: git switch -c week2-attempt.
  4. Work through the three tasks below.
  5. Commit, push, and open a Pull Request against your fork's main. Share the PR URL with your teacher once you are ready for review.

<aside> 💻 Open in GitHub Codespaces

</aside>

<aside> 💡 Working in Codespaces? The Azure CLI is pre-installed in the Codespace. Before starting Task 3, run az login --use-device-code and sign in with the HackYourFuture credentials your teacher provided. See AZURE_LOGIN.md for step-by-step instructions.

</aside>

The Scenario

You have just been hired as a Data Engineer at "MessyCorp Inc." The previous engineer left a single script running on his laptop that loads sales data, cleans it, calculates revenue, and dumps the results to a CSV. He emailed you the file and said: "It works fine, just don't change the folder names."

Your manager wants you to turn this into a professional, testable pipeline before the team grows. The data is messy (whitespace, missing fields, bad values), the code is worse, and there are no tests. Time to fix all of it.

The Input Data

The sales data is already in the repo at task-1/data/messy_sales.csv. It contains 15 transactions with intentional problems:

The Messy Script (Starter Code)

This is what you are refactoring. Do not use this code directly. Study it, understand what it does, then rebuild it properly. The path is intentionally broken (it points at a previous engineer's laptop) so the script fails at the first line until you replace it with INPUT_PATH from your config.

import csv

# 1. Hardcoded path!
data = []
with open("/Users/steve/Downloads/sales.csv") as f:
    reader = csv.DictReader(f)
    for row in reader:
        data.append(row)

# 2. Inline cleaning: no functions!
clean = []
for row in data:
    if row["product_name"].strip() == "":
        continue
    row["product_name"] = row["product_name"].strip().title()
    row["price"] = float(row["price"])
    row["quantity"] = int(row["quantity"])
    if row["price"] < 0:
        continue
    row["revenue"] = row["price"] * row["quantity"]
    row["vat"] = row["revenue"] * 0.21
    clean.append(row)

# 3. Summary: mixed with everything else!
total = 0
for row in clean:
    total += row["revenue"]
print("Total revenue:", total)

# 4. Side effect!
with open("clean.csv", "w") as f:
    writer = csv.DictWriter(f, fieldnames=clean[0].keys())
    writer.writeheader()
    writer.writerows(clean)
print("Done")

Notice the problems: hardcoded paths, no functions, mutation everywhere, magic numbers, no error handling, no tests.

Task 1: The Clean Pipeline (60 points)

<aside> 💡 What you write vs. what is provided. The repo gives you stub files with function signatures, docstrings, and raise NotImplementedError placeholders. Your job is to implement them. Read each file before you start writing: the stubs tell you exactly what each function should do and where it lives.

</aside>

All pipeline code lives in task-1/. Work inside that folder.

1a: Configuration (Configuration & Secrets)

Open task-1/src/config.py. Implement the two TODOs:

The .env.example in task-1/ shows the expected variable names. Copy it to .env and fill in the paths before running the pipeline.

<aside> ⚠️ Never commit your .env file. The .gitignore entry in the repo root already excludes it.

</aside>

1b: Data Model (Dataclasses for Data Objects)

Open task-1/src/models.py. Replace the stub with the full Transaction dataclass:

@dataclass
class Transaction:
    transaction_id: int
    product_name: str
    category: str
    price: float
    quantity: int
    customer_email: str
    date: str
    revenue: float = 0.0
    vat: float = 0.0

Add __post_init__ validation:

1c: Composable Transform Functions (Separation of Concerns, Functional Composition)

Open task-1/src/transforms.py. Implement the four pure functions, each taking a list of dicts and returning a new list (no mutation):

def remove_invalid(rows: list[dict]) -> list[dict]:
    """Remove rows with empty product_name or negative price."""

def clean_fields(rows: list[dict]) -> list[dict]:
    """Strip/title-case product_name, lowercase email, default missing category to 'Unknown'."""

def calculate_revenue(rows: list[dict], vat_rate: float = 0.21) -> list[dict]:
    """Add 'revenue' (price * quantity) and 'vat' (revenue * vat_rate) fields."""

def filter_zero_quantity(rows: list[dict]) -> list[dict]:
    """Remove rows where quantity is 0."""

Chain them in this order in your pipeline:

data = remove_invalid(raw_rows)
data = clean_fields(data)
data = filter_zero_quantity(data)
data = calculate_revenue(data)

<aside> 💡 This is why pure functions matter: you can always go back to the original data if something goes wrong in one of the steps.

</aside>

1d: Pipeline Entry Point (Separation of Concerns)

Open task-1/src/pipeline.py. Implement the I/O functions and the run() orchestrator so the pipeline:

  1. Loads configuration from config.py
  2. Reads task-1/data/messy_sales.csv into a list of dicts
  3. Runs the transform chain from transforms.py
  4. Converts cleaned dicts to Transaction dataclass instances
  5. Writes results to task-1/output/clean_sales.csv
  6. Prints a summary: total transactions, total revenue, total VAT

Keep I/O (reading/writing files) separate from logic (transforms). The transform functions should never open files or print anything.

1e: Tests (Testing with Pytest)

Open task-1/tests/test_transforms.py. Implement the four test stubs:

  1. test_remove_invalid_drops_empty_names: verify rows with empty product names are removed
  2. test_clean_fields_normalizes_names: verify product names get stripped and title-cased
  3. test_calculate_revenue_adds_fields: verify revenue and VAT are calculated correctly
  4. test_no_mutation: verify the original list is unchanged after running transforms

Example structure for the first test:

def test_remove_invalid_drops_empty_names():
    data = [
        {"product_name": "Laptop", "price": 999.99},
        {"product_name": "", "price": 50.0},
        {"product_name": "  ", "price": 25.0},
    ]
    result = remove_invalid(data)
    assert len(result) == 1
    assert result[0]["product_name"] == "Laptop"

Run from the task-1/ folder:

cd task-1
pytest tests/

<aside> ⌨️ Hands on: Want a challenge? Add a filter_by_date_range(rows, start, end) transform that only keeps transactions within a date range. Write a test for it.

</aside>

Task 2: AI Debugging Report (20 points)

Open task-2/AI_DEBUG.md. While building Task 1, capture one debugging session where you used an LLM to fix a real bug. Fill in the four sections:

<aside> ⚠️ Using AI to help: fictional data like messy_sales.csv is safe to paste. Never share real customer data or PII with an LLM.

</aside>

Task 3: Upload Your Pipeline Output to Azure Blob Storage (20 points)

In later weeks you will move data between Python scripts and Azure Blob Storage programmatically. Before you write a single line of SDK code, get comfortable doing it from the terminal using the Azure CLI and manually via the Azure Portal.

3a: Programmatic Upload (Azure CLI)

<aside> 💡 Codespaces: az is pre-installed. Run az login --use-device-code and follow the prompts (see AZURE_LOGIN.md).

</aside>

  1. Ensure you are in the repository root (run cd .. if you are still inside task-1), and run your pipeline so task-1/output/clean_sales.csv exists.
  2. Verify you are logged in to the HackYourFuture Azure tenant and make note of your account email (you will need it for the screenshot):
   az account show --query "{user:user.name, subscription:name, tenant:homeTenantId}" --output table

If the output does not show your HackYourFuture account, follow the steps in AZURE_LOGIN.md to log in.

  1. Create a private container in the shared storage account. Replace <your-github-username> with your exact GitHub username: all students share one storage account, so the name must be unique:
   az storage container create \
     --account-name sthyfstudentsdemo \
     --name week2-<your-github-username> \
     --auth-mode login \
     --output json

A successful create prints {"created": true}.

  1. Upload your cleaned CSV:
   az storage blob upload \
     --account-name sthyfstudentsdemo \
     --container-name week2-<your-github-username> \
     --file task-1/output/clean_sales.csv \
     --name clean_sales.csv \
     --overwrite true \
     --auth-mode login \
     --output json

A successful upload prints the blob properties as JSON (look for "etag" in the output).

  1. Print the blob URL and copy it to your clipboard:
   az storage blob url \
     --account-name sthyfstudentsdemo \
     --container-name week2-<your-github-username> \
     --name clean_sales.csv \
     --auth-mode login \
     --output tsv
  1. Paste the URL into task-3/assets/blob_url.txt.

3b: Manual Upload (Azure Portal)

Now upload the original messy data using the Azure Portal, so you can see the container you just created with the CLI.

  1. Log into portal.azure.com and switch to the HackYourFuture directory (top-right avatar → Switch directory).
  2. Navigate to your storage account (search for sthyfstudentsdemo in the top search bar, or find it in the rg-hyf-students-readonly resource group).
  3. In the left sidebar, click Storage browserBlob containers.
  4. Open your week2-<your-github-username> container (the one you created via the CLI!). You should see clean_sales.csv already inside it.
  5. Click the Upload button. Select task-1/data/messy_sales.csv from your local machine and upload it.
  6. Take a screenshot showing your container contents with both clean_sales.csv and messy_sales.csv visible. Make sure your account email is visible in the top-right corner.