🎒 Week 1 Assignment: The Data Cleaning Pipeline

In this assignment, you will build a robust command-line tool to clean a "messy" dataset. This mimics a very common real-world task for data engineers: taking raw, inconsistent data and transforming it into a clean, usable format. You will also document one debugging session with an LLM, and prove that your access to the HackYourFuture Azure tenant works.

How to start

You will work in a fresh GitHub repository per assignment, hand in your work as a Pull Request on your own copy, and the auto-grader posts your score on the PR.

Open the Week 1 assignment template repo and click Use this template → Create a new repository under your own GitHub account.
Clone your new repository locally.
Create a feature branch: git switch -c week1-attempt.
Work through the three tasks below.
Commit, push, and open a Pull Request against your repo's main. The auto-grader runs on PR open + every push and posts a score comment within ~30 seconds.

<aside> 💡 You can run the same checks the auto-grader runs locally before pushing: bash .hyf/test.sh && cat .hyf/score.json. Iterate until pass: true.

</aside>

Task 1: The Cleaner Pipeline (60 points)

<aside> 💡 What you write vs. what is provided. The structural patterns (CLI argument parsing, logging setup, the read/clean/write loop, pathlib, the if __name__ == "__main__": guard, the per-row skip-on-validation-fail logic) are already implemented in task-1/src/cleaner.py. Your job is to fill in the four helper functions in task-1/src/utils.py so the cleaner produces correct output. Read cleaner.py first to understand the orchestration patterns you will reuse in later weeks; they are part of the "Technical Requirements" the auto-grader checks for.

</aside>

Inside task-1/, you will find:

data/messy_users.csv: the input dataset (already committed; do not edit).
src/cleaner.py: the entry point that reads the CSV, calls helpers, and writes JSON. Already implemented for you: read it before changing anything.
src/utils.py: pure functions for cleaning individual fields. You implement these four functions.
output/: the cleaner writes clean_users.json here.

The dataset is full of errors: whitespace issues, inconsistent capitalization, missing fields, and badly formatted numbers. Your job is to fill in the four helpers in utils.py so the cleaner produces a clean JSON file.

A peek at what you are up against (first few rows of messy_users.csv):

id,name,email,department,salary
1,  Alice Johnson ,[email protected],Engineering,85000
4,"David, Jr.",[email protected],Sales,"68,000"
6,  FRANK WILSON  ,[email protected],marketing,  95000

You will find: leading/trailing whitespace, names with commas inside quoted fields, missing departments, salaries wrapped in quotes with thousand-separators, and a few formats the rules below do not literally cover.

<aside> 💡 The split into utils.py (pure functions) plus cleaner.py (orchestration) is the separation of concerns pattern you will use all term. Pure functions are easy to test; the orchestrator glues them together.

</aside>

Cleaning rules

Name: Remove leading/trailing whitespace.
Email: Convert to lowercase. Strip whitespace.
Department: If missing or empty, set to "Unknown".
Salary:

Remove extra characters like " or , (e.g. "68,000" becomes integer 68000).
Convert to int. If the value is empty or N/A, store null (Python None).

Validation:

If name is empty (after cleaning), skip the row and log a warning.
If email is empty (after cleaning), skip the row and log a warning.

<aside> ⚠️ The salary column is the canonical type-confusion trap from this week. You will see "68,000", N/A, blank cells, and quoted strings. Cast safely at the boundary: see Gotcha #10: The CSV String Trap before you start.

</aside>

The cleaning rules above describe the shapes you will encounter, but the dataset has at least one row whose format is not literally covered by the rules. Plan for failure modes:

<aside> 💡 Expect at least one row to break your first attempt. Some salary cells contain unexpected formats the rules above do not literally describe (e.g. a value with a period inside it). When int() raises ValueError on a row, your clean_salary function should return None for that row's salary instead of crashing the whole script: wrap the conversion in try/except ValueError. The exact ValueError message you get the first time is excellent material for your Task 2 AI Debug Report: capture the traceback before you fix it.

</aside>

Technical requirements

Modularity: utils.py should only contain pure helper functions. cleaner.py imports from utils and orchestrates.
Main guard: Use if __name__ == "__main__": in cleaner.py.
Path handling: Use pathlib.Path instead of string concatenation. Path("data") / "messy_users.csv" works on macOS, Linux, and Windows; "data/" + filename does not.
CLI arguments: The script must accept --input and --output flags. The auto-grader runs python3 src/cleaner.py --input data/messy_users.csv --output output/clean_users.json from the task-1/ directory.
Logging: Use the logging module (not print) to report progress. INFO logs for status, WARNING logs for skipped rows.
Type hinting: All functions must have type hints.
Error handling: Use try/except to handle missing input files gracefully.

Output format

output/clean_users.json must be a JSON array of objects, one per cleaned row. Example:

[
  {
    "id": 1,
    "name": "Alice Johnson",
    "email": "[email protected]",
    "department": "Engineering",
    "salary": 85000
  }
]

<aside> 💡 Using AI to help: When pasting a stack trace into an LLM to debug a parsing error, the messy_users.csv data is fictional and safe to share (⚠️ but on real-world data at work, never paste names, emails, IDs, or other PII without redacting first). The training data is fictional; the habit is not.

</aside>

Task 2: AI Debugging Report (20 points)

Practise using AI as a tool for debugging, not just for generating code.

While working on Task 1, capture one debugging session where you used an LLM (ChatGPT, Claude, GitHub Copilot, etc.) to fix a real bug.
Open task-2/AI_DEBUG.md and fill in the four sections:

The Error: What went wrong? Paste the traceback.
The Prompt: What did you ask the AI? Include the code you pasted.
The Solution: What did the AI suggest? Did it work first try?
Reflection: Did you understand why it was broken, or did you just accept the fix?

The auto-grader checks that the file exists, that all four section headers are present, and that the file is non-trivially long (you have actually filled it in).

Task 3: Prove HYF Azure access (20 points)

You will receive an email invite to the HackYourFuture Azure tenant before Week 1 starts. This task verifies the invite worked and you can navigate the portal.

<aside> 💡 Do this task whenever your invite arrives. Tasks 1 and 2 do not depend on Azure access, so if the invite is still in flight, work on the cleaner pipeline first and come back to Task 3 once the email lands.

</aside>

Wait for the invite email from Microsoft. Subject typically: You're invited to join HackYourFuture. Check your spam folder.
Accept the invite. Click the link, sign in with your personal Microsoft account, approve the consent dialog.
Switch directory in the portal. Go to portal.azure.com. Top-right avatar → Switch directory → select HackYourFuture.
Verify Reader access. Search the top bar for rg-hyf-students-readonly and click the resource group. You should see a demo storage account inside.
Capture a screenshot showing all three of:

the directory switcher / breadcrumb showing HackYourFuture,
your avatar / email in the top-right,
the rg-hyf-students-readonly view with at least one resource visible.

Save it as task-3/azure_proof.png in your repo. JPG is also accepted.

<aside> ❗ If you have not received the invite within 24 hours of starting Week 1, message your teacher in Slack with the email address you enrolled with. They will re-issue the invite.

</aside>

Evaluation criteria

The auto-grader runs on every push to your PR and posts a score comment. Total = 100, passing = 60.

Task	Weight	What the grader checks
Task 1	60	`task-1/src/cleaner.py` and `task-1/src/utils.py` exist; the file imports `logging` and uses `pathlib`; `utils.py` defines real helper functions; running the cleaner against `task-1/data/messy_users.csv` writes valid JSON to `task-1/output/clean_users.json`; output passes structural checks (≥10 rows, no whitespace in names, lowercase emails, integer or null salaries, no missing names or emails, no duplicate emails).
Task 2	20	`task-2/AI_DEBUG.md` exists, has all four required sections (`The Error`, `The Prompt`, `The Solution`, `Reflection`), and is non-trivially filled in.
Task 3	20	`task-3/azure_proof.png` (or `.jpg`/`.jpeg`) exists and is >0 bytes. Teachers spot-check the image contents during PR review.

You can run the grader locally with bash .hyf/test.sh && cat .hyf/score.json to see the same per-task breakdown the workflow produces.

<aside> ❗ The auto-grader is a smoke test, not your grade. A passing auto-grader means your submission is reviewable, not that you have passed the assignment. Teachers manually review every PR for code quality, modularity, type hints, real logging use, error handling, AI debugging substance (did the reflection show actual understanding?), and the authenticity of the Azure screenshot. A submission that scores 100 from the auto-grader but consists of LLM-generated boilerplate or a hardcoded JSON output will not pass the manual review. Spend your time understanding what you build, not optimising for the grader.

</aside>

🎒 Week 1 Assignment: The Data Cleaning Pipeline

How to start

Task 1: The Cleaner Pipeline (60 points)

Cleaning rules

Technical requirements

Output format

Task 2: AI Debugging Report (20 points)

Task 3: Prove HYF Azure access (20 points)

Evaluation criteria

Submission