Azure Setup and Account Access
Week 1 Assignment: The Data Cleaning Pipeline
Going Further: Optional Deep Dives
In this assignment, you will build a robust command-line tool to clean a "messy" dataset. This mimics a very common real-world task for data engineers: taking raw, inconsistent data and transforming it into a clean, usable format. You will also document one debugging session with an LLM, and prove that your access to the HackYourFuture Azure tenant works.
You will work in a fresh GitHub repository per assignment, hand in your work as a Pull Request on your own copy, and the auto-grader posts your score on the PR.
git switch -c week1-attempt.main. The auto-grader runs on PR open + every push and posts a score comment within ~30 seconds.<aside>
💡 You can run the same checks the auto-grader runs locally before pushing: bash .hyf/test.sh && cat .hyf/score.json. Iterate until pass: true.
</aside>
<aside>
💡 What you write vs. what is provided. The structural patterns (CLI argument parsing, logging setup, the read/clean/write loop, pathlib, the if __name__ == "__main__": guard, the per-row skip-on-validation-fail logic) are already implemented in task-1/src/cleaner.py. Your job is to fill in the four helper functions in task-1/src/utils.py so the cleaner produces correct output. Read cleaner.py first to understand the orchestration patterns you will reuse in later weeks; they are part of the "Technical Requirements" the auto-grader checks for.
</aside>
Inside task-1/, you will find:
data/messy_users.csv: the input dataset (already committed; do not edit).src/cleaner.py: the entry point that reads the CSV, calls helpers, and writes JSON. Already implemented for you: read it before changing anything.src/utils.py: pure functions for cleaning individual fields. You implement these four functions.output/: the cleaner writes clean_users.json here.The dataset is full of errors: whitespace issues, inconsistent capitalization, missing fields, and badly formatted numbers. Your job is to fill in the four helpers in utils.py so the cleaner produces a clean JSON file.
A peek at what you are up against (first few rows of messy_users.csv):
id,name,email,department,salary
1, Alice Johnson ,[email protected],Engineering,85000
4,"David, Jr.",[email protected],Sales,"68,000"
6, FRANK WILSON ,[email protected],marketing, 95000
You will find: leading/trailing whitespace, names with commas inside quoted fields, missing departments, salaries wrapped in quotes with thousand-separators, and a few formats the rules below do not literally cover.
<aside>
💡 The split into utils.py (pure functions) plus cleaner.py (orchestration) is the separation of concerns pattern you will use all term. Pure functions are easy to test; the orchestrator glues them together.
</aside>
"Unknown"." or , (e.g. "68,000" becomes integer 68000).int. If the value is empty or N/A, store null (Python None).name is empty (after cleaning), skip the row and log a warning.email is empty (after cleaning), skip the row and log a warning.<aside>
⚠️ The salary column is the canonical type-confusion trap from this week. You will see "68,000", N/A, blank cells, and quoted strings. Cast safely at the boundary: see Gotcha #10: The CSV String Trap before you start.
</aside>
The cleaning rules above describe the shapes you will encounter, but the dataset has at least one row whose format is not literally covered by the rules. Plan for failure modes:
<aside>
💡 Expect at least one row to break your first attempt. Some salary cells contain unexpected formats the rules above do not literally describe (e.g. a value with a period inside it). When int() raises ValueError on a row, your clean_salary function should return None for that row's salary instead of crashing the whole script: wrap the conversion in try/except ValueError. The exact ValueError message you get the first time is excellent material for your Task 2 AI Debug Report: capture the traceback before you fix it.
</aside>
utils.py should only contain pure helper functions. cleaner.py imports from utils and orchestrates.if __name__ == "__main__": in cleaner.py.pathlib.Path instead of string concatenation. Path("data") / "messy_users.csv" works on macOS, Linux, and Windows; "data/" + filename does not.--input and --output flags. The auto-grader runs python3 src/cleaner.py --input data/messy_users.csv --output output/clean_users.json from the task-1/ directory.logging module (not print) to report progress. INFO logs for status, WARNING logs for skipped rows.try/except to handle missing input files gracefully.output/clean_users.json must be a JSON array of objects, one per cleaned row. Example:
[
{
"id": 1,
"name": "Alice Johnson",
"email": "[email protected]",
"department": "Engineering",
"salary": 85000
}
]
<aside> 💡 Using AI to help: When pasting a stack trace into an LLM to debug a parsing error, the messy_users.csv data is fictional and safe to share (⚠️ but on real-world data at work, never paste names, emails, IDs, or other PII without redacting first). The training data is fictional; the habit is not.
</aside>
Practise using AI as a tool for debugging, not just for generating code.
task-2/AI_DEBUG.md and fill in the four sections:The auto-grader checks that the file exists, that all four section headers are present, and that the file is non-trivially long (you have actually filled it in).
You will receive an email invite to the HackYourFuture Azure tenant before Week 1 starts. This task verifies the invite worked and you can navigate the portal.
<aside> 💡 Do this task whenever your invite arrives. Tasks 1 and 2 do not depend on Azure access, so if the invite is still in flight, work on the cleaner pipeline first and come back to Task 3 once the email lands.
</aside>
You're invited to join HackYourFuture. Check your spam folder.HackYourFuture.rg-hyf-students-readonly and click the resource group. You should see a demo storage account inside.HackYourFuture,rg-hyf-students-readonly view with at least one resource visible.task-3/azure_proof.png in your repo. JPG is also accepted.<aside> ❗ If you have not received the invite within 24 hours of starting Week 1, message your teacher in Slack with the email address you enrolled with. They will re-issue the invite.
</aside>
The auto-grader runs on every push to your PR and posts a score comment. Total = 100, passing = 60.
| Task | Weight | What the grader checks |
|---|---|---|
| Task 1 | 60 | task-1/src/cleaner.py and task-1/src/utils.py exist; the file imports logging and uses pathlib; utils.py defines real helper functions; running the cleaner against task-1/data/messy_users.csv writes valid JSON to task-1/output/clean_users.json; output passes structural checks (≥10 rows, no whitespace in names, lowercase emails, integer or null salaries, no missing names or emails, no duplicate emails). |
| Task 2 | 20 | task-2/AI_DEBUG.md exists, has all four required sections (The Error, The Prompt, The Solution, Reflection), and is non-trivially filled in. |
| Task 3 | 20 | task-3/azure_proof.png (or .jpg/.jpeg) exists and is >0 bytes. Teachers spot-check the image contents during PR review. |
You can run the grader locally with bash .hyf/test.sh && cat .hyf/score.json to see the same per-task breakdown the workflow produces.
<aside>
❗ The auto-grader is a smoke test, not your grade. A passing auto-grader means your submission is reviewable, not that you have passed the assignment. Teachers manually review every PR for code quality, modularity, type hints, real logging use, error handling, AI debugging substance (did the reflection show actual understanding?), and the authenticity of the Azure screenshot. A submission that scores 100 from the auto-grader but consists of LLM-generated boilerplate or a hardcoded JSON output will not pass the manual review. Spend your time understanding what you build, not optimising for the grader.
</aside>