Week 1 - Python Foundations

Python Setup

Data Types and Variables

Control Flow: Logic and Loops

Functions and Modules

Type Hints for Clearer Code

Command-Line Interface Habits

Errors and Debugging

Logging in Python

File Operations

Azure Setup and Account Access

Practice

Week 1 Gotchas & Pitfalls

Week 1 Assignment: The Data Cleaning Pipeline

Week 1 Glossary

Going Further: Optional Deep Dives

Week 1 Kickoff Slides

🎒 Week 1 Assignment: The Data Cleaning Pipeline

In this assignment, you will build a robust command-line tool to clean a "messy" dataset. This mimics a very common real-world task for data engineers: taking raw, inconsistent data and transforming it into a clean, usable format. You will also document one debugging session with an LLM, and prove that your access to the HackYourFuture Azure tenant works.

How to start

You will work in a fresh GitHub repository per assignment, hand in your work as a Pull Request on your own copy, and the auto-grader posts your score on the PR.

  1. Open the Week 1 assignment template repo and click Use this templateCreate a new repository under your own GitHub account.
  2. Clone your new repository locally.
  3. Create a feature branch: git switch -c week1-attempt.
  4. Work through the three tasks below.
  5. Commit, push, and open a Pull Request against your repo's main. The auto-grader runs on PR open + every push and posts a score comment within ~30 seconds.

<aside> 💡 You can run the same checks the auto-grader runs locally before pushing: bash .hyf/test.sh && cat .hyf/score.json. Iterate until pass: true.

</aside>

Task 1: The Cleaner Pipeline (60 points)

<aside> 💡 What you write vs. what is provided. The structural patterns (CLI argument parsing, logging setup, the read/clean/write loop, pathlib, the if __name__ == "__main__": guard, the per-row skip-on-validation-fail logic) are already implemented in task-1/src/cleaner.py. Your job is to fill in the four helper functions in task-1/src/utils.py so the cleaner produces correct output. Read cleaner.py first to understand the orchestration patterns you will reuse in later weeks; they are part of the "Technical Requirements" the auto-grader checks for.

</aside>

Inside task-1/, you will find:

The dataset is full of errors: whitespace issues, inconsistent capitalization, missing fields, and badly formatted numbers. Your job is to fill in the four helpers in utils.py so the cleaner produces a clean JSON file.

A peek at what you are up against (first few rows of messy_users.csv):

id,name,email,department,salary
1,  Alice Johnson ,[email protected],Engineering,85000
4,"David, Jr.",[email protected],Sales,"68,000"
6,  FRANK WILSON  ,[email protected],marketing,  95000

You will find: leading/trailing whitespace, names with commas inside quoted fields, missing departments, salaries wrapped in quotes with thousand-separators, and a few formats the rules below do not literally cover.

<aside> 💡 The split into utils.py (pure functions) plus cleaner.py (orchestration) is the separation of concerns pattern you will use all term. Pure functions are easy to test; the orchestrator glues them together.

</aside>

Cleaning rules

  1. Name: Remove leading/trailing whitespace.
  2. Email: Convert to lowercase. Strip whitespace.
  3. Department: If missing or empty, set to "Unknown".
  4. Salary:
  1. Validation:

<aside> ⚠️ The salary column is the canonical type-confusion trap from this week. You will see "68,000", N/A, blank cells, and quoted strings. Cast safely at the boundary: see Gotcha #10: The CSV String Trap before you start.

</aside>

The cleaning rules above describe the shapes you will encounter, but the dataset has at least one row whose format is not literally covered by the rules. Plan for failure modes:

<aside> 💡 Expect at least one row to break your first attempt. Some salary cells contain unexpected formats the rules above do not literally describe (e.g. a value with a period inside it). When int() raises ValueError on a row, your clean_salary function should return None for that row's salary instead of crashing the whole script: wrap the conversion in try/except ValueError. The exact ValueError message you get the first time is excellent material for your Task 2 AI Debug Report: capture the traceback before you fix it.

</aside>

Technical requirements

Output format

output/clean_users.json must be a JSON array of objects, one per cleaned row. Example:

[
  {
    "id": 1,
    "name": "Alice Johnson",
    "email": "[email protected]",
    "department": "Engineering",
    "salary": 85000
  }
]

<aside> 💡 Using AI to help: When pasting a stack trace into an LLM to debug a parsing error, the messy_users.csv data is fictional and safe to share (⚠️ but on real-world data at work, never paste names, emails, IDs, or other PII without redacting first). The training data is fictional; the habit is not.

</aside>

Task 2: AI Debugging Report (20 points)

Practise using AI as a tool for debugging, not just for generating code.

  1. While working on Task 1, capture one debugging session where you used an LLM (ChatGPT, Claude, GitHub Copilot, etc.) to fix a real bug.
  2. Open task-2/AI_DEBUG.md and fill in the four sections:

The auto-grader checks that the file exists, that all four section headers are present, and that the file is non-trivially long (you have actually filled it in).

Task 3: Prove HYF Azure access (20 points)

You will receive an email invite to the HackYourFuture Azure tenant before Week 1 starts. This task verifies the invite worked and you can navigate the portal.

<aside> 💡 Do this task whenever your invite arrives. Tasks 1 and 2 do not depend on Azure access, so if the invite is still in flight, work on the cleaner pipeline first and come back to Task 3 once the email lands.

</aside>

  1. Wait for the invite email from Microsoft. Subject typically: You're invited to join HackYourFuture. Check your spam folder.
  2. Accept the invite. Click the link, sign in with your personal Microsoft account, approve the consent dialog.
  3. Switch directory in the portal. Go to portal.azure.com. Top-right avatar → Switch directory → select HackYourFuture.
  4. Verify Reader access. Search the top bar for rg-hyf-students-readonly and click the resource group. You should see a demo storage account inside.
  5. Capture a screenshot showing all three of:
  1. Save it as task-3/azure_proof.png in your repo. JPG is also accepted.

<aside> ❗ If you have not received the invite within 24 hours of starting Week 1, message your teacher in Slack with the email address you enrolled with. They will re-issue the invite.

</aside>

Evaluation criteria

The auto-grader runs on every push to your PR and posts a score comment. Total = 100, passing = 60.

Task Weight What the grader checks
Task 1 60 task-1/src/cleaner.py and task-1/src/utils.py exist; the file imports logging and uses pathlib; utils.py defines real helper functions; running the cleaner against task-1/data/messy_users.csv writes valid JSON to task-1/output/clean_users.json; output passes structural checks (≥10 rows, no whitespace in names, lowercase emails, integer or null salaries, no missing names or emails, no duplicate emails).
Task 2 20 task-2/AI_DEBUG.md exists, has all four required sections (The Error, The Prompt, The Solution, Reflection), and is non-trivially filled in.
Task 3 20 task-3/azure_proof.png (or .jpg/.jpeg) exists and is >0 bytes. Teachers spot-check the image contents during PR review.

You can run the grader locally with bash .hyf/test.sh && cat .hyf/score.json to see the same per-task breakdown the workflow produces.

<aside> ❗ The auto-grader is a smoke test, not your grade. A passing auto-grader means your submission is reviewable, not that you have passed the assignment. Teachers manually review every PR for code quality, modularity, type hints, real logging use, error handling, AI debugging substance (did the reflection show actual understanding?), and the authenticity of the Azure screenshot. A submission that scores 100 from the auto-grader but consists of LLM-generated boilerplate or a hardcoded JSON output will not pass the manual review. Spend your time understanding what you build, not optimising for the grader.

</aside>

Submission