Teachers

🎒 Week 1 Assignment: The Data Cleaning Pipeline

In this assignment, you will build a robust command-line tool to clean a "messy" dataset. This mimics a very common real-world task for data engineers: taking raw, inconsistent data and transforming it into a clean, usable format.

Task 1 – The Cleaner Script

week_1__messy_users.csv

You have been given a file data/messy_users.csv. It contains user data, but it is full of errors: whitespace issues, inconsistent capitalization, missing fields, and badly formatted numbers.

Create a Python script named src/cleaner.py that reads this CSV file, cleans the data according to the rules below, and writes the valid records to a JSON file.

Before writing any code, think about the structure of the clean data you want to produce. In real data engineering work, cleaning is usually done to conform to a target shape expected by downstream systems: with this in mind, your implementation should reflect a clear and consistent data structure.

Cleaning Rules

Name: Remove any leading/trailing whitespace.
Email: Convert to lowercase.
Department: If missing, set to "Unknown".
Salary:
- Remove any extra characters like " or , (e.g. "68,000" should become integer 68000).
- Convert to an integer.
Validation:
- If the name is empty (after cleaning), skip the row and log a warning.
- If the email is empty, skip the row and log a warning.

Technical Requirements

CLI Arguments: The script must accept input and output file paths as command-line arguments.
- Example: python src/cleaner.py data/messy_users.csv output/clean_users.json
Logging: Use the logging module (not print) to report progress.

Logs should be informative and actionable. Imagine this script runs daily as part of a pipeline and you need to understand what happened by reading the logs.
- Log an INFO message when starting.
- Log a WARNING message for every skipped row.
- Log an INFO message at the end with the total number of processed vs. valid rows.
Type Hinting: All functions must have type hints.
Error Handling: Use try/except blocks to handle file not found errors gracefully.

Task 2 – AI Debugging Report

We want you to practice using AI as a tool for debugging, not just generating code.

Introduce a bug into your code intentionally (or use a real one you encountered).

Examples: Salary parsed as string instead of int, Logging counts wrong, JSON output malformed, Rows skipped incorrectly, ...
Ask an LLM (ChatGPT, Claude, etc.) to help you fix it.
Create a file AI_DEBUG.md and document:
- The Error: What went wrong? (Paste the traceback).
- The Prompt: What did you ask the AI?
- The Solution: What did the AI suggest? Did it work?
- Reflection: Did you understand why it was broken? Did the AI suggest anything incorrect or unnecessary?

Task 3 – Azure Setup

Data engineering often happens in the cloud. We need to verify you are ready for the upcoming cloud modules.

Log into portal.azure.com.
Take a screenshot of the portal dashboard showing your account logged in.
Save the image as assets/azure_proof.png.

Submission

Ensure your project structure looks like this:

week1-assignment/
├── src/
│   └── cleaner.py
├── data/
│   └── messy_users.csv
├── output/
│   └── (clean_users.json will be generated here)
├── assets/
│   └── azure_proof.png
├── AI_DEBUG.md
└── README.md

README.md should include: how to run the script, example commands, assumptions made during cleaning if any, etc.

Create a git branch week1/your-name.
Commit your changes.
Push to the repository and open a Pull Request.

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.

Week 1 -Foundational Python