In this assignment, you will build a robust command-line tool to clean a "messy" dataset. This mimics a very common real-world task for data engineers: taking raw, inconsistent data and transforming it into a clean, usable format.
You have been given a file data/messy_users.csv. It contains user data, but it is full of errors: whitespace issues, inconsistent capitalization, missing fields, and badly formatted numbers.
Create a Python script named src/cleaner.py that reads this CSV file, cleans the data according to the rules below, and writes the valid records to a JSON file.
Before writing any code, think about the structure of the clean data you want to produce. In real data engineering work, cleaning is usually done to conform to a target shape expected by downstream systems: with this in mind, your implementation should reflect a clear and consistent data structure.
" or , (e.g. "68,000" should become integer 68000).name is empty (after cleaning), skip the row and log a warning.email is empty, skip the row and log a warning.CLI Arguments: The script must accept input and output file paths as command-line arguments.
python src/cleaner.py data/messy_users.csv output/clean_users.jsonLogging: Use the logging module (not print) to report progress.
Logs should be informative and actionable. Imagine this script runs daily as part of a pipeline and you need to understand what happened by reading the logs.
Type Hinting: All functions must have type hints.
Error Handling: Use try/except blocks to handle file not found errors gracefully.
We want you to practice using AI as a tool for debugging, not just generating code.
Introduce a bug into your code intentionally (or use a real one you encountered).
Examples: Salary parsed as string instead of int, Logging counts wrong, JSON output malformed, Rows skipped incorrectly, ...
Ask an LLM (ChatGPT, Claude, etc.) to help you fix it.
Create a file AI_DEBUG.md and document:
Data engineering often happens in the cloud. We need to verify you are ready for the upcoming cloud modules.
assets/azure_proof.png.Ensure your project structure looks like this:
week1-assignment/
├── src/
│ └── cleaner.py
├── data/
│ └── messy_users.csv
├── output/
│ └── (clean_users.json will be generated here)
├── assets/
│ └── azure_proof.png
├── AI_DEBUG.md
└── README.md
README.md should include: how to run the script, example commands, assumptions made during cleaning if any, etc.
Create a git branch week1/your-name.
Commit your changes.
Push to the repository and open a Pull Request.

Found a mistake or have a suggestion? Let us know in the feedback form.