Azure Setup and Account Access
Week 1 Assignment: The Data Cleaning Pipeline
Going Further: Optional Deep Dives
These six exercises consolidate everything Week 1 introduced: variables, functions, type hints, logging, debugging, file I/O, and CLI habits. Pick the ones that match what felt shaky on a first read; you can do them in any order.
By the end of this chapter, you should have practiced the core Python skills that the Week 1 assignment in the next chapter combines into one larger task.
All six exercises live as subfolders on the same w1 branch of the practice repo. Open it once for the whole week, not once per exercise:
<aside> 💻 Open in GitHub Codespaces
</aside>
Prefer your own editor? Clone locally:
git clone -b w1 <https://github.com/lassebenni/hyf-data-track-python-exercises.git>
cd hyf-data-track-python-exercises
code .
Either way, each exercise is its own subfolder: exercise_1/, exercise_2/, ..., exercise_6/. Switch between them in the file explorer; one cold-start covers the whole week.
Reference solutions for all six exercises live on a separate branch, w1-solutions. The w1 branch is intentionally starter-only so you do not accidentally peek before you struggle (which is where the learning happens).
When you finish an exercise, or you are genuinely stuck after ~30 minutes, switch to the solutions branch:
git fetch origin
git checkout w1-solutions
On w1-solutions, each starter file (exercise.py, exercise_2.py, exercise_6.py, assignment_6.py) has been filled in with the answer in-place. The original # TODO and # FIXME comments are still there, with the solution code and a # WHY ...: note sitting directly under each one. So the file you read is the question and the answer side-by-side: TODO above, code + commentary below. Read the WHY comments, do not copy the code verbatim. The point of the reference solution is the commentary, not the answer.
If you have any uncommitted edits across the repo, git checkout w1-solutions will refuse with "Your local changes ... would be overwritten by checkout". That's Git protecting your work, not breaking it. Two safe options:
git stash) first, then git checkout w1-solutions.w1-solutions file tree.Concepts: Variables (Ch2), Functions (Ch4), Type hints (Ch5), Logging (Ch8).
You are building a small weather station script. Your task is to write a function that converts Celsius to Fahrenheit, but it must be “production-ready”.
Instructions:
logging module and configure it to level INFO.convert_c_to_f.float and return a float.(celsius * 9/5) + 32).info message: "Converting {celsius}°C to {fahrenheit}°F".<aside>
📦 Files: exercise_1/ on the w1 branch (use the Codespace you opened at the top of this page).
</aside>
Concepts: Lists (Ch2), Loops & conditionals (Ch3), Debugging (Ch7).
You have received a list of user ages from a database, but the data is dirty. Some values are strings, some are numbers, and some are negative (impossible!).
The buggy code:
Copy this into exercise_2.py. It currently crashes.
ages = [25, 30, "40", "not_available", 20, -5]
def calculate_average_age(age_list):
total = 0
count = 0
for age in age_list:
total += age
count += 1
return total / count
print(calculate_average_age(ages))
Instructions:
"40"), convert it to an int."not_available"), skip it.<aside>
📦 Files: exercise_2/ on the w1 branch (use the Codespace you opened at the top of this page).
</aside>
Concepts: Floating-point math (Ch2), Debugging (Ch7).
You are processing payments for a transaction system. You have a wallet with 0.1 bitcoin, and you receive 0.2 bitcoin. You want to check if you now have exactly 0.3 bitcoin to execute a trade.
The buggy code:
This code prints "Transaction failed?" even though 0.1 + 0.2 should equal 0.3. Even stranger: the wallet balance prints as 0.30000000000000004, not 0.3.
wallet = 0.1
wallet += 0.2
print(f"Wallet balance:{wallet}")
if wallet == 0.3:
print("Transaction Success!")
else:
print("Transaction failed?")
Instructions:
Wallet balance:0.30000000000000004 (not 0.3), and the if check fails. Both clues point at the same root cause: Python cannot store 0.1 + 0.2 as exactly 0.3.if wallet == 0.3: line.wallet variable (or look in the Variables pane).wallet?round() function to fix the comparison (e.g., round to 1 decimal place).<aside>
📦 Files: exercise_3/ on the w1 branch (use the Codespace you opened at the top of this page).
</aside>
Concepts: Dictionaries (Ch2), Type hints (Ch5), Logging (Ch8), branching logic (Ch3).
Combine everything! You need to process student grades.
Instructions:
student = {"name": "Alice", "grades": [85, 90, 78]}
process_student(student_data: dict) -> None.logging to show timestamps.INFO message: "Processing grades for [Name]".if / elif / else ladder so exactly one grade is logged: if the average is greater than 80, log INFO "Grade A"; else if the average is greater than 60, log INFO "Grade B"; otherwise log WARNING "Student requires assistance".<aside>
⚠️ The order matters. If you write three separate if blocks instead of if / elif / else, an average of 85 will trigger both "Grade A" and "Grade B" because both conditions are true. The elif is what makes the ladder pick exactly one branch.
</aside>
Run your function with Alice's grades and confirm only one log line is emitted per student.
<aside>
📦 Files: exercise_4/ on the w1 branch (use the Codespace you opened at the top of this page).
</aside>
Concepts: File I/O & context managers (Ch9), Strings (Ch2).
A big part of data engineering is reading data from files. Let's practice reading a raw text file and processing it.
Instructions:
raw_data.txt in your folder. Add these lines:amsterdam
rotterdam
the hague
utrecht
exercise_5.py.with open(...) pattern to read the raw_data.txt file.with open(...) again to write the cleaned names into a new file called processed_data.txt.<aside>
📦 Files: exercise_5/ on the w1 branch (use the Codespace you opened at the top of this page).
</aside>
Concepts: CLI arguments via argparse (Ch6), Logging levels (Ch8), pathlib (Ch9).
You are given a tiny CSV cleaner that mostly works, but it has three problems any production pipeline would fix on day one:
INFO. You cannot quiet it for a clean run, or crank it up to DEBUG when something looks off.logging.info() regardless of what it actually is. Skipped rows should warn; per-row trace should be DEBUG; the summary is INFO.Goal: make the script runnable like this:
python exercise_6.py --input data/messy_users.csv --output cleaned.json --log-level DEBUG
Instructions: