Azure Setup and Account Access
Week 1 Assignment: The Data Cleaning Pipeline
Welcome to the inevitable! No matter how experienced you become, you will write code that crashes, as I’m sure you’ve experienced plenty of times in the core program. In fact, as a Data Engineer, you will often deal with messy data that breaks your pipelines.
Learning to read errors and fix them (debugging) is a superpower. In this chapter, we will learn how Python tells us something went wrong and how to investigate it.
Just like in JavaScript, errors in Python generally fall into three categories:
These happen when Python doesn’t understand your code because you broke the rules of the language. Python catches these before it runs your program.
# ❌ SyntaxError: expected ':'
if True
print("This won't run")
Common Syntax Errors:
: at the end of if, for, def.( or brackets [.Since you have the VSCode Python Extension installed, your IDE will flag these when you write the code by using Pylance, VSCode’s default Python language support tool.
These happen while the program is running. The syntax is correct, but something illegal happened during execution.
# ❌ ZeroDivisionError: division by zero
result = 10 / 0
# ❌ NameError: name 'x' is not defined
print(x)
The program runs without crashing, but it does the wrong thing. These are the hardest to catch because Python won’t give you an error message.
# 🐛 Logical Error: Calculating average incorrectly
numbers = [10, 20, 30]
average = sum(numbers) / 2 # Should be divided by len(numbers), which is 3!
When Python crashes, it prints a “Stack Trace” (or Traceback). It looks intimidating, but it’s actually very helpful. It tells you exactly where the problem is.
Unlike some JavaScript error messages which can be vague, Python’s traceback is usually very precise.
Example Traceback:
Traceback (most recent call last):
File "main.py", line 5, in <module>
calculate_total(10, 0)
File "main.py", line 2, in calculate_total
return a / b
ZeroDivisionError: division by zero
How to read it:
ZeroDivisionError) and the message (division by zero).main.py), the line number (line 2), and the code that caused the crash.Copy the following code into a file named buggy.py and run it. Look at the traceback. Which line actually caused the crash? Which line called the function that crashed?
def greet(name):
return "Hello " + name
def welcome_users(users):
for user in users:
print(greet(user))
# There is a bug here!
user_list = ["Alice", "Bob", 123]
welcome_users(user_list)
The simplest way to debug is often the most effective. If your code isn’t doing what you expect, print() the values of your variables at different steps.
def add_tax(amount):
print(f"DEBUG: amount is{amount}") # 👀 Check input
tax = amount * 0.21
print(f"DEBUG: tax calculated is{tax}") # 👀 Check intermediate value
return amount + tax
The following code tries to find the largest number in a list, but it returns the wrong answer. Use print() statements to trace the loop and fix the logical error.
numbers = [1, 5, 2, 9, 3]
max_num = 0
for n in numbers:
if n < max_num: # 🤔 Is this correct?
max_num = n
print(f"The largest number is{max_num}")
Using print() is fine for small scripts, but for larger applications, it gets messy. Imagine having to delete 50 print statements before committing your code!
Visual Studio Code has a built-in Debugger. It allows you to pause your code in the middle of execution and look at the variables live.
A breakpoint is a stop sign for your code. When Python reaches this line, it will pause.
Instead of clicking the “Play” button at the top right:
Your code will start running and freeze at your red dot.
Once paused, a floating toolbar appears at the top. Here are the most important buttons:
Look at the Variables panel on the left side. You can see the value of every variable at that exact moment. No more print(variable) needed!
You are writing a script to fetch data records from an API. You want to collect exactly 20 records to form a “batch” before saving them. The API gives you records in chunks of 3.
This code runs forever and crashes your terminal (an infinite loop). Do not fix it by guessing! Use the debugger to find out why it misses the target.
current_count += 3.current_count variable in the Variables panel on the left.current_count have when the loop should stop, but doesn’t?target_batch_size = 20
current_count = 0
print("--- Starting Batch Collection ---")
# We need exactly 20 records to close the batch
while current_count != target_batch_size:
print(f"Status: We have{current_count} records...")
# Simulate fetching 3 records at a time
current_count += 3
print("Batch successfully collected!")
Once you see the variable skip past 20 in the debugger, you’ll realize why != (not equal) is dangerous here. How would you change the while condition to make it safe?
SyntaxError and a RuntimeError?If you want to dive deeper into how Python handles errors and how to use the VS Code debugger efficiently, check out these resources: