Azure Setup and Account Access
Week 1 Assignment: The Data Cleaning Pipeline
Going Further: Optional Deep Dives
Welcome to the "Gotchas" section. These are the subtle traps that every Python developer falls into at least once. We're telling you now so you can spot them when (not if) they happen to you.
This is the most famous Python gotcha.
You think that list=[] in a function argument creates a new empty list every time you call the function.
Python creates the default object once when the function is defined, not when it is called. If you modify that object, the change persists for the next call!
<aside> 🎬 Terminal Tutorial: Mutable Defaults
</aside>
# BAD ❌
def add_student(name, students=[]):
students.append(name)
return students
print(add_student("Alice")) # ['Alice'] - Good
print(add_student("Bob")) # ['Alice', 'Bob'] - WAIT WHAT? (Alice is still there!)
# GOOD ✅
def add_student(name, students=None):
if students is None:
students = [] # Create a new list inside the function
students.append(name)
return students
You think 0.1 + 0.2 equals 0.3.
Computers store decimals in binary (base 2), which cannot perfectly represent some fractions like 0.1.
<aside> 🎬 Terminal Tutorial: Float Precision
</aside>
print(0.1 + 0.2 == 0.3) # False!
print(0.1 + 0.2) # 0.30000000000000004
<aside>
⚠️ Data Engineering Rule: Never use float types for money. Use integers (cents) or the decimal module.
</aside>
You need a variable to store a list of users, so you call it list.
You just overwrote Python's built-in list() function. Now you can't create new lists or convert things to lists anymore.
<aside> 🎬 Terminal Tutorial: Shadowing Built-ins
</aside>
# BAD ❌
list = [1, 2, 3] # You just killed the list() function
my_tuple = (4, 5)
numbers = list(my_tuple) # TypeError: 'list' object is not callable
Common names to avoid: list, str, dict, set, type, id, min, max, sum.
You hardcode file paths using slashes, like data/file.csv or data\\file.csv.
Windows uses backslashes \\. Mac/Linux use forward slashes /. If you hardcode them, your code breaks on other operating systems.
# BAD ❌
filename = "data\\\\exports\\\\users.csv" # Breaks on Mac/Linux
# GOOD ✅
import os
filename = os.path.join("data", "exports", "users.csv")
# BETTER (Python 3) ✅
from pathlib import Path
filename = Path("data") / "exports" / "users.csv"
You think list_b = list_a creates a second, independent list.
Both variables point to the same object in memory. If you change one, you change both.
<aside> 🎬 Terminal Tutorial: Reference Trap
</aside>
# BAD ❌
original = [1, 2, 3]
copy = original # This is NOT a copy!
copy.append(4)
print(original) # [1, 2, 3, 4] - The original was changed too!
# GOOD ✅
original = [1, 2, 3]
copy = original.copy() # Or: list(original) or original[:]
copy.append(4)
print(original) # [1, 2, 3] - Safe!
This is even more dangerous when passing lists to functions.
# BAD ❌
def add_bonus(scores):
scores.append(10) # Modifies the ORIGINAL list outside the function!
my_scores = [90, 80]
add_bonus(my_scores)
print(my_scores) # [90, 80, 10] - The original list was changed!
# GOOD ✅
def add_bonus(scores):
new_scores = scores.copy() # Create a local copy first
new_scores.append(10)
return new_scores
my_scores = [90, 80]
new_scores = add_bonus(my_scores)
print(my_scores) # [90, 80] - Safe!
You think import my_utils only gives you access to the functions inside it.
Python executes the entire file when you import it. If you have "loose" print statements or file writes at the bottom of your utility script, they will run every time you import it!
<aside> 🎬 Terminal Tutorial: Import Side-effects
</aside>
# my_utils.py
def clean(val):
return val.strip()
# LOOSE CODE - This runs on every import!
print("Cleanup started...")
The Fix: Always wrap your "running" code in an if __name__ == "__main__": block.
Using if not value: is a safe way to check if data is missing or empty.
In Python, numbers like 0 and 0.0 are considered "falsy". If you are cleaning numeric data (like prices or counts), if not value: will trigger for a price of zero, even though zero is a perfectly valid number!
<aside> 🎬 Terminal Tutorial: Falsy Surprises
</aside>
# BAD ❌ (drops zeros)
def process_price(price):
if not price: # Triggers for 0.0!
return "MISSING"
return f"${price}"
print(process_price(0.0)) # "MISSING" - WRONG!
The Fix: Specifically check for None or the expected type.
# GOOD ✅
def process_price(price):
if price is None:
return "MISSING"
return f"${price}"
You use print("Error happened") to debug your code.
print() outputs to stdout (Standard Output), which is often discarded in production environments (like Docker containers or Cloud Functions). If your script fails at 3 AM, your print statements are gone forever.
# BAD ❌
try:
process_data()
except Exception as e:
print(f"Error: {e}") # Lost in the void
# GOOD ✅
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
try:
process_data()
except Exception as e:
logger.error("Processing failed", exc_info=True) # Saved with stack trace!
You use filename = input("Enter filename: ") to make your script interactive.
Data Engineering scripts are almost never run by humans. They are run by Scheduler Robots (cron, Airflow, GitHub Actions). Robots cannot type on a keyboard. Your script will hang forever waiting for input.
# BAD ❌ (Blocks automation)
filename = input("Which file? ")
# GOOD ✅ (Accepts arguments)
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--file", help="Path to the CSV file")
args = parser.parse_args()
filename = args.file
A column called age in your CSV contains numbers, so you can do row["age"] + 1 directly.
CSV is a text format. JSON booleans survive parsing, but every value coming out of csv.DictReader, csv.reader, or a raw HTTP response body is a string until you explicitly cast it. Forgetting to convert is the #1 day-one bug in the Week 1 assignment, and the error messages can look scary the first time you see them.
# BAD ❌ (TypeError on every row)
import csv
with open("users.csv") as f:
for row in csv.DictReader(f):
next_year = row["age"] + 1 # TypeError: can only concatenate str (not "int") to str
# GOOD ✅ (cast at the boundary)
import csv
with open("users.csv") as f:
for row in csv.DictReader(f):
next_year = int(row["age"]) + 1
int("3.14") # ❌ ValueError: invalid literal for int() with base 10: '3.14'
int(float("3.14")) # ✅ 3 (go through float first if the string has decimals)
bool("False") # ❌ True (any non-empty string is truthy — Ch2 falsy table)
"False" == "True" # ❌ False (string comparison, not boolean parsing)
# ✅ Parse explicitly:
def parse_bool(s: str) -> bool:
return s.strip().lower() in {"true", "1", "yes", "y"}
sorted(["10", "2", "30"]) # ❌ ['10', '2', '30'] (lexicographic order)
sorted(["10", "2", "30"], key=int)# ✅ ['2', '10', '30'] (sort by numeric value)
<aside>
💡 The fix is always the same: cast at the boundary where the data enters your program (the line that reads from CSV, JSON, env vars, or argv). Once values are the right type, the rest of your code works as written. Mixing strings and numbers deeper in the pipeline is much harder to debug.
</aside>
In the next chapter you put everything Week 1 introduced to work on the assignment: cleaning a real CSV of users, writing a JSON output, documenting an AI debug session, and recording a screenshot of your Azure access.