Azure Setup and Account Access
Week 1 Assignment: The Data Cleaning Pipeline
Python is dynamically typed - you don't have to declare variable types. But type hints let you annotate your code with expected types, making it easier to understand and catch bugs early.
<aside> 💡 Type hints are especially valuable in data pipelines where data flows through many functions. They act as documentation and help tools catch type mismatches before runtime.
</aside>
def process_data(records, threshold):
results = []
for record in records:
if record["score"] > threshold:
results.append(record)
return results
Questions a reader might have:
records? A list? A dictionary?threshold? Integer? Float?def process_data(records: list[dict], threshold: float) -> list[dict]:
results = []
for record in records:
if record["score"] > threshold:
results.append(record)
return results
Now it's clear:
records is a list of dictionariesthreshold is a float# Variables
name: str = "Alice"
age: int = 25
price: float = 19.99
is_active: bool = True
# Functions
def greet(name: str) -> str:
return f"Hello, {name}!"
def add(a: int, b: int) -> int:
return a + b
def is_valid(value: str) -> bool:
return len(value) > 0
# Function that returns None
def print_message(message: str) -> None:
print(message)
# Function that might return None
def find_user(user_id: int) -> str | None:
users = {1: "Alice", 2: "Bob"}
return users.get(user_id) # Returns None if not found
<aside> ⚠️ The str | None syntax (using |) requires Python 3.10+. For Python 3.9, use Optional[str] from the typing module instead.
</aside>
# List of strings
names: list[str] = ["Alice", "Bob", "Charlie"]
# List of integers
scores: list[int] = [85, 92, 78, 95]
# List of dictionaries
records: list[dict] = [
{"name": "Alice", "age": 25},
{"name": "Bob", "age": 30}
]
# Dictionary with string keys and integer values
ages: dict[str, int] = {"Alice": 25, "Bob": 30}
# Dictionary with string keys and any values
config: dict[str, str | int | bool] = {
"host": "localhost",
"port": 8080,
"debug": True
}
# Tuple with specific types for each position
point: tuple[int, int] = (10, 20)
person: tuple[str, int, bool] = ("Alice", 25, True)
def calculate_discount(
price: float,
discount_percent: float,
min_price: float = 0.0
) -> float:
"""Calculate discounted price with a minimum floor."""
discounted = price * (1 - discount_percent / 100)
return max(discounted, min_price)
from typing import Callable
def apply_to_all(
items: list[str],
transform: Callable[[str], str]
) -> list[str]:
"""Apply a transformation function to all items."""
return [transform(item) for item in items]
# Usage
names = ["alice", "bob"]
upper_names = apply_to_all(names, str.upper) # ["ALICE", "BOB"]
Here's how type hints improve a data pipeline function:
def process_sales(data, min_amount):
filtered = [r for r in data if r["amount"] >= min_amount]
total = sum(r["amount"] for r in filtered)
return {"count": len(filtered), "total": total}
def process_sales(
data: list[dict[str, float | str]],
min_amount: float
) -> dict[str, int | float]:
"""
Filter sales records and calculate summary statistics.
Args:
data: List of sales records with 'amount' and other fields.
min_amount: Minimum amount to include in results.
Returns:
Dictionary with 'count' and 'total' keys.
"""
filtered = [r for r in data if r["amount"] >= min_amount]
total = sum(r["amount"] for r in filtered)
return {"count": len(filtered), "total": total}
For complex types, create aliases to keep code readable:
# Define type aliases at the top of your module
Record = dict[str, str | int | float | None]
RecordList = list[Record]
TransformFunc = Callable[[Record], Record]
def transform_records(
records: RecordList,
transformer: TransformFunc
) -> RecordList:
"""Apply a transformation to each record."""
return [transformer(record) for record in records]
<aside> 💡 Type aliases are great for data pipelines where you work with the same data structures repeatedly. Define them once and reuse throughout your code.
</aside>
VS Code with the Python extension (and Pylance) automatically checks types as you write code.
Add to your .vscode/settings.json:
{
"python.analysis.typeCheckingMode": "basic"
}
Options:
"off": No type checking"basic": Catch common errors (recommended to start)"strict": Comprehensive checkingdef calculate_total(prices: list[float]) -> float:
return sum(prices)
# VS Code will highlight this as an error:
result = calculate_total("not a list") # Error: str is not list[float]
<aside> ⌨️ Hands-on: Open VS Code and create a function get_average(numbers: list[float]) -> float. Try calling it with wrong types and see how VS Code highlights the errors.
</aside>
def filter_records(
records: list[dict[str, str]],
field: str,
value: str
) -> list[dict[str, str]]:
"""Filter records where field equals value."""
return [r for r in records if r.get(field) == value]
def pipeline(
data: list[dict],
transformations: list[Callable[[dict], dict]]
) -> list[dict]:
"""Apply a series of transformations to data."""
result = data
for transform in transformations:
result = [transform(record) for record in result]
return result
from pathlib import Path
def read_file(filepath: str | Path) -> str:
"""Read and return file contents."""
with open(filepath, "r") as f:
return f.read()
Type hints are optional. Skip them when:
# Type hints not needed here - it's obvious
x = 5
name = "Alice"
# Type hints helpful here - clarifies the function's contract
def process_batch(records: list[dict], batch_size: int = 100) -> list[list[dict]]:
...
str or None?Next lesson: CLI Habits