Azure Setup and Account Access
Week 1 Assignment: The Data Cleaning Pipeline
Career relevance: Week 1 in the NL data job market
Going Further: Optional Deep Dives
Python is dynamically typed - you don't have to declare variable types. But type hints let you annotate your code with expected types, making it easier to understand and catch bugs early.
<aside> 💡 Type hints are especially valuable in data pipelines where data flows through many functions. They act as documentation and help tools catch type mismatches before runtime.
</aside>
def process_data(records, threshold):
results = []
for record in records:
if record["score"] > threshold:
results.append(record)
return results
Questions a reader might have:
records? A list? A dictionary?threshold? Integer? Float?def process_data(records: list[dict], threshold: float) -> list[dict]:
results = []
for record in records:
if record["score"] > threshold:
results.append(record)
return results
Now it's clear:
records is a list of dictionariesthreshold is a float# Variables
name: str = "Alice"
age: int = 25
price: float = 19.99
is_active: bool = True
# Functions
def greet(name: str) -> str:
return f"Hello, {name}!"
def add(a: int, b: int) -> int:
return a + b
def is_valid(value: str) -> bool:
return len(value) > 0
# Function that returns None
def print_message(message: str) -> None:
print(message)
# Function that might return None
def find_user(user_id: int) -> str | None:
users = {1: "Alice", 2: "Bob"}
return users.get(user_id) # Returns None if not found
<aside> ⚠️ The str | None syntax (using |) requires Python 3.10+. For Python 3.9, use Optional[str] from the typing module instead.
</aside>
list[str] is a generic type: a built-in collection (list) annotated with the type of its contents (str). Since Python 3.9, you spell these with built-in lowercase types; before 3.9 you had to import List, Dict, etc. from typing.
# List of strings
names: list[str] = ["Alice", "Bob", "Charlie"]
# List of integers
scores: list[int] = [85, 92, 78, 95]
# List of dictionaries
records: list[dict] = [
{"name": "Alice", "age": 25},
{"name": "Bob", "age": 30}
]
# Dictionary with string keys and integer values
ages: dict[str, int] = {"Alice": 25, "Bob": 30}
# Dictionary with string keys and any values
config: dict[str, str | int | bool] = {
"host": "localhost",
"port": 8080,
"debug": True
}
# Tuple with specific types for each position
point: tuple[int, int] = (10, 20)
person: tuple[str, int, bool] = ("Alice", 25, True)
def calculate_discount(
price: float,
discount_percent: float,
min_price: float = 0.0
) -> float:
"""Calculate discounted price with a minimum floor."""
discounted = price * (1 - discount_percent / 100)
return max(discounted, min_price)
from typing import Callable
def apply_to_all(
items: list[str],
transform: Callable[[str], str]
) -> list[str]:
"""Apply a transformation function to all items."""
return [transform(item) for item in items]
# Usage
names = ["alice", "bob"]
upper_names = apply_to_all(names, str.upper) # ["ALICE", "BOB"]
Here's how type hints improve a data pipeline function:
def process_sales(data, min_amount):
filtered = [r for r in data if r["amount"] >= min_amount]
total = sum(r["amount"] for r in filtered)
return {"count": len(filtered), "total": total}
def process_sales(
data: list[dict[str, float | str]],
min_amount: float
) -> dict[str, int | float]:
"""
Filter sales records and calculate summary statistics.
Args:
data: List of sales records with 'amount' and other fields.
min_amount: Minimum amount to include in results.
Returns:
Dictionary with 'count' and 'total' keys.
"""
filtered = [r for r in data if r["amount"] >= min_amount]
total = sum(r["amount"] for r in filtered)
return {"count": len(filtered), "total": total}
For complex types, create type aliases to keep code readable:
# Define type aliases at the top of your module
Record = dict[str, str | int | float | None]
RecordList = list[Record]
TransformFunc = Callable[[Record], Record]
def transform_records(
records: RecordList,
transformer: TransformFunc
) -> RecordList:
"""Apply a transformation to each record."""
return [transformer(record) for record in records]
<aside> 💡 Type aliases are great for data pipelines where you work with the same data structures repeatedly. Define them once and reuse throughout your code.
</aside>
The Python extension's Pylance plugin is a static type checker: it reads your annotations without running the code and flags type mismatches as you type. The CLI equivalent is mypy. Both rely on the same hint syntax this chapter teaches; you opt in file-by-file (the design called gradual typing), so partially-annotated code stays valid Python.
Add to your .vscode/settings.json:
{
"python.analysis.typeCheckingMode": "basic"
}
Options:
"off": No type checking"basic": Catch common errors (recommended to start)"strict": Comprehensive checkingdef calculate_total(prices: list[float]) -> float:
return sum(prices)
# VS Code will highlight this as an error:
result = calculate_total("not a list") # Error: str is not list[float]
<aside>
⌨️ Hands on: Open VS Code and create a function get_average(numbers: list[float]) -> float. Try calling it with wrong types and see how VS Code highlights the errors.
</aside>
<aside> 🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=1&chapter=type_hints&exercise=w1_type_hints__get_average&lang=python
</aside>
def filter_records(
records: list[dict[str, str]],
field: str,
value: str
) -> list[dict[str, str]]:
"""Filter records where field equals value."""
return [r for r in records if r.get(field) == value]
def pipeline(
data: list[dict],
transformations: list[Callable[[dict], dict]]
) -> list[dict]:
"""Apply a series of transformations to data."""
result = data
for transform in transformations:
result = [transform(record) for record in result]
return result
from pathlib import Path
def read_file(filepath: str | Path) -> str:
"""Read and return file contents."""
with open(filepath, "r") as f:
return f.read()
Type hints are optional. Skip them when:
# Type hints not needed here - it's obvious
x = 5
name = "Alice"
# Type hints helpful here - clarifies the function's contract
def process_batch(records: list[dict], batch_size: int = 100) -> list[list[dict]]:
...
<aside> 🤓 Curious Geek: Python's gradual-typing arc
Python type hints arrived in PEP 484 in 2014, co-authored by Guido van Rossum and Jukka Lehtosalo (creator of mypy). Their goal was gradual typing: opt in file-by-file so a 2-million-line dynamically-typed codebase like Dropbox or Instagram could migrate without a rewrite. Two later PEPs polished the syntax: PEP 585 (2020) lets you write list[str] instead of List[str] (no more from typing import List), and PEP 604 (2020) lets you write str | None instead of Optional[str]. The runtime still ignores the annotations: tools like mypy and Pylance read them statically. That choice keeps Python's "just write code" feel intact, which is exactly why type hints succeeded where stricter proposals failed.
</aside>
The track expects you to add hints to every public function you write from this point on, so the next reasonable step is to use them on something concrete.
<aside>
📝 Practice: The week's Practice chapter has two exercises that exercise type hints: Ex 1 (the Temperature Logger: annotate c_to_f(celsius: float) -> float) and Ex 4 (Grade Processor: annotate the helpers you split out). Both take a few minutes.
</aside>
<aside> 🚀 Try it in the widget: Interactive Quiz: Type Hints
</aside>
https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_1_ch5_type_hints_quiz&embed=1
str or None? Give the modern (3.10+) form and the older typing form.[("alice", 25), ("bob", 30)]).Callable[[int, int], int] describe? Give an example of a function it would match.Next up: Command-Line Interface Habits, where you leave the editor and start running your typed Python scripts from the terminal, picking up the habits (argument parsing, exit codes, stdout vs stderr) that make a script feel like a proper pipeline step.