Week 1 - Python Foundations

Overview

Python Setup

Data Types and Variables

Control Flow: Logic and Loops

Functions and Modules

Type Hints for Clearer Code

Command-Line Interface Habits

🐛 Errors and Debugging

📝 Logging in Python

File Operations

Azure Setup and Account Access

🛠️ Practice

🎒 Assignment

⚠️ Gotchas & Pitfalls

🗓️ Lesson Plan

Data Types and Variables

Before diving into functions and modules, you need to understand Python's core building blocks: variables and data types. These fundamentals are essential for every data pipeline you'll build.

<aside> 💡 Core Program Refresher

You've already encountered many of these concepts in the Core program. If you need a quick reminder of how JavaScript handled these, check out these earlier lessons:

While the syntax is different, the logic of "storing data in boxes" remains the same!

</aside>

Why Data Types Matter

Python is dynamically typed so you don't declare variable types explicitly. Python figures it out based on the value you assign. While this makes Python flexible and quick to write, understanding data types is crucial because:

  1. Operations depend on types: You can add two numbers, but adding a number to a string causes an error.
  2. Data pipelines parse raw data: CSV files, JSON responses, and API payloads all come as text. You need to convert them to the right types.
  3. Memory and performance: Different types use different amounts of memory. For large datasets, this matters.

<aside> 💡 In data engineering, you'll constantly convert between types: parsing strings from files into numbers, dates, or booleans for processing.

</aside>

Variables

A variable is a name that refers to a value stored in memory. Think of it as a label on a box.

<aside> 🎬 Animation: Variables as References

</aside>

Creating Variables

# Assign values to variables
name = "Alice"
age = 28
salary = 65000.50
is_active = True

# Print them
print(name)      # Output: Alice
print(age)       # Output: 28

Naming Conventions

Python has rules and conventions for variable names:

Rule Example Valid?
Start with letter or underscore my_var, _private
Start with number 2nd_value
Contain spaces my var
Use snake_case user_count ✅ (recommended)
Use camelCase userCount ✅ (not Pythonic)

<aside> ⚠️ Always use snake_case for variable names in Python. This is the community standard defined in PEP 8.

</aside>

<aside> 🤓 Curious Geek: PEP 8

PEP stands for Python Enhancement Proposal.

It's the mechanism used to propose new features, collect community input, and document design decisions for Python.

PEP 8 is famous for style, but there are thousands of PEPs that shaped the language!

</aside>

Multiple Assignment

Python lets you assign multiple variables at once:

# Assign the same value to multiple variables
x = y = z = 0

# Assign different values in one line (unpacking)
name, age, city = "Bob", 30, "Amsterdam"

# Swap values without a temporary variable
a, b = 10, 20
a, b = b, a  # Now a=20, b=10

<aside> ⌨️ Hands on: You receive min_age = 45 and max_age = 18. If min_age > max_age, swap them (using tuple unpacking) and print both values.

🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=1&chapter=data_types_variables&exercise=w1_data_types_variables__swap_min_max&lang=python

</aside>

Numeric Types

Python has three numeric types. For data engineering, you'll mostly use integers and floats.

Integers (int)

Whole numbers, positive or negative, without decimals:

record_count = 1500
offset = -10
year = 2024

# Arithmetic operations
total = record_count + 500      # 2000
doubled = record_count * 2      # 3000
per_batch = record_count // 101 # 15 (floor division)
remainder = record_count % 100  # 0 (modulo)

Floats (float)

Numbers with decimal points:

price = 19.99
temperature = -3.5
percentage = 0.75

# Be careful with float precision
result = 0.1 + 0.2
print(result)  # Output: 0.30000000000000004 (not exactly 0.3!)

<aside> ⚠️ Floats have precision limitations. For financial calculations, use Python's decimal module instead.

</aside>

Type Conversion

Convert between numeric types:

# String to int
count_str = "42"
count = int(count_str)  # 42

# String to float
price_str = "19.99"
price = float(price_str)  # 19.99

# Float to int (truncates, doesn't round)
value = int(3.9)  # 3

# Int to float
whole = float(5)  # 5.0

<aside> ⌨️ Hands on: Create variables for a pipeline's metrics: total_records = 1000, processed = 850, failed = 150. Calculate and print the success rate as a percentage.

🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=1&chapter=data_types_variables&exercise=w1_data_types_variables__success_rate&lang=python

</aside>

Strings

Strings are sequences of characters, used for text data. In data engineering, you'll work with strings constantly : file paths, column names, log messages, and parsed data.

Creating Strings

# Single or double quotes (no difference)
name = "Alice"
name = 'Alice'

# Multi-line strings with triple quotes
query = """
SELECT *
FROM users
WHERE active = true
"""

# f-strings for formatting (Python 3.6+)
user = "Bob"
count = 42
message = f"User {user} processed {count} records"
print(message)  # Output: User Bob processed 42 records

<aside> ⌨️ Hands on: Build a folder path like data/2024/01/05/events.csv using year = 2024, month = 1, day = 5 (month/day should be zero-padded to 2 digits).

🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=1&chapter=data_types_variables&exercise=w1_data_types_variables__path_padding&lang=python

</aside>

Common String Methods

These methods are essential for cleaning data:

raw_value = "  HELLO World  "

# Remove whitespace
cleaned = raw_value.strip()       # "HELLO World"

# Change case
lower = raw_value.lower()         # "  hello world  "
upper = raw_value.upper()         # "  HELLO WORLD  "

# Replace characters
fixed = raw_value.replace(" ", "_")  # "__HELLO_World__"

# Split into list
parts = "a,b,c".split(",")        # ["a", "b", "c"]

# Join list into string
joined = "-".join(["2024", "01", "15"])  # "2024-01-15"

Indexing and Slicing

Access individual characters or substrings:

text = "Python"

# Indexing (0-based)
first = text[0]   # "P"
last = text[-1]   # "n"

# Slicing [start:end] (end is exclusive)
first_three = text[0:3]   # "Pyt"
last_three = text[-3:]    # "hon"

<aside> ⌨️ Hands on: Given raw_name = " john DOE ", clean it to produce "John Doe" (title case, no extra whitespace).

🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=1&chapter=data_types_variables&exercise=w1_data_types_variables__clean_name&lang=python

</aside>

Booleans

Booleans represent truth values: True or False. They're used in conditions and validation.

is_valid = True
has_errors = False

# Comparison operators return booleans
print(10 > 5)      # True
print(10 == 5)     # False
print(10 != 5)     # True
print("a" in "abc") # True

# Combine with logical operators
print(True and False)  # False
print(True or False)   # True
print(not True)        # False

Truthiness

Python considers some values "truthy" (act like True) and others "falsy" (act like False):

| --- | --- |

# This is useful for validation
data = []
if data:
    print("Has data")
else:
    print("Empty!")  # This prints

<aside> ⌨️ Hands on: Given raw_email = "" and raw_phone = None, print "Missing contact info" if both values are missing (falsy), otherwise print "OK".

</aside>

Lists

Lists are ordered, mutable collections. They're your go-to for storing sequences of items.

<aside> 🎬 Animation: Mutable vs Immutable

</aside>

Creating and Modifying Lists

# Create a list
fruits = ["apple", "banana", "cherry"]
numbers = [1, 2, 3, 4, 5]
mixed = [1, "two", 3.0, True]  # Can mix types (but avoid this)

# Access elements
first = fruits[0]      # "apple"
last = fruits[-1]      # "cherry"

# Modify elements
fruits[0] = "avocado"  # ["avocado", "banana", "cherry"]

# Add elements
fruits.append("date")           # Add to end
fruits.insert(0, "apricot")     # Add at position

# Remove elements
fruits.remove("banana")         # Remove by value
popped = fruits.pop()           # Remove and return last item

List Operations

numbers = [3, 1, 4, 1, 5]

# Length
print(len(numbers))  # 5

# Check membership
print(3 in numbers)  # True

# Sort (modifies in place)
numbers.sort()       # [1, 1, 3, 4, 5]

# Iterate
for num in numbers:
    print(num)

Slicing Lists

Just like strings, lists support slicing:

data = [0, 1, 2, 3, 4, 5]

first_three = data[:3]   # [0, 1, 2]
last_two = data[-2:]     # [4, 5]
every_other = data[::2]  # [0, 2, 4]

<aside> ⌨️ Hands on: Given ids = [101, 102, 103, 104, 105], create batch = ids[-3:] and print it. Then reverse the batch and print the reversed list.

</aside>

Dictionaries

Dictionaries store key-value pairs. They're essential for data engineering because JSON data maps directly to Python dicts.

Creating and Accessing Dictionaries

# Create a dictionary
user = {
    "name": "Alice",
    "age": 28,
    "city": "Amsterdam"
}

# Access values by key
print(user["name"])      # "Alice"
print(user.get("email")) # None (no KeyError if missing)
print(user.get("email", "N/A"))  # "N/A" (default value)

# Add or modify
user["email"] = "[email protected]"
user["age"] = 29

# Remove
del user["city"]

Iterating Over Dictionaries

user = {"name": "Alice", "age": 28}

# Keys
for key in user:
    print(key)

# Values
for value in user.values():
    print(value)

# Both
for key, value in user.items():
    print(f"{key}: {value}")

Nested Dictionaries

Real-world data often has nested structures: