Teachers

Data Types and Variables

Before diving into functions, you need to understand Python's core building blocks: variables and data types. These fundamentals are essential for every data pipeline you'll build.

Why Data Types Matter

Python is dynamically typed — you don't declare variable types explicitly. Python figures it out based on the value you assign. While this makes Python flexible and quick to write, understanding data types is crucial because:

Operations depend on types: You can add two numbers, but adding a number to a string causes an error.
Data pipelines parse raw data: CSV files, JSON responses, and API payloads all come as text. You need to convert them to the right types.
Memory and performance: Different types use different amounts of memory. For large datasets, this matters.

💡 In data engineering, you'll constantly convert between types: parsing strings from files into numbers, dates, or booleans for processing.

Variables

A variable is a name that refers to a value stored in memory. Think of it as a label on a box.

🎬 Animation: Variables as References

Creating Variables

# Assign values to variables
name = "Alice"
age = 28
salary = 65000.50
is_active = True

# Print them
print(name)      # Output: Alice
print(age)       # Output: 28

Naming Conventions

Python has rules and conventions for variable names:

Rule	Example	Valid?
Start with letter or underscore	`my_var`, `_private`	✅
Start with number	`2nd_value`	❌
Contain spaces	`my var`	❌
Use snake_case	`user_count`	✅ (recommended)
Use camelCase	`userCount`	✅ (not Pythonic)

⚠️ Always use snake_case for variable names in Python. This is the community standard defined in PEP 8.

Multiple Assignment

Python lets you assign multiple variables at once:

# Assign the same value to multiple variables
x = y = z = 0

# Assign different values in one line (unpacking)
name, age, city = "Bob", 30, "Amsterdam"

# Swap values without a temporary variable
a, b = 10, 20
a, b = b, a  # Now a=20, b=10

Numeric Types

Python has three numeric types. For data engineering, you'll mostly use integers and floats.

Integers (`int`)

Whole numbers, positive or negative, without decimals:

record_count = 1500
offset = -10
year = 2024

# Arithmetic operations
total = record_count + 500      # 2000
doubled = record_count * 2      # 3000
per_batch = record_count // 100 # 15 (floor division)
remainder = record_count % 100  # 0 (modulo)

Floats (`float`)

Numbers with decimal points:

price = 19.99
temperature = -3.5
percentage = 0.75

# Be careful with float precision
result = 0.1 + 0.2
print(result)  # Output: 0.30000000000000004 (not exactly 0.3!)

⚠️ Floats have precision limitations. For financial calculations, use Python's decimal module instead.

Type Conversion

Convert between numeric types:

# String to int
count_str = "42"
count = int(count_str)  # 42

# String to float
price_str = "19.99"
price = float(price_str)  # 19.99

# Float to int (truncates, doesn't round)
value = int(3.9)  # 3

# Int to float
whole = float(5)  # 5.0

⌨️ Hands-on: Create variables for a pipeline's metrics: total_records = 1000, processed = 850, failed = 150. Calculate and print the success rate as a percentage.

Strings

Strings are sequences of characters, used for text data. In data engineering, you'll work with strings constantly — file paths, column names, log messages, and parsed data.

Creating Strings

# Single or double quotes (no difference)
name = "Alice"
name = 'Alice'

# Multi-line strings with triple quotes
query = """
SELECT *
FROM users
WHERE active = true
"""

# f-strings for formatting (Python 3.6+)
user = "Bob"
count = 42
message = f"User {user} processed {count} records"
print(message)  # Output: User Bob processed 42 records

Common String Methods

These methods are essential for cleaning data:

raw_value = "  HELLO World  "

# Remove whitespace
cleaned = raw_value.strip()       # "HELLO World"

# Change case
lower = raw_value.lower()         # "  hello world  "
upper = raw_value.upper()         # "  HELLO WORLD  "

# Replace characters
fixed = raw_value.replace(" ", "_")  # "__HELLO_World__"

# Split into list
parts = "a,b,c".split(",")        # ["a", "b", "c"]

# Join list into string
joined = "-".join(["2024", "01", "15"])  # "2024-01-15"

Indexing and Slicing

Access individual characters or substrings:

text = "Python"

# Indexing (0-based)
first = text[0]   # "P"
last = text[-1]   # "n"

# Slicing [start:end] (end is exclusive)
first_three = text[0:3]   # "Pyt"
last_three = text[-3:]    # "hon"

⌨️ Hands-on: Given raw_name = " john DOE ", clean it to produce "John Doe" (title case, no extra whitespace).

Booleans

Booleans represent truth values: True or False. They're used in conditions and validation.

is_valid = True
has_errors = False

# Comparison operators return booleans
print(10 > 5)      # True
print(10 == 5)     # False
print(10 != 5)     # True
print("a" in "abc") # True

# Combine with logical operators
print(True and False)  # False
print(True or False)   # True
print(not True)        # False

Truthiness

Python considers some values "truthy" (act like True) and others "falsy" (act like False):

Falsy Values	Truthy Values
`False`	`True`
`0`, `0.0`	Any non-zero number
`""` (empty string)	Any non-empty string
`[]` (empty list)	Any non-empty list
`{}` (empty dict)	Any non-empty dict
`None`	Everything else

# This is useful for validation
data = []
if data:
    print("Has data")
else:
    print("Empty!")  # This prints

Lists

Lists are ordered, mutable collections. They're your go-to for storing sequences of items.

🎬 Animation: Mutable vs Immutable

Creating and Modifying Lists

# Create a list
fruits = ["apple", "banana", "cherry"]
numbers = [1, 2, 3, 4, 5]
mixed = [1, "two", 3.0, True]  # Can mix types (but avoid this)

# Access elements
first = fruits[0]      # "apple"
last = fruits[-1]      # "cherry"

# Modify elements
fruits[0] = "avocado"  # ["avocado", "banana", "cherry"]

# Add elements
fruits.append("date")           # Add to end
fruits.insert(0, "apricot")     # Add at position

# Remove elements
fruits.remove("banana")         # Remove by value
popped = fruits.pop()           # Remove and return last item

List Operations

numbers = [3, 1, 4, 1, 5]

# Length
print(len(numbers))  # 5

# Check membership
print(3 in numbers)  # True

# Sort (modifies in place)
numbers.sort()       # [1, 1, 3, 4, 5]

# Iterate
for num in numbers:
    print(num)

Slicing Lists

Just like strings, lists support slicing:

data = [0, 1, 2, 3, 4, 5]

first_three = data[:3]   # [0, 1, 2]
last_two = data[-2:]     # [4, 5]
every_other = data[::2]  # [0, 2, 4]

Dictionaries

Dictionaries store key-value pairs. They're essential for data engineering because JSON data maps directly to Python dicts.

Creating and Accessing Dictionaries

# Create a dictionary
user = {
    "name": "Alice",
    "age": 28,
    "city": "Amsterdam"
}

# Access values by key
print(user["name"])      # "Alice"
print(user.get("email")) # None (no KeyError if missing)
print(user.get("email", "N/A"))  # "N/A" (default value)

# Add or modify
user["email"] = "[email protected]"
user["age"] = 29

# Remove
del user["city"]

Iterating Over Dictionaries

user = {"name": "Alice", "age": 28}

# Keys
for key in user:
    print(key)

# Values
for value in user.values():
    print(value)

# Both
for key, value in user.items():
    print(f"{key}: {value}")

Nested Dictionaries

Real-world data often has nested structures:

record = {
    "id": 123,
    "user": {
        "name": "Bob",
        "email": "[email protected]"
    },
    "tags": ["premium", "active"]
}

# Access nested values
user_name = record["user"]["name"]  # "Bob"
first_tag = record["tags"][0]       # "premium"

⌨️ Hands-on: Create a dictionary representing a data pipeline run with keys: pipeline_name, status, records_processed, and errors. Print a summary message using f-strings.

Type Checking and Conversion

Use type() to check what type a value is:

Week 1 -Foundational Python