Before diving into functions, you need to understand Python's core building blocks: variables and data types. These fundamentals are essential for every data pipeline you'll build.
Python is dynamically typed — you don't declare variable types explicitly. Python figures it out based on the value you assign. While this makes Python flexible and quick to write, understanding data types is crucial because:
💡 In data engineering, you'll constantly convert between types: parsing strings from files into numbers, dates, or booleans for processing.
A variable is a name that refers to a value stored in memory. Think of it as a label on a box.
🎬 Animation: Variables as References
# Assign values to variables
name = "Alice"
age = 28
salary = 65000.50
is_active = True
# Print them
print(name) # Output: Alice
print(age) # Output: 28
Python has rules and conventions for variable names:
| Rule | Example | Valid? |
|---|---|---|
| Start with letter or underscore | my_var, _private |
✅ |
| Start with number | 2nd_value |
❌ |
| Contain spaces | my var |
❌ |
| Use snake_case | user_count |
✅ (recommended) |
| Use camelCase | userCount |
✅ (not Pythonic) |
⚠️ Always use snake_case for variable names in Python. This is the community standard defined in PEP 8.
Python lets you assign multiple variables at once:
# Assign the same value to multiple variables
x = y = z = 0
# Assign different values in one line (unpacking)
name, age, city = "Bob", 30, "Amsterdam"
# Swap values without a temporary variable
a, b = 10, 20
a, b = b, a # Now a=20, b=10
Python has three numeric types. For data engineering, you'll mostly use integers and floats.
int)Whole numbers, positive or negative, without decimals:
record_count = 1500
offset = -10
year = 2024
# Arithmetic operations
total = record_count + 500 # 2000
doubled = record_count * 2 # 3000
per_batch = record_count // 100 # 15 (floor division)
remainder = record_count % 100 # 0 (modulo)
float)Numbers with decimal points:
price = 19.99
temperature = -3.5
percentage = 0.75
# Be careful with float precision
result = 0.1 + 0.2
print(result) # Output: 0.30000000000000004 (not exactly 0.3!)
⚠️ Floats have precision limitations. For financial calculations, use Python's decimal module instead.
Convert between numeric types:
# String to int
count_str = "42"
count = int(count_str) # 42
# String to float
price_str = "19.99"
price = float(price_str) # 19.99
# Float to int (truncates, doesn't round)
value = int(3.9) # 3
# Int to float
whole = float(5) # 5.0
⌨️ Hands-on: Create variables for a pipeline's metrics: total_records = 1000, processed = 850, failed = 150. Calculate and print the success rate as a percentage.
Strings are sequences of characters, used for text data. In data engineering, you'll work with strings constantly — file paths, column names, log messages, and parsed data.
# Single or double quotes (no difference)
name = "Alice"
name = 'Alice'
# Multi-line strings with triple quotes
query = """
SELECT *
FROM users
WHERE active = true
"""
# f-strings for formatting (Python 3.6+)
user = "Bob"
count = 42
message = f"User {user} processed {count} records"
print(message) # Output: User Bob processed 42 records
These methods are essential for cleaning data:
raw_value = " HELLO World "
# Remove whitespace
cleaned = raw_value.strip() # "HELLO World"
# Change case
lower = raw_value.lower() # " hello world "
upper = raw_value.upper() # " HELLO WORLD "
# Replace characters
fixed = raw_value.replace(" ", "_") # "__HELLO_World__"
# Split into list
parts = "a,b,c".split(",") # ["a", "b", "c"]
# Join list into string
joined = "-".join(["2024", "01", "15"]) # "2024-01-15"
Access individual characters or substrings:
text = "Python"
# Indexing (0-based)
first = text[0] # "P"
last = text[-1] # "n"
# Slicing [start:end] (end is exclusive)
first_three = text[0:3] # "Pyt"
last_three = text[-3:] # "hon"
⌨️ Hands-on: Given raw_name = " john DOE ", clean it to produce "John Doe" (title case, no extra whitespace).
Booleans represent truth values: True or False. They're used in conditions and validation.
is_valid = True
has_errors = False
# Comparison operators return booleans
print(10 > 5) # True
print(10 == 5) # False
print(10 != 5) # True
print("a" in "abc") # True
# Combine with logical operators
print(True and False) # False
print(True or False) # True
print(not True) # False
Python considers some values "truthy" (act like True) and others "falsy" (act like False):
| Falsy Values | Truthy Values |
|---|---|
False |
True |
0, 0.0 |
Any non-zero number |
"" (empty string) |
Any non-empty string |
[] (empty list) |
Any non-empty list |
{} (empty dict) |
Any non-empty dict |
None |
Everything else |
# This is useful for validation
data = []
if data:
print("Has data")
else:
print("Empty!") # This prints
Lists are ordered, mutable collections. They're your go-to for storing sequences of items.
🎬 Animation: Mutable vs Immutable
# Create a list
fruits = ["apple", "banana", "cherry"]
numbers = [1, 2, 3, 4, 5]
mixed = [1, "two", 3.0, True] # Can mix types (but avoid this)
# Access elements
first = fruits[0] # "apple"
last = fruits[-1] # "cherry"
# Modify elements
fruits[0] = "avocado" # ["avocado", "banana", "cherry"]
# Add elements
fruits.append("date") # Add to end
fruits.insert(0, "apricot") # Add at position
# Remove elements
fruits.remove("banana") # Remove by value
popped = fruits.pop() # Remove and return last item
numbers = [3, 1, 4, 1, 5]
# Length
print(len(numbers)) # 5
# Check membership
print(3 in numbers) # True
# Sort (modifies in place)
numbers.sort() # [1, 1, 3, 4, 5]
# Iterate
for num in numbers:
print(num)
Just like strings, lists support slicing:
data = [0, 1, 2, 3, 4, 5]
first_three = data[:3] # [0, 1, 2]
last_two = data[-2:] # [4, 5]
every_other = data[::2] # [0, 2, 4]
Dictionaries store key-value pairs. They're essential for data engineering because JSON data maps directly to Python dicts.
# Create a dictionary
user = {
"name": "Alice",
"age": 28,
"city": "Amsterdam"
}
# Access values by key
print(user["name"]) # "Alice"
print(user.get("email")) # None (no KeyError if missing)
print(user.get("email", "N/A")) # "N/A" (default value)
# Add or modify
user["email"] = "[email protected]"
user["age"] = 29
# Remove
del user["city"]
user = {"name": "Alice", "age": 28}
# Keys
for key in user:
print(key)
# Values
for value in user.values():
print(value)
# Both
for key, value in user.items():
print(f"{key}: {value}")
Real-world data often has nested structures:
record = {
"id": 123,
"user": {
"name": "Bob",
"email": "[email protected]"
},
"tags": ["premium", "active"]
}
# Access nested values
user_name = record["user"]["name"] # "Bob"
first_tag = record["tags"][0] # "premium"
⌨️ Hands-on: Create a dictionary representing a data pipeline run with keys: pipeline_name, status, records_processed, and errors. Print a summary message using f-strings.
Use type() to check what type a value is: