Functions and modules are the building blocks of maintainable data pipelines. In this section, you'll learn how to structure your code so it's reusable, testable, and easy to understand.
You've already worked with functions in the Core program. Let's review the basics and focus on patterns specific to data engineering.
def greet(name):
"""Return a greeting message."""
return f"Hello, {name}!"
# Call the function
message = greet("Data Engineer")
print(message) # Output: Hello, Data Engineer!
def calculate_average(numbers):
"""Calculate the average of a list of numbers."""
if not numbers:
return 0
return sum(numbers) / len(numbers)
# Usage
values = [10, 20, 30, 40, 50]
avg = calculate_average(values)
print(f"Average: {avg}") # Output: Average: 30.0
Default parameters make functions more flexible:
def read_data(filepath, delimiter=",", skip_header=True):
"""Read data from a file with configurable options."""
print(f"Reading {filepath}")
print(f"Delimiter: {delimiter}")
print(f"Skip header: {skip_header}")
# All of these work:
read_data("data.csv")
read_data("data.tsv", delimiter="\\\\t")
read_data("data.csv", skip_header=False)
π‘ In data pipelines, default parameters are great for common configurations. Most CSV files use commas, so make that the default!
Docstrings document what your function does. While regular comments (#) are for developers reading your code, docstrings are formal documentation that Python can read.
π‘ Why use docstrings?
- Self-documenting: You can use the
help()function to read them.- Tool integration: IDEs and automated documentation tools (like Sphinx) use them to generate manuals.
- Consistency: They provide a standardized way to describe parameters and return values.
def transform_record(record, uppercase_fields=None):
"""
Transform a data record by applying formatting rules.
Args:
record: A dictionary representing a single data record.
uppercase_fields: List of field names to convert to uppercase.
If None, no fields are uppercased.
Returns:
A new dictionary with the transformed record.
Example:
>>> transform_record({"name": "alice"}, ["name"])
{"name": "ALICE"}
"""
result = record.copy()
if uppercase_fields:
for field in uppercase_fields:
if field in result:
result[field] = result[field].upper()
return result
β οΈ Always write docstrings for functions that will be used by others (or by future you!). In data pipelines, this includes any transformation or validation function.
A module is simply a Python file containing code. Projects in data engineering quickly grow beyond a single script, so we use modules to keep things organized.
π‘ Why use modules?
- Maintainability: Itβs easier to find code in small, focused files than in one 2,000-line script.
- Reusability: You can import the same utility functions into many different pipelines.
- Collaboration: Multiple team members can work on different modules at the same time without conflicting.
Create a file called transformations.py:
# transformations.py
"""Data transformation functions for the pipeline."""
def clean_string(value):
"""Remove leading/trailing whitespace and normalize to lowercase."""
if value is None:
return ""
return str(value).strip().lower()
def parse_number(value, default=0):
"""Parse a string as a number, returning default if parsing fails."""
try:
return float(value)
except (ValueError, TypeError):
return default
# main.py
from transformations import clean_string, parse_number
# Now you can use the functions
name = clean_string(" ALICE ") # Returns: "alice"
price = parse_number("19.99") # Returns: 19.99
invalid = parse_number("N/A", default=-1) # Returns: -1
# Import specific functions (recommended)
from transformations import clean_string, parse_number
# Import the whole module
import transformations
name = transformations.clean_string(" BOB ")
# Import with alias
import transformations as tf
name = tf.clean_string(" BOB ")
# Import everything (avoid this - makes code harder to read)
from transformations import *
π‘ Prefer explicit imports (from module import function) - they make it clear where each function comes from.
__name__ and __main__ PatternThis pattern lets you write code that works both as a module and as a standalone script.
# transformations.py
def clean_string(value):
return str(value).strip().lower()
# Test code - runs every time the file is imported!
print(clean_string(" TEST "))
If you import this module, the test code runs automatically - not what you want!
if __name__ == "__main__"# transformations.py
def clean_string(value):
"""Remove whitespace and convert to lowercase."""
return str(value).strip().lower()
def parse_number(value, default=0):
"""Parse a string as a number."""
try:
return float(value)
except (ValueError, TypeError):
return default
# This only runs when the file is executed directly
if __name__ == "__main__":
# Test/demo code here
print("Testing transformations...")
print(clean_string(" HELLO ")) # Output: hello
print(parse_number("42")) # Output: 42.0
print(parse_number("invalid")) # Output: 0
π Think of if name == "main": as saying "only run this code if I'm the main script, not if I'm being imported."
python transformations.py, Python sets __name__ to "__main__"import transformations, Python sets __name__ to "transformations"Watch the pattern in action: This terminal demo shows the difference between running a file directly vs. importing it:
π¬ Terminal Tutorial: name == "main"
https://gist.githack.com/lassebenni/98ed9f6f746a53a320eb3426b5ce2883/raw/name_main_terminal.html
Here's a practical example of a well-structured module for data processing:
# pipeline.py
"""
A simple data processing pipeline.
This module provides functions for reading, transforming, and writing data.
"""
import json
import csv
def read_csv(filepath):
"""Read a CSV file and return a list of dictionaries."""
with open(filepath, "r") as file:
reader = csv.DictReader(file)
return list(reader)
def transform_records(records):
"""Apply transformations to a list of records."""
transformed = []
for record in records:
transformed.append({
key.lower(): value.strip()
for key, value in record.items()
})
return transformed
def write_json(data, filepath):
"""Write data to a JSON file."""
with open(filepath, "w") as file:
json.dump(data, file, indent=2)
def run_pipeline(input_path, output_path):
"""Execute the complete pipeline."""
print(f"Reading from {input_path}...")
data = read_csv(input_path)
print(f"Transforming {len(data)} records...")
transformed = transform_records(data)
print(f"Writing to {output_path}...")
write_json(transformed, output_path)
print("Pipeline complete!")
if __name__ == "__main__":
run_pipeline("data/input.csv", "data/output.json")
β¨οΈ Hands-on: Create a new file called utils.py with two functions: is_empty(value) that returns True if a value is None or empty string, and safe_int(value) that converts to int or returns 0. Add the main block with test code.
As your data pipelines grow from single scripts to multiple modules, organizing them correctly becomes vital. Professional Python projects use the src/ layout to ensure a clean separation between source code and metadata.
my-data-project/
βββ .venv/ # Virtual environment
βββ src/ # All source code lives here
β βββ my_pipeline/ # Your package name
β βββ __init__.py
β βββ main.py
β βββ utils.py
βββ data/ # Data files (raw, intermediate, processed)
β βββ 01_raw/
β βββ 02_processed/
βββ tests/ # Mirror of src/ structure
βββ pyproject.toml # Modern project metadata & dependencies
βββ .gitignore # Files to ignore in git
βββ README.md # Project documentation
Always include a .gitignore to keep your repository clean. At a minimum, you should ignore your virtual environment and Python cache:
# Virtual environment
.venv/
venv/
# Python cache
__pycache__/
*.pyc
# Data files (if large or sensitive)
data/
π‘ Why the src layout?
It prevents accidental imports from the root directory and ensures that you are testing the code as it will be installed. This is the gold standard for data engineering pipelines.
src/ layout.Next lesson: Type Hints

Found a mistake or have a suggestion? Let us know in the feedback form.