Teachers

Functions and Modules for Data Pipelines

Functions and modules are the building blocks of maintainable data pipelines. In this section, you'll learn how to structure your code so it's reusable, testable, and easy to understand.

Defining Functions

You've already worked with functions in the Core program. Let's review the basics and focus on patterns specific to data engineering.

Basic function syntax

def greet(name):
    """Return a greeting message."""
    return f"Hello, {name}!"

# Call the function
message = greet("Data Engineer")
print(message)  # Output: Hello, Data Engineer!

Functions with multiple parameters

def calculate_average(numbers):
    """Calculate the average of a list of numbers."""
    if not numbers:
        return 0
    return sum(numbers) / len(numbers)

# Usage
values = [10, 20, 30, 40, 50]
avg = calculate_average(values)
print(f"Average: {avg}")  # Output: Average: 30.0

Default parameters

Default parameters make functions more flexible:

def read_data(filepath, delimiter=",", skip_header=True):
    """Read data from a file with configurable options."""
    print(f"Reading {filepath}")
    print(f"Delimiter: {delimiter}")
    print(f"Skip header: {skip_header}")

# All of these work:
read_data("data.csv")
read_data("data.tsv", delimiter="\\\\t")
read_data("data.csv", skip_header=False)

💡 In data pipelines, default parameters are great for common configurations. Most CSV files use commas, so make that the default!

Docstrings

Docstrings document what your function does. While regular comments (#) are for developers reading your code, docstrings are formal documentation that Python can read.

💡 Why use docstrings?

Self-documenting: You can use the help() function to read them.

Tool integration: IDEs and automated documentation tools (like Sphinx) use them to generate manuals.

Consistency: They provide a standardized way to describe parameters and return values.

Good docstring practices

def transform_record(record, uppercase_fields=None):
    """
    Transform a data record by applying formatting rules.

    Args:
        record: A dictionary representing a single data record.
        uppercase_fields: List of field names to convert to uppercase.
                         If None, no fields are uppercased.

    Returns:
        A new dictionary with the transformed record.

    Example:
        >>> transform_record({"name": "alice"}, ["name"])
        {"name": "ALICE"}
    """
    result = record.copy()
    if uppercase_fields:
        for field in uppercase_fields:
            if field in result:
                result[field] = result[field].upper()
    return result

⚠️ Always write docstrings for functions that will be used by others (or by future you!). In data pipelines, this includes any transformation or validation function.

Organizing Code into Modules

A module is simply a Python file containing code. Projects in data engineering quickly grow beyond a single script, so we use modules to keep things organized.

💡 Why use modules?

Maintainability: It’s easier to find code in small, focused files than in one 2,000-line script.

Reusability: You can import the same utility functions into many different pipelines.

Collaboration: Multiple team members can work on different modules at the same time without conflicting.

Creating a module

Create a file called transformations.py:

# transformations.py
"""Data transformation functions for the pipeline."""

def clean_string(value):
    """Remove leading/trailing whitespace and normalize to lowercase."""
    if value is None:
        return ""
    return str(value).strip().lower()

def parse_number(value, default=0):
    """Parse a string as a number, returning default if parsing fails."""
    try:
        return float(value)
    except (ValueError, TypeError):
        return default

Importing from modules

# main.py
from transformations import clean_string, parse_number

# Now you can use the functions
name = clean_string("  ALICE  ")  # Returns: "alice"
price = parse_number("19.99")      # Returns: 19.99
invalid = parse_number("N/A", default=-1)  # Returns: -1

Different import styles

# Import specific functions (recommended)
from transformations import clean_string, parse_number

# Import the whole module
import transformations
name = transformations.clean_string("  BOB  ")

# Import with alias
import transformations as tf
name = tf.clean_string("  BOB  ")

# Import everything (avoid this - makes code harder to read)
from transformations import *

💡 Prefer explicit imports (from module import function) - they make it clear where each function comes from.

The `name` and `main` Pattern

This pattern lets you write code that works both as a module and as a standalone script.

The problem

# transformations.py
def clean_string(value):
    return str(value).strip().lower()

# Test code - runs every time the file is imported!
print(clean_string("  TEST  "))

If you import this module, the test code runs automatically - not what you want!

The solution: `if name == "main"`

# transformations.py
def clean_string(value):
    """Remove whitespace and convert to lowercase."""
    return str(value).strip().lower()

def parse_number(value, default=0):
    """Parse a string as a number."""
    try:
        return float(value)
    except (ValueError, TypeError):
        return default

# This only runs when the file is executed directly
if __name__ == "__main__":
    # Test/demo code here
    print("Testing transformations...")
    print(clean_string("  HELLO  "))  # Output: hello
    print(parse_number("42"))          # Output: 42.0
    print(parse_number("invalid"))     # Output: 0

💭 Think of if name == "main": as saying "only run this code if I'm the main script, not if I'm being imported."

How it works

When you run python transformations.py, Python sets __name__ to "__main__"
When you do import transformations, Python sets __name__ to "transformations"

Watch the pattern in action: This terminal demo shows the difference between running a file directly vs. importing it:

🎬 Terminal Tutorial: name == "main"

https://gist.githack.com/lassebenni/98ed9f6f746a53a320eb3426b5ce2883/raw/name_main_terminal.html

Structuring a Data Pipeline Module

Here's a practical example of a well-structured module for data processing:

# pipeline.py
"""
A simple data processing pipeline.

This module provides functions for reading, transforming, and writing data.
"""

import json
import csv

def read_csv(filepath):
    """Read a CSV file and return a list of dictionaries."""
    with open(filepath, "r") as file:
        reader = csv.DictReader(file)
        return list(reader)

def transform_records(records):
    """Apply transformations to a list of records."""
    transformed = []
    for record in records:
        transformed.append({
            key.lower(): value.strip()
            for key, value in record.items()
        })
    return transformed

def write_json(data, filepath):
    """Write data to a JSON file."""
    with open(filepath, "w") as file:
        json.dump(data, file, indent=2)

def run_pipeline(input_path, output_path):
    """Execute the complete pipeline."""
    print(f"Reading from {input_path}...")
    data = read_csv(input_path)

    print(f"Transforming {len(data)} records...")
    transformed = transform_records(data)

    print(f"Writing to {output_path}...")
    write_json(transformed, output_path)

    print("Pipeline complete!")

if __name__ == "__main__":
    run_pipeline("data/input.csv", "data/output.json")

⌨️ Hands-on: Create a new file called utils.py with two functions: is_empty(value) that returns True if a value is None or empty string, and safe_int(value) that converts to int or returns 0. Add the main block with test code.

Industry Standard Project Structure

As your data pipelines grow from single scripts to multiple modules, organizing them correctly becomes vital. Professional Python projects use the src/ layout to ensure a clean separation between source code and metadata.

my-data-project/
├── .venv/                # Virtual environment
├── src/                  # All source code lives here
│   └── my_pipeline/      # Your package name
│       ├── __init__.py
│       ├── main.py
│       └── utils.py
├── data/                 # Data files (raw, intermediate, processed)
│   ├── 01_raw/
│   └── 02_processed/
├── tests/                # Mirror of src/ structure
├── pyproject.toml        # Modern project metadata & dependencies
├── .gitignore           # Files to ignore in git
└── README.md            # Project documentation

The .gitignore file

Always include a .gitignore to keep your repository clean. At a minimum, you should ignore your virtual environment and Python cache:

# Virtual environment
.venv/
venv/

# Python cache
__pycache__/
*.pyc

# Data files (if large or sensitive)
data/

💡 Why the src layout?

It prevents accidental imports from the root directory and ensures that you are testing the code as it will be installed. This is the gold standard for data engineering pipelines.

Extra reading

Python Modules and Packages — How to organize code into reusable modules.
Defining Your Own Python Function — Deep dive into function syntax and parameters.
What Does if name == "main" Do? — The essential pattern explained.
if name == "main" for Python Developers — mCoding's concise video explanation (8 min).
How to Structure a Python Project — ArjanCodes on the modern src/ layout.

Next lesson: Type Hints

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.

Week 1 -Foundational Python