Week 2 - Structuring Data Pipelines

Introduction to Data Pipelines

Configuration & Secrets (.env)

Separation of Concerns (I/O vs Logic)

OOP vs Functional Programming

Dataclasses for Data Objects

Functional Composition

Testing with Pytest

Linting and Formatting with Ruff

Practice

Gotchas & Pitfalls

Assignment: Refactoring to a Clean Pipeline

Testing with Pytest

By the end of this chapter, you should be able to:

Everything you've built this week, separating I/O from logic, writing pure functions, using dataclasses, was leading to this moment. The whole point of clean architecture is that your code becomes testable.

Testing isn't just about catching bugs. It's about confidence. When you refactor a function, add a feature, or fix a bug, tests tell you instantly whether you broke something. In professional pipelines, tests are the safety net that lets you move fast without fear.

<aside> 📘 Core Program Refresher: In the Core program you used Jest's describe / it / expect pattern (see Unit testing). Pytest follows the same arrange / act / assert mental model: organise tests by behaviour, set up the inputs, call the function under test, assert on the result. The big difference is the syntax: Pytest uses plain functions and a single assert keyword instead of expect().toBe() style matchers.

</aside>

Why Pytest?

Python has a built-in testing framework called unittest, but the industry standard is pytest. Why?

Install it:

pip install pytest

How Pytest Discovers Tests

Pytest uses a simple naming convention called test discovery to find your tests automatically:

What Convention Example
Files Start with test_ test_pipeline.py
Functions Start with test_ test_clean_name()
Folders Named tests/ (by convention) tests/test_pipeline.py

To run all tests, just type:

pytest

Pytest will scan your project, find every file and function matching the convention, and run them all.

<aside> 🎬 Terminal Tutorial: Running Pytest for the First Time

</aside>

Your First Test

Remember the clean_name function from OOP vs Functional Programming?

def clean_name(raw_name: str) -> str:
    return raw_name.strip().title()

Here's how you test it. The test-file snippets in this chapter assume pytest is installed in your environment and that the imported modules (pipeline, logic) exist in your project: they're shape demos, not standalone scripts.

<!-- runner:expect-fail -->

# test_pipeline.py
from pipeline import clean_name

def test_clean_name_strips_whitespace():
    assert clean_name("  alice  ") == "Alice"

def test_clean_name_converts_to_title_case():
    assert clean_name("bob smith") == "Bob Smith"

That's it. No classes, no boilerplate. Just assert (Python's built-in truth check) and a clear function name that describes what you're testing.

Run it:

pytest test_pipeline.py -v

The -v flag gives you verbose output, showing each test name and its result.

Testing Pure Functions (The Payoff)

This is where the architecture from the previous chapters (Separation of Concerns through Functional Composition) pays off. Because we separated I/O from logic, we can test our business rules without touching files, databases, or APIs.

# logic.py
from dataclasses import dataclass

@dataclass
class Product:
    product_id: int
    price: float
    currency: str

def apply_vat(product: Product, vat_rate: float) -> Product:
    return Product(
        product_id=product.product_id,
        price=round(product.price * (1 + vat_rate), 2),
        currency=product.currency
    )

Testing this is trivial because the function is pure: no files, no network, no database:

<!-- runner:expect-fail -->

# test_logic.py
from logic import Product, apply_vat

def test_apply_vat_calculates_correctly():
    product = Product(product_id=1, price=100.0, currency="EUR")
    result = apply_vat(product, 0.21)
    assert result.price == 121.0

def test_apply_vat_does_not_mutate_original():
    original = Product(product_id=1, price=100.0, currency="EUR")
    apply_vat(original, 0.21)
    assert original.price == 100.0  # unchanged!

Compare this to testing a "god function" that reads from CSV, transforms, and writes to disk. You'd need actual files, cleanup logic, and your tests would be slow and fragile.

Fixtures: Reusable Test Setup

When multiple tests need the same data, you can use fixtures instead of repeating yourself:

<!-- runner:expect-fail -->

import pytest
from logic import Product

@pytest.fixture
def sample_product():
    return Product(product_id=1, price=100.0, currency="EUR")

def test_apply_vat(sample_product):
    result = apply_vat(sample_product, 0.21)
    assert result.price == 121.0

def test_product_has_currency(sample_product):
    assert sample_product.currency == "EUR"

A fixture is a function decorated with @pytest.fixture. Pytest automatically calls it and passes the result to any test that lists its name as a parameter.

Fixtures are especially useful for:

Parametrize: Test Multiple Inputs

Instead of writing a separate test for each edge case, use @pytest.mark.parametrize:

<!-- runner:expect-fail -->

import pytest

@pytest.mark.parametrize("raw, expected", [
    ("  alice  ", "Alice"),
    ("BOB", "Bob"),
    ("charlie brown", "Charlie Brown"),
    ("", ""),
])
def test_clean_name(raw, expected):
    assert clean_name(raw) == expected

This runs the same test logic with four different inputs. If one fails, pytest tells you exactly which input caused the failure.

This is far better than a for loop inside a single test, because:

<aside> 💡 Using AI to help: Paste a function signature (⚠️ no real data, no PII) into an LLM and ask for parametrize edge-case sets: empty inputs, boundary values, type mismatches, unicode oddities. Then prune the suggestions yourself: pytest will happily run twenty cases, but only some of them exercise different code paths. The cases worth keeping are the ones that would fail under different bugs.

</aside>

Time to write some tests of your own.

<aside> ⌨️ Hands on: Implement a validate_email(email) function that checks for a valid format (one @, non-empty parts, no spaces). Think about edge cases, then the tests will check them for you!

🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=2&chapter=testing_pytest&exercise=w2_testing_pytest__write_tests&lang=python

</aside>

Organizing Your Tests

A typical project structure:

my_pipeline/
├── config.py
├── logic.py
├── io_layer.py
├── main.py
└── tests/
    ├── test_logic.py
    └── test_config.py

Keep your tests close to what they test. Name them clearly: test_logic.py tests logic.py.

<aside> 🤓 Curious Geek: Why Pytest Won Over unittest

Python's built-in unittest was modeled after Java's JUnit. It requires you to write test classes, use self.assertEqual() methods, and follow a rigid structure.

Pytest was created in 2004 by Holger Krekel as a simpler alternative. It just uses plain assert statements and functions, no classes required.

Today, pytest is used by projects like Django, Flask, and even parts of CPython itself. It won because it made testing feel like writing normal Python code!

</aside>

Now turn the patterns into your own test file.

<aside> 📝 Practice: Apply this pattern in Practice Exercise 4: Write Tests with Pytest.

</aside>

🧠 Knowledge Check

Extra reading