Week 2 - Structuring Data Pipelines
Introduction to Data Pipelines
Configuration & Secrets (.env)
Separation of Concerns (I/O vs Logic)
Linting and Formatting with Ruff
Assignment: Refactoring to a Clean Pipeline
Everything you've built this week, separating I/O from logic, writing pure functions, using dataclasses, was leading to this moment. The whole point of clean architecture is that your code becomes testable.
Testing isn't just about catching bugs. It's about confidence. When you refactor a function, add a feature, or fix a bug, tests tell you instantly whether you broke something. In professional pipelines, tests are the safety net that lets you move fast without fear.
Python has a built-in testing framework called unittest, but the industry standard is pytest. Why?
assert, no special method names to memorizeInstall it:
pip install pytest
Pytest uses a simple naming convention called test discovery to find your tests automatically:
| What | Convention | Example |
|---|---|---|
| Files | Start with test_ |
test_pipeline.py |
| Functions | Start with test_ |
test_clean_name() |
| Folders | Named tests/ (by convention) |
tests/test_pipeline.py |
To run all tests, just type:
pytest
Pytest will scan your project, find every file and function matching the convention, and run them all.
<aside> ๐ฌ Terminal Tutorial: Running Pytest for the First Time
</aside>
Remember the clean_name function from Chapter 4?
def clean_name(raw_name: str) -> str:
return raw_name.strip().title()
Here's how you test it:
# test_pipeline.py
from pipeline import clean_name
def test_clean_name_strips_whitespace():
assert clean_name(" alice ") == "Alice"
def test_clean_name_converts_to_title_case():
assert clean_name("bob smith") == "Bob Smith"
That's it. No classes, no boilerplate. Just assert (Python's built-in truth check) and a clear function name that describes what you're testing.
Run it:
pytest test_pipeline.py -v
The -v flag gives you verbose output, showing each test name and its result.
This is where the architecture from Chapters 3-6 pays off. Because we separated I/O from logic, we can test our business rules without touching files, databases, or APIs.
# logic.py
from dataclasses import dataclass
@dataclass
class Product:
product_id: int
price: float
currency: str
def apply_vat(product: Product, vat_rate: float) -> Product:
return Product(
product_id=product.product_id,
price=round(product.price * (1 + vat_rate), 2),
currency=product.currency
)
Testing this is trivial because the function is pure: no files, no network, no database:
# test_logic.py
from logic import Product, apply_vat
def test_apply_vat_calculates_correctly():
product = Product(product_id=1, price=100.0, currency="EUR")
result = apply_vat(product, 0.21)
assert result.price == 121.0
def test_apply_vat_does_not_mutate_original():
original = Product(product_id=1, price=100.0, currency="EUR")
apply_vat(original, 0.21)
assert original.price == 100.0 # unchanged!
Compare this to testing a "god function" that reads from CSV, transforms, and writes to disk. You'd need actual files, cleanup logic, and your tests would be slow and fragile.
When multiple tests need the same data, you can use fixtures instead of repeating yourself:
import pytest
from logic import Product
@pytest.fixture
def sample_product():
return Product(product_id=1, price=100.0, currency="EUR")
def test_apply_vat(sample_product):
result = apply_vat(sample_product, 0.21)
assert result.price == 121.0
def test_product_has_currency(sample_product):
assert sample_product.currency == "EUR"
A fixture is a function decorated with @pytest.fixture. Pytest automatically calls it and passes the result to any test that lists its name as a parameter.
Fixtures are especially useful for:
Instead of writing a separate test for each edge case, use @pytest.mark.parametrize:
import pytest
@pytest.mark.parametrize("raw, expected", [
(" alice ", "Alice"),
("BOB", "Bob"),
("charlie brown", "Charlie Brown"),
("", ""),
])
def test_clean_name(raw, expected):
assert clean_name(raw) == expected
This runs the same test logic with four different inputs. If one fails, pytest tells you exactly which input caused the failure.
This is far better than a for loop inside a single test, because:
<aside>
โจ๏ธ Hands on: Implement a validate_email(email) function that checks for a valid format (one @, non-empty parts, no spaces). Think about edge cases, then the tests will check them for you!
๐ Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=2&chapter=testing_pytest&exercise=w2_testing_pytest__write_tests&lang=python
</aside>
A typical project structure:
my_pipeline/
โโโ config.py
โโโ logic.py
โโโ io_layer.py
โโโ main.py
โโโ tests/
โโโ test_logic.py
โโโ test_config.py
Keep your tests close to what they test. Name them clearly: test_logic.py tests logic.py.
<aside> ๐ค Curious Geek: Why Pytest Won Over unittest
Python's built-in unittest was modeled after Java's JUnit. It requires you to write test classes, use self.assertEqual() methods, and follow a rigid structure.
Pytest was created in 2004 by Holger Krekel as a simpler alternative. It just uses plain assert statements and functions, no classes required.
Today, pytest is used by projects like Django, Flask, and even parts of CPython itself. It won because it made testing feel like writing normal Python code!
</aside>
pytest know which files and functions to run as tests?pytest.mark.parametrize considered better than writing a for loop inside a single test function?<aside>
๐ก In the wild: The pytest project itself is a great example of well-structured Python. Browse its testing/ directory to see how the maintainers use fixtures, parametrize, and conftest.py in a real project with thousands of tests.
</aside>
The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.