Week 2 - Structuring Data Pipelines

Introduction to Data Pipelines

Configuration & Secrets (.env)

Separation of Concerns (I/O vs Logic)

OOP vs Functional Programming

Dataclasses for Data Objects

Functional Composition

Testing with Pytest

Linting and Formatting with Ruff

Practice

Assignment: Refactoring to a Clean Pipeline

Gotchas & Pitfalls

Lesson Plan

📦 Dataclasses for Data Objects

In the previous chapter, you learned the difference between OOP and functional programming. You saw how Python classes use __init__ and self to manage state and configuration.

But writing __init__ by hand gets repetitive fast, especially for data objects where you just want to store values. An important question appears:

What shape does the data inside the pipeline actually have?

Using dictionaries can work, but it quickly becomes fragile as pipelines grow.

In this chapter, you'll learn about dataclasses: a Python shortcut that auto-generates the __init__ boilerplate for you, while adding type safety and validation.

By the end of this chapter, you'll understand how dataclasses help you write pipelines that are easier to read and safer to modify.

The Problem with Dictionaries

Dictionaries are flexible, but that flexibility comes at a cost.

row = {
"product_id":123,
"price":19.99,
"currency":"EUR"
}

Here are some of the limitations:

No protection against typos: A simple misspelling like row["prize"] instead of row["price"] causes a KeyError at runtime. Your code runs fine… until it doesn't. Most likely, you'd only discover the mistake when the pipeline crashes.

No documentation of structure: Looking at a dictionary, you don't know which keys are required, which are optional, or what types the values should be.

Editors can't help you: Your editor can't autocomplete keys or warn you about mistakes. Everything is discovered at runtime, often in production.

Model Data as Objects

Instead of passing raw dictionaries around, you want to say:

“This is a Product, and it has a price, a currency, and an ID.”

Python gives us a tool that does exactly this: dataclasses. Here’s the same data modeled as a dataclass:

from dataclasses import dataclass

@dataclass
class Product:
    product_id:int
    price:float
    currency:str

Creating an object now looks like this:

product = Product(
    product_id=123,
    price=19.99,
    currency="EUR"
)

And accessing fields becomes:

product.price
product.currency

Your editor now knows:

This is already a great improvement! Bugs are way easier to spot and mistakes are caught earlier.

For example, if you were now to type product.prize, your editor will warn you before you run the code.

<aside> 🎬 Animation: Dictionary vs Dataclass: Autocomplete and Safety

</aside>

<aside> 🤓 Curious Geek: Dataclasses vs NamedTuples

Before Python 3.7 introduced Dataclasses, Pythonistas often used collections.namedtuple for simple data objects.

NamedTuples are immutable by default and behave like tuples (you can index them: p[0]), while Dataclasses are mutable classes by default.

Today, Dataclasses are preferred because they support type hints, inheritance, and default values much more elegantly!

💡 Dataclasses make your data structure self-documenting: its shape is explicit and clear, so new team members can understand the pipeline without digging through code.

</aside>

Adding Validation with post_init

Dataclasses can also help to enforce rules through validation. Imagine a product with a negative price:

Product(product_id=1, price=-10, currency="EUR")

That shouldn't exist! We can prevent this by adding a __post_init__ method.

Remember how the @dataclass decorator auto-generates __init__ for you? The __post_init__ method runs right after that generated __init__ finishes, giving you a hook to validate or modify the data:

from dataclasses import dataclass

@dataclass
class Product:
    product_id:int
    price:float
    currency:str
    def __post_init__(self):
        if self.price <=0:
            raise ValueError("Price must be positive")

This way, invalid data is rejected immediately, and you can of course set all the rules you want!

Being able to stop bad data early in the process is extremely powerful in pipelines!

<aside> ⌨️ Hands on: Create a Product dataclass with __post_init__ validation: reject negative prices and empty names. Make bad data impossible!

🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=2&chapter=dataclasses&exercise=w2_dataclasses__product_model&lang=python

</aside>

Now take this one step further.

Encapsulation

So far, we’ve used dataclasses mainly to store data in a structured way. But dataclasses can do more than hold values, they can also own behavior thanks to methods. This is what allows you to keep logic close to the data it belongs to!

@dataclass
class Product:
    product_id:int
    price:float
    currency:str

    def price_with_vat(self, vat_rate: float) -> float:
	    return self.price * (1 + vat_rate)

Usage:

product.price_with_vat(0.21)

This is called encapsulation (keeping data and its behavior together):

<aside> 💡 If you're unsure what methods to add, ask yourself:

</aside>

“What should this data object be able to do?”

Serialization: Going Back to Dicts

Inside your pipeline, dataclasses are great.

But sooner or later, data needs to leave the pipeline: written to a database, sent to an API or saved as JSON. Those systems don't understand Python objects. They understand plain data: dicts, JSON, rows. This is where serialization (converting objects to portable formats) comes in!

Luckily, Python makes this easy.

from dataclasses import asdict

product_dict = asdict(product)

Now your Product object becomes:

{
"name":"Laptop",
"price":1200.0
}

And here you go, you just got clean, structured data again!

This is then super easy to insert into a database, convert to JSON or in general pass it to another system. Eg:

import json

json.dumps(asdict(product))

This pattern is extremely common in real pipelines:

Use rich objects inside your logic, convert to plain data at the boundaries.

Dataclasses is what makes this transition explicit and intentional.

🧠 Knowledge Check

  1. What advantage does accessing row.price have over row["price"] when using dataclasses?
  2. When would you use post_init in a dataclass, and what kind of checks belong there?
  3. Why is it useful to put transformation logic (like vat calculations or validations) inside a dataclass instead of separate functions?
  4. At which points in a data pipeline should dataclasses usually be converted back to dictionaries or JSON?

Extra reading

<aside> 💡 In the wild: FastAPI uses Pydantic models (which build on the same idea as dataclasses) to validate every API request automatically. When a user sends bad data, FastAPI rejects it before your code even runs, exactly like __post_init__ validation.

</aside>


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.