Dataclasses for Data Objects

In the previous chapter, you learned the difference between OOP and functional programming. You saw how Python classes use __init__ and self to manage state and configuration.

But writing __init__ by hand gets repetitive fast, especially for data objects where you just want to store values. An important question appears:

What shape does the data inside the pipeline actually have?

Using dictionaries can work, but it quickly becomes fragile as pipelines grow.

In this chapter, you'll learn about dataclasses: a Python shortcut that auto-generates the __init__ boilerplate for you, while adding type safety and validation.

By the end of this chapter, you'll understand how dataclasses help you write pipelines that are easier to read and safer to modify.

The Problem with Dictionaries

Dictionaries are flexible, but that flexibility comes at a cost.

row = {
"product_id":123,
"price":19.99,
"currency":"EUR"
}

Here are some of the limitations:

No protection against typos: A simple misspelling like row["prize"] instead of row["price"] causes a KeyError at runtime. Your code runs fine… until it doesn't. Most likely, you'd only discover the mistake when the pipeline crashes.

No documentation of structure: Looking at a dictionary, you don't know which keys are required, which are optional, or what types the values should be.

Editors can't help you: Your editor can't autocomplete keys or warn you about mistakes. Everything is discovered at runtime, often in production.

Model Data as Objects

Instead of passing raw dictionaries around, you want to say:

"This is a Product, and it has a price, a currency, and an ID."

Python gives us a tool that does exactly this: dataclasses. Here's the same data modeled as a dataclass, plus an instance and a couple of field accesses:

from dataclasses import dataclass

@dataclass
class Product:
    product_id: int
    price: float
    currency: str

product = Product(
    product_id=123,
    price=19.99,
    currency="EUR"
)

print(product.price)     # 19.99
print(product.currency)  # EUR

Your editor now knows:

what fields exist
their types
how to autocomplete them

This is already a great improvement! Bugs are easier to spot and mistakes are caught earlier.

For example, if you were now to type product.prize, your editor will warn you before you run the code.

<aside> 🎬 Animation: Dictionary vs Dataclass: Autocomplete and Safety

</aside>

https://gist.githack.com/lassebenni/b0a5df8109a4d8ded3cc94d4bd668a36/raw/dict_vs_dataclass_animation.html

Beyond the immediate ergonomics, dataclasses also document the shape of your data for any reader: the field list and types act as a tiny schema right next to the code that uses them.

<aside> 💡 Dataclasses make your data structure self-documenting: its shape is explicit and clear, so new team members can understand the pipeline without digging through code.

</aside>

The shortcut isn't new either. Before @dataclass arrived in 3.7, Python had a more limited cousin.

<aside> 🤓 Curious Geek: Dataclasses vs NamedTuples

Before Python 3.7 introduced Dataclasses, Pythonistas often used collections.namedtuple for simple data objects.

NamedTuples are immutable by default and behave like tuples (you can index them: p[0]), while Dataclasses are mutable classes by default.

Today, Dataclasses are preferred because they support type hints, inheritance, and default values much more elegantly!

</aside>

Adding Validation with `__post_init__`

Dataclasses can also help to enforce rules through validation. Imagine a product with a negative price:

<aside> 💡 The bare Product(price=-10, ...) call below would happily succeed today: nothing stops you from constructing nonsense. The __post_init__ method below adds the guard that makes it raise instead.

</aside>

We can prevent this by adding a __post_init__ method. Remember how the @dataclass decorator auto-generates __init__ for you? The __post_init__ method runs right after that generated __init__ finishes, giving you a hook to validate or modify the data:

from dataclasses import dataclass

@dataclass
class Product:
    product_id: int
    price: float
    currency: str

    def __post_init__(self):
        if self.price <= 0:
            raise ValueError("Price must be positive")

# Now this constructor raises instead of silently producing a bad row:
Product(product_id=1, price=-10, currency="EUR")

Invalid data is rejected immediately, and you can of course set all the rules you want.

Being able to stop bad data early in the process is extremely powerful in pipelines.

<aside> 💡 Using AI to help: Ask an LLM to suggest __post_init__ rules for your dataclass. Paste only the field list (⚠️ no real data, no PII) and decide which suggestions belong on the data layer (range checks, enum membership, non-empty strings) versus business logic (cross-record rules, dedup, currency conversion). The data layer should reject impossible values; business rules belong elsewhere.

</aside>

Time to write your own validating dataclass.

<aside> ⌨️ Hands on: Create a Product dataclass with __post_init__ validation: reject negative prices and empty names. Make bad data impossible!

</aside>

<aside> 🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=2&chapter=dataclasses&exercise=w2_dataclasses__product_model&lang=python

</aside>

Now take this one step further.

Encapsulation

So far, we've used dataclasses mainly to store data in a structured way. But dataclasses can do more than hold values: they can also own behavior thanks to methods. This is what allows you to keep logic close to the data it belongs to.

from dataclasses import dataclass

@dataclass
class Product:
    product_id: int
    price: float
    currency: str

    def price_with_vat(self, vat_rate: float) -> float:
        return self.price * (1 + vat_rate)

product = Product(product_id=1, price=100.0, currency="EUR")
print(product.price_with_vat(0.21))  # 121.0

This is called encapsulation (keeping data and its behavior together):

Now the logic lives with the data it operates on
Code becomes easier to reason about

<aside> 💡 If you're unsure what methods to add, ask yourself: "What should this data object be able to do?"

</aside>

Serialization: Going Back to Dicts

Inside your pipeline, dataclasses are great.

But sooner or later, data needs to leave the pipeline: written to a database, sent to an API or saved as JSON. Those systems don't understand Python objects. They understand plain data: dicts, JSON, rows. This is where serialization (converting objects to portable formats) comes in.

Luckily, Python makes this easy. asdict() converts a dataclass instance into a plain dict you can hand off to anything else:

import json
from dataclasses import asdict, dataclass

@dataclass
class Product:
    product_id: int
    price: float
    currency: str

product = Product(product_id=1, price=1200.0, currency="EUR")

product_dict = asdict(product)
print(product_dict)
# {'product_id': 1, 'price': 1200.0, 'currency': 'EUR'}

print(json.dumps(product_dict))
# {"product_id": 1, "price": 1200.0, "currency": "EUR"}

Clean, structured data again. Easy to insert into a database, send over an API, or pass to another system.

This pattern is extremely common in real pipelines:

Use rich objects inside your logic, convert to plain data at the boundaries.

Dataclasses is what makes this transition explicit and intentional.

<aside> 📝 Practice: Apply this pattern in Practice Exercise 2: Model Data with a Dataclass.

</aside>

🧠 Knowledge Check

<aside> 🚀 Try it in the widget: Interactive Quiz: Dataclasses

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_2_ch5_dataclasses_quiz&embed=1

1. What advantage does accessing row.price have over row["price"] when using dataclasses?
1. When would you use __post_init__ in a dataclass, and what kind of checks belong there?
1. Why is it useful to put transformation logic (like VAT calculations or validations) inside a dataclass instead of separate functions?
1. At which points in a data pipeline should dataclasses usually be converted back to dictionaries or JSON?

<aside> 🎬 Struggling with this concept? Watch this beginner-friendly video:

Watch on YouTube

</aside>

https://www.youtube.com/watch?v=vRVVyl9uaZc

Extra reading

Python Dataclasses (Official Docs) - How dataclasses work, including defaults, methods, and __post_init__
Dataclasses vs Dictionaries - Practical examples showing why dataclasses are safer and more expressive than dicts

<aside> 💡 In the wild: FastAPI uses Pydantic models (which build on the same idea as dataclasses) to validate every API request automatically. When a user sends bad data, FastAPI rejects it before your code even runs, exactly like __post_init__ validation.

</aside>

Next up: Functional Composition, where you chain pure functions into a readable, top-to-bottom pipeline using the {**r, "key": value} pattern to avoid mutation.

Dataclasses for Data Objects

The Problem with Dictionaries

Model Data as Objects

Adding Validation with __post_init__

Encapsulation

Serialization: Going Back to Dicts

🧠 Knowledge Check

Extra reading

Adding Validation with `__post_init__`