Joining and Merging DataFrames
Working with Strings and Dates
Going Further: Optional Deep Dives
Raw tables are noisy. Aggregation turns raw events into business metrics like revenue per month or churn per region. Pandas makes this fast with groupby.
By the end of this chapter, you should be able to group a DataFrame by one or more keys, apply multiple aggregation functions in a single call, and add group-level statistics back to individual rows with transform.
<aside>
📦 Run the examples: companion_ch3_grouping_aggregation.py: run in the Codespace or clone locally to follow along with this chapter.
</aside>
import pandas as pd
orders = pd.DataFrame(
{
"order_id": [1, 2, 3, 4, 5, 6, 7],
"customer_id": [10, 10, 10, 20, 20, 30, 30],
"region": ["NL", "NL", "BE", "DE", "NL", "BE", "NL"],
"month": ["2024-01", "2024-01", "2024-01", "2024-02", "2024-02", "2024-02", "2024-02"],
"order_date": pd.to_datetime(
["2024-01-05", "2024-01-12", "2024-01-20", "2024-02-03",
"2024-02-14", "2024-02-18", "2024-02-25"]
),
"amount": [120, 80, 90, 200, 150, 50, 95],
}
)
An aggregation function reduces a column of many values to a single value.
| Function | What it returns |
|---|---|
sum |
Total of all values |
mean |
Average |
count |
Number of non-null values |
min / max |
Smallest / largest value |
std |
Standard deviation |
The simplest use is applying one function to one column directly:
orders["amount"].sum() # 785
orders["amount"].mean() # 112.14...
These same functions power groupby: instead of aggregating the whole column, they aggregate within each group.
groupby PatternA groupby splits data into groups, applies a function, then combines the results.
Start with the simplest form: one column to group by, one function.
orders.groupby("region")["amount"].sum()
region
BE 140
DE 200
NL 445
Name: amount, dtype: int64
Each region becomes one row; sum collapses all orders in that region to a single total. From here you can add more columns and more functions using .agg().
summary = (
orders.groupby(["region", "month"], as_index=False)
.agg(
total_revenue=("amount", "sum"),
avg_order_value=("amount", "mean"),
order_count=("order_id", "count"),
)
)
<aside>
💡 Use as_index=False to keep group keys as normal columns.
</aside>
This pattern (split by key, apply a function, combine results) is called Split-Apply-Combine.
<aside> 🤓 Curious Geek: Split-Apply-Combine
This pattern was popularized by Hadley Wickham and is the mental model behind groupby.
</aside>
You can mix functions per column.
metrics = orders.groupby("region").agg(
total=("amount", "sum"),
max_order=("amount", "max"),
first_order_date=("order_date", "min"),
)
size() vs count()size() counts all rows, including NaN.count() ignores NaN values in the column.orders.groupby("region").size()
orders.groupby("region")["amount"].count()
transform keeps the original number of rows, which is useful for adding group-level stats back to each row.
orders["region_avg"] = orders.groupby("region")["amount"].transform("mean")
You can filter groups based on group-level metrics.
high_volume = orders.groupby("customer_id").filter(lambda group: len(group) >= 3)
<aside>
⚠️ groupby drops rows where grouping keys are NaN. Use dropna=False or fill missing values before grouping.
</aside>
Filtering groups based on their size uses filter, not a boolean mask on the original DataFrame.
Use this sample table. Group by region and calculate total revenue and average order value. Add a column with each order's region average using transform.
import pandas as pd
orders = pd.DataFrame(
{
"order_id": [1, 2, 3, 4, 5, 6],
"region": ["NL", "NL", "BE", "DE", "BE", "NL"],
"amount": [120, 80, 90, 200, 50, 150],
}
)
<aside> 🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=grouping_aggregation&exercise=w4_grouping_aggregation__region_summary&lang=python
</aside>
Use the hands-on orders DataFrame (order_id, region, amount) for all exercises.
region and count the number of orders using both size() and count(). Check whether the results differ if you introduce a NaN into the amount column first.amount per region in a single agg call with named output columns.region_avg with each order's region average using transform("mean") on amount.filter.as_index=False do in a groupby call?groupby("region").size() and groupby("region")["amount"].count()?transform instead of agg?customer_id drops 5% of orders silently. What is the most likely cause?groupby("customer_id").filter(lambda df: len(df) >= 3) return?Test your recall before moving on.
<aside> 🚀 Try it in the widget: Interactive Quiz: Grouping and Aggregation
</aside>
https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch3_grouping_aggregation_quiz&embed=1
If groupby or transform felt unclear, this video walks through the same concepts step by step.
<aside> 🎬 Struggling with this concept? Watch this beginner-friendly video:
</aside>
https://www.youtube.com/watch?v=txMdrV1Ut64
You can also describe the business metric you want to an LLM to get a starting point.
<aside>
💡 Using AI to help: Describe the business metric you want (⚠️ Ensure no PII or sensitive company data is included!): for example "revenue per region per month, with a running total column", and ask an LLM to write the groupby + transform chain. Check that the row count matches your original DataFrame if using transform.
</aside>
Ready to apply these skills? Try the practice exercise before moving on.
<aside> ⌨️ Hands on: Practice with Exercise 3: Grouping and Aggregation.
</aside>
Next up: Joining and Merging DataFrames, where you combine multiple tables using different join types and track unmatched rows with indicator.
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with ❤️ by the HackYourFuture community · Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.