Week 4 - Data Processing with Pandas
Introduction to Pandas and DataFrames
Selecting, Filtering, and Sorting Data
Joining and Merging DataFrames
Working with Strings and Dates
Assignment: MessyCorp Goes Pandas
Raw tables are noisy. Aggregation turns raw events into business metrics like revenue per month or churn per region. Pandas makes this fast with groupby.
groupby PatternA groupby splits data into groups, applies a function, then combines the results.
summary = (
orders.groupby(["region", "month"], as_index=False)
.agg(
total_revenue=("amount", "sum"),
avg_order_value=("amount", "mean"),
order_count=("order_id", "count"),
)
)
<aside>
💡 Use as_index=False to keep group keys as normal columns.
</aside>
<aside> 🤓 Curious Geek: Split-Apply-Combine
This pattern was popularized by Hadley Wickham and is the mental model behind groupby.
</aside>
You can mix functions per column.
metrics = orders.groupby("region").agg(
total=("amount", "sum"),
max_order=("amount", "max"),
first_order_date=("order_date", "min"),
)
size() vs count()size() counts all rows, including NaN.count() ignores NaN values in the column.orders.groupby("region").size()
orders.groupby("region")["amount"].count()
transform keeps the original number of rows, which is useful for adding group-level stats back to each row.
orders["region_avg"] = orders.groupby("region")["amount"].transform("mean")
You can filter groups based on group-level metrics.
high_volume = orders.groupby("customer_id").filter(lambda df: len(df) >= 3)
<aside>
⚠️ groupby drops rows where grouping keys are NaN. Use dropna=False or fill missing values before grouping.
</aside>
<aside>
⌨️ Hands on: Use this sample table. Group by region and calculate total revenue and average order value. Add a column with each order's region average using transform.
import pandas as pd
orders = pd.DataFrame(
{
"order_id": [1, 2, 3, 4, 5, 6],
"region": ["NL", "NL", "BE", "DE", "BE", "NL"],
"amount": [120, 80, 90, 200, 50, 150],
}
)
🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=grouping_aggregation&exercise=w4_grouping_aggregation__region_summary&lang=python
💭 The widget uses plain Python lists of dictionaries to mimic a table.
</aside>
product_category and count the number of orders in each group.groupby(["year", "month"]).transform("sum") on amount.The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.