Week 4: Data Processing

Pandas and DataFrames

Selecting and Filtering Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Jupyter Notebooks

Practice

Assignment: MessyCorp Pandas

Gotchas & Pitfalls

Week 4 Kickoff Slides

Career relevance: Week 4

Pandas Cheatsheet

Week 4 Glossary

Going Further: Optional Deep Dives

Grouping and Aggregation

Raw tables are noisy. Aggregation turns raw events into business metrics like revenue per month or churn per region. Pandas makes this fast with groupby.

By the end of this chapter, you should be able to group a DataFrame by one or more keys, apply multiple aggregation functions in a single call, and add group-level statistics back to individual rows with transform.

<aside> 📦 Run the examples: companion_ch3_grouping_aggregation.py: run in the Codespace or clone locally to follow along with this chapter.

</aside>

Concepts

import pandas as pd

orders = pd.DataFrame(
    {
        "order_id": [1, 2, 3, 4, 5, 6, 7],
        "customer_id": [10, 10, 10, 20, 20, 30, 30],
        "region": ["NL", "NL", "BE", "DE", "NL", "BE", "NL"],
        "month": ["2024-01", "2024-01", "2024-01", "2024-02", "2024-02", "2024-02", "2024-02"],
        "order_date": pd.to_datetime(
            ["2024-01-05", "2024-01-12", "2024-01-20", "2024-02-03",
             "2024-02-14", "2024-02-18", "2024-02-25"]
        ),
        "amount": [120, 80, 90, 200, 150, 50, 95],
    }
)

Aggregation Functions

An aggregation function reduces a column of many values to a single value.

Function What it returns
sum Total of all values
mean Average
count Number of non-null values
min / max Smallest / largest value
std Standard deviation

The simplest use is applying one function to one column directly:

orders["amount"].sum()   # 785
orders["amount"].mean()  # 112.14...

These same functions power groupby: instead of aggregating the whole column, they aggregate within each group.

The groupby Pattern

A groupby splits data into groups, applies a function, then combines the results.

Start with the simplest form: one column to group by, one function.

orders.groupby("region")["amount"].sum()
region
BE    140
DE    200
NL    445
Name: amount, dtype: int64

Each region becomes one row; sum collapses all orders in that region to a single total. From here you can add more columns and more functions using .agg().

summary = (
    orders.groupby(["region", "month"], as_index=False)
    .agg(
        total_revenue=("amount", "sum"),
        avg_order_value=("amount", "mean"),
        order_count=("order_id", "count"),
    )
)

<aside> 💡 Use as_index=False to keep group keys as normal columns.

</aside>

This pattern (split by key, apply a function, combine results) is called Split-Apply-Combine.

<aside> 🤓 Curious Geek: Split-Apply-Combine

This pattern was popularized by Hadley Wickham and is the mental model behind groupby.

</aside>

Multiple Aggregations

You can mix functions per column.

metrics = orders.groupby("region").agg(
    total=("amount", "sum"),
    max_order=("amount", "max"),
    first_order_date=("order_date", "min"),
)

size() vs count()

orders.groupby("region").size()
orders.groupby("region")["amount"].count()

Transform vs Aggregate

transform keeps the original number of rows, which is useful for adding group-level stats back to each row.

orders["region_avg"] = orders.groupby("region")["amount"].transform("mean")

Filtering Groups

You can filter groups based on group-level metrics.

high_volume = orders.groupby("customer_id").filter(lambda group: len(group) >= 3)

<aside> ⚠️ groupby drops rows where grouping keys are NaN. Use dropna=False or fill missing values before grouping.

</aside>

Filtering groups based on their size uses filter, not a boolean mask on the original DataFrame.

⌨️ Hands on: Group and Transform

Use this sample table. Group by region and calculate total revenue and average order value. Add a column with each order's region average using transform.

import pandas as pd

orders = pd.DataFrame(
    {
        "order_id": [1, 2, 3, 4, 5, 6],
        "region": ["NL", "NL", "BE", "DE", "BE", "NL"],
        "amount": [120, 80, 90, 200, 50, 150],
    }
)

<aside> 🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=grouping_aggregation&exercise=w4_grouping_aggregation__region_summary&lang=python

</aside>

Exercises

Use the hands-on orders DataFrame (order_id, region, amount) for all exercises.

  1. Group by region and count the number of orders using both size() and count(). Check whether the results differ if you introduce a NaN into the amount column first.
  2. Compute total and average amount per region in a single agg call with named output columns.
  3. Add a column region_avg with each order's region average using transform("mean") on amount.
  4. Filter the grouped DataFrame to keep only regions that have more than 1 order using filter.

Extra reading

Knowledge Check

Test your recall before moving on.

<aside> 🚀 Try it in the widget: Interactive Quiz: Grouping and Aggregation

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch3_grouping_aggregation_quiz&embed=1

If groupby or transform felt unclear, this video walks through the same concepts step by step.

<aside> 🎬 Struggling with this concept? Watch this beginner-friendly video:

Watch on YouTube

</aside>

https://www.youtube.com/watch?v=txMdrV1Ut64

You can also describe the business metric you want to an LLM to get a starting point.

<aside> 💡 Using AI to help: Describe the business metric you want (⚠️ Ensure no PII or sensitive company data is included!): for example "revenue per region per month, with a running total column", and ask an LLM to write the groupby + transform chain. Check that the row count matches your original DataFrame if using transform.

</aside>

Ready to apply these skills? Try the practice exercise before moving on.

<aside> ⌨️ Hands on: Practice with Exercise 3: Grouping and Aggregation.

</aside>


Next up: Joining and Merging DataFrames, where you combine multiple tables using different join types and track unmatched rows with indicator.


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.