Week 4 - Data Processing with Pandas

Introduction to Pandas and DataFrames

Selecting, Filtering, and Sorting Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Practice

Assignment: MessyCorp Goes Pandas

Gotchas & Pitfalls

Lesson Plan

📊 Grouping and Aggregation

Raw tables are noisy. Aggregation turns raw events into business metrics like revenue per month or churn per region. Pandas makes this fast with groupby.

Concepts

The groupby Pattern

A groupby splits data into groups, applies a function, then combines the results.

summary = (
    orders.groupby(["region", "month"], as_index=False)
    .agg(
        total_revenue=("amount", "sum"),
        avg_order_value=("amount", "mean"),
        order_count=("order_id", "count"),
    )
)

<aside> 💡 Use as_index=False to keep group keys as normal columns.

</aside>

<aside> 🤓 Curious Geek: Split-Apply-Combine

This pattern was popularized by Hadley Wickham and is the mental model behind groupby.

</aside>

Multiple Aggregations

You can mix functions per column.

metrics = orders.groupby("region").agg(
    total=("amount", "sum"),
    max_order=("amount", "max"),
    first_order_date=("order_date", "min"),
)

size() vs count()

orders.groupby("region").size()
orders.groupby("region")["amount"].count()

Transform vs Aggregate

transform keeps the original number of rows, which is useful for adding group-level stats back to each row.

orders["region_avg"] = orders.groupby("region")["amount"].transform("mean")

Filtering Groups

You can filter groups based on group-level metrics.

high_volume = orders.groupby("customer_id").filter(lambda df: len(df) >= 3)

<aside> ⚠️ groupby drops rows where grouping keys are NaN. Use dropna=False or fill missing values before grouping.

</aside>

<aside> ⌨️ Hands on: Use this sample table. Group by region and calculate total revenue and average order value. Add a column with each order's region average using transform.

import pandas as pd

orders = pd.DataFrame(
    {
        "order_id": [1, 2, 3, 4, 5, 6],
        "region": ["NL", "NL", "BE", "DE", "BE", "NL"],
        "amount": [120, 80, 90, 200, 50, 150],
    }
)

🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=grouping_aggregation&exercise=w4_grouping_aggregation__region_summary&lang=python

💭 The widget uses plain Python lists of dictionaries to mimic a table.

</aside>

Exercises

  1. Group by product_category and count the number of orders in each group.
  2. Compute monthly revenue using groupby(["year", "month"]).
  3. Add a column with each customer's lifetime value using transform("sum") on amount.

Extra reading


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.