Week 4 - Data Processing with Pandas

Introduction to Pandas and DataFrames

Selecting, Filtering, and Sorting Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Practice

Assignment: MessyCorp Goes Pandas

Gotchas & Pitfalls

Lesson Plan

⚙️ Advanced Transformations

Once your data is clean, you often need to reshape it for dashboards, machine learning features, or warehouse tables. These transformations are where Pandas shines, but also where performance traps appear.

Concepts

Pivot and Melt

Use pivot to go from long to wide. Use melt to go from wide to long.

# Long to wide
pivoted = orders.pivot(index="customer_id", columns="month", values="amount")

# Wide to long
long = pivoted.reset_index().melt(
    id_vars="customer_id",
    var_name="month",
    value_name="amount",
)

<aside> 🤓 Curious Geek: Pivot tables

The term comes from spreadsheet tools, where pivot tables helped analysts summarize large tables quickly.

</aside>

pivot_table for Aggregations

summary = orders.pivot_table(
    index="region",
    columns="channel",
    values="amount",
    aggfunc="sum",
    fill_value=0,
)

Mapping and Applying Functions

Use map or replace for simple lookups. Use apply only when needed.

region_map = {"NL": "Europe", "US": "North America"}
orders["region_group"] = orders["country"].map(region_map).fillna("Other")

Window Functions

Rolling and cumulative metrics are common in analytics.

orders = orders.sort_values("order_date")
orders["rolling_7d"] = orders["amount"].rolling(7, min_periods=1).sum()
orders["cumulative"] = orders["amount"].cumsum()
orders["prev_amount"] = orders["amount"].shift(1)

Vectorization vs Loops

Vectorized operations are faster and clearer.

# Slow loop
orders["amount_usd"] = [amount * 1.1 for amount in orders["amount"]]

# Vectorized
orders["amount_usd"] = orders["amount"] * 1.1

<aside> ⚠️ apply can be much slower than vectorized operations. Reach for apply only when no vectorized option exists.

</aside>

<aside> ⌨️ Hands on: Use this sample table. Pivot by region and month, then melt it back into a long format.

import pandas as pd

sales = pd.DataFrame(
    {
        "region": ["NL", "NL", "BE", "BE"],
        "month": ["2024-01", "2024-02", "2024-01", "2024-02"],
        "amount": [120, 150, 90, 110],
    }
)

🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=advanced_transformations&exercise=w4_advanced_transformations__pivot_sales&lang=python

💭 The widget uses plain Python lists of dictionaries to mimic a table.

</aside>

Exercises

  1. Create a pivot table that shows total amount per region and channel.
  2. Add a rolling_30d column that sums the last 30 days of orders.
  3. Map product codes to readable names using a dictionary and map.

Extra reading


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.