Advanced Transformations

Once your data is clean, you often need to reshape it for dashboards, machine learning features, or warehouse tables. These transformations are where Pandas shines, but also where performance traps appear.

By the end of this chapter, you should be able to reshape DataFrames with pivot and melt, apply rolling and cumulative window functions, and replace slow loops with vectorized operations.

<aside> 📦 Run the examples: companion_ch6_advanced_transformations.py: run in the Codespace or clone locally to follow along with this chapter.

</aside>

Concepts

import pandas as pd

orders = pd.DataFrame(
    {
        "order_id": [1, 2, 3, 4, 5, 6],
        "customer_id": [10, 10, 20, 20, 30, 30],
        "country": ["NL", "BE", "NL", "DE", "BE", "NL"],
        "month": ["2024-01", "2024-01", "2024-02", "2024-02", "2024-03", "2024-03"],
        "channel": ["web", "app", "web", "app", "web", "app"],
        "order_date": pd.to_datetime(
            ["2024-01-05", "2024-01-12", "2024-02-03", "2024-02-18",
             "2024-03-07", "2024-03-22"]
        ),
        "amount": [120, 80, 200, 150, 90, 95],
    }
)

Pivot and Melt

Long format has one row per observation: one region per month means many rows. Wide format has one row per entity with a separate column for each category value: each month becomes its own column. Long format is easier to filter and aggregate; wide format is easier to read as a report.

Use pivot to go from long to wide. Use melt to go from wide to long.

# pivot requires one unique value per (index, column) combination
sales = pd.DataFrame(
    {
        "customer_id": [10, 10, 20, 20],
        "month": ["2024-01", "2024-02", "2024-01", "2024-02"],
        "amount": [120, 150, 90, 110],
    }
)

# Long to wide
pivoted = sales.pivot(index="customer_id", columns="month", values="amount")

# Wide to long
long = pivoted.reset_index().melt(
    id_vars="customer_id",
    var_name="month",
    value_name="amount",
)

<aside> 🤓 Curious Geek: Pivot tables

The term comes from spreadsheet tools, where pivot tables helped analysts summarize large tables quickly.

</aside>

`pivot_table` for Aggregations

pivot_table is like pivot but handles duplicate key combinations by aggregating them. Where pivot requires exactly one value per (index, column) pair, pivot_table accepts many and reduces them with a function like sum or mean.

summary = orders.pivot_table(
    index="country",
    columns="channel",
    values="amount",
    aggfunc="sum",
    fill_value=0,
)

Mapping and Applying Functions

When you create a new column, choose the narrowest tool that fits the job. Most transformations should use a Pandas built-in instead of a custom Python function.

Use map when each value in one Series should be translated through a dictionary or a simple function. Use replace when you want to swap known values and keep unmatched values unchanged.

region_map = {"NL": "Europe", "BE": "Europe", "DE": "Europe", "US": "North America"}
orders["region_group"] = orders["country"].map(region_map).fillna("Other")

country_names = {"NL": "Netherlands", "BE": "Belgium", "DE": "Germany"}
orders["country_name"] = orders["country"].replace(country_names)

Use vectorized built-ins when Pandas already has the operation. This connects back to earlier chapters: arithmetic from Pandas and DataFrames, .str and .dt from Working with Strings and Dates, and groupby().transform() from Grouping and Aggregation.

# Prefer built-ins for column-wise operations
orders["amount_usd"] = orders["amount"] * 1.1
orders["country_lower"] = orders["country"].str.lower()
orders["order_month"] = orders["order_date"].dt.to_period("M").astype(str)
orders["customer_avg"] = orders.groupby("customer_id")["amount"].transform("mean")

Use Series.apply() only when each single value needs custom Python logic that Pandas does not already provide. Use DataFrame.apply(axis=1) even more carefully: it passes one row at a time into Python, so it is usually slower and harder to test than vectorized code.

def order_band(amount):
    if amount >= 150:
        return "large"
    if amount >= 100:
        return "medium"
    return "small"

orders["order_band"] = orders["amount"].apply(order_band)

def customer_channel_label(row):
    return f"{row['customer_id']}-{row['channel']}"

orders["customer_channel"] = orders.apply(customer_channel_label, axis=1)

<aside> ⚠️ apply can be much slower than vectorized operations. Reach for apply only when no vectorized option exists.

</aside>

Replacing apply with a vectorized operation is one of the highest-value Pandas performance improvements you can make.

Window Functions

A rolling function aggregates over a sliding window of rows, for example the sum of the last 7 rows. A cumulative function aggregates from the first row up to the current one, growing with each row. Both are common in time-series analytics for smoothing noise or tracking running totals.

orders = orders.sort_values("order_date")
orders["rolling_7d"] = orders["amount"].rolling(7, min_periods=1).sum()
orders["cumulative"] = orders["amount"].cumsum()
orders["prev_amount"] = orders["amount"].shift(1)

Vectorization vs Loops

Vectorized operations are faster and clearer.

# Slow loop
orders["amount_usd"] = [amount * 1.1 for amount in orders["amount"]]

# Vectorized
orders["amount_usd"] = orders["amount"] * 1.1

⌨️ Hands on: Pivot and Melt

Use this sample table. Pivot by region and month, then melt it back into a long format.

import pandas as pd

sales = pd.DataFrame(
    {
        "region": ["NL", "NL", "BE", "BE"],
        "month": ["2024-01", "2024-02", "2024-01", "2024-02"],
        "amount": [120, 150, 90, 110],
    }
)

<aside> 🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=advanced_transformations&exercise=w4_advanced_transformations__pivot_sales&lang=python

</aside>

Exercises

Create a pivot table that shows total amount per region and channel.
Add a rolling_30d column that sums the last 30 days of orders.
Map product codes to readable names using a dictionary and map.

Extra reading

Knowledge Check

1. What is the difference between pivot and melt?
1. Why should you avoid apply when a vectorized alternative exists?
1. What does orders["amount"].rolling(7, min_periods=1).sum() compute?
1. When would you use map instead of apply?
1. A pivot produces NaN in cells where there are no matching rows. How do you fill them?

Test your recall before moving on.

<aside> 🚀 Try it in the widget: Interactive Quiz: Advanced Transformations

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch6_advanced_transformations_quiz&embed=1

You can also describe the shape you want to an LLM to get a starting point.

<aside> 💡 Using AI to help: Describe the shape you want (⚠️ Ensure no PII or sensitive company data is included!): for example "I have one row per region per month and I want a pivot table where months are columns", and ask an LLM to write the pivot call. Verify the output shape with .shape and confirm no rows were silently dropped.

</aside>

Ready to apply these skills? Try the practice exercise before moving on.

<aside> ⌨️ Hands on: Practice with Exercise 5: Reshape and Export.

</aside>

Next up: Writing Data, where you export cleaned DataFrames to CSV, Parquet, SQLite, and Azure Blob Storage.

The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.

Advanced Transformations

Concepts

Pivot and Melt

pivot_table for Aggregations

Mapping and Applying Functions

Window Functions

Vectorization vs Loops

⌨️ Hands on: Pivot and Melt

Exercises

Extra reading

Knowledge Check

`pivot_table` for Aggregations