Joining and Merging DataFrames
Working with Strings and Dates
Going Further: Optional Deep Dives
Once your data is clean, you often need to reshape it for dashboards, machine learning features, or warehouse tables. These transformations are where Pandas shines, but also where performance traps appear.
By the end of this chapter, you should be able to reshape DataFrames with pivot and melt, apply rolling and cumulative window functions, and replace slow loops with vectorized operations.
<aside>
📦 Run the examples: companion_ch6_advanced_transformations.py: run in the Codespace or clone locally to follow along with this chapter.
</aside>
import pandas as pd
orders = pd.DataFrame(
{
"order_id": [1, 2, 3, 4, 5, 6],
"customer_id": [10, 10, 20, 20, 30, 30],
"country": ["NL", "BE", "NL", "DE", "BE", "NL"],
"month": ["2024-01", "2024-01", "2024-02", "2024-02", "2024-03", "2024-03"],
"channel": ["web", "app", "web", "app", "web", "app"],
"order_date": pd.to_datetime(
["2024-01-05", "2024-01-12", "2024-02-03", "2024-02-18",
"2024-03-07", "2024-03-22"]
),
"amount": [120, 80, 200, 150, 90, 95],
}
)
Long format has one row per observation: one region per month means many rows. Wide format has one row per entity with a separate column for each category value: each month becomes its own column. Long format is easier to filter and aggregate; wide format is easier to read as a report.
Use pivot to go from long to wide. Use melt to go from wide to long.
# pivot requires one unique value per (index, column) combination
sales = pd.DataFrame(
{
"customer_id": [10, 10, 20, 20],
"month": ["2024-01", "2024-02", "2024-01", "2024-02"],
"amount": [120, 150, 90, 110],
}
)
# Long to wide
pivoted = sales.pivot(index="customer_id", columns="month", values="amount")
# Wide to long
long = pivoted.reset_index().melt(
id_vars="customer_id",
var_name="month",
value_name="amount",
)
<aside> 🤓 Curious Geek: Pivot tables
The term comes from spreadsheet tools, where pivot tables helped analysts summarize large tables quickly.
</aside>
pivot_table for Aggregationspivot_table is like pivot but handles duplicate key combinations by aggregating them. Where pivot requires exactly one value per (index, column) pair, pivot_table accepts many and reduces them with a function like sum or mean.
summary = orders.pivot_table(
index="country",
columns="channel",
values="amount",
aggfunc="sum",
fill_value=0,
)
When you create a new column, choose the narrowest tool that fits the job. Most transformations should use a Pandas built-in instead of a custom Python function.
Use map when each value in one Series should be translated through a dictionary or a simple function. Use replace when you want to swap known values and keep unmatched values unchanged.
region_map = {"NL": "Europe", "BE": "Europe", "DE": "Europe", "US": "North America"}
orders["region_group"] = orders["country"].map(region_map).fillna("Other")
country_names = {"NL": "Netherlands", "BE": "Belgium", "DE": "Germany"}
orders["country_name"] = orders["country"].replace(country_names)
Use vectorized built-ins when Pandas already has the operation. This connects back to earlier chapters: arithmetic from Pandas and DataFrames, .str and .dt from Working with Strings and Dates, and groupby().transform() from Grouping and Aggregation.
# Prefer built-ins for column-wise operations
orders["amount_usd"] = orders["amount"] * 1.1
orders["country_lower"] = orders["country"].str.lower()
orders["order_month"] = orders["order_date"].dt.to_period("M").astype(str)
orders["customer_avg"] = orders.groupby("customer_id")["amount"].transform("mean")
Use Series.apply() only when each single value needs custom Python logic that Pandas does not already provide. Use DataFrame.apply(axis=1) even more carefully: it passes one row at a time into Python, so it is usually slower and harder to test than vectorized code.
def order_band(amount):
if amount >= 150:
return "large"
if amount >= 100:
return "medium"
return "small"
orders["order_band"] = orders["amount"].apply(order_band)
def customer_channel_label(row):
return f"{row['customer_id']}-{row['channel']}"
orders["customer_channel"] = orders.apply(customer_channel_label, axis=1)
<aside>
⚠️ apply can be much slower than vectorized operations. Reach for apply only when no vectorized option exists.
</aside>
Replacing apply with a vectorized operation is one of the highest-value Pandas performance improvements you can make.
A rolling function aggregates over a sliding window of rows, for example the sum of the last 7 rows. A cumulative function aggregates from the first row up to the current one, growing with each row. Both are common in time-series analytics for smoothing noise or tracking running totals.
orders = orders.sort_values("order_date")
orders["rolling_7d"] = orders["amount"].rolling(7, min_periods=1).sum()
orders["cumulative"] = orders["amount"].cumsum()
orders["prev_amount"] = orders["amount"].shift(1)
Vectorized operations are faster and clearer.
# Slow loop
orders["amount_usd"] = [amount * 1.1 for amount in orders["amount"]]
# Vectorized
orders["amount_usd"] = orders["amount"] * 1.1
Use this sample table. Pivot by region and month, then melt it back into a long format.
import pandas as pd
sales = pd.DataFrame(
{
"region": ["NL", "NL", "BE", "BE"],
"month": ["2024-01", "2024-02", "2024-01", "2024-02"],
"amount": [120, 150, 90, 110],
}
)
<aside> 🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=advanced_transformations&exercise=w4_advanced_transformations__pivot_sales&lang=python
</aside>
amount per region and channel.rolling_30d column that sums the last 30 days of orders.map.pivot and melt?apply when a vectorized alternative exists?orders["amount"].rolling(7, min_periods=1).sum() compute?map instead of apply?NaN in cells where there are no matching rows. How do you fill them?Test your recall before moving on.
<aside> 🚀 Try it in the widget: Interactive Quiz: Advanced Transformations
</aside>
https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch6_advanced_transformations_quiz&embed=1
You can also describe the shape you want to an LLM to get a starting point.
<aside>
💡 Using AI to help: Describe the shape you want (⚠️ Ensure no PII or sensitive company data is included!): for example "I have one row per region per month and I want a pivot table where months are columns", and ask an LLM to write the pivot call. Verify the output shape with .shape and confirm no rows were silently dropped.
</aside>
Ready to apply these skills? Try the practice exercise before moving on.
<aside> ⌨️ Hands on: Practice with Exercise 5: Reshape and Export.
</aside>
Next up: Writing Data, where you export cleaned DataFrames to CSV, Parquet, SQLite, and Azure Blob Storage.
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with ❤️ by the HackYourFuture community · Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.