Week 4 - Data Processing with Pandas

Introduction to Pandas and DataFrames

Selecting, Filtering, and Sorting Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Practice

Assignment: MessyCorp Goes Pandas

Gotchas & Pitfalls

Lesson Plan

⚠️ Gotchas & Pitfalls

Pandas makes data work feel simple, but there are traps that can silently break your results. Read these carefully and you will save yourself hours.

1. Chained Indexing and SettingWithCopyWarning

The Misconception

You can filter a DataFrame and then assign a new column directly.

The Reality

Pandas might return a view, not a copy. The assignment can fail silently.

# BAD
nl_orders = orders[orders["region"] == "NL"]
nl_orders["amount_eur"] = nl_orders["amount"] * 0.92

The Fix: Use .loc on the original DataFrame or call .copy().

# GOOD
nl_orders = orders.loc[orders["region"] == "NL"].copy()
nl_orders["amount_eur"] = nl_orders["amount"] * 0.92

2. and vs & in Filters

The Misconception

Python and works with Pandas Series.

The Reality

You must use & and wrap each condition in parentheses.

# BAD
orders[orders["region"] == "NL" and orders["amount"] > 100]

# GOOD
orders[(orders["region"] == "NL") & (orders["amount"] > 100)]

3. Index Alignment Surprises

The Misconception

Pandas adds two Series row by row like a list.

The Reality

Pandas aligns by index labels, not position.

s1 = pd.Series([10, 20], index=["a", "b"])
s2 = pd.Series([1, 2], index=["b", "a"])

s1 + s2  # Result aligns on labels, not row order

The Fix: Reset or align indices intentionally.

4. NaN Does Not Equal NaN

The Misconception

You can compare missing values with ==.

The Reality

NaN != NaN. Use isna() or notna().

orders[orders["amount"].isna()]

5. Many-to-Many Joins Multiply Rows

The Misconception

Merging two tables with duplicate keys is safe.

The Reality

If both tables have duplicates, the result can explode.

merged = orders.merge(customers, on="customer_id", how="left")
len(merged)  # Might be much larger than expected

The Fix: Deduplicate keys or aggregate before merging.

6. astype(int) Fails with Missing Values

The Misconception

You can cast a column with missing values to int directly.

The Reality

Regular int columns cannot contain NaN.

# BAD
orders["amount"].astype(int)

The Fix: Use Int64 (nullable) or fill missing values first.

orders["amount"].astype("Int64")

7. Grouping Drops Missing Keys

The Misconception

groupby keeps rows where group keys are missing.

The Reality

Rows with NaN in group keys are dropped by default.

orders.groupby("region").size()

The Fix: Use dropna=False or fill missing values before grouping.

8. Time Zones Break Comparisons

The Misconception

You can compare timezone-aware and timezone-naive timestamps.

The Reality

Pandas raises errors when you mix them.

# BAD
orders[orders["order_date"] > pd.Timestamp("2024-01-01", tz="UTC")]

The Fix: Localize or convert time zones consistently before comparisons.

9. Sorting Strings vs Numbers

The Misconception

A numeric column stored as strings sorts numerically.

The Reality

Strings sort lexicographically, so "100" comes before "2".

orders.sort_values("amount")  # Wrong if amount is a string

The Fix: Clean types with pd.to_numeric before sorting.


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.