Week 4 - Data Processing with Pandas
Introduction to Pandas and DataFrames
Selecting, Filtering, and Sorting Data
Joining and Merging DataFrames
Working with Strings and Dates
Assignment: MessyCorp Goes Pandas
Pandas makes data work feel simple, but there are traps that can silently break your results. Read these carefully and you will save yourself hours.
SettingWithCopyWarningYou can filter a DataFrame and then assign a new column directly.
Pandas might return a view, not a copy. The assignment can fail silently.
# BAD
nl_orders = orders[orders["region"] == "NL"]
nl_orders["amount_eur"] = nl_orders["amount"] * 0.92
The Fix: Use .loc on the original DataFrame or call .copy().
# GOOD
nl_orders = orders.loc[orders["region"] == "NL"].copy()
nl_orders["amount_eur"] = nl_orders["amount"] * 0.92
and vs & in FiltersPython and works with Pandas Series.
You must use & and wrap each condition in parentheses.
# BAD
orders[orders["region"] == "NL" and orders["amount"] > 100]
# GOOD
orders[(orders["region"] == "NL") & (orders["amount"] > 100)]
Pandas adds two Series row by row like a list.
Pandas aligns by index labels, not position.
s1 = pd.Series([10, 20], index=["a", "b"])
s2 = pd.Series([1, 2], index=["b", "a"])
s1 + s2 # Result aligns on labels, not row order
The Fix: Reset or align indices intentionally.
NaN Does Not Equal NaNYou can compare missing values with ==.
NaN != NaN. Use isna() or notna().
orders[orders["amount"].isna()]
Merging two tables with duplicate keys is safe.
If both tables have duplicates, the result can explode.
merged = orders.merge(customers, on="customer_id", how="left")
len(merged) # Might be much larger than expected
The Fix: Deduplicate keys or aggregate before merging.
astype(int) Fails with Missing ValuesYou can cast a column with missing values to int directly.
Regular int columns cannot contain NaN.
# BAD
orders["amount"].astype(int)
The Fix: Use Int64 (nullable) or fill missing values first.
orders["amount"].astype("Int64")
groupby keeps rows where group keys are missing.
Rows with NaN in group keys are dropped by default.
orders.groupby("region").size()
The Fix: Use dropna=False or fill missing values before grouping.
You can compare timezone-aware and timezone-naive timestamps.
Pandas raises errors when you mix them.
# BAD
orders[orders["order_date"] > pd.Timestamp("2024-01-01", tz="UTC")]
The Fix: Localize or convert time zones consistently before comparisons.
A numeric column stored as strings sorts numerically.
Strings sort lexicographically, so "100" comes before "2".
orders.sort_values("amount") # Wrong if amount is a string
The Fix: Clean types with pd.to_numeric before sorting.
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.