Joining and Merging DataFrames
Working with Strings and Dates
Going Further: Optional Deep Dives
A quick-reference for the patterns you use most in Week 4. Each section maps to a chapter. Copy the snippets into your assignment pipeline as a starting point.
df.shape # (rows, columns)
df.dtypes # column types
df.info() # types + non-null counts in one call
df.describe() # numeric stats (min, max, mean, percentiles)
df.head(10) # first 10 rows
df.isna().sum() # missing values per column
# Select columns
df["col"] # Series
df[["col_a", "col_b"]] # DataFrame
# Filter rows: always use & and | with parentheses, never `and` / `or`
df[df["region"] == "NL"]
df[(df["region"] == "NL") & (df["amount"] > 100)]
df[df["amount"].isna()]
# loc (label-based, end inclusive) vs iloc (position-based, end exclusive)
df.loc[0:5, "region":"amount"] # rows 0-5, columns region through amount
df.iloc[0:5, 1:3] # rows 0-4, columns at positions 1-2
# Sort
df.sort_values("amount", ascending=False)
df.sort_values(["region", "amount"])
# Named aggregations: preferred form for reports
df.groupby("region").agg(
total_revenue=("amount", "sum"),
order_count=("order_id", "size"), # size() counts rows; count() skips NaN
avg_order=("amount", "mean"),
)
# Multi-key groupby with as_index=False (returns a DataFrame, not a Series)
df.groupby(["region", "month"], as_index=False).agg(
total=("amount", "sum"),
)
# Quick single-column aggregate
df.groupby("region")["amount"].sum()
df.groupby("region").size() # row count per group
# Inner join: keeps only matching rows
merged = orders.merge(customers, on="customer_email", how="inner")
# Left join: keeps all orders; unmatched customer fields become NaN
merged = orders.merge(customers, on="customer_email", how="left")
# Different column names in each table
merged = orders.merge(customers,
left_on="cust_email",
right_on="customer_email")
# Check for row explosion before and after
print(len(orders), "orders")
print(len(merged), "after merge") # should not be much larger
# Deduplicate a key column before merging
customers = customers.drop_duplicates(subset="customer_email", keep="first")
# String normalization
df["email"] = df["email"].str.strip().str.lower()
df["name"] = df["name"].str.title()
df["has_gmail"] = df["email"].str.contains("@gmail.com", na=False)
df["domain"] = df["email"].str.extract(r"@([\w.]+)")
# Parse dates: errors="coerce" turns bad values into NaT (not a crash)
df["date"] = pd.to_datetime(df["date"], errors="coerce", format="%Y-%m-%d")
# Extract date components
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["weekday"] = df["date"].dt.day_name()
df["week"] = df["date"].dt.isocalendar().week
# Drop rows with invalid dates
df = df[df["date"].notna()]
# Add a column derived from group statistics (transform keeps original row count)
df["regional_avg"] = df.groupby("region")["amount"].transform("mean")
# Pivot table: rows=region, columns=month, values=sum of amount
pivot = df.pivot_table(index="region", columns="month",
values="amount", aggfunc="sum")
# Melt: wide to long (inverse of pivot)
long = df.melt(id_vars=["order_id", "region"],
value_vars=["q1_sales", "q2_sales"],
var_name="quarter", value_name="sales")
# Rolling window
df = df.sort_values("date")
df["7d_rolling_sum"] = df["amount"].rolling(window=7).sum()
from pathlib import Path
import matplotlib
matplotlib.use("Agg") # headless: no display needed
import matplotlib.pyplot as plt
output = Path("output")
output.mkdir(exist_ok=True)
# CSV: always index=False to avoid the "Unnamed: 0" column
df.to_csv(output / "report.csv", index=False)
# Parquet: requires pyarrow; much smaller than CSV for large data
df.to_parquet(output / "report.parquet", index=False)
# Read back
df2 = pd.read_csv(output / "report.csv")
df3 = pd.read_parquet(output / "report.parquet")
# Chart: sort before plotting so bars are meaningful
(df.groupby("category")["amount"].sum()
.sort_values(ascending=False)
.plot(kind="bar", title="Revenue by category"))
plt.tight_layout()
plt.savefig(output / "revenue.png", bbox_inches="tight")
plt.close()
| Trap | Bad | Good |
|---|---|---|
| Boolean operators | df[a == 1 and b == 2] |
df[(a == 1) & (b == 2)] |
| Chained assignment | df[mask]["col"] = x |
df.loc[mask, "col"] = x |
| Unnamed: 0 | df.to_csv("f.csv") |
df.to_csv("f.csv", index=False) |
| NaN comparison | df[df["x"] == None] |
df[df["x"].isna()] |
| Join explosion | merge without dedup | df.drop_duplicates(subset=key) before merge |
See Gotchas & Pitfalls for the full list with code examples.
| Operation | Pandas | Polars |
|---|---|---|
| Filter | df[df["col"] > 1] |
df.filter(pl.col("col") > 1) |
| Groupby sum | df.groupby("r")["a"].sum() |
df.group_by("r").agg(pl.col("a").sum()) |
| Add column | df["x"] = df["a"] * 2 |
df.with_columns((pl.col("a") * 2).alias("x")) |
| Read CSV | pd.read_csv("f.csv") |
pl.read_csv("f.csv") |
| To pandas | (n/a) | df.to_pandas() |
<aside> 📚 For the authoritative reference, see the official Pandas cheatsheet (PDF) (BSD 3-Clause).
</aside>
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with ❤️ by the HackYourFuture community · Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.