Pandas Cheatsheet

A quick-reference for the patterns you use most in Week 4. Each section maps to a chapter. Copy the snippets into your assignment pipeline as a starting point.

Ch1: Inspect a DataFrame

df.shape          # (rows, columns)
df.dtypes         # column types
df.info()         # types + non-null counts in one call
df.describe()     # numeric stats (min, max, mean, percentiles)
df.head(10)       # first 10 rows
df.isna().sum()   # missing values per column

Ch2: Select, Filter, and Sort

# Select columns
df["col"]                    # Series
df[["col_a", "col_b"]]      # DataFrame

# Filter rows: always use & and | with parentheses, never `and` / `or`
df[df["region"] == "NL"]
df[(df["region"] == "NL") & (df["amount"] > 100)]
df[df["amount"].isna()]

# loc (label-based, end inclusive) vs iloc (position-based, end exclusive)
df.loc[0:5, "region":"amount"]   # rows 0-5, columns region through amount
df.iloc[0:5, 1:3]                # rows 0-4, columns at positions 1-2

# Sort
df.sort_values("amount", ascending=False)
df.sort_values(["region", "amount"])

Ch3: Groupby and Aggregation

# Named aggregations: preferred form for reports
df.groupby("region").agg(
    total_revenue=("amount", "sum"),
    order_count=("order_id", "size"),   # size() counts rows; count() skips NaN
    avg_order=("amount", "mean"),
)

# Multi-key groupby with as_index=False (returns a DataFrame, not a Series)
df.groupby(["region", "month"], as_index=False).agg(
    total=("amount", "sum"),
)

# Quick single-column aggregate
df.groupby("region")["amount"].sum()
df.groupby("region").size()          # row count per group

Ch4: Join and Merge

# Inner join: keeps only matching rows
merged = orders.merge(customers, on="customer_email", how="inner")

# Left join: keeps all orders; unmatched customer fields become NaN
merged = orders.merge(customers, on="customer_email", how="left")

# Different column names in each table
merged = orders.merge(customers,
                      left_on="cust_email",
                      right_on="customer_email")

# Check for row explosion before and after
print(len(orders), "orders")
print(len(merged), "after merge")   # should not be much larger

# Deduplicate a key column before merging
customers = customers.drop_duplicates(subset="customer_email", keep="first")

Ch5: Strings and Dates

# String normalization
df["email"] = df["email"].str.strip().str.lower()
df["name"]  = df["name"].str.title()
df["has_gmail"] = df["email"].str.contains("@gmail.com", na=False)
df["domain"] = df["email"].str.extract(r"@([\w.]+)")

# Parse dates: errors="coerce" turns bad values into NaT (not a crash)
df["date"] = pd.to_datetime(df["date"], errors="coerce", format="%Y-%m-%d")

# Extract date components
df["year"]    = df["date"].dt.year
df["month"]   = df["date"].dt.month
df["weekday"] = df["date"].dt.day_name()
df["week"]    = df["date"].dt.isocalendar().week

# Drop rows with invalid dates
df = df[df["date"].notna()]

Ch6: Advanced Transformations

# Add a column derived from group statistics (transform keeps original row count)
df["regional_avg"] = df.groupby("region")["amount"].transform("mean")

# Pivot table: rows=region, columns=month, values=sum of amount
pivot = df.pivot_table(index="region", columns="month",
                       values="amount", aggfunc="sum")

# Melt: wide to long (inverse of pivot)
long = df.melt(id_vars=["order_id", "region"],
               value_vars=["q1_sales", "q2_sales"],
               var_name="quarter", value_name="sales")

# Rolling window
df = df.sort_values("date")
df["7d_rolling_sum"] = df["amount"].rolling(window=7).sum()

Ch7: Write Results

from pathlib import Path
import matplotlib
matplotlib.use("Agg")         # headless: no display needed
import matplotlib.pyplot as plt

output = Path("output")
output.mkdir(exist_ok=True)

# CSV: always index=False to avoid the "Unnamed: 0" column
df.to_csv(output / "report.csv", index=False)

# Parquet: requires pyarrow; much smaller than CSV for large data
df.to_parquet(output / "report.parquet", index=False)

# Read back
df2 = pd.read_csv(output / "report.csv")
df3 = pd.read_parquet(output / "report.parquet")

# Chart: sort before plotting so bars are meaningful
(df.groupby("category")["amount"].sum()
   .sort_values(ascending=False)
   .plot(kind="bar", title="Revenue by category"))
plt.tight_layout()
plt.savefig(output / "revenue.png", bbox_inches="tight")
plt.close()

Top 5 Gotchas

Trap	Bad	Good
Boolean operators	`df[a == 1 and b == 2]`	`df[(a == 1) & (b == 2)]`
Chained assignment	`df[mask]["col"] = x`	`df.loc[mask, "col"] = x`
Unnamed: 0	`df.to_csv("f.csv")`	`df.to_csv("f.csv", index=False)`
NaN comparison	`df[df["x"] == None]`	`df[df["x"].isna()]`
Join explosion	merge without dedup	`df.drop_duplicates(subset=key)` before merge

See Gotchas & Pitfalls for the full list with code examples.

Pandas vs Polars (Ch9 preview)

Operation	Pandas	Polars
Filter	`df[df["col"] > 1]`	`df.filter(pl.col("col") > 1)`
Groupby sum	`df.groupby("r")["a"].sum()`	`df.group_by("r").agg(pl.col("a").sum())`
Add column	`df["x"] = df["a"] * 2`	`df.with_columns((pl.col("a") * 2).alias("x"))`
Read CSV	`pd.read_csv("f.csv")`	`pl.read_csv("f.csv")`
To pandas	(n/a)	`df.to_pandas()`

<aside> 📚 For the authoritative reference, see the official Pandas cheatsheet (PDF) (BSD 3-Clause).

</aside>

The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.