Week 4: Data Processing

Pandas and DataFrames

Selecting and Filtering Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Jupyter Notebooks

Practice

Assignment: MessyCorp Pandas

Gotchas & Pitfalls

Week 4 Kickoff Slides

Career relevance: Week 4

Pandas Cheatsheet

Week 4 Glossary

Going Further: Optional Deep Dives

Pandas Cheatsheet

A quick-reference for the patterns you use most in Week 4. Each section maps to a chapter. Copy the snippets into your assignment pipeline as a starting point.

Ch1: Inspect a DataFrame

df.shape          # (rows, columns)
df.dtypes         # column types
df.info()         # types + non-null counts in one call
df.describe()     # numeric stats (min, max, mean, percentiles)
df.head(10)       # first 10 rows
df.isna().sum()   # missing values per column

Ch2: Select, Filter, and Sort

# Select columns
df["col"]                    # Series
df[["col_a", "col_b"]]      # DataFrame

# Filter rows: always use & and | with parentheses, never `and` / `or`
df[df["region"] == "NL"]
df[(df["region"] == "NL") & (df["amount"] > 100)]
df[df["amount"].isna()]

# loc (label-based, end inclusive) vs iloc (position-based, end exclusive)
df.loc[0:5, "region":"amount"]   # rows 0-5, columns region through amount
df.iloc[0:5, 1:3]                # rows 0-4, columns at positions 1-2

# Sort
df.sort_values("amount", ascending=False)
df.sort_values(["region", "amount"])

Ch3: Groupby and Aggregation

# Named aggregations: preferred form for reports
df.groupby("region").agg(
    total_revenue=("amount", "sum"),
    order_count=("order_id", "size"),   # size() counts rows; count() skips NaN
    avg_order=("amount", "mean"),
)

# Multi-key groupby with as_index=False (returns a DataFrame, not a Series)
df.groupby(["region", "month"], as_index=False).agg(
    total=("amount", "sum"),
)

# Quick single-column aggregate
df.groupby("region")["amount"].sum()
df.groupby("region").size()          # row count per group

Ch4: Join and Merge

# Inner join: keeps only matching rows
merged = orders.merge(customers, on="customer_email", how="inner")

# Left join: keeps all orders; unmatched customer fields become NaN
merged = orders.merge(customers, on="customer_email", how="left")

# Different column names in each table
merged = orders.merge(customers,
                      left_on="cust_email",
                      right_on="customer_email")

# Check for row explosion before and after
print(len(orders), "orders")
print(len(merged), "after merge")   # should not be much larger

# Deduplicate a key column before merging
customers = customers.drop_duplicates(subset="customer_email", keep="first")

Ch5: Strings and Dates

# String normalization
df["email"] = df["email"].str.strip().str.lower()
df["name"]  = df["name"].str.title()
df["has_gmail"] = df["email"].str.contains("@gmail.com", na=False)
df["domain"] = df["email"].str.extract(r"@([\w.]+)")

# Parse dates: errors="coerce" turns bad values into NaT (not a crash)
df["date"] = pd.to_datetime(df["date"], errors="coerce", format="%Y-%m-%d")

# Extract date components
df["year"]    = df["date"].dt.year
df["month"]   = df["date"].dt.month
df["weekday"] = df["date"].dt.day_name()
df["week"]    = df["date"].dt.isocalendar().week

# Drop rows with invalid dates
df = df[df["date"].notna()]

Ch6: Advanced Transformations

# Add a column derived from group statistics (transform keeps original row count)
df["regional_avg"] = df.groupby("region")["amount"].transform("mean")

# Pivot table: rows=region, columns=month, values=sum of amount
pivot = df.pivot_table(index="region", columns="month",
                       values="amount", aggfunc="sum")

# Melt: wide to long (inverse of pivot)
long = df.melt(id_vars=["order_id", "region"],
               value_vars=["q1_sales", "q2_sales"],
               var_name="quarter", value_name="sales")

# Rolling window
df = df.sort_values("date")
df["7d_rolling_sum"] = df["amount"].rolling(window=7).sum()

Ch7: Write Results

from pathlib import Path
import matplotlib
matplotlib.use("Agg")         # headless: no display needed
import matplotlib.pyplot as plt

output = Path("output")
output.mkdir(exist_ok=True)

# CSV: always index=False to avoid the "Unnamed: 0" column
df.to_csv(output / "report.csv", index=False)

# Parquet: requires pyarrow; much smaller than CSV for large data
df.to_parquet(output / "report.parquet", index=False)

# Read back
df2 = pd.read_csv(output / "report.csv")
df3 = pd.read_parquet(output / "report.parquet")

# Chart: sort before plotting so bars are meaningful
(df.groupby("category")["amount"].sum()
   .sort_values(ascending=False)
   .plot(kind="bar", title="Revenue by category"))
plt.tight_layout()
plt.savefig(output / "revenue.png", bbox_inches="tight")
plt.close()

Top 5 Gotchas

Trap Bad Good
Boolean operators df[a == 1 and b == 2] df[(a == 1) & (b == 2)]
Chained assignment df[mask]["col"] = x df.loc[mask, "col"] = x
Unnamed: 0 df.to_csv("f.csv") df.to_csv("f.csv", index=False)
NaN comparison df[df["x"] == None] df[df["x"].isna()]
Join explosion merge without dedup df.drop_duplicates(subset=key) before merge

See Gotchas & Pitfalls for the full list with code examples.

Pandas vs Polars (Ch9 preview)

Operation Pandas Polars
Filter df[df["col"] > 1] df.filter(pl.col("col") > 1)
Groupby sum df.groupby("r")["a"].sum() df.group_by("r").agg(pl.col("a").sum())
Add column df["x"] = df["a"] * 2 df.with_columns((pl.col("a") * 2).alias("x"))
Read CSV pd.read_csv("f.csv") pl.read_csv("f.csv")
To pandas (n/a) df.to_pandas()

<aside> 📚 For the authoritative reference, see the official Pandas cheatsheet (PDF) (BSD 3-Clause).

</aside>


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.