Week 4: Data Processing

Pandas and DataFrames

Selecting and Filtering Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Jupyter Notebooks

Practice

Assignment: MessyCorp Pandas

Gotchas & Pitfalls

Week 4 Kickoff Slides

Career relevance: Week 4

Pandas Cheatsheet

Week 4 Glossary

Going Further: Optional Deep Dives

Week 4 Glossary

A single place to look up every data-processing term you meet this week. Entries use ### term headers so each definition has a stable anchor that other chapters link to. Browse by chapter or use Cmd/Ctrl + F.

Ch1: Pandas and DataFrames

DataFrame

The central data structure in Pandas: a two-dimensional table of rows and columns. Each column is a Series. A DataFrame is roughly equivalent to a spreadsheet tab or a SQL table, but held in memory and processed with vectorized Python operations.

Series

A one-dimensional labeled array in Pandas. Each column of a DataFrame is a Series. A Series has a name (the column label) and an index (the row labels).

index

The row labels of a DataFrame or Series. The default index is integers starting at 0, but it can be any hashable value (strings, dates, multi-level). The index is used by loc to select rows by label and by join() to align rows when merging on the index. merge() aligns on column values by default; pass left_index=True or right_index=True to use the index instead.

dtype

The data type of a Series column: int64, float64, object (strings), bool, datetime64[ns], category. Pandas infers dtypes on load; mismatched dtypes (e.g. numbers stored as object because of a stray letter) are the most common source of cleaning errors.

vectorized operation

An operation that applies to every element of a column simultaneously, without a Python for loop. Pandas runs vectorized operations in compiled C/Cython code, making them 10-100x faster than loops. The .str, .dt, and arithmetic operators (+, *, >) are all vectorized.


Ch2: Selecting and Filtering Data

boolean indexing

Filtering rows by passing a Series of True/False values inside df[...]. The mask is created by a comparison: df["amount"] > 100 produces a boolean Series, and df[df["amount"] > 100] returns only the rows where it is True.

loc

Label-based row and column selection: df.loc[row_label, column_label]. Use loc with the index label or a boolean mask. df.loc[:, "city"] selects all rows, column "city".

iloc

Integer-position-based selection: df.iloc[row_number, column_number]. Use iloc when you need the Nth row regardless of its index label. df.iloc[0] is always the first row.

chaining

Calling multiple Pandas methods in sequence on the same DataFrame: df.dropna().reset_index().rename(...). Chaining avoids intermediate variable names and keeps the pipeline readable, but every step returns a new object (Pandas is mostly immutable).

SettingWithCopyWarning

A Pandas warning raised when you try to set a value on a slice of a DataFrame. The root cause is that slices are sometimes copies and sometimes views of the original, and Pandas cannot guarantee which. Fix: use .copy() on slices you intend to modify, or use .loc to assign directly to the original.


Ch3: Grouping and Aggregation

groupby

The Pandas operation that implements split-apply-combine: split rows into groups by one or more columns, apply a function to each group, and combine the results into a new DataFrame. df.groupby("region")["revenue"].sum() gives total revenue per region.

split-apply-combine

The three-step mental model behind every groupby operation. (1) Split: divide the DataFrame into groups by a key column. (2) Apply: compute a function on each group independently. (3) Combine: stack the per-group results into one output DataFrame.

aggregation

A function that reduces a group of values to a single value: sum, mean, count, min, max, std. Passed to .agg() after groupby.

named aggregation

A syntax for giving the output column an explicit name: agg(total_revenue=("amount", "sum")). The output column is named total_revenue instead of amount. Preferred over renaming after the fact because the intent is clear at the call site.

transform

A groupby method that applies a function per group but returns a result with the same shape as the input: one value per original row. Used for group-level normalization or adding a group-level total back as a new column (e.g. "each row's share of its group's total").


Ch4: Joining and Merging DataFrames

merge key

The column (or columns) used to match rows between two DataFrames in a join. Both DataFrames must have a column with the same name and comparable values. The key must be cleaned (same case, no leading/trailing spaces) before joining or mismatches silently drop rows.

inner join

A join that returns only rows where the merge key exists in both DataFrames. Rows in either DataFrame that have no match in the other are dropped silently. The most common join type; produces the intersection.

left join

A join that returns all rows from the left DataFrame, plus matching rows from the right. Rows in the left DataFrame with no match in the right get NaN for the right DataFrame's columns. Use when you want to keep all left rows regardless of whether the right side has a match.

indicator

A parameter on pd.merge() (indicator=True) that adds a _merge column to the result, showing whether each row came from "left_only", "right_only", or "both". Used to audit joins: df[df["_merge"] == "left_only"] shows orphan rows dropped by an inner join.

orphan row

A row in the left DataFrame whose merge key value has no match in the right DataFrame. An inner join silently drops orphan rows. Discovering them with indicator=True before committing to join type is a best practice.


Ch5: Working with Strings and Dates

str accessor

The .str attribute on a Pandas string (object-dtype) Series. It exposes vectorized string methods: .str.lower(), .str.strip(), .str.replace(), .str.contains(), .str.split(). These operate on every element of the column without a Python loop.

dt accessor

The .dt attribute on a Pandas datetime (datetime64[ns]) Series. It exposes vectorized date/time components: .dt.year, .dt.month, .dt.day, .dt.isocalendar().week, .dt.strftime(fmt).

datetime parsing

Converting a string column (e.g. "2024-03-15") to a datetime64[ns] dtype using pd.to_datetime(). The errors="coerce" argument converts unparseable strings to NaT instead of raising an error, which is the safe default for messy data.

NaT

Not a Time: the Pandas equivalent of NaN for datetime columns. Produced when pd.to_datetime(errors="coerce") encounters a value it cannot parse. Treated as missing: isna() returns True for NaT values.

ISO 8601

The international date format standard: YYYY-MM-DD for dates, YYYY-MM-DDTHH:MM:SS for datetimes. When you control the format, write dates in ISO 8601: it sorts lexicographically as well as chronologically and is unambiguous across locales.


Ch6: Advanced Transformations

pivot

Reshaping a "long" DataFrame (one row per observation) into a "wide" table (one row per unique value of one column, new columns for each value of another). df.pivot(index="region", columns="month", values="revenue") produces a matrix with regions as rows and months as columns.

melt

The inverse of pivot: converting a "wide" table (many columns) into a "long" table (one row per original value). pd.melt(df, id_vars=["region"], value_vars=["jan", "feb", "mar"]) produces one row per region-month pair.

window function

An operation that computes a value for each row using a sliding "window" of surrounding rows. df["revenue"].rolling(7).mean() computes a 7-day rolling average. df.groupby("region")["revenue"].expanding().sum() computes a cumulative sum per group. Window functions are vectorized and do not reduce the number of rows.

rank

A window-style operation that assigns each row a position within its group. df.groupby("region")["revenue"].rank(method="dense") assigns 1 to the highest-revenue row in each region. Used for "top N per group" queries.


Ch7: Writing Data

Parquet

A binary columnar file format. Parquet stores each column's values together on disk (not row by row), which makes column-selection queries fast and compression ratios high. Standard format for data warehouses and lakehouses. Write with df.to_parquet("file.parquet").