Joining and Merging DataFrames
Working with Strings and Dates
Going Further: Optional Deep Dives
A single place to look up every data-processing term you meet this week. Entries use ### term headers so each definition has a stable anchor that other chapters link to. Browse by chapter or use Cmd/Ctrl + F.
The central data structure in Pandas: a two-dimensional table of rows and columns. Each column is a Series. A DataFrame is roughly equivalent to a spreadsheet tab or a SQL table, but held in memory and processed with vectorized Python operations.
A one-dimensional labeled array in Pandas. Each column of a DataFrame is a Series. A Series has a name (the column label) and an index (the row labels).
The row labels of a DataFrame or Series. The default index is integers starting at 0, but it can be any hashable value (strings, dates, multi-level). The index is used by loc to select rows by label and by join() to align rows when merging on the index. merge() aligns on column values by default; pass left_index=True or right_index=True to use the index instead.
The data type of a Series column: int64, float64, object (strings), bool, datetime64[ns], category. Pandas infers dtypes on load; mismatched dtypes (e.g. numbers stored as object because of a stray letter) are the most common source of cleaning errors.
An operation that applies to every element of a column simultaneously, without a Python for loop. Pandas runs vectorized operations in compiled C/Cython code, making them 10-100x faster than loops. The .str, .dt, and arithmetic operators (+, *, >) are all vectorized.
Filtering rows by passing a Series of True/False values inside df[...]. The mask is created by a comparison: df["amount"] > 100 produces a boolean Series, and df[df["amount"] > 100] returns only the rows where it is True.
Label-based row and column selection: df.loc[row_label, column_label]. Use loc with the index label or a boolean mask. df.loc[:, "city"] selects all rows, column "city".
Integer-position-based selection: df.iloc[row_number, column_number]. Use iloc when you need the Nth row regardless of its index label. df.iloc[0] is always the first row.
Calling multiple Pandas methods in sequence on the same DataFrame: df.dropna().reset_index().rename(...). Chaining avoids intermediate variable names and keeps the pipeline readable, but every step returns a new object (Pandas is mostly immutable).
A Pandas warning raised when you try to set a value on a slice of a DataFrame. The root cause is that slices are sometimes copies and sometimes views of the original, and Pandas cannot guarantee which. Fix: use .copy() on slices you intend to modify, or use .loc to assign directly to the original.
The Pandas operation that implements split-apply-combine: split rows into groups by one or more columns, apply a function to each group, and combine the results into a new DataFrame. df.groupby("region")["revenue"].sum() gives total revenue per region.
The three-step mental model behind every groupby operation. (1) Split: divide the DataFrame into groups by a key column. (2) Apply: compute a function on each group independently. (3) Combine: stack the per-group results into one output DataFrame.
A function that reduces a group of values to a single value: sum, mean, count, min, max, std. Passed to .agg() after groupby.
A syntax for giving the output column an explicit name: agg(total_revenue=("amount", "sum")). The output column is named total_revenue instead of amount. Preferred over renaming after the fact because the intent is clear at the call site.
A groupby method that applies a function per group but returns a result with the same shape as the input: one value per original row. Used for group-level normalization or adding a group-level total back as a new column (e.g. "each row's share of its group's total").
The column (or columns) used to match rows between two DataFrames in a join. Both DataFrames must have a column with the same name and comparable values. The key must be cleaned (same case, no leading/trailing spaces) before joining or mismatches silently drop rows.
A join that returns only rows where the merge key exists in both DataFrames. Rows in either DataFrame that have no match in the other are dropped silently. The most common join type; produces the intersection.
A join that returns all rows from the left DataFrame, plus matching rows from the right. Rows in the left DataFrame with no match in the right get NaN for the right DataFrame's columns. Use when you want to keep all left rows regardless of whether the right side has a match.
A parameter on pd.merge() (indicator=True) that adds a _merge column to the result, showing whether each row came from "left_only", "right_only", or "both". Used to audit joins: df[df["_merge"] == "left_only"] shows orphan rows dropped by an inner join.
A row in the left DataFrame whose merge key value has no match in the right DataFrame. An inner join silently drops orphan rows. Discovering them with indicator=True before committing to join type is a best practice.
The .str attribute on a Pandas string (object-dtype) Series. It exposes vectorized string methods: .str.lower(), .str.strip(), .str.replace(), .str.contains(), .str.split(). These operate on every element of the column without a Python loop.
The .dt attribute on a Pandas datetime (datetime64[ns]) Series. It exposes vectorized date/time components: .dt.year, .dt.month, .dt.day, .dt.isocalendar().week, .dt.strftime(fmt).
Converting a string column (e.g. "2024-03-15") to a datetime64[ns] dtype using pd.to_datetime(). The errors="coerce" argument converts unparseable strings to NaT instead of raising an error, which is the safe default for messy data.
Not a Time: the Pandas equivalent of NaN for datetime columns. Produced when pd.to_datetime(errors="coerce") encounters a value it cannot parse. Treated as missing: isna() returns True for NaT values.
The international date format standard: YYYY-MM-DD for dates, YYYY-MM-DDTHH:MM:SS for datetimes. When you control the format, write dates in ISO 8601: it sorts lexicographically as well as chronologically and is unambiguous across locales.
Reshaping a "long" DataFrame (one row per observation) into a "wide" table (one row per unique value of one column, new columns for each value of another). df.pivot(index="region", columns="month", values="revenue") produces a matrix with regions as rows and months as columns.
The inverse of pivot: converting a "wide" table (many columns) into a "long" table (one row per original value). pd.melt(df, id_vars=["region"], value_vars=["jan", "feb", "mar"]) produces one row per region-month pair.
An operation that computes a value for each row using a sliding "window" of surrounding rows. df["revenue"].rolling(7).mean() computes a 7-day rolling average. df.groupby("region")["revenue"].expanding().sum() computes a cumulative sum per group. Window functions are vectorized and do not reduce the number of rows.
A window-style operation that assigns each row a position within its group. df.groupby("region")["revenue"].rank(method="dense") assigns 1 to the highest-revenue row in each region. Used for "top N per group" queries.
A binary columnar file format. Parquet stores each column's values together on disk (not row by row), which makes column-selection queries fast and compression ratios high. Standard format for data warehouses and lakehouses. Write with df.to_parquet("file.parquet").