Week 4 - Data Processing with Pandas

Introduction to Pandas and DataFrames

Selecting, Filtering, and Sorting Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Practice

Assignment: MessyCorp Goes Pandas

Gotchas & Pitfalls

Lesson Plan

🔗 Joining and Merging DataFrames

Real datasets almost never live in a single table. You will combine customers, orders, products, and payments all the time. If you join incorrectly, your numbers are wrong.

Concepts

Join Types

merged = orders.merge(customers, on="customer_id", how="left")

<aside> 🤓 Curious Geek: Join diagrams

Many tutorials use Venn diagrams for joins. They are a helpful metaphor, but real joins are based on keys, not sets.

</aside>

Joining on Multiple Keys

merged = orders.merge(
    prices,
    on=["product_id", "price_date"],
    how="left",
)

Tracking Matches with indicator

merged = orders.merge(customers, on="customer_id", how="left", indicator=True)
merged["_merge"].value_counts()

_merge helps you spot rows that failed to match.

Index-Based Joins

If your key is the index, use join.

customers = customers.set_index("customer_id")
orders = orders.set_index("customer_id")
merged = orders.join(customers, how="left")

Concatenation

concat stacks DataFrames vertically or horizontally.

y2024 = pd.read_csv("data/orders_2024.csv")
y2025 = pd.read_csv("data/orders_2025.csv")
all_orders = pd.concat([y2024, y2025], ignore_index=True)

<aside> ⚠️ Many-to-many joins can multiply rows. If both tables have duplicate keys, the result size can explode.

</aside>

<aside> ⌨️ Hands on: Use these sample tables. Join orders with customers on customer_id. Use indicator=True and count how many rows are left_only.

import pandas as pd

orders = pd.DataFrame(
    {
        "order_id": [1, 2, 3, 4],
        "customer_id": [100, 101, 999, 102],
        "amount": [120, 90, 200, 75],
    }
)

customers = pd.DataFrame(
    {
        "customer_id": [100, 101, 102],
        "name": ["Alice", "Bob", "Chloe"],
    }
)

🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=joining_merging&exercise=w4_joining_merging__left_join_customers&lang=python

💭 The widget uses plain Python lists of dictionaries to mimic a table.

</aside>

Exercises

  1. Join a customers DataFrame to orders using a left join.
  2. Add suffixes for conflicting column names using suffixes=("_order", "_customer").
  3. Use concat to combine two monthly order files into one DataFrame.

Extra reading


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.