Week 4: Data Processing

Pandas and DataFrames

Selecting and Filtering Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Jupyter Notebooks

Practice

Assignment: MessyCorp Pandas

Gotchas & Pitfalls

Week 4 Kickoff Slides

Career relevance: Week 4

Pandas Cheatsheet

Week 4 Glossary

Going Further: Optional Deep Dives

Joining and Merging DataFrames

Real datasets almost never live in a single table. You will combine customers, orders, products, and payments all the time. If you join incorrectly, your numbers are wrong.

By the end of this chapter, you should be able to join two DataFrames on one or more keys, choose the right join type, detect unmatched rows with indicator=True, and stack DataFrames vertically with concat.

<aside> ๐Ÿ“ฆ Run the examples: companion_ch4_joining_merging.py: run in the Codespace or clone locally to follow along with this chapter.

</aside>

Concepts

import pandas as pd

orders = pd.DataFrame(
    {
        "order_id": [1, 2, 3, 4],
        "customer_id": [100, 101, 999, 102],
        "product_id": ["A", "B", "A", "C"],
        "price_date": ["2024-01", "2024-01", "2024-02", "2024-02"],
        "amount": [120, 90, 200, 75],
    }
)

customers = pd.DataFrame(
    {
        "customer_id": [100, 101, 102],
        "name": ["Alice", "Bob", "Chloe"],
    }
)

prices = pd.DataFrame(
    {
        "product_id": ["A", "B", "C", "A"],
        "price_date": ["2024-01", "2024-01", "2024-02", "2024-02"],
        "unit_price": [10.0, 15.0, 20.0, 11.0],
    }
)

Join Types

A join combines rows from two tables by matching values in a shared column (the key). The join type controls what happens to rows that have no match.

merged = orders.merge(customers, on="customer_id", how="left")

<aside> ๐Ÿ“˜ Core program connection: If you wrote SQL JOIN in the Core program, the semantics are identical: how="inner" is INNER JOIN, how="left" is LEFT JOIN, and so on. The difference is syntax: .merge() returns a DataFrame instead of a query result, and you chain it in Python instead of writing SQL. The concepts transfer directly.

</aside>

The four join types behave the same in SQL and Pandas: once you know one, you know both.

<aside> ๐Ÿค“ Curious Geek: Join diagrams

Many tutorials use Venn diagrams for joins. They are a helpful metaphor, but real joins are based on keys, not sets.

</aside>

The choice of join type determines which rows survive, so always verify with indicator=True on your first run.

Joining on Multiple Keys

merged = orders.merge(
    prices,
    on=["product_id", "price_date"],
    how="left",
)

Tracking Matches with indicator

merged = orders.merge(customers, on="customer_id", how="left", indicator=True)
merged["_merge"].value_counts()

_merge helps you spot rows that failed to match.

Index-Based Joins

If your key is the index, use join.

customers = customers.set_index("customer_id")
orders = orders.set_index("customer_id")
merged = orders.join(customers, how="left")

Concatenation

concat stacks DataFrames vertically or horizontally.

y2024 = pd.DataFrame({"order_id": [1, 2], "amount": [120, 90]})
y2025 = pd.DataFrame({"order_id": [3, 4], "amount": [200, 75]})
all_orders = pd.concat([y2024, y2025], ignore_index=True)  # resets index to 0,1,2,3

<aside> โš ๏ธ Many-to-many joins can multiply rows. If both tables have duplicate keys, the result size can explode.

</aside>

Use assert len(merged) == len(orders) after a left join to catch unexpected row multiplication early.

โŒจ๏ธ Hands on: Left Join with Indicator

Use these sample tables. Join orders with customers on customer_id. Use indicator=True and count how many rows are left_only.

import pandas as pd

orders = pd.DataFrame(
    {
        "order_id": [1, 2, 3, 4],
        "customer_id": [100, 101, 999, 102],
        "amount": [120, 90, 200, 75],
    }
)

customers = pd.DataFrame(
    {
        "customer_id": [100, 101, 102],
        "name": ["Alice", "Bob", "Chloe"],
    }
)

<aside> ๐Ÿš€ Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=joining_merging&exercise=w4_joining_merging__left_join_customers&lang=python

</aside>

Exercises

Use the hands-on orders, customers, and prices DataFrames from the Hands on section for all exercises.

  1. Join customers to orders using a left join on customer_id. Use indicator=True and count how many orders have no matching customer.
  2. Add suffixes=("_order", "_customer") to the merge and check the resulting column names.
  3. Create a second customers2 DataFrame with one new customer and use pd.concat to combine both customer tables into one, resetting the index.
  4. Join orders with prices on both product_id and price_date to add unit_price to each order.

Extra reading

Knowledge Check

Test your recall before moving on.

<aside> ๐Ÿš€ Try it in the widget: Interactive Quiz: Joining and Merging DataFrames

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch4_joining_merging_quiz&embed=1

If the join types or indicator column felt unclear, this video walks through merge and join from scratch.

<aside> ๐ŸŽฌ Struggling with this concept? Watch this beginner-friendly video:

Watch on YouTube

</aside>

https://www.youtube.com/watch?v=iYWKfUOtGaw

You can also describe the join you need to an LLM to get a starting point.

<aside> ๐Ÿ’ก Using AI to help: Describe the two tables and the relationship between them (โš ๏ธ Ensure no PII or sensitive company data is included!): for example "orders table with customerid, customers table with customerid and name, left join to add name to each order", and ask an LLM to write the merge call. Always verify the row count of the result matches your expectations.

</aside>

Ready to apply these skills? Try the practice exercise before moving on.

<aside> โŒจ๏ธ Hands on: Practice with Exercise 4: Joining Two Tables.

</aside>


Next up: Working with Strings and Dates, where you clean messy text columns and parse dates using vectorized string and datetime operations.


The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with โค๏ธ by the HackYourFuture community ยท Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.