Joining and Merging DataFrames
Working with Strings and Dates
Going Further: Optional Deep Dives
Real datasets almost never live in a single table. You will combine customers, orders, products, and payments all the time. If you join incorrectly, your numbers are wrong.
By the end of this chapter, you should be able to join two DataFrames on one or more keys, choose the right join type, detect unmatched rows with indicator=True, and stack DataFrames vertically with concat.
<aside>
๐ฆ Run the examples: companion_ch4_joining_merging.py: run in the Codespace or clone locally to follow along with this chapter.
</aside>
import pandas as pd
orders = pd.DataFrame(
{
"order_id": [1, 2, 3, 4],
"customer_id": [100, 101, 999, 102],
"product_id": ["A", "B", "A", "C"],
"price_date": ["2024-01", "2024-01", "2024-02", "2024-02"],
"amount": [120, 90, 200, 75],
}
)
customers = pd.DataFrame(
{
"customer_id": [100, 101, 102],
"name": ["Alice", "Bob", "Chloe"],
}
)
prices = pd.DataFrame(
{
"product_id": ["A", "B", "C", "A"],
"price_date": ["2024-01", "2024-01", "2024-02", "2024-02"],
"unit_price": [10.0, 15.0, 20.0, 11.0],
}
)
A join combines rows from two tables by matching values in a shared column (the key). The join type controls what happens to rows that have no match.
merged = orders.merge(customers, on="customer_id", how="left")
<aside>
๐ Core program connection: If you wrote SQL JOIN in the Core program, the semantics are identical: how="inner" is INNER JOIN, how="left" is LEFT JOIN, and so on. The difference is syntax: .merge() returns a DataFrame instead of a query result, and you chain it in Python instead of writing SQL. The concepts transfer directly.
</aside>
The four join types behave the same in SQL and Pandas: once you know one, you know both.
<aside> ๐ค Curious Geek: Join diagrams
Many tutorials use Venn diagrams for joins. They are a helpful metaphor, but real joins are based on keys, not sets.
</aside>
The choice of join type determines which rows survive, so always verify with indicator=True on your first run.
merged = orders.merge(
prices,
on=["product_id", "price_date"],
how="left",
)
indicatormerged = orders.merge(customers, on="customer_id", how="left", indicator=True)
merged["_merge"].value_counts()
_merge helps you spot rows that failed to match.
If your key is the index, use join.
customers = customers.set_index("customer_id")
orders = orders.set_index("customer_id")
merged = orders.join(customers, how="left")
concat stacks DataFrames vertically or horizontally.
y2024 = pd.DataFrame({"order_id": [1, 2], "amount": [120, 90]})
y2025 = pd.DataFrame({"order_id": [3, 4], "amount": [200, 75]})
all_orders = pd.concat([y2024, y2025], ignore_index=True) # resets index to 0,1,2,3
<aside> โ ๏ธ Many-to-many joins can multiply rows. If both tables have duplicate keys, the result size can explode.
</aside>
Use assert len(merged) == len(orders) after a left join to catch unexpected row multiplication early.
Use these sample tables. Join orders with customers on customer_id. Use indicator=True and count how many rows are left_only.
import pandas as pd
orders = pd.DataFrame(
{
"order_id": [1, 2, 3, 4],
"customer_id": [100, 101, 999, 102],
"amount": [120, 90, 200, 75],
}
)
customers = pd.DataFrame(
{
"customer_id": [100, 101, 102],
"name": ["Alice", "Bob", "Chloe"],
}
)
<aside> ๐ Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=joining_merging&exercise=w4_joining_merging__left_join_customers&lang=python
</aside>
Use the hands-on orders, customers, and prices DataFrames from the Hands on section for all exercises.
customers to orders using a left join on customer_id. Use indicator=True and count how many orders have no matching customer.suffixes=("_order", "_customer") to the merge and check the resulting column names.customers2 DataFrame with one new customer and use pd.concat to combine both customer tables into one, resetting the index.orders with prices on both product_id and price_date to add unit_price to each order.Test your recall before moving on.
concat instead of merge?indicator=True add to a merged DataFrame, and why is it useful?merge and join in Pandas?<aside> ๐ Try it in the widget: Interactive Quiz: Joining and Merging DataFrames
</aside>
https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch4_joining_merging_quiz&embed=1
If the join types or indicator column felt unclear, this video walks through merge and join from scratch.
<aside> ๐ฌ Struggling with this concept? Watch this beginner-friendly video:
</aside>
https://www.youtube.com/watch?v=iYWKfUOtGaw
You can also describe the join you need to an LLM to get a starting point.
<aside>
๐ก Using AI to help: Describe the two tables and the relationship between them (โ ๏ธ Ensure no PII or sensitive company data is included!): for example "orders table with customerid, customers table with customerid and name, left join to add name to each order", and ask an LLM to write the merge call. Always verify the row count of the result matches your expectations.
</aside>
Ready to apply these skills? Try the practice exercise before moving on.
<aside> โจ๏ธ Hands on: Practice with Exercise 4: Joining Two Tables.
</aside>
Next up: Working with Strings and Dates, where you clean messy text columns and parse dates using vectorized string and datetime operations.
The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with โค๏ธ by the HackYourFuture community ยท Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.