Week 4 - Data Processing with Pandas
Introduction to Pandas and DataFrames
Selecting, Filtering, and Sorting Data
Joining and Merging DataFrames
Working with Strings and Dates
Assignment: MessyCorp Goes Pandas
Most bugs in analytics come from selecting the wrong rows. If you can slice DataFrames confidently, you can trust your results.
This chapter shows you the main selection tools in Pandas and how to filter and sort data without surprises.
Use square brackets for column selection. A single column gives a Series. A list gives a DataFrame.
cols = ["order_id", "amount"]
orders["amount"]
orders[cols]
<aside>
💡 Prefer bracket selection over orders.amount because column names with spaces or keywords will break attribute access.
</aside>
loc and ilocloc is label-based.iloc is position-based.orders.loc[0] # Row with index label 0
orders.loc[0:3, ["order_id", "amount"]]
orders.iloc[0] # First row by position
orders.iloc[0:3, 0:2]
<aside>
⚠️ loc includes the end label in slices. iloc does not include the end position.
</aside>
Filters are just boolean masks.
big_orders = orders[orders["amount"] > 100]
nl_big = orders[(orders["country"] == "NL") & (orders["amount"] > 100)]
<aside>
⚠️ Always use & and | with parentheses. Python and and or do not work with Series.
</aside>
<aside> 🤓 Curious Geek: Boolean masks
Pandas masking uses NumPy-style boolean arrays under the hood, which is why & and | are required.
</aside>
query Methodquery is readable for complex filters.
orders.query("country == 'NL' and amount > 100")
<aside>
💡 Use query when you are explaining logic to a non-Python audience.
</aside>
orders.sort_values("amount", ascending=False)
orders.sort_values(["country", "amount"], ascending=[True, False])
orders.sort_index()
orders["amount_rank"] = orders["amount"].rank(ascending=False, method="dense")
orders.duplicated(subset=["order_id"]).sum()
orders = orders.drop_duplicates(subset=["order_id"])
<aside>
⌨️ Hands on: Use this sample DataFrame and filter it to keep only rows where amount is between 50 and 200 and country is either NL or BE. Sort the result by amount descending.
import pandas as pd
orders = pd.DataFrame(
{
"order_id": [1, 2, 3, 4, 5, 6],
"country": ["NL", "BE", "DE", "NL", "BE", "NL"],
"amount": [45, 120, 75, 210, 180, 95],
}
)
🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=selecting_filtering_sorting&exercise=w4_selecting_filtering_sorting__filter_orders&lang=python
💭 The widget uses plain Python lists of dictionaries to mimic a table.
</aside>
order_id, country, and amount from your DataFrame.amount > 100 and country == "NL".amount descending and add a rank column.The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.