Joining and Merging DataFrames
Working with Strings and Dates
Going Further: Optional Deep Dives
Most bugs in analytics come from selecting the wrong rows. If you can slice DataFrames confidently, you can trust your results.
This chapter shows you the main selection tools in Pandas and how to filter and sort data without surprises.
By the end of this chapter, you should be able to select rows and columns with loc and iloc, write boolean filters combining multiple conditions, and sort and deduplicate a DataFrame.
<aside>
๐ฆ Run the examples: companion_ch2_selecting_filtering_sorting.py: run in the Codespace or clone locally to follow along with this chapter.
</aside>
import pandas as pd
orders = pd.DataFrame(
{
"order_id": [1, 2, 3, 4, 5],
"country": ["NL", "BE", "DE", "NL", "BE"],
"amount": [45, 120, 75, 210, 180],
}
)
Use square brackets for column selection. A single column gives a Series. A list gives a DataFrame.
cols = ["order_id", "amount"]
orders["amount"]
orders[cols]
<aside>
๐ก Prefer bracket selection over orders.amount because column names with spaces or keywords will break attribute access.
</aside>
loc and ilocloc is label-based.iloc is position-based.orders.loc[0] # Row with index label 0 (label-based)
orders.loc[0:3, ["order_id", "amount"]] # Labels 0-3 inclusive, named columns
orders.iloc[0] # First row by position (position-based)
orders.iloc[0:3, 0:2] # Positions 0-2 exclusive, first 2 columns
<aside>
โ ๏ธ loc includes the end label in slices. iloc does not include the end position.
</aside>
Practice selecting specific rows and columns with loc:
<aside>
โจ๏ธ Hands on: Implement loc_select(df) using df.loc[condition, columns] to return only the rows where order_id is in [2, 4], keeping only the "order_id" and "amount" columns.
</aside>
<aside> ๐ Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=selecting_filtering_sorting&exercise=w4_selecting_filtering_sorting__loc_select&lang=python
</aside>
Filters are just boolean masks.
big_orders = orders[orders["amount"] > 100]
nl_big = orders[(orders["country"] == "NL") & (orders["amount"] > 100)]
<aside>
โ ๏ธ Always use & and | with parentheses. Python and and or do not work with Series.
</aside>
NumPy's boolean array model is why the syntax differs from plain Python conditionals.
<aside> ๐ค Curious Geek: Boolean masks
Pandas masking uses NumPy-style boolean arrays under the hood, which is why & and | are required.
</aside>
query Methodquery is readable for complex filters.
orders.query("country == 'NL' and amount > 100")
<aside>
๐ก Use query when you are explaining logic to a non-Python audience.
</aside>
orders.sort_values("amount", ascending=False)
orders.sort_values(["country", "amount"], ascending=[True, False])
orders.sort_index()
orders["amount_rank"] = orders["amount"].rank(ascending=False, method="dense")
# method="dense": tied values share a rank and the next rank is not skipped (1, 1, 2)
# method="min": tied values share a rank but the next rank IS skipped (1, 1, 3)
orders.duplicated(subset=["order_id"]).sum()
orders = orders.drop_duplicates(subset=["order_id"])
Use this sample DataFrame and filter it to keep only rows where amount is between 50 and 200 and country is either NL or BE. Sort the result by amount descending.
import pandas as pd
orders = pd.DataFrame(
{
"order_id": [1, 2, 3, 4, 5, 6],
"country": ["NL", "BE", "DE", "NL", "BE", "NL"],
"amount": [45, 120, 75, 210, 180, 95],
}
)
<aside> ๐ Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=selecting_filtering_sorting&exercise=w4_selecting_filtering_sorting__filter_orders&lang=python
</aside>
order_id, country, and amount from your DataFrame.amount > 100 and country == "NL".amount descending and add a rank column.order_id, then remove them keeping the first occurrence.loc and iloc?orders[orders["amount"] > 100 and orders["country"] == "NL"] raise a ValueError?query() over a boolean mask?drop_duplicates() removes all duplicate rows. What does drop_duplicates(subset=["order_id"]) do differently?loc and an iloc expression that achieve the same result, assuming the default integer index.Test your recall before moving on.
<aside> ๐ Try it in the widget: Interactive Quiz: Selecting and Filtering Data
</aside>
https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch2_selecting_filtering_quiz&embed=1
If the loc/iloc distinction or boolean mask syntax felt shaky, this video walks through the same concepts from a different angle.
<aside> ๐ฌ Struggling with this concept? Watch this beginner-friendly video:
</aside>
https://www.youtube.com/watch?v=Lw2rlcxScZY
You can also describe your filter requirement to an LLM to get a starting point.
<aside> ๐ก Using AI to help: Paste a filter requirement in plain English (โ ๏ธ Ensure no PII or sensitive company data is included!): for example "keep rows where city is Amsterdam and amount is above 100", and ask an LLM to write the boolean mask. Always verify the output against a small test DataFrame.
</aside>
Ready to apply these skills? Try the practice exercise before moving on.
<aside> โจ๏ธ Hands on: Practice with Exercise 2: Filtering and Sorting.
</aside>
Next up: Grouping and Aggregation, where you learn to summarise data by category using groupby and produce business metrics like revenue per region.
The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with โค๏ธ by the HackYourFuture community ยท Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.