Selecting and Filtering Data

Most bugs in analytics come from selecting the wrong rows. If you can slice DataFrames confidently, you can trust your results.

This chapter shows you the main selection tools in Pandas and how to filter and sort data without surprises.

By the end of this chapter, you should be able to select rows and columns with loc and iloc, write boolean filters combining multiple conditions, and sort and deduplicate a DataFrame.

<aside> 📦 Run the examples: companion_ch2_selecting_filtering_sorting.py: run in the Codespace or clone locally to follow along with this chapter.

</aside>

Concepts

import pandas as pd

orders = pd.DataFrame(
    {
        "order_id": [1, 2, 3, 4, 5],
        "country": ["NL", "BE", "DE", "NL", "BE"],
        "amount": [45, 120, 75, 210, 180],
    }
)

Selecting Columns

Use square brackets for column selection. A single column gives a Series. A list gives a DataFrame.

cols = ["order_id", "amount"]
orders["amount"]
orders[cols]

<aside> 💡 Prefer bracket selection over orders.amount because column names with spaces or keywords will break attribute access.

</aside>

Selecting Rows with `loc` and `iloc`

loc is label-based.
iloc is position-based.

orders.loc[0]                           # Row with index label 0 (label-based)
orders.loc[0:3, ["order_id", "amount"]] # Labels 0-3 inclusive, named columns

orders.iloc[0]                          # First row by position (position-based)
orders.iloc[0:3, 0:2]                   # Positions 0-2 exclusive, first 2 columns

<aside> ⚠️ loc includes the end label in slices. iloc does not include the end position.

</aside>

Practice selecting specific rows and columns with loc:

<aside> ⌨️ Hands on: Implement loc_select(df) using df.loc[condition, columns] to return only the rows where order_id is in [2, 4], keeping only the "order_id" and "amount" columns.

</aside>

<aside> 🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=selecting_filtering_sorting&exercise=w4_selecting_filtering_sorting__loc_select&lang=python

</aside>

Boolean Filtering

Filters are just boolean masks.

big_orders = orders[orders["amount"] > 100]

nl_big = orders[(orders["country"] == "NL") & (orders["amount"] > 100)]

<aside> ⚠️ Always use & and | with parentheses. Python and and or do not work with Series.

</aside>

NumPy's boolean array model is why the syntax differs from plain Python conditionals.

<aside> 🤓 Curious Geek: Boolean masks

Pandas masking uses NumPy-style boolean arrays under the hood, which is why & and | are required.

</aside>

The `query` Method

query is readable for complex filters.

orders.query("country == 'NL' and amount > 100")

<aside> 💡 Use query when you are explaining logic to a non-Python audience.

</aside>

Sorting and Ranking

orders.sort_values("amount", ascending=False)
orders.sort_values(["country", "amount"], ascending=[True, False])
orders.sort_index()

orders["amount_rank"] = orders["amount"].rank(ascending=False, method="dense")
# method="dense": tied values share a rank and the next rank is not skipped (1, 1, 2)
# method="min":   tied values share a rank but the next rank IS skipped   (1, 1, 3)

Duplicates

orders.duplicated(subset=["order_id"]).sum()
orders = orders.drop_duplicates(subset=["order_id"])

⌨️ Hands on: Filter and Sort Orders

Use this sample DataFrame and filter it to keep only rows where amount is between 50 and 200 and country is either NL or BE. Sort the result by amount descending.

import pandas as pd

orders = pd.DataFrame(
    {
        "order_id": [1, 2, 3, 4, 5, 6],
        "country": ["NL", "BE", "DE", "NL", "BE", "NL"],
        "amount": [45, 120, 75, 210, 180, 95],
    }
)

<aside> 🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=selecting_filtering_sorting&exercise=w4_selecting_filtering_sorting__filter_orders&lang=python

</aside>

Exercises

Select the columns order_id, country, and amount from your DataFrame.
Filter orders to only include amount > 100 and country == "NL".
Sort the filtered data by amount descending and add a rank column.
Count how many rows have a duplicate order_id, then remove them keeping the first occurrence.

Extra reading

Knowledge Check

1. What is the difference between loc and iloc?
1. Why does orders[orders["amount"] > 100 and orders["country"] == "NL"] raise a ValueError?
1. When would you use query() over a boolean mask?
1. drop_duplicates() removes all duplicate rows. What does drop_duplicates(subset=["order_id"]) do differently?
1. You want to select rows 5 through 9 by position. Write both a loc and an iloc expression that achieve the same result, assuming the default integer index.

Test your recall before moving on.

<aside> 🚀 Try it in the widget: Interactive Quiz: Selecting and Filtering Data

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch2_selecting_filtering_quiz&embed=1

If the loc/iloc distinction or boolean mask syntax felt shaky, this video walks through the same concepts from a different angle.

<aside> 🎬 Struggling with this concept? Watch this beginner-friendly video:

Watch on YouTube

</aside>

https://www.youtube.com/watch?v=Lw2rlcxScZY

You can also describe your filter requirement to an LLM to get a starting point.

<aside> 💡 Using AI to help: Paste a filter requirement in plain English (⚠️ Ensure no PII or sensitive company data is included!): for example "keep rows where city is Amsterdam and amount is above 100", and ask an LLM to write the boolean mask. Always verify the output against a small test DataFrame.

</aside>

Ready to apply these skills? Try the practice exercise before moving on.

<aside> ⌨️ Hands on: Practice with Exercise 2: Filtering and Sorting.

</aside>

Next up: Grouping and Aggregation, where you learn to summarise data by category using groupby and produce business metrics like revenue per region.

The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.

Selecting and Filtering Data

Concepts

Selecting Columns

Selecting Rows with loc and iloc

Boolean Filtering

The query Method

Sorting and Ranking

Duplicates

⌨️ Hands on: Filter and Sort Orders

Exercises

Extra reading

Knowledge Check

Selecting Rows with `loc` and `iloc`

The `query` Method