Selecting, Filtering, and Sorting Data

Most bugs in analytics come from selecting the wrong rows. If you can slice DataFrames confidently, you can trust your results.

This chapter shows you the main selection tools in Pandas and how to filter and sort data without surprises.

Concepts

Selecting Columns

Use square brackets for column selection. A single column gives a Series. A list gives a DataFrame.

cols = ["order_id", "amount"]
orders["amount"]
orders[cols]

<aside> 💡 Prefer bracket selection over orders.amount because column names with spaces or keywords will break attribute access.

</aside>

Selecting Rows with `loc` and `iloc`

loc is label-based.
iloc is position-based.

orders.loc[0]               # Row with index label 0
orders.loc[0:3, ["order_id", "amount"]]

orders.iloc[0]              # First row by position
orders.iloc[0:3, 0:2]

<aside> ⚠️ loc includes the end label in slices. iloc does not include the end position.

</aside>

Boolean Filtering

Filters are just boolean masks.

big_orders = orders[orders["amount"] > 100]

nl_big = orders[(orders["country"] == "NL") & (orders["amount"] > 100)]

<aside> ⚠️ Always use & and | with parentheses. Python and and or do not work with Series.

</aside>

<aside> 🤓 Curious Geek: Boolean masks

Pandas masking uses NumPy-style boolean arrays under the hood, which is why & and | are required.

</aside>

The `query` Method

query is readable for complex filters.

orders.query("country == 'NL' and amount > 100")

<aside> 💡 Use query when you are explaining logic to a non-Python audience.

</aside>

Sorting and Ranking

orders.sort_values("amount", ascending=False)
orders.sort_values(["country", "amount"], ascending=[True, False])
orders.sort_index()

orders["amount_rank"] = orders["amount"].rank(ascending=False, method="dense")

Duplicates

orders.duplicated(subset=["order_id"]).sum()
orders = orders.drop_duplicates(subset=["order_id"])

<aside> ⌨️ Hands on: Use this sample DataFrame and filter it to keep only rows where amount is between 50 and 200 and country is either NL or BE. Sort the result by amount descending.

import pandas as pd

orders = pd.DataFrame(
    {
        "order_id": [1, 2, 3, 4, 5, 6],
        "country": ["NL", "BE", "DE", "NL", "BE", "NL"],
        "amount": [45, 120, 75, 210, 180, 95],
    }
)

🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=selecting_filtering_sorting&exercise=w4_selecting_filtering_sorting__filter_orders&lang=python

💭 The widget uses plain Python lists of dictionaries to mimic a table.

</aside>

Exercises

Select the columns order_id, country, and amount from your DataFrame.
Filter orders to only include amount > 100 and country == "NL".
Sort the filtered data by amount descending and add a rank column.

Extra reading

The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.

Selecting, Filtering, and Sorting Data

Concepts

Selecting Columns

Selecting Rows with loc and iloc

Boolean Filtering

The query Method

Sorting and Ranking

Duplicates

Exercises

Extra reading

Selecting Rows with `loc` and `iloc`

The `query` Method