Week 4 - Data Processing with Pandas

Introduction to Pandas and DataFrames

Selecting, Filtering, and Sorting Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Practice

Assignment: MessyCorp Goes Pandas

Gotchas & Pitfalls

Lesson Plan

🔍 Selecting, Filtering, and Sorting Data

Most bugs in analytics come from selecting the wrong rows. If you can slice DataFrames confidently, you can trust your results.

This chapter shows you the main selection tools in Pandas and how to filter and sort data without surprises.

Concepts

Selecting Columns

Use square brackets for column selection. A single column gives a Series. A list gives a DataFrame.

cols = ["order_id", "amount"]
orders["amount"]
orders[cols]

<aside> 💡 Prefer bracket selection over orders.amount because column names with spaces or keywords will break attribute access.

</aside>

Selecting Rows with loc and iloc

orders.loc[0]               # Row with index label 0
orders.loc[0:3, ["order_id", "amount"]]

orders.iloc[0]              # First row by position
orders.iloc[0:3, 0:2]

<aside> ⚠️ loc includes the end label in slices. iloc does not include the end position.

</aside>

Boolean Filtering

Filters are just boolean masks.

big_orders = orders[orders["amount"] > 100]

nl_big = orders[(orders["country"] == "NL") & (orders["amount"] > 100)]

<aside> ⚠️ Always use & and | with parentheses. Python and and or do not work with Series.

</aside>

<aside> 🤓 Curious Geek: Boolean masks

Pandas masking uses NumPy-style boolean arrays under the hood, which is why & and | are required.

</aside>

The query Method

query is readable for complex filters.

orders.query("country == 'NL' and amount > 100")

<aside> 💡 Use query when you are explaining logic to a non-Python audience.

</aside>

Sorting and Ranking

orders.sort_values("amount", ascending=False)
orders.sort_values(["country", "amount"], ascending=[True, False])
orders.sort_index()

orders["amount_rank"] = orders["amount"].rank(ascending=False, method="dense")

Duplicates

orders.duplicated(subset=["order_id"]).sum()
orders = orders.drop_duplicates(subset=["order_id"])

<aside> ⌨️ Hands on: Use this sample DataFrame and filter it to keep only rows where amount is between 50 and 200 and country is either NL or BE. Sort the result by amount descending.

import pandas as pd

orders = pd.DataFrame(
    {
        "order_id": [1, 2, 3, 4, 5, 6],
        "country": ["NL", "BE", "DE", "NL", "BE", "NL"],
        "amount": [45, 120, 75, 210, 180, 95],
    }
)

🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=selecting_filtering_sorting&exercise=w4_selecting_filtering_sorting__filter_orders&lang=python

💭 The widget uses plain Python lists of dictionaries to mimic a table.

</aside>

Exercises

  1. Select the columns order_id, country, and amount from your DataFrame.
  2. Filter orders to only include amount > 100 and country == "NL".
  3. Sort the filtered data by amount descending and add a rank column.

Extra reading


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.