Week 4: Data Processing

Pandas and DataFrames

Selecting and Filtering Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Jupyter Notebooks

Practice

Assignment: MessyCorp Pandas

Gotchas & Pitfalls

Week 4 Kickoff Slides

Career relevance: Week 4

Pandas Cheatsheet

Week 4 Glossary

Going Further: Optional Deep Dives

Selecting and Filtering Data

Most bugs in analytics come from selecting the wrong rows. If you can slice DataFrames confidently, you can trust your results.

This chapter shows you the main selection tools in Pandas and how to filter and sort data without surprises.

By the end of this chapter, you should be able to select rows and columns with loc and iloc, write boolean filters combining multiple conditions, and sort and deduplicate a DataFrame.

<aside> ๐Ÿ“ฆ Run the examples: companion_ch2_selecting_filtering_sorting.py: run in the Codespace or clone locally to follow along with this chapter.

</aside>

Concepts

import pandas as pd

orders = pd.DataFrame(
    {
        "order_id": [1, 2, 3, 4, 5],
        "country": ["NL", "BE", "DE", "NL", "BE"],
        "amount": [45, 120, 75, 210, 180],
    }
)

Selecting Columns

Use square brackets for column selection. A single column gives a Series. A list gives a DataFrame.

cols = ["order_id", "amount"]
orders["amount"]
orders[cols]

<aside> ๐Ÿ’ก Prefer bracket selection over orders.amount because column names with spaces or keywords will break attribute access.

</aside>

Selecting Rows with loc and iloc

orders.loc[0]                           # Row with index label 0 (label-based)
orders.loc[0:3, ["order_id", "amount"]] # Labels 0-3 inclusive, named columns

orders.iloc[0]                          # First row by position (position-based)
orders.iloc[0:3, 0:2]                   # Positions 0-2 exclusive, first 2 columns

<aside> โš ๏ธ loc includes the end label in slices. iloc does not include the end position.

</aside>

Practice selecting specific rows and columns with loc:

<aside> โŒจ๏ธ Hands on: Implement loc_select(df) using df.loc[condition, columns] to return only the rows where order_id is in [2, 4], keeping only the "order_id" and "amount" columns.

</aside>

<aside> ๐Ÿš€ Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=selecting_filtering_sorting&exercise=w4_selecting_filtering_sorting__loc_select&lang=python

</aside>

Boolean Filtering

Filters are just boolean masks.

big_orders = orders[orders["amount"] > 100]

nl_big = orders[(orders["country"] == "NL") & (orders["amount"] > 100)]

<aside> โš ๏ธ Always use & and | with parentheses. Python and and or do not work with Series.

</aside>

NumPy's boolean array model is why the syntax differs from plain Python conditionals.

<aside> ๐Ÿค“ Curious Geek: Boolean masks

Pandas masking uses NumPy-style boolean arrays under the hood, which is why & and | are required.

</aside>

The query Method

query is readable for complex filters.

orders.query("country == 'NL' and amount > 100")

<aside> ๐Ÿ’ก Use query when you are explaining logic to a non-Python audience.

</aside>

Sorting and Ranking

orders.sort_values("amount", ascending=False)
orders.sort_values(["country", "amount"], ascending=[True, False])
orders.sort_index()

orders["amount_rank"] = orders["amount"].rank(ascending=False, method="dense")
# method="dense": tied values share a rank and the next rank is not skipped (1, 1, 2)
# method="min":   tied values share a rank but the next rank IS skipped   (1, 1, 3)

Duplicates

orders.duplicated(subset=["order_id"]).sum()
orders = orders.drop_duplicates(subset=["order_id"])

โŒจ๏ธ Hands on: Filter and Sort Orders

Use this sample DataFrame and filter it to keep only rows where amount is between 50 and 200 and country is either NL or BE. Sort the result by amount descending.

import pandas as pd

orders = pd.DataFrame(
    {
        "order_id": [1, 2, 3, 4, 5, 6],
        "country": ["NL", "BE", "DE", "NL", "BE", "NL"],
        "amount": [45, 120, 75, 210, 180, 95],
    }
)

<aside> ๐Ÿš€ Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=selecting_filtering_sorting&exercise=w4_selecting_filtering_sorting__filter_orders&lang=python

</aside>

Exercises

  1. Select the columns order_id, country, and amount from your DataFrame.
  2. Filter orders to only include amount > 100 and country == "NL".
  3. Sort the filtered data by amount descending and add a rank column.
  4. Count how many rows have a duplicate order_id, then remove them keeping the first occurrence.

Extra reading

Knowledge Check

Test your recall before moving on.

<aside> ๐Ÿš€ Try it in the widget: Interactive Quiz: Selecting and Filtering Data

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch2_selecting_filtering_quiz&embed=1

If the loc/iloc distinction or boolean mask syntax felt shaky, this video walks through the same concepts from a different angle.

<aside> ๐ŸŽฌ Struggling with this concept? Watch this beginner-friendly video:

Watch on YouTube

</aside>

https://www.youtube.com/watch?v=Lw2rlcxScZY

You can also describe your filter requirement to an LLM to get a starting point.

<aside> ๐Ÿ’ก Using AI to help: Paste a filter requirement in plain English (โš ๏ธ Ensure no PII or sensitive company data is included!): for example "keep rows where city is Amsterdam and amount is above 100", and ask an LLM to write the boolean mask. Always verify the output against a small test DataFrame.

</aside>

Ready to apply these skills? Try the practice exercise before moving on.

<aside> โŒจ๏ธ Hands on: Practice with Exercise 2: Filtering and Sorting.

</aside>


Next up: Grouping and Aggregation, where you learn to summarise data by category using groupby and produce business metrics like revenue per region.


The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with โค๏ธ by the HackYourFuture community ยท Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.