Week 4 - Data Processing with Pandas

Introduction to Pandas and DataFrames

Selecting, Filtering, and Sorting Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Practice

Assignment: MessyCorp Goes Pandas

Gotchas & Pitfalls

Lesson Plan

Content

🛠️ Practice

Exercise 1: Quick EDA on Orders

Concepts: DataFrame creation, info(), describe(), missing values.

Create a small dataset and explore it.

from io import StringIO
import pandas as pd

csv_data = StringIO(
    """order_id,customer_id,region,amount,order_date
1,100,NL,120,2024-01-02
2,101,BE,90,2024-01-03
3,102,NL,,2024-01-03
4,103,DE,200,2024-01-04
5,100,NL,50,2024-01-05
"""
)

orders = pd.read_csv(csv_data)

Instructions:

  1. Run orders.info() and orders.describe().
  2. Count missing values per column.
  3. Fill missing amount values with 0.

Exercise 2: Filtering and Sorting

Concepts: Boolean masks, loc, sorting.

Instructions:

  1. Filter orders where region is NL and amount is greater than 80.
  2. Sort the filtered data by amount descending.
  3. Add a column is_big_order where amount >= 150.

Exercise 3: Grouping and Aggregation

Concepts: groupby, agg, transform.

Instructions:

  1. Group by region and calculate total revenue and order count.
  2. Add a column region_avg with the average order amount per region.
  3. Verify that region_avg is the same for all rows in the same region.

Exercise 4: Joining Two Tables

Concepts: merge, join types.

Create a customer table:

customers = pd.DataFrame(
    {
        "customer_id": [100, 101, 102, 103],
        "name": ["Alice", "Bob", "Chloe", "Daan"],
        "segment": ["retail", "retail", "b2b", "b2b"],
    }
)

Instructions:

  1. Join orders with customers using a left join on customer_id.
  2. Count how many orders belong to each segment.

Exercise 5: Reshape and Export

Concepts: pivot_table, file output.

Instructions:

  1. Create a pivot table of total amount by region and order_date.
  2. Save the pivot table to output/pivot.csv and output/pivot.parquet.
  3. Read back the Parquet file and confirm the row count.

The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.