Week 4: Data Processing

Pandas and DataFrames

Selecting and Filtering Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Jupyter Notebooks

Practice

Assignment: MessyCorp Pandas

Gotchas & Pitfalls

Week 4 Kickoff Slides

Career relevance: Week 4

Pandas Cheatsheet

Week 4 Glossary

Going Further: Optional Deep Dives

Content

🛠️ Practice

<aside> 💡 Using AI to help: When you get stuck, paste the failing snippet (⚠️ Ensure no PII or sensitive company data is included!) into an LLM and ask it to explain the error and suggest a fix. Then verify the fix yourself by running the corrected code.

</aside>

Exercises 2–5 build on the orders DataFrame from Exercise 1. Run Exercise 1 first, or paste its setup block at the top of your script.

Open the workspace once

All Week 4 exercises live under data-track/week-4/ in HYF's Learning-Resources repo. One Codespace covers all 7 exercises.

<aside> 💻 Open in GitHub Codespaces

</aside>

The repo's data-track/.devcontainer/ boots Python 3.11 + ruff + Pylance for every exercise. From the Codespace's Explorer, navigate into data-track/week-4/exercise_N/.

Prefer your own VS Code? Clone locally instead:

git clone <https://github.com/HackYourFuture/Learning-Resources.git>
cd Learning-Resources/data-track/week-4
code .

Each exercise folder ships its own requirements.txt (when needed) and a per-exercise README with detailed instructions.

Reference solutions (peek only after attempting)

Each exercise_N/solutions/ folder holds the answer in-place. The starter file is filled with the answer code, the original # TODO comments are preserved, and # WHY ...: notes sit under each non-obvious choice.

Read the WHY notes, not the code. The point is the reasoning, not the syntax.

Spoiler discipline

The solution sits next to your starter under solutions/ rather than on a separate branch. The folder name and the deliberate "open this folder to see the answer" click are the whole barrier, and they are enough. Time-box yourself: 10-30 minutes of honest attempt before you open solutions/. The struggle is where the learning happens.

You can diff your attempt against the reference once you have tried:

diff exercise_1/exercise.py exercise_1/solutions/exercise.py

Exercise 1: Quick EDA on Orders

Concepts: DataFrame creation, info(), describe(), missing values.

Create a small dataset and explore it.

from io import StringIO
import pandas as pd

csv_data = StringIO(
    """order_id,customer_id,region,amount,order_date
1,100,NL,120,2024-01-02
2,101,BE,90,2024-01-03
3,102,NL,,2024-01-03
4,103,DE,200,2024-01-04
5,100,NL,50,2024-01-05
"""
)

orders = pd.read_csv(csv_data)

Instructions:

  1. Run orders.info() and orders.describe().
  2. Count missing values per column.
  3. Fill missing amount values with 0.

<aside> 📦 Files: exercise_1/: use the Codespace you opened at the top of this page.

</aside>


Exercise 2: Filtering and Sorting

Concepts: Boolean masks, loc, sorting.

Instructions:

  1. Filter orders where region is NL and amount is greater than 80.
  2. Sort the filtered data by amount descending.
  3. Add a column is_big_order where amount >= 150.

<aside> 📦 Files: exercise_2/: use the Codespace you opened at the top of this page.

</aside>


Exercise 3: Grouping and Aggregation

Concepts: groupby, agg, transform.

Instructions:

  1. Group by region and calculate total revenue and order count.
  2. Add a column region_avg with the average order amount per region.
  3. Verify that region_avg is the same for all rows in the same region.

<aside> 📦 Files: exercise_3/: use the Codespace you opened at the top of this page.

</aside>


Exercise 4: Joining Two Tables

Concepts: merge, join types.

Create a customer table:

customers = pd.DataFrame(
    {
        "customer_id": [100, 101, 102, 103],
        "name": ["Alice", "Bob", "Chloe", "Daan"],
        "segment": ["retail", "retail", "b2b", "b2b"],
    }
)

Instructions:

  1. Join orders with customers using a left join on customer_id.
  2. Count how many orders belong to each segment.

<aside> 📦 Files: exercise_4/: use the Codespace you opened at the top of this page.

</aside>


Exercise 5: Reshape and Export

Concepts: pivot_table, file output.

Run this setup before writing files:

from pathlib import Path
Path("output").mkdir(exist_ok=True)

Instructions:

  1. Create a pivot table of total amount by region and order_date.
  2. Save the pivot table to output/pivot.csv and output/pivot.parquet.
  3. Read back the Parquet file and confirm the row count.

<aside> 📦 Files: exercise_5/: use the Codespace you opened at the top of this page. Includes pyarrow in requirements.txt for Parquet support.

</aside>


Exercise 6: Strings and Dates

Concepts: .str accessor, pd.to_datetime, .dt accessor.

Use the orders DataFrame from Exercise 1, extended with an email column:

orders['email'] = ['[email protected]', '[email protected]', None, '[email protected]', '[email protected]']
orders['order_date'] = pd.to_datetime(orders['order_date'])

Instructions:

  1. Normalize email to lowercase with no leading or trailing whitespace. Both .str.strip() and .str.lower() skip NaN values by default, so missing emails stay as NaN.
  2. Extract the year and month from order_date into two new columns: year and month.
  3. Filter to orders placed in January 2024 using the .dt accessor.

<aside> 📦 Files: exercise_6/: use the Codespace you opened at the top of this page.

</aside>


Exercise 7: Visualize Revenue

Concepts: groupby + plot, savefig, headless backend.

import matplotlib
matplotlib.use("Agg")   # must come before importing pyplot
import matplotlib.pyplot as plt
from pathlib import Path
Path("output").mkdir(exist_ok=True)