Week 4 - Data Processing with Pandas

Introduction to Pandas and DataFrames

Selecting, Filtering, and Sorting Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Practice

Assignment: MessyCorp Goes Pandas

Gotchas & Pitfalls

Lesson Plan

🗓️ Working with Strings and Dates

Real-world data is messy. Names have extra spaces, emails have mixed case, and dates come in ten different formats. Pandas gives you vectorized tools to clean this without loops.

Concepts

Vectorized String Operations

Use the .str accessor to apply string logic to a full column.

users["email"] = users["email"].str.strip().str.lower()
users["country"] = users["country"].str.upper()

users["has_gmail"] = users["email"].str.contains("@gmail.com", na=False)
users["first_name"] = users["name"].str.split(" ").str[0]

Pattern Matching and Extraction

users["phone"] = users["phone"].str.replace(" ", "", regex=False)
users["area_code"] = users["phone"].str.extract(r"(\\d{2,3})")

<aside> 💡 If a pattern is complex, ask an LLM to draft the regex, then test it on real data before using it in production.

</aside>

<aside> 🤓 Curious Geek: ISO 8601

The YYYY-MM-DD format is part of the ISO 8601 date standard. It sorts correctly as a string.

</aside>

Parsing Dates

pd.to_datetime converts strings into datetime objects.

orders["order_date"] = pd.to_datetime(
    orders["order_date"],
    errors="coerce",
    format="%Y-%m-%d",
)

errors="coerce" turns invalid values into NaT so you can handle them later.

Date Components and Arithmetic

orders["year"] = orders["order_date"].dt.year
orders["month"] = orders["order_date"].dt.month
orders["weekday"] = orders["order_date"].dt.day_name()

orders["days_since_order"] = (pd.Timestamp("today") - orders["order_date"]).dt.days

Time Zones and Resampling

orders["order_date"] = orders["order_date"].dt.tz_localize("UTC")
orders["order_date_eu"] = orders["order_date"].dt.tz_convert("Europe/Amsterdam")

monthly = orders.set_index("order_date").resample("M")["amount"].sum()

<aside> ⌨️ Hands on: Use this sample table. Clean the city column by stripping spaces and converting to title case. Then parse signup_date and extract the year.

import pandas as pd

users = pd.DataFrame(
    {
        "name": ["Alice", "Bob", "Chloe"],
        "city": [" amsterdam", "ROTTERDAM ", " the hague "],
        "signup_date": ["2024-01-05", "2024/01/10", "invalid"],
    }
)

🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=strings_dates&exercise=w4_strings_dates__clean_city_year&lang=python

💭 The widget uses plain Python lists of dictionaries to mimic a table.

</aside>

Exercises

  1. Clean a column email by lowercasing and trimming spaces.
  2. Extract a numeric invoice_id from a column like INV-2024-0012.
  3. Parse a created_at column with mixed formats and count invalid rows.

Extra reading


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.