Week 4: Data Processing

Pandas and DataFrames

Selecting and Filtering Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Jupyter Notebooks

Practice

Assignment: MessyCorp Pandas

Gotchas & Pitfalls

Week 4 Kickoff Slides

Career relevance: Week 4

Pandas Cheatsheet

Week 4 Glossary

Going Further: Optional Deep Dives

Working with Strings and Dates

Real-world data is messy. Names have extra spaces, emails have mixed case, and dates come in ten different formats. Pandas gives you vectorized tools to clean this without loops.

By the end of this chapter, you should be able to clean string columns with the .str accessor, parse dates with pd.to_datetime, and extract date components for analysis.

<aside> ๐Ÿ“ฆ Run the examples: companion_ch5_strings_dates.py: run in the Codespace or clone locally to follow along with this chapter.

</aside>

Concepts

import pandas as pd

users = pd.DataFrame(
    {
        "name": ["Alice Smith", "Bob Jones", "Chloe van Dam"],
        "email": ["  [email protected]", "[email protected] ", "[email protected]"],
        "phone": ["06 12345678", "06 23456789", "06 34567890"],
        "country": ["nl", "be", "de"],
    }
)

orders = pd.DataFrame(
    {
        "order_id": [1, 2, 3],
        "order_date": pd.to_datetime(["2024-01-05", "2024-02-14", "2024-03-22"]),
        "amount": [120, 90, 200],
    }
)

Vectorized String Operations

Use the .str accessor to apply string logic to a full column.

users["email"] = users["email"].str.strip().str.lower()
users["country"] = users["country"].str.upper()

users["has_gmail"] = users["email"].str.contains("@gmail.com", na=False)
users["first_name"] = users["name"].str.split(" ").str[0]

Pattern Matching and Extraction

The .str accessor also supports both literal and regex-based replacement and extraction for cases where simple string methods are not enough.

users["phone"] = users["phone"].str.replace(" ", "", regex=False)
users["area_code"] = users["phone"].str.extract(r"(\d{2,3})")

<aside> ๐Ÿค“ Curious Geek: ISO 8601

The YYYY-MM-DD format is part of the ISO 8601 date standard. It sorts correctly as a string.

</aside>

Parsing Dates

pd.to_datetime converts strings into datetime objects.

orders["order_date"] = pd.to_datetime(
    orders["order_date"],
    errors="coerce",
    format="%Y-%m-%d",
)

errors="coerce" turns invalid values into NaT so you can handle them later.

Practice parsing a date column and extracting its components:

<aside> โŒจ๏ธ Hands on: Implement extract_year_month(df) that calls pd.to_datetime on the "order_date" column, then uses .dt.year and .dt.month to add integer "year" and "month" columns. Return the updated DataFrame.

</aside>

<aside> ๐Ÿš€ Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=strings_dates&exercise=w4_strings_dates__parse_dates&lang=python

</aside>

Date Components and Arithmetic

Once a column is parsed as datetime, the .dt accessor gives you individual components: year, month, day, weekday, and more.

orders["year"] = orders["order_date"].dt.year
orders["month"] = orders["order_date"].dt.month
orders["weekday"] = orders["order_date"].dt.day_name()

orders["days_since_order"] = (pd.Timestamp("today") - orders["order_date"]).dt.days

โŒจ๏ธ Hands on: Clean Strings and Parse Dates

Use this sample table. Clean the city column by stripping spaces and converting to title case. Then parse signup_date and extract the year.

import pandas as pd

users = pd.DataFrame(
    {
        "name": ["Alice", "Bob", "Chloe"],
        "city": [" amsterdam", "ROTTERDAM ", " the hague "],
        "signup_date": ["2024-01-05", "2024/01/10", "invalid"],
    }
)

<aside> ๐Ÿš€ Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=strings_dates&exercise=w4_strings_dates__clean_city_year&lang=python

</aside>

Exercises

  1. Clean a column email by lowercasing and trimming spaces.
  2. Extract a numeric invoice_id from a column like INV-2024-0012.
  3. Parse a created_at column with mixed formats and count invalid rows.

Extra reading

Knowledge Check

Test your recall before moving on.

<aside> ๐Ÿš€ Try it in the widget: Interactive Quiz: Working with Strings and Dates

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch5_strings_dates_quiz&embed=1

If the .str accessor or pd.to_datetime felt unclear, this video walks through dates and time series in Pandas step by step.

<aside> ๐ŸŽฌ Struggling with this concept? Watch this beginner-friendly video:

Watch on YouTube

</aside>

https://www.youtube.com/watch?v=UFuo7EHI8zc

You can also describe a messy column to an LLM to get a starting point.

<aside> ๐Ÿ’ก Using AI to help: Paste a few sample values from your string or date column (โš ๏ธ Ensure no PII or sensitive company data is included!): for example [" [email protected]", "[email protected] "], and ask an LLM to write the .str chain to clean it. Always test the result on the actual column before using it in production.

</aside>

Ready to apply these skills? Try the practice exercise before moving on.

<aside> โŒจ๏ธ Hands on: Practice with Exercise 6: Strings and Dates.

</aside>


Next up: Advanced Transformations, where you reshape DataFrames with pivot, melt, and window functions for dashboards and reporting.


The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with โค๏ธ by the HackYourFuture community ยท Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.