Joining and Merging DataFrames
Working with Strings and Dates
Going Further: Optional Deep Dives
Real-world data is messy. Names have extra spaces, emails have mixed case, and dates come in ten different formats. Pandas gives you vectorized tools to clean this without loops.
By the end of this chapter, you should be able to clean string columns with the .str accessor, parse dates with pd.to_datetime, and extract date components for analysis.
<aside>
๐ฆ Run the examples: companion_ch5_strings_dates.py: run in the Codespace or clone locally to follow along with this chapter.
</aside>
import pandas as pd
users = pd.DataFrame(
{
"name": ["Alice Smith", "Bob Jones", "Chloe van Dam"],
"email": [" [email protected]", "[email protected] ", "[email protected]"],
"phone": ["06 12345678", "06 23456789", "06 34567890"],
"country": ["nl", "be", "de"],
}
)
orders = pd.DataFrame(
{
"order_id": [1, 2, 3],
"order_date": pd.to_datetime(["2024-01-05", "2024-02-14", "2024-03-22"]),
"amount": [120, 90, 200],
}
)
Use the .str accessor to apply string logic to a full column.
users["email"] = users["email"].str.strip().str.lower()
users["country"] = users["country"].str.upper()
users["has_gmail"] = users["email"].str.contains("@gmail.com", na=False)
users["first_name"] = users["name"].str.split(" ").str[0]
The .str accessor also supports both literal and regex-based replacement and extraction for cases where simple string methods are not enough.
users["phone"] = users["phone"].str.replace(" ", "", regex=False)
users["area_code"] = users["phone"].str.extract(r"(\d{2,3})")
<aside> ๐ค Curious Geek: ISO 8601
The YYYY-MM-DD format is part of the ISO 8601 date standard. It sorts correctly as a string.
</aside>
pd.to_datetime converts strings into datetime objects.
orders["order_date"] = pd.to_datetime(
orders["order_date"],
errors="coerce",
format="%Y-%m-%d",
)
errors="coerce" turns invalid values into NaT so you can handle them later.
Practice parsing a date column and extracting its components:
<aside>
โจ๏ธ Hands on: Implement extract_year_month(df) that calls pd.to_datetime on the "order_date" column, then uses .dt.year and .dt.month to add integer "year" and "month" columns. Return the updated DataFrame.
</aside>
<aside> ๐ Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=strings_dates&exercise=w4_strings_dates__parse_dates&lang=python
</aside>
Once a column is parsed as datetime, the .dt accessor gives you individual components: year, month, day, weekday, and more.
orders["year"] = orders["order_date"].dt.year
orders["month"] = orders["order_date"].dt.month
orders["weekday"] = orders["order_date"].dt.day_name()
orders["days_since_order"] = (pd.Timestamp("today") - orders["order_date"]).dt.days
Use this sample table. Clean the city column by stripping spaces and converting to title case. Then parse signup_date and extract the year.
import pandas as pd
users = pd.DataFrame(
{
"name": ["Alice", "Bob", "Chloe"],
"city": [" amsterdam", "ROTTERDAM ", " the hague "],
"signup_date": ["2024-01-05", "2024/01/10", "invalid"],
}
)
<aside> ๐ Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=strings_dates&exercise=w4_strings_dates__clean_city_year&lang=python
</aside>
email by lowercasing and trimming spaces.invoice_id from a column like INV-2024-0012.created_at column with mixed formats and count invalid rows.Test your recall before moving on.
users["email"].str.lower() instead of users["email"].lower()?errors="coerce" do in pd.to_datetime()?.dt.year and .dt.day_name()?str.contains("@gmail.com") call raises a ValueError on a column with missing values. How do you fix it?YYYY-MM-DD strings sort correctly, while DD/MM/YYYY does not?<aside> ๐ Try it in the widget: Interactive Quiz: Working with Strings and Dates
</aside>
https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch5_strings_dates_quiz&embed=1
If the .str accessor or pd.to_datetime felt unclear, this video walks through dates and time series in Pandas step by step.
<aside> ๐ฌ Struggling with this concept? Watch this beginner-friendly video:
</aside>
https://www.youtube.com/watch?v=UFuo7EHI8zc
You can also describe a messy column to an LLM to get a starting point.
<aside>
๐ก Using AI to help: Paste a few sample values from your string or date column (โ ๏ธ Ensure no PII or sensitive company data is included!): for example [" [email protected]", "[email protected] "], and ask an LLM to write the .str chain to clean it. Always test the result on the actual column before using it in production.
</aside>
Ready to apply these skills? Try the practice exercise before moving on.
<aside> โจ๏ธ Hands on: Practice with Exercise 6: Strings and Dates.
</aside>
Next up: Advanced Transformations, where you reshape DataFrames with pivot, melt, and window functions for dashboards and reporting.
The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with โค๏ธ by the HackYourFuture community ยท Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.