Joining and Merging DataFrames
Working with Strings and Dates
Going Further: Optional Deep Dives
Content
<aside> 💡 Using AI to help: When you get stuck, paste the failing snippet (⚠️ Ensure no PII or sensitive company data is included!) into an LLM and ask it to explain the error and suggest a fix. Then verify the fix yourself by running the corrected code.
</aside>
Exercises 2–5 build on the orders DataFrame from Exercise 1. Run Exercise 1 first, or paste its setup block at the top of your script.
All Week 4 exercises live under data-track/week-4/ in HYF's Learning-Resources repo. One Codespace covers all 7 exercises.
<aside> 💻 Open in GitHub Codespaces
</aside>
The repo's data-track/.devcontainer/ boots Python 3.11 + ruff + Pylance for every exercise. From the Codespace's Explorer, navigate into data-track/week-4/exercise_N/.
Prefer your own VS Code? Clone locally instead:
git clone <https://github.com/HackYourFuture/Learning-Resources.git>
cd Learning-Resources/data-track/week-4
code .
Each exercise folder ships its own requirements.txt (when needed) and a per-exercise README with detailed instructions.
Each exercise_N/solutions/ folder holds the answer in-place. The starter file is filled with the answer code, the original # TODO comments are preserved, and # WHY ...: notes sit under each non-obvious choice.
Read the WHY notes, not the code. The point is the reasoning, not the syntax.
The solution sits next to your starter under solutions/ rather than on a separate branch. The folder name and the deliberate "open this folder to see the answer" click are the whole barrier, and they are enough. Time-box yourself: 10-30 minutes of honest attempt before you open solutions/. The struggle is where the learning happens.
You can diff your attempt against the reference once you have tried:
diff exercise_1/exercise.py exercise_1/solutions/exercise.py
Concepts: DataFrame creation, info(), describe(), missing values.
Create a small dataset and explore it.
from io import StringIO
import pandas as pd
csv_data = StringIO(
"""order_id,customer_id,region,amount,order_date
1,100,NL,120,2024-01-02
2,101,BE,90,2024-01-03
3,102,NL,,2024-01-03
4,103,DE,200,2024-01-04
5,100,NL,50,2024-01-05
"""
)
orders = pd.read_csv(csv_data)
Instructions:
orders.info() and orders.describe().amount values with 0.<aside>
📦 Files: exercise_1/: use the Codespace you opened at the top of this page.
</aside>
Concepts: Boolean masks, loc, sorting.
Instructions:
region is NL and amount is greater than 80.amount descending.is_big_order where amount >= 150.<aside>
📦 Files: exercise_2/: use the Codespace you opened at the top of this page.
</aside>
Concepts: groupby, agg, transform.
Instructions:
region and calculate total revenue and order count.region_avg with the average order amount per region.region_avg is the same for all rows in the same region.<aside>
📦 Files: exercise_3/: use the Codespace you opened at the top of this page.
</aside>
Concepts: merge, join types.
Create a customer table:
customers = pd.DataFrame(
{
"customer_id": [100, 101, 102, 103],
"name": ["Alice", "Bob", "Chloe", "Daan"],
"segment": ["retail", "retail", "b2b", "b2b"],
}
)
Instructions:
orders with customers using a left join on customer_id.segment.<aside>
📦 Files: exercise_4/: use the Codespace you opened at the top of this page.
</aside>
Concepts: pivot_table, file output.
Run this setup before writing files:
from pathlib import Path
Path("output").mkdir(exist_ok=True)
Instructions:
amount by region and order_date.output/pivot.csv and output/pivot.parquet.<aside>
📦 Files: exercise_5/: use the Codespace you opened at the top of this page. Includes pyarrow in requirements.txt for Parquet support.
</aside>
Concepts: .str accessor, pd.to_datetime, .dt accessor.
Use the orders DataFrame from Exercise 1, extended with an email column:
orders['email'] = ['[email protected]', '[email protected]', None, '[email protected]', '[email protected]']
orders['order_date'] = pd.to_datetime(orders['order_date'])
Instructions:
email to lowercase with no leading or trailing whitespace. Both .str.strip() and .str.lower() skip NaN values by default, so missing emails stay as NaN.order_date into two new columns: year and month..dt accessor.<aside>
📦 Files: exercise_6/: use the Codespace you opened at the top of this page.
</aside>
Concepts: groupby + plot, savefig, headless backend.
import matplotlib
matplotlib.use("Agg") # must come before importing pyplot
import matplotlib.pyplot as plt
from pathlib import Path
Path("output").mkdir(exist_ok=True)