Week 4 - Data Processing with Pandas
Introduction to Pandas and DataFrames
Selecting, Filtering, and Sorting Data
Joining and Merging DataFrames
Working with Strings and Dates
Assignment: MessyCorp Goes Pandas
Pandas is the default tool for working with tabular data in Python. Once your data is ingested, you still need to clean it, inspect it, join it, and transform it into something useful. Pandas gives you a fast, expressive way to do all of that without writing slow Python loops.
By the end of this chapter, you should be able to create a DataFrame, understand its basic building blocks, and inspect data quality quickly.
In data engineering, ETL means Extract, Transform, Load. You clean and transform data before it reaches the warehouse. ELT means Extract, Load, Transform. You load raw data first, then transform it inside the warehouse.
Pandas is a common tool for the Transform step. In ETL, Pandas runs before loading. In ELT, Pandas can still be used for smaller transformations or local validation before pushing data to the warehouse.
Real-world examples:
<aside> 💡 Use ETL when data quality must be enforced before loading. Use ELT when you want flexibility and fast iteration inside the warehouse.
</aside>
A Series is a single labeled column. A DataFrame is a table made of multiple Series. Every DataFrame has an Index, which is the row label used for alignment and joins.
import pandas as pd
scores = pd.Series([90, 85, 70], name="score")
users = pd.DataFrame(
{
"user_id": [1, 2, 3],
"name": ["Alice", "Bob", "Chloe"],
"score": [90, 85, 70],
}
)
<aside> 💡 The Index is not the same as a column. It is a label layer. You can keep it or reset it, but you should always be aware of it.
</aside>
graph TD
DF[DataFrame] --> Index
DF --> Columns
Columns --> Series1[Series: user_id]
Columns --> Series2[Series: name]
Columns --> Series3[Series: score]
subgraph Structure
Index -- Label --> Rows
Series1 -- Data --> Rows
Series2 -- Data --> Rows
Series3 -- Data --> Rows
end
<aside> 🤓 Curious Geek: Why the name Pandas?
It comes from the term panel data, which describes multi-dimensional data in econometrics.
</aside>
You can build DataFrames from dictionaries, lists, or files. Use the simplest option for your data source.
# From a dict of lists
users = pd.DataFrame(
{
"user_id": [1, 2, 3],
"country": ["NL", "BE", "NL"],
}
)
# From a list of dicts
orders = pd.DataFrame(
[
{"order_id": 101, "amount": 120.0},
{"order_id": 102, "amount": 90.0},
]
)
# From a CSV file
orders = pd.read_csv("data/orders.csv")
<aside> ⌨️ Hands on: Use this sample data, then add two more rows and print the DataFrame.
import pandas as pd
products = pd.DataFrame(
{
"product": ["laptop", "mouse", "keyboard"],
"price": [1200, 25, 85],
"in_stock": [True, True, False],
}
)
🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=intro_pandas_dataframes&exercise=w4_intro_pandas_dataframes__table_shape&lang=python
💭 The widget uses plain Python lists of dictionaries to mimic a table.
</aside>
Fast inspection saves hours of debugging later.
orders.head(3)
orders.tail(3)
orders.sample(3, random_state=42)
orders.shape
orders.dtypes
orders.info()
orders.describe(numeric_only=True)
Use info() to check missing values and data types, and describe() to spot outliers.
<aside>
💡 If info() shows object for a numeric column, it probably needs type cleaning.
</aside>
Pandas represents missing values as NaN (for numbers) or NaT (for datetimes).
orders.isna().sum()
orders["amount"] = orders["amount"].fillna(0)
orders = orders.dropna(subset=["order_id"])
<aside>
⌨️ Hands on: Load this CSV with missing values. Use isna().sum() to count missing data per column, then decide which columns to fill and which rows to drop.
from io import StringIO
import pandas as pd
csv_data = StringIO(
\\"\\"\\"order_id,amount,region
1,120,NL
2,,BE
3,75,
4,200,DE
\\"\\"\\"
)
orders = pd.read_csv(csv_data)
</aside>
Vectorized operations are fast and readable.
orders["amount_eur"] = orders["amount"] * 0.92
orders["is_big_order"] = orders["amount"] > 100
orders = orders.assign(amount_usd=lambda df: df["amount"] * 1.1)
city, population, and country.head(), info(), and describe() to summarize the DataFrame.population_millions and round it to two decimals.The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.