Joining and Merging DataFrames
Working with Strings and Dates
Going Further: Optional Deep Dives
Pandas is the default tool for working with tabular data in Python. Once your data is ingested, you still need to clean it, inspect it, join it, and transform it into something useful. Pandas gives you a fast, expressive way to do all of that without writing slow Python loops.
By the end of this chapter, you should be able to create a DataFrame, understand its basic building blocks, and inspect data quality quickly.
<aside> ๐ก Pandas handles the Transform step in data pipelines: after data is ingested, you clean and reshape it before loading results into a database or file. You will see this pattern throughout the week.
</aside>
The companion script for this chapter covers all the code examples below. You can run it as a plain .py file or, if you prefer a notebook environment, follow the Jupyter Notebooks chapter first.
<aside>
๐ฆ Run the examples: companion_ch1_intro_pandas_dataframes.py: run in the Codespace or clone locally to follow along with this chapter.
</aside>
In the Core program you learned SQL Basics. SQL is still the right tool when structured data already lives in a database and you want to filter, join, or aggregate it close to where it is stored.
Pandas solves a different part of the workflow. Many data pipelines start with CSV files, Excel exports, API responses, JSON payloads, or messy local files. Before that data is ready for SQL, you often need to inspect it, fix types, handle missing values, clean text, validate assumptions, reshape tables, and write reusable outputs. Pandas gives you that in Python, in the same script that ingests and validates the data.
Later in the track you return to SQL for warehouse-scale transformations. This week focuses on the Python-side transformation work that happens before data is clean enough to load, or when the output needs to be a file, chart, or local report.
A Series is a single labeled column. A DataFrame is a table made of multiple Series. Every DataFrame has an Index, which is the row label used for alignment and joins.
import pandas as pd
scores = pd.Series([90, 85, 70], name="score")
users = pd.DataFrame(
{
"user_id": [1, 2, 3],
"name": ["Alice", "Bob", "Chloe"],
"score": [90, 85, 70],
}
)
<aside> ๐ก The Index is not the same as a column. It is a label layer. You can keep it or reset it, but you should always be aware of it.
</aside>
graph TD
DF[DataFrame] --> Index
DF --> Columns
Columns --> Series1[Series: user_id]
Columns --> Series2[Series: name]
Columns --> Series3[Series: score]
subgraph Structure
Index -- Label --> Rows
Series1 -- Data --> Rows
Series2 -- Data --> Rows
Series3 -- Data --> Rows
end
<aside> ๐ค Curious Geek: Why the name Pandas?
It comes from the term panel data, which describes multi-dimensional data in econometrics.
</aside>
You can build DataFrames from dictionaries, lists, or files. Use the simplest option for your data source.
import pandas as pd
from io import StringIO
# From a dict of lists
users = pd.DataFrame(
{
"user_id": [1, 2, 3],
"country": ["NL", "BE", "NL"],
}
)
# From a list of dicts
orders = pd.DataFrame(
[
{"order_id": 101, "amount": 120.0},
{"order_id": 102, "amount": 90.0},
]
)
# From a CSV file
orders = pd.read_csv(StringIO("order_id,amount\n101,120.0\n102,90.0\n103,45.5"))
Use this sample data, then add two more rows and print the DataFrame.
import pandas as pd
products = pd.DataFrame(
{
"product": ["laptop", "mouse", "keyboard"],
"price": [1200, 25, 85],
"in_stock": [True, True, False],
}
)
<aside> ๐ Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=intro_pandas_dataframes&exercise=w4_intro_pandas_dataframes__table_shape&lang=python
</aside>
Fast inspection saves hours of debugging later.
<aside>
๐ก The snippets below use the orders DataFrame created in the block above. Run them in sequence or keep state from the previous block.
</aside>
orders.head(3)
orders.tail(3)
orders.sample(3, random_state=42)
orders.shape
orders.dtypes
orders.info()
orders.describe()
Use info() to check missing values and data types, and describe() to spot outliers.
<aside>
๐ก If info() shows object for a numeric column, it probably needs type cleaning.
</aside>
Pandas represents missing values as NaN (for numbers) or NaT (for datetimes).
orders.isna().sum()
orders["amount"] = orders["amount"].fillna(0)
orders = orders.dropna(subset=["order_id"])
Load this CSV with missing values. Use isna().sum() to count missing data per column, then decide which columns to fill and which rows to drop.
from io import StringIO
import pandas as pd
csv_data = StringIO(
"""order_id,amount,region
1,120,NL
2,,BE
3,75,
4,200,DE
"""
)
orders = pd.read_csv(csv_data)
Vectorized operations are fast and readable.
orders["amount_eur"] = orders["amount"] * 0.92
orders["is_big_order"] = orders["amount"] > 100
orders = orders.assign(amount_usd=lambda df: df["amount"] * 1.1)
city, population, and country.head(), info(), and describe() to summarize the DataFrame.population_millions and round it to two decimals.info() show that describe() does not?fillna(0) do, and when should you avoid it?orders.amount * 0.92 instead of a loop. Why is this faster?Test your recall before moving on.
<aside> ๐ Try it in the widget: Interactive Quiz: Pandas and DataFrames
</aside>
https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch1_intro_pandas_dataframes_quiz&embed=1
If the chapter felt fast, a short video covers the same ground from a different angle.
<aside> ๐ฌ Struggling with this concept? Watch this beginner-friendly video:
</aside>
https://www.youtube.com/watch?v=ZyhVh-qRZPA
You can also use an LLM to explore specific concepts further.
<aside>
๐ก Using AI to help: Paste a sample DataFrame (โ ๏ธ Ensure no PII or sensitive company data is included!) into an LLM and ask it to explain what info() output means, or to suggest which columns need cleaning. Describing the shape of your data often leads to better debugging suggestions than pasting the full dataset.
</aside>
Ready to apply these skills? Try the practice exercise before moving on.
<aside> โจ๏ธ Hands on: Practice with Exercise 1: Quick EDA on Orders.
</aside>
Next up: Selecting and Filtering Data, where you learn to slice rows and columns precisely using loc, iloc, and boolean masks.
The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with โค๏ธ by the HackYourFuture community ยท Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.