Week 4: Data Processing

Pandas and DataFrames

Selecting and Filtering Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Jupyter Notebooks

Practice

Assignment: MessyCorp Pandas

Gotchas & Pitfalls

Week 4 Kickoff Slides

Career relevance: Week 4

Pandas Cheatsheet

Week 4 Glossary

Going Further: Optional Deep Dives

Pandas and DataFrames

Pandas is the default tool for working with tabular data in Python. Once your data is ingested, you still need to clean it, inspect it, join it, and transform it into something useful. Pandas gives you a fast, expressive way to do all of that without writing slow Python loops.

By the end of this chapter, you should be able to create a DataFrame, understand its basic building blocks, and inspect data quality quickly.

<aside> ๐Ÿ’ก Pandas handles the Transform step in data pipelines: after data is ingested, you clean and reshape it before loading results into a database or file. You will see this pattern throughout the week.

</aside>

The companion script for this chapter covers all the code examples below. You can run it as a plain .py file or, if you prefer a notebook environment, follow the Jupyter Notebooks chapter first.

<aside> ๐Ÿ“ฆ Run the examples: companion_ch1_intro_pandas_dataframes.py: run in the Codespace or clone locally to follow along with this chapter.

</aside>

Concepts

Why not just use SQL?

In the Core program you learned SQL Basics. SQL is still the right tool when structured data already lives in a database and you want to filter, join, or aggregate it close to where it is stored.

Pandas solves a different part of the workflow. Many data pipelines start with CSV files, Excel exports, API responses, JSON payloads, or messy local files. Before that data is ready for SQL, you often need to inspect it, fix types, handle missing values, clean text, validate assumptions, reshape tables, and write reusable outputs. Pandas gives you that in Python, in the same script that ingests and validates the data.

Later in the track you return to SQL for warehouse-scale transformations. This week focuses on the Python-side transformation work that happens before data is clean enough to load, or when the output needs to be a file, chart, or local report.

Pandas, Series, DataFrames, and Index

A Series is a single labeled column. A DataFrame is a table made of multiple Series. Every DataFrame has an Index, which is the row label used for alignment and joins.

import pandas as pd

scores = pd.Series([90, 85, 70], name="score")
users = pd.DataFrame(
    {
        "user_id": [1, 2, 3],
        "name": ["Alice", "Bob", "Chloe"],
        "score": [90, 85, 70],
    }
)

<aside> ๐Ÿ’ก The Index is not the same as a column. It is a label layer. You can keep it or reset it, but you should always be aware of it.

</aside>

graph TD
    DF[DataFrame] --> Index
    DF --> Columns
    Columns --> Series1[Series: user_id]
    Columns --> Series2[Series: name]
    Columns --> Series3[Series: score]

    subgraph Structure
    Index -- Label --> Rows
    Series1 -- Data --> Rows
    Series2 -- Data --> Rows
    Series3 -- Data --> Rows
    end

<aside> ๐Ÿค“ Curious Geek: Why the name Pandas?

It comes from the term panel data, which describes multi-dimensional data in econometrics.

</aside>

Creating DataFrames

You can build DataFrames from dictionaries, lists, or files. Use the simplest option for your data source.

import pandas as pd
from io import StringIO

# From a dict of lists
users = pd.DataFrame(
    {
        "user_id": [1, 2, 3],
        "country": ["NL", "BE", "NL"],
    }
)

# From a list of dicts
orders = pd.DataFrame(
    [
        {"order_id": 101, "amount": 120.0},
        {"order_id": 102, "amount": 90.0},
    ]
)

# From a CSV file
orders = pd.read_csv(StringIO("order_id,amount\n101,120.0\n102,90.0\n103,45.5"))

โŒจ๏ธ Hands on: Build a DataFrame

Use this sample data, then add two more rows and print the DataFrame.

import pandas as pd

products = pd.DataFrame(
    {
        "product": ["laptop", "mouse", "keyboard"],
        "price": [1200, 25, 85],
        "in_stock": [True, True, False],
    }
)

<aside> ๐Ÿš€ Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=intro_pandas_dataframes&exercise=w4_intro_pandas_dataframes__table_shape&lang=python

</aside>

Inspecting Data Quickly

Fast inspection saves hours of debugging later.

<aside> ๐Ÿ’ก The snippets below use the orders DataFrame created in the block above. Run them in sequence or keep state from the previous block.

</aside>

orders.head(3)
orders.tail(3)
orders.sample(3, random_state=42)
orders.shape
orders.dtypes
orders.info()
orders.describe()

Use info() to check missing values and data types, and describe() to spot outliers.

<aside> ๐Ÿ’ก If info() shows object for a numeric column, it probably needs type cleaning.

</aside>

Missing Data Basics

Pandas represents missing values as NaN (for numbers) or NaT (for datetimes).

orders.isna().sum()
orders["amount"] = orders["amount"].fillna(0)
orders = orders.dropna(subset=["order_id"])

โŒจ๏ธ Hands on: Handle Missing Values

Load this CSV with missing values. Use isna().sum() to count missing data per column, then decide which columns to fill and which rows to drop.

from io import StringIO
import pandas as pd

csv_data = StringIO(
    """order_id,amount,region
1,120,NL
2,,BE
3,75,
4,200,DE
"""
)

orders = pd.read_csv(csv_data)

Basic Column Operations

Vectorized operations are fast and readable.

orders["amount_eur"] = orders["amount"] * 0.92
orders["is_big_order"] = orders["amount"] > 100
orders = orders.assign(amount_usd=lambda df: df["amount"] * 1.1)

Exercises

  1. Create a DataFrame from a dictionary with columns city, population, and country.
  2. Use head(), info(), and describe() to summarize the DataFrame.
  3. Add a new column population_millions and round it to two decimals.

Extra reading

Knowledge Check

Test your recall before moving on.

<aside> ๐Ÿš€ Try it in the widget: Interactive Quiz: Pandas and DataFrames

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch1_intro_pandas_dataframes_quiz&embed=1

If the chapter felt fast, a short video covers the same ground from a different angle.

<aside> ๐ŸŽฌ Struggling with this concept? Watch this beginner-friendly video:

Watch on YouTube

</aside>

https://www.youtube.com/watch?v=ZyhVh-qRZPA

You can also use an LLM to explore specific concepts further.

<aside> ๐Ÿ’ก Using AI to help: Paste a sample DataFrame (โš ๏ธ Ensure no PII or sensitive company data is included!) into an LLM and ask it to explain what info() output means, or to suggest which columns need cleaning. Describing the shape of your data often leads to better debugging suggestions than pasting the full dataset.

</aside>

Ready to apply these skills? Try the practice exercise before moving on.

<aside> โŒจ๏ธ Hands on: Practice with Exercise 1: Quick EDA on Orders.

</aside>


Next up: Selecting and Filtering Data, where you learn to slice rows and columns precisely using loc, iloc, and boolean masks.


The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with โค๏ธ by the HackYourFuture community ยท Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.