Week 4 - Data Processing with Pandas

Introduction to Pandas and DataFrames

Selecting, Filtering, and Sorting Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Practice

Assignment: MessyCorp Goes Pandas

Gotchas & Pitfalls

Lesson Plan

🐼 Introduction to Pandas and DataFrames

Pandas is the default tool for working with tabular data in Python. Once your data is ingested, you still need to clean it, inspect it, join it, and transform it into something useful. Pandas gives you a fast, expressive way to do all of that without writing slow Python loops.

By the end of this chapter, you should be able to create a DataFrame, understand its basic building blocks, and inspect data quality quickly.

Concepts

ETL vs ELT and where Pandas fits

In data engineering, ETL means Extract, Transform, Load. You clean and transform data before it reaches the warehouse. ELT means Extract, Load, Transform. You load raw data first, then transform it inside the warehouse.

Pandas is a common tool for the Transform step. In ETL, Pandas runs before loading. In ELT, Pandas can still be used for smaller transformations or local validation before pushing data to the warehouse.

Real-world examples:

<aside> 💡 Use ETL when data quality must be enforced before loading. Use ELT when you want flexibility and fast iteration inside the warehouse.

</aside>

Pandas, Series, DataFrames, and Index

A Series is a single labeled column. A DataFrame is a table made of multiple Series. Every DataFrame has an Index, which is the row label used for alignment and joins.

import pandas as pd

scores = pd.Series([90, 85, 70], name="score")
users = pd.DataFrame(
    {
        "user_id": [1, 2, 3],
        "name": ["Alice", "Bob", "Chloe"],
        "score": [90, 85, 70],
    }
)

<aside> 💡 The Index is not the same as a column. It is a label layer. You can keep it or reset it, but you should always be aware of it.

</aside>

graph TD
    DF[DataFrame] --> Index
    DF --> Columns
    Columns --> Series1[Series: user_id]
    Columns --> Series2[Series: name]
    Columns --> Series3[Series: score]

    subgraph Structure
    Index -- Label --> Rows
    Series1 -- Data --> Rows
    Series2 -- Data --> Rows
    Series3 -- Data --> Rows
    end

<aside> 🤓 Curious Geek: Why the name Pandas?

It comes from the term panel data, which describes multi-dimensional data in econometrics.

</aside>

Creating DataFrames

You can build DataFrames from dictionaries, lists, or files. Use the simplest option for your data source.

# From a dict of lists
users = pd.DataFrame(
    {
        "user_id": [1, 2, 3],
        "country": ["NL", "BE", "NL"],
    }
)

# From a list of dicts
orders = pd.DataFrame(
    [
        {"order_id": 101, "amount": 120.0},
        {"order_id": 102, "amount": 90.0},
    ]
)

# From a CSV file
orders = pd.read_csv("data/orders.csv")

<aside> ⌨️ Hands on: Use this sample data, then add two more rows and print the DataFrame.

import pandas as pd

products = pd.DataFrame(
    {
        "product": ["laptop", "mouse", "keyboard"],
        "price": [1200, 25, 85],
        "in_stock": [True, True, False],
    }
)

🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=intro_pandas_dataframes&exercise=w4_intro_pandas_dataframes__table_shape&lang=python

💭 The widget uses plain Python lists of dictionaries to mimic a table.

</aside>

Inspecting Data Quickly

Fast inspection saves hours of debugging later.

orders.head(3)
orders.tail(3)
orders.sample(3, random_state=42)
orders.shape
orders.dtypes
orders.info()
orders.describe(numeric_only=True)

Use info() to check missing values and data types, and describe() to spot outliers.

<aside> 💡 If info() shows object for a numeric column, it probably needs type cleaning.

</aside>

Missing Data Basics

Pandas represents missing values as NaN (for numbers) or NaT (for datetimes).

orders.isna().sum()
orders["amount"] = orders["amount"].fillna(0)
orders = orders.dropna(subset=["order_id"])

<aside> ⌨️ Hands on: Load this CSV with missing values. Use isna().sum() to count missing data per column, then decide which columns to fill and which rows to drop.

from io import StringIO
import pandas as pd

csv_data = StringIO(
    \\"\\"\\"order_id,amount,region
1,120,NL
2,,BE
3,75,
4,200,DE
\\"\\"\\"
)

orders = pd.read_csv(csv_data)

</aside>

Basic Column Operations

Vectorized operations are fast and readable.

orders["amount_eur"] = orders["amount"] * 0.92
orders["is_big_order"] = orders["amount"] > 100
orders = orders.assign(amount_usd=lambda df: df["amount"] * 1.1)

Exercises

  1. Create a DataFrame from a dictionary with columns city, population, and country.
  2. Use head(), info(), and describe() to summarize the DataFrame.
  3. Add a new column population_millions and round it to two decimals.

Extra reading


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.