Introduction to Pandas and DataFrames

Pandas is the default tool for working with tabular data in Python. Once your data is ingested, you still need to clean it, inspect it, join it, and transform it into something useful. Pandas gives you a fast, expressive way to do all of that without writing slow Python loops.

By the end of this chapter, you should be able to create a DataFrame, understand its basic building blocks, and inspect data quality quickly.

Concepts

ETL vs ELT and where Pandas fits

In data engineering, ETL means Extract, Transform, Load. You clean and transform data before it reaches the warehouse. ELT means Extract, Load, Transform. You load raw data first, then transform it inside the warehouse.

Pandas is a common tool for the Transform step. In ETL, Pandas runs before loading. In ELT, Pandas can still be used for smaller transformations or local validation before pushing data to the warehouse.

Real-world examples:

Zillow uses Apache Airflow to author and manage ETL pipelines for large-scale data processing and analytics.
Uber describes incremental ETL in its data lakehouse architecture using Apache Hudi to keep data fresh and cost-effective.
Spendesk loads data into Snowflake and uses dbt to transform it into data marts, which is a classic ELT pattern.
Exporo describes leaving ELT to Fivetran in its case study, another ELT pattern.

<aside> 💡 Use ETL when data quality must be enforced before loading. Use ELT when you want flexibility and fast iteration inside the warehouse.

</aside>

Pandas, Series, DataFrames, and Index

A Series is a single labeled column. A DataFrame is a table made of multiple Series. Every DataFrame has an Index, which is the row label used for alignment and joins.

import pandas as pd

scores = pd.Series([90, 85, 70], name="score")
users = pd.DataFrame(
    {
        "user_id": [1, 2, 3],
        "name": ["Alice", "Bob", "Chloe"],
        "score": [90, 85, 70],
    }
)

<aside> 💡 The Index is not the same as a column. It is a label layer. You can keep it or reset it, but you should always be aware of it.

</aside>

graph TD
    DF[DataFrame] --> Index
    DF --> Columns
    Columns --> Series1[Series: user_id]
    Columns --> Series2[Series: name]
    Columns --> Series3[Series: score]

    subgraph Structure
    Index -- Label --> Rows
    Series1 -- Data --> Rows
    Series2 -- Data --> Rows
    Series3 -- Data --> Rows
    end

<aside> 🤓 Curious Geek: Why the name Pandas?

It comes from the term panel data, which describes multi-dimensional data in econometrics.

</aside>

Creating DataFrames

You can build DataFrames from dictionaries, lists, or files. Use the simplest option for your data source.

# From a dict of lists
users = pd.DataFrame(
    {
        "user_id": [1, 2, 3],
        "country": ["NL", "BE", "NL"],
    }
)

# From a list of dicts
orders = pd.DataFrame(
    [
        {"order_id": 101, "amount": 120.0},
        {"order_id": 102, "amount": 90.0},
    ]
)

# From a CSV file
orders = pd.read_csv("data/orders.csv")

<aside> ⌨️ Hands on: Use this sample data, then add two more rows and print the DataFrame.

import pandas as pd

products = pd.DataFrame(
    {
        "product": ["laptop", "mouse", "keyboard"],
        "price": [1200, 25, 85],
        "in_stock": [True, True, False],
    }
)

🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=intro_pandas_dataframes&exercise=w4_intro_pandas_dataframes__table_shape&lang=python

💭 The widget uses plain Python lists of dictionaries to mimic a table.

</aside>

Inspecting Data Quickly

Fast inspection saves hours of debugging later.

orders.head(3)
orders.tail(3)
orders.sample(3, random_state=42)
orders.shape
orders.dtypes
orders.info()
orders.describe(numeric_only=True)

Use info() to check missing values and data types, and describe() to spot outliers.

<aside> 💡 If info() shows object for a numeric column, it probably needs type cleaning.

</aside>

Missing Data Basics

Pandas represents missing values as NaN (for numbers) or NaT (for datetimes).

orders.isna().sum()
orders["amount"] = orders["amount"].fillna(0)
orders = orders.dropna(subset=["order_id"])

<aside> ⌨️ Hands on: Load this CSV with missing values. Use isna().sum() to count missing data per column, then decide which columns to fill and which rows to drop.

from io import StringIO
import pandas as pd

csv_data = StringIO(
    \\"\\"\\"order_id,amount,region
1,120,NL
2,,BE
3,75,
4,200,DE
\\"\\"\\"
)

orders = pd.read_csv(csv_data)

</aside>

Basic Column Operations

Vectorized operations are fast and readable.

orders["amount_eur"] = orders["amount"] * 0.92
orders["is_big_order"] = orders["amount"] > 100
orders = orders.assign(amount_usd=lambda df: df["amount"] * 1.1)

Exercises

Create a DataFrame from a dictionary with columns city, population, and country.
Use head(), info(), and describe() to summarize the DataFrame.
Add a new column population_millions and round it to two decimals.

Extra reading

The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.