Week 4 - Data Processing with Pandas

Introduction to Pandas and DataFrames

Selecting, Filtering, and Sorting Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Practice

Assignment: MessyCorp Goes Pandas

Gotchas & Pitfalls

Lesson Plan

⚡ Alternatives to Pandas

Pandas is the standard, but it is not always the best tool for large or highly parallel workloads. Knowing the alternatives helps you choose the right tool when data grows beyond laptop scale.

Concepts

When Pandas Starts to Hurt

Polars

Polars is a Rust-based DataFrame library with a Python API. It is fast, memory efficient, and uses lazy execution by default.

import polars as pl

orders = pl.read_csv("data/orders.csv")
result = (
    orders
    .group_by("region")
    .agg(pl.col("amount").sum().alias("total_revenue"))
)

When to use: large datasets, performance-critical pipelines, or when you want lazy evaluation.

DuckDB

DuckDB is an embedded analytical database. It runs SQL directly on files like Parquet and CSV.

import duckdb

result = duckdb.sql(
    """
    SELECT region, SUM(amount) AS total_revenue
    FROM 'data/orders.csv'
    GROUP BY region
    """
).df()

<aside> 🤓 Curious Geek: Embedded analytics

DuckDB is embedded like SQLite, but optimized for analytical queries on columnar data.

</aside>

When to use: SQL-first analytics, quick exploration of Parquet files, and complex joins.

Dask

Dask scales a Pandas-like API across multiple cores or machines.

import dask.dataframe as dd

orders = dd.read_csv("data/orders_*.csv")
result = orders.groupby("region")["amount"].sum().compute()

When to use: datasets larger than RAM and existing Pandas pipelines that need to scale.

Quick Comparison

Tool Strengths Trade-offs
Pandas Simple, huge ecosystem Single machine, eager execution
Polars Fast, lazy, parallel Newer API, no index
DuckDB SQL on files, fast joins SQL mindset, not a full DataFrame tool
Dask Scales Pandas code Overhead for small data

<aside> 💡 Learn Pandas first. Switch only when you can explain the performance problem you are solving.

</aside>

<aside> ⌨️ Hands on: Use this sample table. Recreate the groupby in DuckDB and compare the output to Pandas.

import pandas as pd
import duckdb

orders = pd.DataFrame(
    {
        "region": ["NL", "NL", "BE", "DE"],
        "amount": [120, 80, 90, 200],
    }
)

pandas_result = orders.groupby("region")["amount"].sum().reset_index()

con = duckdb.connect()
con.register("orders", orders)
duckdb_result = con.execute(
    "SELECT region, SUM(amount) AS total_revenue FROM orders GROUP BY region"
).df()

🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=pandas_alternatives&exercise=w4_pandas_alternatives__choose_tool&lang=python

💭 The widget uses plain Python lists of dictionaries to mimic a table.

</aside>

Exercises

  1. Install one alternative library and run a simple groupby query.
  2. Convert a Pandas DataFrame to Polars and back.
  3. Use DuckDB to query a CSV file without loading it into Pandas.

Extra reading


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.