Week 4 - Data Processing with Pandas
Introduction to Pandas and DataFrames
Selecting, Filtering, and Sorting Data
Joining and Merging DataFrames
Working with Strings and Dates
Assignment: MessyCorp Goes Pandas
Pandas is the standard, but it is not always the best tool for large or highly parallel workloads. Knowing the alternatives helps you choose the right tool when data grows beyond laptop scale.
Polars is a Rust-based DataFrame library with a Python API. It is fast, memory efficient, and uses lazy execution by default.
import polars as pl
orders = pl.read_csv("data/orders.csv")
result = (
orders
.group_by("region")
.agg(pl.col("amount").sum().alias("total_revenue"))
)
When to use: large datasets, performance-critical pipelines, or when you want lazy evaluation.
DuckDB is an embedded analytical database. It runs SQL directly on files like Parquet and CSV.
import duckdb
result = duckdb.sql(
"""
SELECT region, SUM(amount) AS total_revenue
FROM 'data/orders.csv'
GROUP BY region
"""
).df()
<aside> 🤓 Curious Geek: Embedded analytics
DuckDB is embedded like SQLite, but optimized for analytical queries on columnar data.
</aside>
When to use: SQL-first analytics, quick exploration of Parquet files, and complex joins.
Dask scales a Pandas-like API across multiple cores or machines.
import dask.dataframe as dd
orders = dd.read_csv("data/orders_*.csv")
result = orders.groupby("region")["amount"].sum().compute()
When to use: datasets larger than RAM and existing Pandas pipelines that need to scale.
| Tool | Strengths | Trade-offs |
|---|---|---|
| Pandas | Simple, huge ecosystem | Single machine, eager execution |
| Polars | Fast, lazy, parallel | Newer API, no index |
| DuckDB | SQL on files, fast joins | SQL mindset, not a full DataFrame tool |
| Dask | Scales Pandas code | Overhead for small data |
<aside> 💡 Learn Pandas first. Switch only when you can explain the performance problem you are solving.
</aside>
<aside> ⌨️ Hands on: Use this sample table. Recreate the groupby in DuckDB and compare the output to Pandas.
import pandas as pd
import duckdb
orders = pd.DataFrame(
{
"region": ["NL", "NL", "BE", "DE"],
"amount": [120, 80, 90, 200],
}
)
pandas_result = orders.groupby("region")["amount"].sum().reset_index()
con = duckdb.connect()
con.register("orders", orders)
duckdb_result = con.execute(
"SELECT region, SUM(amount) AS total_revenue FROM orders GROUP BY region"
).df()
🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=pandas_alternatives&exercise=w4_pandas_alternatives__choose_tool&lang=python
💭 The widget uses plain Python lists of dictionaries to mimic a table.
</aside>
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.