Week 4: Data Processing

Pandas and DataFrames

Selecting and Filtering Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Jupyter Notebooks

Practice

Assignment: MessyCorp Pandas

Gotchas & Pitfalls

Week 4 Kickoff Slides

Career relevance: Week 4

Pandas Cheatsheet

Week 4 Glossary

Going Further: Optional Deep Dives

Alternatives to Pandas

This chapter helps you decide when Pandas is still the right tool and when a larger workload may need a different DataFrame engine. For the Week 4 assignment, Pandas is the right choice.

By the end of this chapter, you should be able to name two common Pandas alternatives, explain when Pandas starts to hurt, and make a practical tool choice based on data size and pipeline constraints.

<aside> 📦 Run the examples: companion_ch9_pandas_alternatives.py: run in the Codespace or clone locally to follow along with this chapter.

</aside>

Concepts

When Pandas Starts to Hurt

Pandas loads the dataset into memory and usually processes work on a single machine. For most Week 4 tasks this is perfect: the code is readable, the ecosystem is mature, and every teammate can understand the pipeline.

Consider an alternative only when you can name the actual problem:

<aside> 💡 Learn Pandas first. Switch tools only when Pandas is the bottleneck, not because another tool sounds faster in theory.

</aside>

Polars

Polars is a Rust-based DataFrame library with a Python API. It is fast, memory-efficient, and supports lazy execution via LazyFrame: operations are collected into a query plan and optimized before the data is processed.

The API is similar to Pandas but not identical. A simple select and filter:

import polars as pl

orders = pl.DataFrame(
    {
        "region": ["NL", "NL", "BE", "DE"],
        "amount": [120, 80, 90, 200],
    }
)

result = orders.filter(pl.col("amount") > 100).select(["region", "amount"])

For aggregations, Polars uses expressions such as pl.col():

result = (
    orders
    .group_by("region")
    .agg(pl.col("amount").sum().alias("total_revenue"))
)

When to consider it: larger local datasets, performance-critical pipelines, or projects where lazy evaluation helps avoid unnecessary work.

Dask

Dask scales a Pandas-like API across multiple cores or machines. It builds a task graph and runs the work in pieces, which helps when one machine cannot process the whole job comfortably.

import dask.dataframe as dd
import pandas as pd

_df = pd.DataFrame(
    {"region": ["NL", "NL", "BE", "DE"], "amount": [120, 80, 90, 200]}
)

orders = dd.from_pandas(_df, npartitions=1)
result = orders.groupby("region")["amount"].sum().compute()

When to consider it: datasets larger than memory, or existing Pandas-style pipelines that need parallel execution.

Quick Comparison

Tool Strengths Trade-offs
Pandas Simple, mature, huge ecosystem Single-machine workflows
Polars Fast, lazy, parallel Newer API, no index
Dask Scales Pandas-like code Overhead for small data

For a small assignment dataset, stay in Pandas. For a production dataset that no longer fits in memory or takes too long after profiling, compare Polars and Dask with a small proof of concept.

<aside> 🤓 Curious Geek: Many "fast DataFrame" benchmarks measure one narrow operation on one machine. Your pipeline may be limited by file reading, joins, memory, network storage, or code clarity instead. Benchmark the slow step you actually have before changing tools.

</aside>

⌨️ Hands on: Decide if Pandas Fits

Use this sample table. Check its shape and memory usage, then decide whether Pandas is still appropriate.

import pandas as pd

orders = pd.DataFrame(
    {
        "region": ["NL", "NL", "BE", "DE"],
        "amount": [120, 80, 90, 200],
    }
)

rows, columns = orders.shape
memory_mb = orders.memory_usage(deep=True).sum() / 1_000_000
print(rows, columns, round(memory_mb, 4))

For this dataset, Pandas is the right tool: it is tiny, readable, and easy to run anywhere.

Exercises

  1. Inspect a DataFrame with .shape and .memory_usage(deep=True).
  2. Write one sentence explaining whether Pandas is still a good fit.
  3. Name the specific bottleneck you would need to see before considering Polars or Dask.

Extra reading

<aside> 💡 Using AI to help: Describe your data scale and pipeline constraints (⚠️ Ensure no PII or sensitive company data is included!): for example "500 MB CSV files, groupby and join, runs on a laptop", and ask an LLM to recommend Pandas, Polars, or Dask with a one-paragraph rationale. Verify the recommendation against the comparison table above before adding a new dependency.

</aside>

Test your recall before moving on.

<aside> 🚀 Try it in the widget: Interactive Quiz: Alternatives to Pandas

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch9_pandas_alternatives_quiz&embed=1

Knowledge Check


Next up: Jupyter Notebooks, where you learn when notebooks are useful and when to graduate work into scripts.


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.