Week 4 - Data Processing with Pandas

Introduction to Pandas

DataFrame operations

Grouping and Aggregation

Joining and Merging

Different Data Types

Advanced Transformations

Alternatives to Pandas

Gotchas & Pitfalls

Teachers

8. Alternatives to Pandas

Content coming soon...

Suggested Topics

Why consider alternatives: Pandas limitations (performance, memory, parallelization)
Industry context: Pandas remains the standard, but alternatives are gaining traction

Polars: The Modern DataFrame Library

What is Polars: Rust-based DataFrame library with Python API
Key advantages: 10-100x faster, lazy evaluation, better memory efficiency, parallel by default
When to use: Large datasets (>1GB), performance-critical pipelines, new projects
Basic syntax comparison: groupby, filtering, aggregation vs Pandas
Lazy vs eager evaluation: scan_csv vs read_csv, building query plans, collect()
Key differences: no index, proper null types, immutable by default, expression syntax
Converting between Polars and Pandas

DuckDB: SQL on DataFrames

What is DuckDB: Embedded analytical database (like SQLite) for data analysis
Key advantages: SQL interface, no server, fast columnar engine, Parquet native
When to use: SQL analytics, querying Parquet files, joining multiple sources
Querying DataFrames: using SQL on Pandas DataFrames without conversion
Querying files directly: CSV and Parquet without loading into memory
Joining multiple data sources: mixing CSV, Parquet, and DataFrames
Converting results back to Pandas

Dask: Parallel Pandas for Large Datasets

What is Dask: Parallel computing library with Pandas-like API
Key advantages: Handles out-of-memory data, familiar API, parallel execution
When to use: Datasets larger than RAM, distributed computing
Basic operations: read_csv with blocksize, lazy evaluation, compute()
Partitions: how Dask splits data for parallel processing
Limitations: not all Pandas operations supported, overhead for small datasets
Working with multiple files: processing sales_*.csv patterns

Comparison and Decision Guide

Comparison table: speed, memory limits, API style, learning curve, use cases
Decision framework: when to stick with Pandas vs choosing an alternative
Conversion patterns: moving data between Pandas, Polars, DuckDB, Dask
Practical example: same task (filter, group, export) in all four tools
Installation and setup: pip install commands
Key takeaway: Learn Pandas first, add alternatives when needed

Back to sidebar

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.