Week 4: Data Processing
Pandas and DataFrames
Selecting and Filtering Data
Grouping and Aggregation
Joining and Merging DataFrames
Working with Strings and Dates
Advanced Transformations
Writing Data
Visualizing Data with Pandas
Alternatives to Pandas
Jupyter Notebooks
Practice
Assignment: MessyCorp Pandas
Gotchas & Pitfalls
Week 4 Kickoff Slides
Career relevance: Week 4
Pandas Cheatsheet
Week 4 Glossary
Going Further: Optional Deep Dives
📚 Going Further: Optional Deep Dives
This page is optional. Nothing here is required for Week 4's learning goals or the assignment. Use it after you finish the week if you want to keep learning, or come back later when a specific topic starts mattering in your day-to-day work.
Sections are grouped by topic: full courses and tutorials for the long-form route, deep dives by concept for going one layer below this week's chapters, videos for talks worth watching, and community and books for the bigger picture.
<aside>
💡 Links that already appear inside a Week 4 chapter's Extra reading section are not duplicated here. This page is the home for resources that are too broad to fit any single chapter, or that go meaningfully deeper than the chapter's tightly-scoped reading.
</aside>
Full courses and tutorials
The chapters' Extra reading sections deliberately stay short. The big ones live here.
- DataTalks.Club: Data Engineering Zoomcamp: free multi-week course. The dbt and Spark modules are the natural extensions of Week 4's transformation patterns: dbt turns your
groupby().agg() logic into version-controlled SQL models; Spark scales the same logic across a cluster.
- Real Python: Pandas tutorials: a full learning path covering everything from basic DataFrames through time-series, advanced groupby, and Pandas internals. The chapters here are tightly scoped; this path goes deeper on every topic.
- Pandas official 10 minutes to pandas: the official tour. Covers everything in Week 4's scope in a single scrollable page: useful as a quick-reference companion when you hit the assignment.
- Jake VanderPlas: Python Data Science Handbook (Pandas chapters): free online. Covers Pandas from first principles through advanced indexing. Chapter 3 maps directly to Week 4.
Deep dives by concept
One layer below what each chapter taught.
Polars
- Polars user guide: the official docs. Start with "Getting started", then read the "Expressions" and "Lazy API" chapters. The lazy API is the feature that makes Polars faster than Pandas on large files: it defers computation until
.collect() and optimizes the query plan in between.
- Polars migration guide: coming from Pandas: the official migration reference. Maps common Pandas operations to their Polars equivalents side by side. The mental model shift from eager (Pandas) to lazy (Polars) is the hardest part: look for the "Lazy / Eager" section.
Dask
- Dask documentation: the official docs. Start with "Why Dask?" then "Dask DataFrames". Dask mirrors the Pandas API almost exactly: operations return "lazy" task graphs until
.compute() is called, and Dask parallelizes across cores or nodes automatically.
- Dask tutorial (GitHub): interactive Jupyter notebooks. Run them locally after
pip install dask[complete]. The Dataframe section is directly relevant to Week 4.
Spark and PySpark
- PySpark official documentation: PySpark mirrors Pandas in structure (DataFrames, select, groupby, join) but runs on a distributed cluster. The API differences (
.select() instead of df["col"], lazy evaluation by default) are the main mental shift.
- Databricks: Introduction to PySpark: if your future employer uses Databricks, start here. Databricks wraps Spark with a managed notebook environment and Delta Lake storage.
- DataTalks.Club Spark Module: free Spark + PySpark module from the DE Zoomcamp. Builds on the same groupby/join mental model you built in Week 4.
Visualization beyond Matplotlib
- Seaborn documentation: Seaborn wraps Matplotlib with a higher-level API for statistical plots (heatmaps, regression plots, violin plots). It reads directly from Pandas DataFrames. Start with the "Gallery" to see what it produces, then "Tutorial" to understand when to reach for it vs Matplotlib.
- Plotly Express documentation: Plotly makes interactive HTML charts (hover, zoom, filter) with almost the same API as Seaborn. Used heavily in DA roles where stakeholders want explorable charts rather than saved PNGs.
- Altair documentation: grammar-of-graphics style charting for Python (equivalent to R's ggplot2). Declarative and elegant for multi-dimensional data. Worth knowing as a vocabulary, even if you use Plotly day to day.
Parquet and Delta Lake
- PyArrow documentation: Parquet: Pandas'
to_parquet uses PyArrow under the hood. This page shows the lower-level API: writing with explicit schemas, controlling row-group size, and reading with predicate pushdown (only load the rows you need).
- delta-rs documentation: Delta Lake adds ACID transactions and time-travel on top of Parquet. If you join a Databricks or Azure Fabric team, you will write Delta tables instead of plain Parquet.
delta-rs is the Python library.
dbt transformations
- dbt: What is dbt?: dbt turns your
groupby().agg() logic into version-controlled SQL models with dependency graphs, tests, and documentation. Your Week 4 Python transformations and dbt's SQL models solve the same problem; dbt does it inside the warehouse, with automatic lineage.
- dbt Learn: Quickstart for dbt Core: walks you through your first dbt project from
dbt init to dbt build. Takes about an hour.
Videos
Longer talks that go beyond what fits in a chapter.
Community
Books
- Python for Data Analysis (Wes McKinney, 3rd ed., O'Reilly 2022): the canonical Pandas reference, written by Pandas' creator. Dense, comprehensive, and still current on the 2.x API. Available free via many library subscriptions (O'Reilly Learning via Safari, Amsterdam Openbare Bibliotheek).
- Effective Pandas (Matt Harrison, 2021): short, opinionated, focused on the patterns that matter in production: method chaining, named aggregations,
assign over df[...] =. A fast read that directly overlaps with Week 4.
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0
*https://hackyourfuture.net/*

Built with ❤️ by the HackYourFuture community · Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.