Week 4: Data Processing

Pandas and DataFrames

Selecting and Filtering Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Jupyter Notebooks

Practice

Assignment: MessyCorp Pandas

Gotchas & Pitfalls

Week 4 Kickoff Slides

Career relevance: Week 4

Pandas Cheatsheet

Week 4 Glossary

Going Further: Optional Deep Dives

Career relevance: Week 4

Indicative as of May 2026: see Sources for current numbers.

This page answers two questions students ask every week: why am I learning this, and how does it help me find a job?

It is scoped to Week 4 content (tabular data processing: Pandas DataFrames, selecting and filtering, groupby and aggregation, joining and merging, string and datetime cleaning, advanced transformations with pivot and melt, writing CSV / Parquet / SQLite, visualization with Matplotlib, and a brief introduction to Polars and Dask). Other weeks' career pages each cover their week's tools, not these. Generic NL junior-data career content (salary bands, day-to-day work, what employers do not expect from juniors) lives in one shared page across the curriculum and is not repeated here.

The numbers below are a rough reading of public NL postings as of May 2026. They are indicative, not measured. A separate project crawls Dutch data postings and will replace the qualitative claims here with measured percentages once the dataset is ready; placeholders are marked ~XX% for that swap.

How "data processing with Pandas" shows up in NL postings

Pandas sits at the center of every data role's tool list. NL postings phrase it in many ways ("transforms data for reporting", "cleans and aggregates datasets", "builds transformation logic in Python"), but the underlying skills are the same: load, clean, join, aggregate, and write. Week 4 covers all five.

Role Postings expecting Week 4's processing stack What the posting expects
Data Engineer (DE) ~XX% (high, likely 85%+) "Transforms raw data into clean outputs for the warehouse", "writes pandas-based ETL scripts", "familiar with Parquet and columnar storage". DE postings treat Pandas as a baseline, not a differentiator. They expect it without asking.
Analytics Engineer (AE) ~XX% (high, likely 70%+) "Comfortable in Python for data wrangling", "builds aggregations and joins for dbt seeds or staging tables". AE roles lean heavily on groupby, merge, and pivot because those are the Python equivalents of the SQL transformations they do in dbt.
Data Scientist (DS) ~XX% (high, likely 75%+) "Python / Pandas for EDA and feature engineering", "cleans training datasets, engineers features from raw events". DS postings name Pandas almost universally; it is where raw event data becomes model inputs.
Data Analyst (DA) ~XX% (mid, likely 50-60%) "Python for data manipulation and reporting", "creates summary tables from SQL exports". DA roles expect simpler Pandas use: read a CSV, clean it, groupby, export. The advanced transformations (pivot, window, multi-key joins) appear in medior DA postings.

The directional shape: Week 4 maps onto every data role, with Pandas being the single most universally expected Python skill in NL data postings. If a posting says "Python" and lists a data role, it means Pandas.

The Week 4 stack vs alternatives in NL

The chapters teach Pandas as the default and introduce alternatives at the end. NL postings name a wider range of tools at medior level.

Concept Tool taught Common NL alternatives Practical implication
DataFrame manipulation pandas polars, dask, plain numpy Pandas is universal. Polars is growing fast in NL: postings from performance-focused teams (FinTech, logistics) name it explicitly. Dask shows up in postings that mention "large-scale" or "distributed". The mental model (filter, groupby, join) transfers; the API differs.
Joining and merging pd.merge SQL joins via dbt models, SQLAlchemy, or Spark At scale, the join often moves out of Python and into SQL (dbt) or Spark. The Week 4 join patterns are the foundation that makes dbt's ref() + join strategy readable.
String/date cleaning .str, .dt re, dateutil, arrow The .str and .dt accessors cover 90% of real-world cleaning. Regex (re) appears in postings that mention "log parsing" or "unstructured text". arrow for timezone-aware datetime work shows up in FinTech and scheduling roles.
Aggregation groupby + agg SQL GROUP BY via dbt models or a metrics layer The groupby mental model (split → apply → combine) is identical to SQL GROUP BY. Knowing both lets you choose: do this in Python or push it to SQL?
Writing outputs to_csv, to_parquet, to_sql pyarrow (lower-level Parquet), delta-rs (Delta Lake), fastparquet Pandas' Parquet writer uses PyArrow under the hood. Postings that name Parquet explicitly often also name PyArrow. Delta Lake (delta-rs) shows up in NL postings from Databricks-adjacent teams.
Visualization matplotlib via df.plot seaborn, plotly, altair, Power BI / Tableau for dashboards Matplotlib is the floor; seaborn (statistical) and plotly (interactive) are common in DS and DA roles. Dashboard delivery via Power BI or Tableau is a separate skill: Week 4 teaches quick diagnostic charts, not BI tool authoring.
Big-data processing Polars, Dask (intro) spark (PySpark), Databricks, ray PySpark appears in NL postings at medior/senior DE level, especially in Databricks shops. Week 4's Polars and Dask intro plants the "not always Pandas" mindset; PySpark is the next step for large-scale pipelines.

What this means for your CV: lead with "Python data processing (Pandas: groupby, merge, pivot; Parquet/SQLite writes; Matplotlib charts)" as a single phrase. Mention Polars or Dask only if you explored them beyond the Ch9 reading; leave PySpark for later.

Junior vs medior expectations

Postings phrase the data-processing bar at three levels:

Week 4 does not yet practice production-scale Pandas (chunked reads, Parquet partitioning), Spark / Databricks, or dbt transformations. Those appear in later weeks (the warehouse week introduces dbt; the distributed-compute module handles Spark). Week 4 is the transformation vocabulary that every later week assumes.

How Week 4 work signals on a CV

Strong line a student can copy-adapt:

Rebuilt a sales-data pipeline from pure-Python loops to Pandas: loaded two CSV sources, cleaned messy strings and dates with .str and .dt vectorized operations, joined sales to customers on email with pd.merge, produced four aggregated report tables using named groupby().agg() calls, and wrote outputs to CSV, Parquet, and SQLite. Added a bar chart saved to output/ with Matplotlib. Uploaded the final report to Azure Blob Storage using the Azure SDK, reading credentials from environment variables.

Recruiter keywords this carries: Python, pandas, ETL, data cleaning, groupby, merge, join, Parquet, SQLite, Matplotlib, Azure Blob Storage, vectorized operations, datetime parsing, data pipeline, data wrangling.

Weaker alternative for contrast (avoid):

Wrote a Python script that processes sales data.

The weaker version drops every keyword. The strong version names the specific patterns (named aggregations, vectorized cleaning, Parquet output, Azure upload) that signal the candidate has moved beyond beginner tutorials.

Interview phrasing for the Week 4 assignment

When asked "tell me about a Python data project you have built", the Week 4 assignment gives you a 90-second answer:

I rebuilt a sales-data pipeline using Pandas for a fictional company called MessyCorp. It loads two CSV sources (sales and customers), cleans messy strings and dates with vectorized .str and .dt operations, joins them on a customer email key, then produces four aggregated report tables using named groupby().agg() calls. Three things I am proud of. First, the cleaning is fully vectorized: no Python loops, every operation runs on the whole column at once. Second, the joins are explicit: I used indicator=True on a trial run to count orphan orders before committing to the join type. Third, the output lands in multiple formats: CSV for the reporting team, Parquet for downstream analytics, and a bar chart saved with Matplotlib. I also uploaded the final report to Azure Blob Storage using the Azure SDK, reading credentials from an environment variable.

This answer hits six interview-relevant concepts: vectorized cleaning, join type awareness, named aggregations, multi-format output, cloud upload, and secrets management.

Two honest follow-ups if asked "what would you do differently?":

What Week 4 does not make you

Week 4 builds the transformation step, not the whole pipeline stack:

Naming these honestly in an interview ("I built batch transformations with Pandas; I have not yet run these at warehouse scale or wired them into dbt") signals more maturity than overclaiming.

Sources

Mark this page indicative, not statistical. The ~XX% figures will be replaced with measured percentages once the postings-crawler project ships.


<aside> 💭 For generic NL junior data-career content (salary bands, day-to-day work, what employers do not expect from any junior), one shared page across all weeks is the right home. That page does not exist yet; for now, treat this page as Week-4-specific only.

</aside>


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.