Joining and Merging DataFrames
Working with Strings and Dates
Going Further: Optional Deep Dives
Indicative as of May 2026: see Sources for current numbers.
This page answers two questions students ask every week: why am I learning this, and how does it help me find a job?
It is scoped to Week 4 content (tabular data processing: Pandas DataFrames, selecting and filtering, groupby and aggregation, joining and merging, string and datetime cleaning, advanced transformations with pivot and melt, writing CSV / Parquet / SQLite, visualization with Matplotlib, and a brief introduction to Polars and Dask). Other weeks' career pages each cover their week's tools, not these. Generic NL junior-data career content (salary bands, day-to-day work, what employers do not expect from juniors) lives in one shared page across the curriculum and is not repeated here.
The numbers below are a rough reading of public NL postings as of May 2026. They are indicative, not measured. A separate project crawls Dutch data postings and will replace the qualitative claims here with measured percentages once the dataset is ready; placeholders are marked ~XX% for that swap.
Pandas sits at the center of every data role's tool list. NL postings phrase it in many ways ("transforms data for reporting", "cleans and aggregates datasets", "builds transformation logic in Python"), but the underlying skills are the same: load, clean, join, aggregate, and write. Week 4 covers all five.
| Role | Postings expecting Week 4's processing stack | What the posting expects |
|---|---|---|
| Data Engineer (DE) | ~XX% (high, likely 85%+) | "Transforms raw data into clean outputs for the warehouse", "writes pandas-based ETL scripts", "familiar with Parquet and columnar storage". DE postings treat Pandas as a baseline, not a differentiator. They expect it without asking. |
| Analytics Engineer (AE) | ~XX% (high, likely 70%+) | "Comfortable in Python for data wrangling", "builds aggregations and joins for dbt seeds or staging tables". AE roles lean heavily on groupby, merge, and pivot because those are the Python equivalents of the SQL transformations they do in dbt. |
| Data Scientist (DS) | ~XX% (high, likely 75%+) | "Python / Pandas for EDA and feature engineering", "cleans training datasets, engineers features from raw events". DS postings name Pandas almost universally; it is where raw event data becomes model inputs. |
| Data Analyst (DA) | ~XX% (mid, likely 50-60%) | "Python for data manipulation and reporting", "creates summary tables from SQL exports". DA roles expect simpler Pandas use: read a CSV, clean it, groupby, export. The advanced transformations (pivot, window, multi-key joins) appear in medior DA postings. |
The directional shape: Week 4 maps onto every data role, with Pandas being the single most universally expected Python skill in NL data postings. If a posting says "Python" and lists a data role, it means Pandas.
The chapters teach Pandas as the default and introduce alternatives at the end. NL postings name a wider range of tools at medior level.
| Concept | Tool taught | Common NL alternatives | Practical implication |
|---|---|---|---|
| DataFrame manipulation | pandas |
polars, dask, plain numpy |
Pandas is universal. Polars is growing fast in NL: postings from performance-focused teams (FinTech, logistics) name it explicitly. Dask shows up in postings that mention "large-scale" or "distributed". The mental model (filter, groupby, join) transfers; the API differs. |
| Joining and merging | pd.merge |
SQL joins via dbt models, SQLAlchemy, or Spark | At scale, the join often moves out of Python and into SQL (dbt) or Spark. The Week 4 join patterns are the foundation that makes dbt's ref() + join strategy readable. |
| String/date cleaning | .str, .dt |
re, dateutil, arrow |
The .str and .dt accessors cover 90% of real-world cleaning. Regex (re) appears in postings that mention "log parsing" or "unstructured text". arrow for timezone-aware datetime work shows up in FinTech and scheduling roles. |
| Aggregation | groupby + agg |
SQL GROUP BY via dbt models or a metrics layer |
The groupby mental model (split → apply → combine) is identical to SQL GROUP BY. Knowing both lets you choose: do this in Python or push it to SQL? |
| Writing outputs | to_csv, to_parquet, to_sql |
pyarrow (lower-level Parquet), delta-rs (Delta Lake), fastparquet |
Pandas' Parquet writer uses PyArrow under the hood. Postings that name Parquet explicitly often also name PyArrow. Delta Lake (delta-rs) shows up in NL postings from Databricks-adjacent teams. |
| Visualization | matplotlib via df.plot |
seaborn, plotly, altair, Power BI / Tableau for dashboards |
Matplotlib is the floor; seaborn (statistical) and plotly (interactive) are common in DS and DA roles. Dashboard delivery via Power BI or Tableau is a separate skill: Week 4 teaches quick diagnostic charts, not BI tool authoring. |
| Big-data processing | Polars, Dask (intro) | spark (PySpark), Databricks, ray |
PySpark appears in NL postings at medior/senior DE level, especially in Databricks shops. Week 4's Polars and Dask intro plants the "not always Pandas" mindset; PySpark is the next step for large-scale pipelines. |
What this means for your CV: lead with "Python data processing (Pandas: groupby, merge, pivot; Parquet/SQLite writes; Matplotlib charts)" as a single phrase. Mention Polars or Dask only if you explored them beyond the Ch9 reading; leave PySpark for later.
Postings phrase the data-processing bar at three levels:
Week 4 does not yet practice production-scale Pandas (chunked reads, Parquet partitioning), Spark / Databricks, or dbt transformations. Those appear in later weeks (the warehouse week introduces dbt; the distributed-compute module handles Spark). Week 4 is the transformation vocabulary that every later week assumes.
Strong line a student can copy-adapt:
Rebuilt a sales-data pipeline from pure-Python loops to Pandas: loaded two CSV sources, cleaned messy strings and dates with
.strand.dtvectorized operations, joined sales to customers on email withpd.merge, produced four aggregated report tables using namedgroupby().agg()calls, and wrote outputs to CSV, Parquet, and SQLite. Added a bar chart saved tooutput/with Matplotlib. Uploaded the final report to Azure Blob Storage using the Azure SDK, reading credentials from environment variables.
Recruiter keywords this carries: Python, pandas, ETL, data cleaning, groupby, merge, join, Parquet, SQLite, Matplotlib, Azure Blob Storage, vectorized operations, datetime parsing, data pipeline, data wrangling.
Weaker alternative for contrast (avoid):
Wrote a Python script that processes sales data.
The weaker version drops every keyword. The strong version names the specific patterns (named aggregations, vectorized cleaning, Parquet output, Azure upload) that signal the candidate has moved beyond beginner tutorials.
When asked "tell me about a Python data project you have built", the Week 4 assignment gives you a 90-second answer:
I rebuilt a sales-data pipeline using Pandas for a fictional company called MessyCorp. It loads two CSV sources (sales and customers), cleans messy strings and dates with vectorized
.strand.dtoperations, joins them on a customer email key, then produces four aggregated report tables using namedgroupby().agg()calls. Three things I am proud of. First, the cleaning is fully vectorized: no Python loops, every operation runs on the whole column at once. Second, the joins are explicit: I usedindicator=Trueon a trial run to count orphan orders before committing to the join type. Third, the output lands in multiple formats: CSV for the reporting team, Parquet for downstream analytics, and a bar chart saved with Matplotlib. I also uploaded the final report to Azure Blob Storage using the Azure SDK, reading credentials from an environment variable.
This answer hits six interview-relevant concepts: vectorized cleaning, join type awareness, named aggregations, multi-format output, cloud upload, and secrets management.
Two honest follow-ups if asked "what would you do differently?":
Week 4 builds the transformation step, not the whole pipeline stack:
groupby().agg() pattern is the Python equivalent of a dbt model. dbt adds version control for SQL models, dependency graphs, tests, and documentation. The transformation thinking you built this week transfers directly; the tooling is different.Naming these honestly in an interview ("I built batch transformations with Pandas; I have not yet run these at warehouse scale or wired them into dbt") signals more maturity than overclaiming.
Mark this page indicative, not statistical. The ~XX% figures will be replaced with measured percentages once the postings-crawler project ships.
<aside> 💭 For generic NL junior data-career content (salary bands, day-to-day work, what employers do not expect from any junior), one shared page across all weeks is the right home. That page does not exist yet; for now, treat this page as Week-4-specific only.
</aside>
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with ❤️ by the HackYourFuture community · Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.