Week 4: Data Processing

Pandas and DataFrames

Selecting and Filtering Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Jupyter Notebooks

Practice

Assignment: MessyCorp Pandas

Gotchas & Pitfalls

Week 4 Kickoff Slides

Career relevance: Week 4

Pandas Cheatsheet

Week 4 Glossary

Going Further: Optional Deep Dives

Visualizing Data with Pandas

Visualization helps you spot outliers, trends, and data quality issues quickly. Even in data engineering, a simple chart can reveal broken joins, missing values, or unexpected spikes before your pipeline ships bad data.

By the end of this chapter, you should be able to create quick diagnostic charts from a DataFrame, group data before plotting, and save charts to files using Matplotlib.

<aside> ๐Ÿ“ฆ Run the examples: companion_ch8_visualizing_data.py: run in the Codespace or clone locally to follow along with this chapter.

</aside>

Concepts

import matplotlib
matplotlib.use("Agg")  # headless backend: no display required
import matplotlib.pyplot as plt
import pandas as pd
from pathlib import Path

Path("output").mkdir(exist_ok=True)

Quick Plots with DataFrame.plot

Pandas integrates with Matplotlib, so you can create quick charts with one line.

import pandas as pd

sales = pd.DataFrame(
    {
        "date": ["2024-01-01", "2024-01-02", "2024-01-03"],
        "amount": [120, 90, 150],
    }
)

sales.plot(x="date", y="amount", kind="line", title="Daily revenue")

Common plot types:

<aside> โš ๏ธ Always sort by date before plotting time series. Unsorted dates create misleading lines.

</aside>

Even a simple line chart can reveal pipeline issues that summary statistics hide.

<aside> ๐Ÿค“ Curious Geek: Anscombe's quartet

Four datasets can have identical statistics but very different plots. Visualization protects you from false confidence.

</aside>

Group Then Plot

Aggregations often become charts.

daily = sales.groupby("date", as_index=False)["amount"].sum()
daily.plot(x="date", y="amount", kind="line")

Matplotlib for More Control

Use Matplotlib directly when you need labels, size control, or export.

ax = daily.plot(x="date", y="amount", kind="line", figsize=(6, 3))
ax.set_xlabel("Date")
ax.set_ylabel("Revenue")
plt.tight_layout()
plt.savefig("output/daily_revenue.png")
plt.close()

<aside> ๐Ÿ’ก Saving charts to files is useful for automated reports and dashboards.

</aside>

โŒจ๏ธ Hands on: Plot Revenue by Region

Use this sample table. Group by region, sum amount, and plot a bar chart. Return the Axes object so it can be inspected.

import matplotlib
matplotlib.use("Agg")   # headless: must come before importing pyplot
import matplotlib.pyplot as plt
import pandas as pd

orders = pd.DataFrame(
    {
        "region": ["NL", "NL", "BE", "DE"],
        "amount": [120, 80, 200, 50],
    }
)

<aside> ๐Ÿš€ Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=visualizing_data&exercise=w4_visualizing_data__daily_totals&lang=python

</aside>

Exercises

  1. Create a bar chart of total revenue by region.
  2. Plot a histogram of order amounts and identify the range with the most values.
  3. Save a chart to output/ and verify the file exists.

Extra reading

<aside> ๐Ÿ“š When your data grows past what Pandas handles comfortably, see the optional Alternatives to Pandas chapter or the Going Further page for deep dives on Polars and Dask.

</aside>

Knowledge Check

Test your recall before moving on.

<aside> ๐Ÿš€ Try it in the widget: Interactive Quiz: Visualizing Data with Pandas

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch8_visualizing_data_quiz&embed=1

If Matplotlib's API felt unfamiliar, this video walks through creating and customizing your first plots from scratch.

<aside> ๐ŸŽฌ Struggling with this concept? Watch this beginner-friendly video:

Watch on YouTube

</aside>

https://www.youtube.com/watch?v=UO98lJQ3QGI

You can also describe your chart goal to an LLM to get a starting point.

<aside> ๐Ÿ’ก Using AI to help: Paste a description of your chart goal (โš ๏ธ Ensure no PII or sensitive company data is included!): for example "bar chart of revenue per region, sorted descending, saved to output/", and ask an LLM to write the Matplotlib code. Run it and tweak the labels before using it in a report.

</aside>

Ready to apply these skills? Try the practice exercise before moving on.

<aside> โŒจ๏ธ Hands on: Practice with Exercise 7: Visualize Revenue.

</aside>


Next up: Alternatives to Pandas, where you learn when Pandas is still the right tool and when a larger workload may need something else.


The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with โค๏ธ by the HackYourFuture community ยท Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.