Week 4 - Data Processing with Pandas

Introduction to Pandas and DataFrames

Selecting, Filtering, and Sorting Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Practice

Assignment: MessyCorp Goes Pandas

Gotchas & Pitfalls

Lesson Plan

📈 Visualizing Data with Pandas

Visualization helps you spot outliers, trends, and data quality issues quickly. Even in data engineering, a simple chart can reveal broken joins, missing values, or unexpected spikes before your pipeline ships bad data.

Concepts

Quick Plots with DataFrame.plot

Pandas integrates with Matplotlib, so you can create quick charts with one line.

import pandas as pd

sales = pd.DataFrame(
    {
        "date": ["2024-01-01", "2024-01-02", "2024-01-03"],
        "amount": [120, 90, 150],
    }
)

sales.plot(x="date", y="amount", kind="line", title="Daily revenue")

Common plot types:

<aside> ⚠️ Always sort by date before plotting time series. Unsorted dates create misleading lines.

</aside>

<aside> 🤓 Curious Geek: Anscombe's quartet

Four datasets can have identical statistics but very different plots. Visualization protects you from false confidence.

</aside>

Group Then Plot

Aggregations often become charts.

daily = sales.groupby("date", as_index=False)["amount"].sum()
daily.plot(x="date", y="amount", kind="line")

Matplotlib for More Control

Use Matplotlib directly when you need labels, size control, or export.

import matplotlib.pyplot as plt

ax = daily.plot(x="date", y="amount", kind="line", figsize=(6, 3))
ax.set_xlabel("Date")
ax.set_ylabel("Revenue")
plt.tight_layout()
plt.savefig("output/daily_revenue.png")

<aside> 💡 Saving charts to files is useful for automated reports and dashboards.

</aside>

<aside> ⌨️ Hands on: Use this sample table. Group by date, calculate total revenue, and plot a line chart.

import pandas as pd

sales = pd.DataFrame(
    {
        "date": ["2024-01-01", "2024-01-01", "2024-01-02", "2024-01-03"],
        "amount": [120, 80, 90, 150],
    }
)

🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=visualizing_data&exercise=w4_visualizing_data__daily_totals&lang=python

💭 The widget uses plain Python lists of dictionaries to produce line-chart points.

</aside>

Exercises

  1. Create a bar chart of total revenue by region.
  2. Plot a histogram of order amounts and identify the range with the most values.
  3. Save a chart to output/ and verify the file exists.

Extra reading


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.