Week 4 - Data Processing with Pandas

Introduction to Pandas and DataFrames

Selecting, Filtering, and Sorting Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Practice

Assignment: MessyCorp Goes Pandas

Gotchas & Pitfalls

Lesson Plan

💾 Writing Data

Processing data is only useful if the result can be reused. That means writing clean outputs to files, databases, or cloud storage in a repeatable way.

Concepts

Writing CSV Files

output_path = "output/monthly_sales.csv"
monthly_sales.to_csv(output_path, index=False)

Use index=False unless you intentionally want the index as a column.

Writing Parquet Files

Parquet is a columnar format that compresses well and loads fast.

monthly_sales.to_parquet("output/monthly_sales.parquet", index=False)

<aside> 🤓 Curious Geek: Columnar storage

Columnar formats read only the columns you query, which is why analytics scans are faster.

</aside>

Writing to SQLite

SQLite is a great local database for analytics and testing.

import sqlite3

with sqlite3.connect("output/analytics.db") as conn:
    monthly_sales.to_sql("monthly_sales", conn, if_exists="replace", index=False)

Validate What You Wrote

Always read back a sample to confirm schema and content.

check = pd.read_parquet("output/monthly_sales.parquet")
assert len(check) == len(monthly_sales)

Azure Blob Storage (Cloud Output)

Azure Blob Storage is the simplest way to store processed data in the cloud. You can upload CSV or Parquet outputs after your pipeline finishes.

from pathlib import Path
import os
from azure.storage.blob import BlobServiceClient

connection_string = os.environ["AZURE_STORAGE_CONNECTION_STRING"]
container_name = "week4-outputs"

file_path = Path("output/monthly_sales.parquet")
blob_name = file_path.name

service = BlobServiceClient.from_connection_string(connection_string)
container = service.get_container_client(container_name)

if not container.exists():
    container.create_container()

with open(file_path, "rb") as file_handle:
    container.upload_blob(blob_name, file_handle, overwrite=True)

<aside> ⚠️ Never hardcode secrets. Store connection strings in environment variables or a .env file that is excluded from git.

</aside>

<aside> ⌨️ Hands on: Use this sample table. Write it to both CSV and Parquet, then read both back and compare their row counts.

import pandas as pd

summary = pd.DataFrame(
    {
        "region": ["NL", "BE", "DE"],
        "total_revenue": [320, 140, 200],
        "order_count": [4, 2, 1],
    }
)

🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=writing_results&exercise=w4_writing_results__row_count_match&lang=python

💭 The widget uses plain Python lists of dictionaries to mimic a table.

</aside>

Exercises

  1. Write an aggregated DataFrame to CSV and Parquet with index=False.
  2. Save the same data into a SQLite table called sales_summary.
  3. Upload one output file to Azure Blob Storage if you have access.

Extra reading


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.