Week 4 - Data Processing with Pandas
Introduction to Pandas and DataFrames
Selecting, Filtering, and Sorting Data
Joining and Merging DataFrames
Working with Strings and Dates
Assignment: MessyCorp Goes Pandas
Processing data is only useful if the result can be reused. That means writing clean outputs to files, databases, or cloud storage in a repeatable way.
output_path = "output/monthly_sales.csv"
monthly_sales.to_csv(output_path, index=False)
Use index=False unless you intentionally want the index as a column.
Parquet is a columnar format that compresses well and loads fast.
monthly_sales.to_parquet("output/monthly_sales.parquet", index=False)
<aside> 🤓 Curious Geek: Columnar storage
Columnar formats read only the columns you query, which is why analytics scans are faster.
</aside>
SQLite is a great local database for analytics and testing.
import sqlite3
with sqlite3.connect("output/analytics.db") as conn:
monthly_sales.to_sql("monthly_sales", conn, if_exists="replace", index=False)
Always read back a sample to confirm schema and content.
check = pd.read_parquet("output/monthly_sales.parquet")
assert len(check) == len(monthly_sales)
Azure Blob Storage is the simplest way to store processed data in the cloud. You can upload CSV or Parquet outputs after your pipeline finishes.
from pathlib import Path
import os
from azure.storage.blob import BlobServiceClient
connection_string = os.environ["AZURE_STORAGE_CONNECTION_STRING"]
container_name = "week4-outputs"
file_path = Path("output/monthly_sales.parquet")
blob_name = file_path.name
service = BlobServiceClient.from_connection_string(connection_string)
container = service.get_container_client(container_name)
if not container.exists():
container.create_container()
with open(file_path, "rb") as file_handle:
container.upload_blob(blob_name, file_handle, overwrite=True)
<aside>
⚠️ Never hardcode secrets. Store connection strings in environment variables or a .env file that is excluded from git.
</aside>
<aside> ⌨️ Hands on: Use this sample table. Write it to both CSV and Parquet, then read both back and compare their row counts.
import pandas as pd
summary = pd.DataFrame(
{
"region": ["NL", "BE", "DE"],
"total_revenue": [320, 140, 200],
"order_count": [4, 2, 1],
}
)
🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=writing_results&exercise=w4_writing_results__row_count_match&lang=python
💭 The widget uses plain Python lists of dictionaries to mimic a table.
</aside>
index=False.sales_summary.The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.