Week 4: Data Processing

Pandas and DataFrames

Selecting and Filtering Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Jupyter Notebooks

Practice

Assignment: MessyCorp Pandas

Gotchas & Pitfalls

Week 4 Kickoff Slides

Career relevance: Week 4

Pandas Cheatsheet

Week 4 Glossary

Going Further: Optional Deep Dives

Writing Data

Processing data is only useful if the result can be reused. That means writing clean outputs to files, databases, or cloud storage in a repeatable way.

By the end of this chapter, you should be able to write DataFrames to CSV, Parquet, and SQLite, validate the output with a read-back check, and understand when to use each format.

<aside> 📦 Run the examples: companion_ch7_writing_results.py: run in the Codespace or clone locally to follow along with this chapter.

</aside>

Concepts

import pandas as pd
from pathlib import Path

monthly_sales = pd.DataFrame(
    {
        "region": ["NL", "BE", "DE"],
        "month": ["2024-01", "2024-01", "2024-01"],
        "total_revenue": [320.0, 140.0, 200.0],
        "order_count": [4, 2, 1],
    }
)

# Create output directory so all write snippets below can run
Path("output").mkdir(exist_ok=True)

Writing CSV Files

output_path = "output/monthly_sales.csv"
monthly_sales.to_csv(output_path, index=False)

Use index=False unless you intentionally want the index as a column.

Writing Parquet Files

Parquet is a binary file format designed for analytical workloads. Unlike CSV, which stores data row by row, Parquet stores each column separately on disk. When you query only two columns from a 20-column dataset, Parquet skips the other 18 entirely. CSV must read every row in full regardless of how many columns you need.

CSV (row-oriented):        Parquet (column-oriented):
row1: NL, 2024-01, 320    column "region":       NL, BE, DE
row2: BE, 2024-01, 140    column "month":        2024-01, ...
row3: DE, 2024-01, 200    column "total_revenue": 320, 140, 200

Parquet also compresses each column independently, which can make files 5–10× smaller than CSV for typical analytical data.

Two things to know before using it: Parquet is a binary format, so you cannot open it in a text editor or inspect it with cat. And because it stores type information alongside the data, it preserves column dtypes (int, float, datetime) exactly, which CSV cannot do.

When to use each format:

Format Use when
CSV Small files, human-readable exchange, quick inspection
Parquet Large datasets, pipelines, cloud storage, fast queries
monthly_sales.to_parquet("output/monthly_sales.parquet", index=False)

<aside> 🤓 Curious Geek: Columnar storage

Columnar formats read only the columns you query, which is why analytics scans are faster.

</aside>

This is why querying a 1 GB Parquet file for two columns is faster than reading the same data from a row-oriented CSV.

Writing to SQLite

SQLite is a great local database for analytics and testing.

import sqlite3

with sqlite3.connect("output/analytics.db") as conn:
    monthly_sales.to_sql("monthly_sales", conn, if_exists="replace", index=False)

Validate What You Wrote

Always read back a sample to confirm schema and content.

check = pd.read_parquet("output/monthly_sales.parquet")
assert len(check) == len(monthly_sales)

Azure Blob Storage (Cloud Output)

Azure Blob Storage is the simplest way to store processed data in the cloud. You can upload CSV or Parquet outputs after your pipeline finishes.

import os
from pathlib import Path

# Requires AZURE_STORAGE_CONNECTION_STRING in your environment.
# Skip this block if you do not have Azure credentials set up yet.
connection_string = os.environ.get("AZURE_STORAGE_CONNECTION_STRING")
if connection_string:
    from azure.storage.blob import BlobServiceClient

    container_name = "week4-outputs"
    file_path = Path("output/monthly_sales.parquet")
    blob_name = file_path.name

    service = BlobServiceClient.from_connection_string(connection_string)
    container = service.get_container_client(container_name)

    if not container.exists():
        container.create_container()

    with open(file_path, "rb") as file_handle:
        container.upload_blob(blob_name, file_handle, overwrite=True)

<aside> ⚠️ Never hardcode secrets. Store connection strings in environment variables or a .env file that is excluded from git.

</aside>

Always read back a sample after writing to confirm the schema survived the round-trip.

⌨️ Hands on: Write and Verify

Use this sample table. Write it to both CSV and Parquet, then read both back and compare their row counts.

import pandas as pd

summary = pd.DataFrame(
    {
        "region": ["NL", "BE", "DE"],
        "total_revenue": [320, 140, 200],
        "order_count": [4, 2, 1],
    }
)

<aside> 🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=writing_results&exercise=w4_writing_results__clean_index_column&lang=python

</aside>

Exercises

  1. Write an aggregated DataFrame to CSV and Parquet with index=False.
  2. Save the same data into a SQLite table called sales_summary.
  3. Upload one output file to Azure Blob Storage if you have access.

Extra reading

Knowledge Check

Test your recall before moving on.

<aside> 🚀 Try it in the widget: Interactive Quiz: Writing Data

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch7_writing_results_quiz&embed=1

If the format differences or the SQLite write felt unclear, this video covers reading and writing to CSV, JSON, SQL, and more in Pandas.

<aside> 🎬 Struggling with this concept? Watch this beginner-friendly video:

Watch on YouTube

</aside>

https://www.youtube.com/watch?v=N6hyN6BW6ao

You can also describe your schema to an LLM to get a starting point.

<aside> 💡 Using AI to help: Paste a schema description (⚠️ Ensure no PII or sensitive company data is included!): for example "a DataFrame with columns region, date, revenue: float", and ask an LLM to generate the to_parquet write + read-back assertion. Always test on a local file before pointing at cloud storage.

</aside>

The export step in the practice exercise covers writing DataFrames to CSV and Parquet.

<aside> ⌨️ Hands on: Practice with Exercise 5: Reshape and Export.

</aside>


Next up: Visualizing Data with Pandas, where you create quick charts directly from DataFrames to validate trends and spot data quality issues.


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.