Joining and Merging DataFrames
Working with Strings and Dates
Going Further: Optional Deep Dives
Processing data is only useful if the result can be reused. That means writing clean outputs to files, databases, or cloud storage in a repeatable way.
By the end of this chapter, you should be able to write DataFrames to CSV, Parquet, and SQLite, validate the output with a read-back check, and understand when to use each format.
<aside>
📦 Run the examples: companion_ch7_writing_results.py: run in the Codespace or clone locally to follow along with this chapter.
</aside>
import pandas as pd
from pathlib import Path
monthly_sales = pd.DataFrame(
{
"region": ["NL", "BE", "DE"],
"month": ["2024-01", "2024-01", "2024-01"],
"total_revenue": [320.0, 140.0, 200.0],
"order_count": [4, 2, 1],
}
)
# Create output directory so all write snippets below can run
Path("output").mkdir(exist_ok=True)
output_path = "output/monthly_sales.csv"
monthly_sales.to_csv(output_path, index=False)
Use index=False unless you intentionally want the index as a column.
Parquet is a binary file format designed for analytical workloads. Unlike CSV, which stores data row by row, Parquet stores each column separately on disk. When you query only two columns from a 20-column dataset, Parquet skips the other 18 entirely. CSV must read every row in full regardless of how many columns you need.
CSV (row-oriented): Parquet (column-oriented):
row1: NL, 2024-01, 320 column "region": NL, BE, DE
row2: BE, 2024-01, 140 column "month": 2024-01, ...
row3: DE, 2024-01, 200 column "total_revenue": 320, 140, 200
Parquet also compresses each column independently, which can make files 5–10× smaller than CSV for typical analytical data.
Two things to know before using it: Parquet is a binary format, so you cannot open it in a text editor or inspect it with cat. And because it stores type information alongside the data, it preserves column dtypes (int, float, datetime) exactly, which CSV cannot do.
When to use each format:
| Format | Use when |
|---|---|
| CSV | Small files, human-readable exchange, quick inspection |
| Parquet | Large datasets, pipelines, cloud storage, fast queries |
monthly_sales.to_parquet("output/monthly_sales.parquet", index=False)
<aside> 🤓 Curious Geek: Columnar storage
Columnar formats read only the columns you query, which is why analytics scans are faster.
</aside>
This is why querying a 1 GB Parquet file for two columns is faster than reading the same data from a row-oriented CSV.
SQLite is a great local database for analytics and testing.
import sqlite3
with sqlite3.connect("output/analytics.db") as conn:
monthly_sales.to_sql("monthly_sales", conn, if_exists="replace", index=False)
Always read back a sample to confirm schema and content.
check = pd.read_parquet("output/monthly_sales.parquet")
assert len(check) == len(monthly_sales)
Azure Blob Storage is the simplest way to store processed data in the cloud. You can upload CSV or Parquet outputs after your pipeline finishes.
import os
from pathlib import Path
# Requires AZURE_STORAGE_CONNECTION_STRING in your environment.
# Skip this block if you do not have Azure credentials set up yet.
connection_string = os.environ.get("AZURE_STORAGE_CONNECTION_STRING")
if connection_string:
from azure.storage.blob import BlobServiceClient
container_name = "week4-outputs"
file_path = Path("output/monthly_sales.parquet")
blob_name = file_path.name
service = BlobServiceClient.from_connection_string(connection_string)
container = service.get_container_client(container_name)
if not container.exists():
container.create_container()
with open(file_path, "rb") as file_handle:
container.upload_blob(blob_name, file_handle, overwrite=True)
<aside>
⚠️ Never hardcode secrets. Store connection strings in environment variables or a .env file that is excluded from git.
</aside>
Always read back a sample after writing to confirm the schema survived the round-trip.
Use this sample table. Write it to both CSV and Parquet, then read both back and compare their row counts.
import pandas as pd
summary = pd.DataFrame(
{
"region": ["NL", "BE", "DE"],
"total_revenue": [320, 140, 200],
"order_count": [4, 2, 1],
}
)
<aside> 🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=4&chapter=writing_results&exercise=w4_writing_results__clean_index_column&lang=python
</aside>
index=False.sales_summary.if_exists="replace" do when writing to a SQLite table?Test your recall before moving on.
<aside> 🚀 Try it in the widget: Interactive Quiz: Writing Data
</aside>
https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch7_writing_results_quiz&embed=1
If the format differences or the SQLite write felt unclear, this video covers reading and writing to CSV, JSON, SQL, and more in Pandas.
<aside> 🎬 Struggling with this concept? Watch this beginner-friendly video:
</aside>
https://www.youtube.com/watch?v=N6hyN6BW6ao
You can also describe your schema to an LLM to get a starting point.
<aside>
💡 Using AI to help: Paste a schema description (⚠️ Ensure no PII or sensitive company data is included!): for example "a DataFrame with columns region, date, revenue: float", and ask an LLM to generate the to_parquet write + read-back assertion. Always test on a local file before pointing at cloud storage.
</aside>
The export step in the practice exercise covers writing DataFrames to CSV and Parquet.
<aside> ⌨️ Hands on: Practice with Exercise 5: Reshape and Export.
</aside>
Next up: Visualizing Data with Pandas, where you create quick charts directly from DataFrames to validate trends and spot data quality issues.
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with ❤️ by the HackYourFuture community · Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.