Azure Blob Storage

Azure Blob Storage is object storage for the cloud. It stores files (called blobs) of any type: JSON, CSV, images, Parquet, logs. Object storage is one of the core reasons cloud platforms exist. It is cheap, scales to petabytes, and every cloud data pipeline reads from or writes to it.

By the end of this chapter, you should be able to upload files to Azure Blob Storage from Python and verify them in the portal.

Concepts

Object storage vs file system storage

You already know file system storage: it is the Documents/ folder on your laptop. Files live in directories, you can rename them, move them between folders, and edit them in place. That works well for day-to-day use, but it has limits when you need to store data at cloud scale.

Object storage works differently. Each object is stored as a single unit with a key (its name) and metadata. You cannot edit part of an object; you replace the whole thing. There are no real directories, only key prefixes that look like paths (e.g. raw/weather/2024-01-15.json).

	File system	Object storage
Structure	Directories and files in a tree	Flat namespace with key prefixes
Editing	Edit a file in place	Replace the entire object
Scale	Limited by disk size	Scales to petabytes automatically
Access	Local or network mount	HTTP API (REST calls)
Redundancy	You manage backups	Automatic replication across disks/regions
Cost	Pay for the disk, used or not	Pay per GB stored + requests
Best for	Application files, code, config	Data pipeline output, logs, media, backups

Why use object storage?

In Weeks 3 and 4, your pipeline saved output to local files. That works on your machine, but local files are not accessible to other services, not backed up, and not shareable.

Object storage solves this problem. Instead of files on a single disk, your data lives in a central cloud location that any authorized (allowed) service can read, write, and share.

Amazon launched S3 (Simple Storage Service) in 2006, and it changed how companies think about storing data. Before S3, you had to buy and manage your own file servers. S3 introduced the idea that storage could be a service: you upload objects (files), the provider handles replication, durability, and scaling. You pay only for what you use.

Every major cloud provider now offers the same concept: Amazon S3, Azure Blob Storage, and Google Cloud Storage. The APIs differ, but the mental model is identical: you put objects in, you get objects out. In this track we use Azure Blob Storage, but if you learn one, the others will feel familiar. For the full history of how object storage replaced file servers, see History of Cloud Computing.

<aside> 🤓 Curious Geek: The history of S3

S3 was one of the first AWS services ever launched and is widely considered the starting point of modern cloud computing. It grew from Amazon's internal need to store massive amounts of data reliably. If you want to see how it evolved over the years, check out this AWS S3 history and timeline.

</aside>

What this means for data pipelines: Your pipeline produces output files (JSON, CSV, Parquet) that need to be stored reliably, accessed by other services, and kept for a long time. Object storage is designed for exactly this pattern. You write the file once, read it many times, and never edit it in place. File system storage is designed for interactive use where you open, edit, and save files repeatedly.

Storage accounts, containers, and blobs

Azure Blob Storage is organized in three levels:

Storage account (unique name, like a top-level namespace)
└── Container (like a folder)
    ├── raw/weather_2024-01-15.json
    ├── raw/weather_2024-01-16.json
    └── processed/summary.csv

Storage account: a namespace for all your storage. Your teacher has created a shared one.
Container: a logical grouping of blobs (not to be confused with Docker containers). Think of it like a top-level folder. Your teacher has pre-created raw and processed containers for you.
Blob: a single file.

<aside> ⚠️ "Container" in Azure Blob Storage means a storage container (folder), not a Docker container. The naming is confusing, but context makes it clear.

</aside>

Here is how the hierarchy looks in practice:

<aside> 🖼️ Visual: Blob Storage hierarchy

</aside>

Uploading from Python

To interact with Azure Blob Storage from Python, you use the azure-storage-blob package. It is part of the Azure SDK for Python, a collection of libraries maintained by Microsoft that cover every Azure service. The blob storage library gives you classes to create clients, upload and download blobs, list containers, and manage metadata. All through Python instead of the CLI.

Install it with pip:

pip install azure-storage-blob

Upload a file:

import os
from azure.storage.blob import BlobServiceClient

connection_string = os.environ["AZURE_STORAGE_CONNECTION_STRING"]  # your teacher provides this value
client = BlobServiceClient.from_connection_string(connection_string)

container_client = client.get_container_client("raw")

with open("weather_data.json", "rb") as f:
    container_client.upload_blob(name="weather_2024-01-15.json", data=f, overwrite=True)

That is the core pattern: create a client, pick a container, upload a file.

<aside> 💡 Your teacher will provide the AZURE_STORAGE_CONNECTION_STRING value. Set it as an environment variable before running the code: export AZURE_STORAGE_CONNECTION_STRING="<value>". In Chapter 5 you will see how to retrieve it from Azure Key Vault.

</aside>

Downloading and listing blobs

# List blobs in a container
for blob in container_client.list_blobs():
    print(blob.name)

# Download a blob
blob_client = container_client.get_blob_client("weather_2024-01-15.json")
data = blob_client.download_blob().readall()

<aside> ⌨️ Hands on: Upload a small JSON file to the shared storage account, then list blobs in the container to verify it appeared.

</aside>

Using the CLI for blob storage

You can also manage blobs from the command line:

# List containers in a storage account
az storage container list \\
  --account-name hyfstoragedev \\
  --output table

# Upload a file
az storage blob upload \\
  --account-name hyfstoragedev \\
  --container-name raw \\
  --file weather_data.json \\
  --name weather_2024-01-15.json

# List blobs
az storage blob list \\
  --account-name hyfstoragedev \\
  --container-name raw \\
  --output table

Organizing blobs

Use a clear naming convention so you can find files later:

raw/weather/2024-01-15.json
raw/weather/2024-01-16.json
processed/weather/summary.csv

Prefixes like raw/ and processed/ act like folders. Azure Blob Storage is flat (no real directories), but prefixes give you logical structure.

<aside> 💡 Using AI to help: Ask an LLM to suggest a blob naming convention for your data source. (⚠️ Ensure no PII or sensitive company data is included!)

</aside>

Blob naming is flexible, but a consistent convention saves you debugging time later.

<aside> 🤓 Curious Geek: Why object storage is so cheap

Object storage uses commodity hardware with built-in redundancy. Your file is automatically replicated across multiple disks (and optionally across regions). Because there is no file system overhead, providers can offer storage at fractions of a cent per gigabyte per month.

</aside>

Exercises

Upload a JSON file to the shared storage account using Python.
List all blobs in your container using the CLI.
Download the file you uploaded and verify the contents match.

<aside> 💡 In the wild: The Meltano open source ELT platform uses blob storage (S3 and Azure Blob Storage) as the default destination for extracted data. Raw API responses are written as JSON objects to a raw/ prefix, then transformed and loaded into a warehouse. This is the same extract-to-blob pattern your pipeline uses.

</aside>

🧠 Knowledge Check

What is the difference between a storage account, a container, and a blob? How do they relate to each other?
Why does object storage use "replace the whole object" instead of "edit in place" like a file system? What advantage does this give data pipelines?
Your upload script runs without errors, but when you list blobs in the container, your file is not there. What are two possible causes?

Extra reading

Azure Blob Storage documentation: official guide
Quickstart: Azure Blob Storage with Python: official Python quickstart
Azure Storage pricing: see how cheap blob storage is

The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.