Week 6 - Cloud and Azure Essentials
Introduction to Cloud and Azure
Week 6 Assignment: Deploy Your Pipeline to Azure
Azure Blob Storage is object storage for the cloud. It stores files (called blobs) of any type: JSON, CSV, images, Parquet, logs. Object storage is one of the core reasons cloud platforms exist. It is cheap, scales to petabytes, and every cloud data pipeline reads from or writes to it.
By the end of this chapter, you should be able to upload files to Azure Blob Storage from Python and verify them in the portal.
You already know file system storage: it is the Documents/ folder on your laptop. Files live in directories, you can rename them, move them between folders, and edit them in place. That works well for day-to-day use, but it has limits when you need to store data at cloud scale.
Object storage works differently. Each object is stored as a single unit with a key (its name) and metadata. You cannot edit part of an object; you replace the whole thing. There are no real directories, only key prefixes that look like paths (e.g. raw/weather/2024-01-15.json).
| File system | Object storage | |
|---|---|---|
| Structure | Directories and files in a tree | Flat namespace with key prefixes |
| Editing | Edit a file in place | Replace the entire object |
| Scale | Limited by disk size | Scales to petabytes automatically |
| Access | Local or network mount | HTTP API (REST calls) |
| Redundancy | You manage backups | Automatic replication across disks/regions |
| Cost | Pay for the disk, used or not | Pay per GB stored + requests |
| Best for | Application files, code, config | Data pipeline output, logs, media, backups |
In Weeks 3 and 4, your pipeline saved output to local files. That works on your machine, but local files are not accessible to other services, not backed up, and not shareable.
Object storage solves this problem. Instead of files on a single disk, your data lives in a central cloud location that any authorized (allowed) service can read, write, and share.
Amazon launched S3 (Simple Storage Service) in 2006, and it changed how companies think about storing data. Before S3, you had to buy and manage your own file servers. S3 introduced the idea that storage could be a service: you upload objects (files), the provider handles replication, durability, and scaling. You pay only for what you use.
Every major cloud provider now offers the same concept: Amazon S3, Azure Blob Storage, and Google Cloud Storage. The APIs differ, but the mental model is identical: you put objects in, you get objects out. In this track we use Azure Blob Storage, but if you learn one, the others will feel familiar. For the full history of how object storage replaced file servers, see History of Cloud Computing.
<aside> ๐ค Curious Geek: The history of S3
S3 was one of the first AWS services ever launched and is widely considered the starting point of modern cloud computing. It grew from Amazon's internal need to store massive amounts of data reliably. If you want to see how it evolved over the years, check out this AWS S3 history and timeline.
</aside>
What this means for data pipelines: Your pipeline produces output files (JSON, CSV, Parquet) that need to be stored reliably, accessed by other services, and kept for a long time. Object storage is designed for exactly this pattern. You write the file once, read it many times, and never edit it in place. File system storage is designed for interactive use where you open, edit, and save files repeatedly.
Azure Blob Storage is organized in three levels:
Storage account (unique name, like a top-level namespace)
โโโ Container (like a folder)
โโโ raw/weather_2024-01-15.json
โโโ raw/weather_2024-01-16.json
โโโ processed/summary.csv
raw and processed containers for you.<aside> โ ๏ธ "Container" in Azure Blob Storage means a storage container (folder), not a Docker container. The naming is confusing, but context makes it clear.
</aside>
Here is how the hierarchy looks in practice:
<aside> ๐ผ๏ธ Visual: Blob Storage hierarchy
</aside>
To interact with Azure Blob Storage from Python, you use the azure-storage-blob package. It is part of the Azure SDK for Python, a collection of libraries maintained by Microsoft that cover every Azure service. The blob storage library gives you classes to create clients, upload and download blobs, list containers, and manage metadata. All through Python instead of the CLI.
Install it with pip:
pip install azure-storage-blob
Upload a file:
import os
from azure.storage.blob import BlobServiceClient
connection_string = os.environ["AZURE_STORAGE_CONNECTION_STRING"] # your teacher provides this value
client = BlobServiceClient.from_connection_string(connection_string)
container_client = client.get_container_client("raw")
with open("weather_data.json", "rb") as f:
container_client.upload_blob(name="weather_2024-01-15.json", data=f, overwrite=True)
That is the core pattern: create a client, pick a container, upload a file.
<aside>
๐ก Your teacher will provide the AZURE_STORAGE_CONNECTION_STRING value. Set it as an environment variable before running the code: export AZURE_STORAGE_CONNECTION_STRING="<value>". In Chapter 5 you will see how to retrieve it from Azure Key Vault.
</aside>
# List blobs in a container
for blob in container_client.list_blobs():
print(blob.name)
# Download a blob
blob_client = container_client.get_blob_client("weather_2024-01-15.json")
data = blob_client.download_blob().readall()
<aside> โจ๏ธ Hands on: Upload a small JSON file to the shared storage account, then list blobs in the container to verify it appeared.
</aside>
You can also manage blobs from the command line:
# List containers in a storage account
az storage container list \\
--account-name hyfstoragedev \\
--output table
# Upload a file
az storage blob upload \\
--account-name hyfstoragedev \\
--container-name raw \\
--file weather_data.json \\
--name weather_2024-01-15.json
# List blobs
az storage blob list \\
--account-name hyfstoragedev \\
--container-name raw \\
--output table
Use a clear naming convention so you can find files later:
raw/weather/2024-01-15.json
raw/weather/2024-01-16.json
processed/weather/summary.csv
Prefixes like raw/ and processed/ act like folders. Azure Blob Storage is flat (no real directories), but prefixes give you logical structure.
<aside> ๐ก Using AI to help: Ask an LLM to suggest a blob naming convention for your data source. (โ ๏ธ Ensure no PII or sensitive company data is included!)
</aside>
Blob naming is flexible, but a consistent convention saves you debugging time later.
<aside> ๐ค Curious Geek: Why object storage is so cheap
Object storage uses commodity hardware with built-in redundancy. Your file is automatically replicated across multiple disks (and optionally across regions). Because there is no file system overhead, providers can offer storage at fractions of a cent per gigabyte per month.
</aside>
<aside>
๐ก In the wild: The Meltano open source ELT platform uses blob storage (S3 and Azure Blob Storage) as the default destination for extracted data. Raw API responses are written as JSON objects to a raw/ prefix, then transformed and loaded into a warehouse. This is the same extract-to-blob pattern your pipeline uses.
</aside>
The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with โค๏ธ by the HackYourFuture community ยท Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.