Week 6 - Cloud and Azure Essentials
Introduction to Cloud and Azure
Azure Blob Storage is object storage for the cloud. It stores files (called blobs) of any type: JSON, CSV, images, Parquet, logs. Object storage is one of the core reasons cloud platforms exist. It is cheap, scales to petabytes, and every cloud data pipeline reads from or writes to it.
By the end of this chapter, you should be able to upload files to Azure Blob Storage from Python and verify them in the portal.
You already know file system storage: it is the Documents/ folder on your laptop. Files live in directories, you can rename them, move them between folders, and edit them in place. That works well for day-to-day use, but it has limits when you need to store data at cloud scale.
Object storage works differently. Each object is stored as a single unit with a key (its name) and metadata. You cannot edit part of an object; you replace the whole thing. There are no real directories, only key prefixes that look like paths (e.g. raw/weather/2024-01-15.json).
| File system | Object storage | |
|---|---|---|
| Structure | Directories and files in a tree | Flat namespace with key prefixes |
| Editing | Edit a file in place | Replace the entire object |
| Scale | Limited by disk size | Scales to petabytes automatically |
| Access | Local or network mount | HTTP API (REST calls) |
| Redundancy | You manage backups | Automatic replication across disks/regions |
| Cost | Pay for the disk, used or not | Pay per GB stored + requests |
| Best for | Application files, code, config | Data pipeline output, logs, media, backups |
In Weeks 3 and 4, your pipeline saved output to local files. That works on your machine, but local files are not accessible to other services, not backed up, and not shareable.
Object storage solves this problem. Instead of files on a single disk, your data lives in a central cloud location that any authorized (allowed) service can read, write, and share.
Amazon launched S3 (Simple Storage Service) in 2006, and it changed how companies think about storing data. Before S3, you had to buy and manage your own file servers. S3 introduced the idea that storage could be a service: you upload objects (files), the provider handles replication, durability, and scaling. You pay only for what you use.
Every major cloud provider now offers the same concept: Amazon S3, Azure Blob Storage, and Google Cloud Storage. The APIs differ, but the mental model is identical: you put objects in, you get objects out. In this track we use Azure Blob Storage, but if you learn one, the others will feel familiar. For the full history of how object storage replaced file servers, see History of Cloud Computing.
<aside> 🤓 Curious Geek: The history of S3
S3 launched in 2006 and is widely considered the starting point of modern cloud computing. For the full timeline, see History of Cloud Computing: Object storage.
</aside>
What this means for data pipelines: Your pipeline produces output files (JSON, CSV, Parquet) that need to be stored reliably, accessed by other services, and kept for a long time. Object storage is designed for exactly this pattern. You write the file once, read it many times, and never edit it in place. File system storage is designed for interactive use where you open, edit, and save files repeatedly.
Azure Blob Storage is organized in three levels:
Storage account (unique name, like a top-level namespace)
└── Container (like a folder)
├── raw/weather_2024-01-15.json
├── raw/weather_2024-01-16.json
└── processed/summary.csv
raw and processed containers for you.Here is how the hierarchy looks in practice:
<aside> 🖼️ Visual: Blob Storage hierarchy
</aside>
By Week 6 you have met three different things named container. The word is overloaded; context tells you which one someone means:
| What it is | Week | Example phrase | |
|---|---|---|---|
| Blob Storage container | A folder inside a storage account | Ch3 | "Upload to the raw container" |
| Docker container | A running process from your image | Week 5 | docker run your weather pipeline |
| Container App Job | Azure runs your image on demand in the cloud | Ch5 | az containerapp job create |
<aside> 🖼️ Visual: Three containers in Week 6
</aside>
When a teammate says "push it to the raw container," they mean the Blob Storage folder, not a Docker container and not a Container App Job.
To interact with Azure Blob Storage from Python, you use the azure-storage-blob package. It is part of the Azure SDK for Python, a collection of libraries maintained by Microsoft that cover every Azure service. The blob storage library gives you classes to create clients, upload and download blobs, list containers, and manage metadata. All through Python instead of the CLI.
Install it with pip:
pip install azure-storage-blob
Upload a file:
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient
# When running locally, DefaultAzureCredential automatically picks up your `az login` credentials.
# No connection string or secret is required on your laptop!
credential = DefaultAzureCredential()
client = BlobServiceClient(account_url="<https://hyfstoragedev.blob.core.windows.net>", credential=credential)
container_client = client.get_container_client("raw")
with open("weather_data.json", "rb") as f:
container_client.upload_blob(name="weather_2024-01-15.json", data=f, overwrite=True)
That is the core pattern: create a client, pick a container, upload a file.
<aside> 💡 Locally vs. Cloud authentication:
DefaultAzureCredential(), Python reads your active az login token.az login. Pass AZURE_STORAGE_CONNECTION_STRING as an env var on the job (Chapter 5 shows the Key Vault commands).
</aside># List blobs in a container
for blob in container_client.list_blobs():
print(blob.name)
# Download a blob
blob_client = container_client.get_blob_client("weather_2024-01-15.json")
data = blob_client.download_blob().readall()
<aside> ⌨️ Hands on: Upload a small JSON file to the shared storage account, then list blobs in the container to verify it appeared.
</aside>
Use the Python SDK in scripts and pipelines (where you can wrap it in retries, logging, and config); use the Azure CLI for one-off uploads, debugging, or quick verification of what landed.
All three commands below pass --auth-mode login, which routes through your Entra ID identity (and the HYF Data Track Student role's blob dataActions). Without --auth-mode login, the CLI tries account-key auth and fails on the shared account.
# List containers in a storage account
az storage container list \
--account-name hyfstoragedev \
--auth-mode login \
--output table
# Upload a file
az storage blob upload \
--account-name hyfstoragedev \
--container-name raw \
--file weather_data.json \
--name weather_2024-01-15.json \
--auth-mode login
# List blobs
az storage blob list \
--account-name hyfstoragedev \
--container-name raw \
--auth-mode login \
--output table
Use a clear naming convention so you can find files later:
raw/weather/2024-01-15.json
raw/weather/2024-01-16.json
processed/weather/summary.csv
Prefixes like raw/ and processed/ act like folders. Azure Blob Storage is flat (no real directories), but prefixes give you logical structure.
<aside> 💡 Using AI to help: Ask an LLM to suggest a blob naming convention for your data source. (⚠️ Ensure no PII or sensitive company data is included!)
</aside>
Blob naming is flexible, but a consistent convention saves you debugging time later.
Azure Data Lake Storage Gen2 (ADLS) is the hierarchical-namespace cousin of blob storage; later weeks use it for Parquet datasets, and its abfss://<container>@<account>.dfs.core.windows.net/<path> URI bakes the same account/container/blob hierarchy you just learned into a single string.
<aside> 🚀 Try it in the widget: Build ADLS URI exercise
</aside>
Practicing the URI structure makes the account, container, and blob hierarchy concrete.
<aside> 🤓 Curious Geek: Why object storage is so cheap
Object storage uses commodity hardware with built-in redundancy. Your file is automatically replicated across multiple disks (and optionally across regions). Because there is no file system overhead, providers can offer storage at fractions of a cent per gigabyte per month.
</aside>
The same extract-to-blob pattern shows up in production pipelines.
<aside>
💡 In the wild: The Meltano open source ELT platform uses blob storage (S3 and Azure Blob Storage) as the default destination for extracted data. Raw API responses are written as JSON objects to a raw/ prefix, then transformed and loaded into a warehouse. This is the same extract-to-blob pattern your pipeline uses.
</aside>
<aside> 🚀 Try it in the widget: Interactive Quiz: Azure Blob Storage
</aside>
https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_6_ch3_azure_blob_storage&embed=1
If the storage-account / container / blob hierarchy still feels abstract, this AZ-900 episode walks the Azure storage services with diagrams and a portal demo.
<aside> 🎬 Struggling with this concept? Watch this beginner-friendly video:
</aside>
https://www.youtube.com/watch?v=_Qlkvd4ZQuo
Ready to apply the concepts? Upload and verify a blob end to end in the next practice exercise.
<aside> ⌨️ Hands on: Practice with Exercise 2: End-to-end blob verification, where you upload a JSON blob from Python and verify it from the CLI.
</aside>
Next up: Azure PostgreSQL Databases, where you connect to a managed Postgres server over SSL, create a table, and upsert rows from Python.