Week 6 - Cloud and Azure Essentials

Introduction to Cloud and Azure

Azure CLI and the Portal

Azure Blob Storage

Azure PostgreSQL Databases

Azure Container Apps Jobs

Cost Awareness

Practice

Assignment: Deploy to Azure

Gotchas & Pitfalls

Slides (PDF)

Career relevance: Week 6

Glossary: Week 6

Going Further

History of Cloud Computing

Azure Blob Storage

Azure Blob Storage is object storage for the cloud. It stores files (called blobs) of any type: JSON, CSV, images, Parquet, logs. Object storage is one of the core reasons cloud platforms exist. It is cheap, scales to petabytes, and every cloud data pipeline reads from or writes to it.

By the end of this chapter, you should be able to upload files to Azure Blob Storage from Python and verify them in the portal.

Concepts

Object storage vs file system storage

You already know file system storage: it is the Documents/ folder on your laptop. Files live in directories, you can rename them, move them between folders, and edit them in place. That works well for day-to-day use, but it has limits when you need to store data at cloud scale.

Object storage works differently. Each object is stored as a single unit with a key (its name) and metadata. You cannot edit part of an object; you replace the whole thing. There are no real directories, only key prefixes that look like paths (e.g. raw/weather/2024-01-15.json).

File system Object storage
Structure Directories and files in a tree Flat namespace with key prefixes
Editing Edit a file in place Replace the entire object
Scale Limited by disk size Scales to petabytes automatically
Access Local or network mount HTTP API (REST calls)
Redundancy You manage backups Automatic replication across disks/regions
Cost Pay for the disk, used or not Pay per GB stored + requests
Best for Application files, code, config Data pipeline output, logs, media, backups

Why use object storage?

In Weeks 3 and 4, your pipeline saved output to local files. That works on your machine, but local files are not accessible to other services, not backed up, and not shareable.

Object storage solves this problem. Instead of files on a single disk, your data lives in a central cloud location that any authorized (allowed) service can read, write, and share.

Amazon launched S3 (Simple Storage Service) in 2006, and it changed how companies think about storing data. Before S3, you had to buy and manage your own file servers. S3 introduced the idea that storage could be a service: you upload objects (files), the provider handles replication, durability, and scaling. You pay only for what you use.

Every major cloud provider now offers the same concept: Amazon S3, Azure Blob Storage, and Google Cloud Storage. The APIs differ, but the mental model is identical: you put objects in, you get objects out. In this track we use Azure Blob Storage, but if you learn one, the others will feel familiar. For the full history of how object storage replaced file servers, see History of Cloud Computing.

<aside> 🤓 Curious Geek: The history of S3

S3 launched in 2006 and is widely considered the starting point of modern cloud computing. For the full timeline, see History of Cloud Computing: Object storage.

</aside>

What this means for data pipelines: Your pipeline produces output files (JSON, CSV, Parquet) that need to be stored reliably, accessed by other services, and kept for a long time. Object storage is designed for exactly this pattern. You write the file once, read it many times, and never edit it in place. File system storage is designed for interactive use where you open, edit, and save files repeatedly.

Storage accounts, containers, and blobs

Azure Blob Storage is organized in three levels:

Storage account (unique name, like a top-level namespace)
└── Container (like a folder)
    ├── raw/weather_2024-01-15.json
    ├── raw/weather_2024-01-16.json
    └── processed/summary.csv

Here is how the hierarchy looks in practice:

<aside> 🖼️ Visual: Blob Storage hierarchy

</aside>

Three things called "container"

By Week 6 you have met three different things named container. The word is overloaded; context tells you which one someone means:

What it is Week Example phrase
Blob Storage container A folder inside a storage account Ch3 "Upload to the raw container"
Docker container A running process from your image Week 5 docker run your weather pipeline
Container App Job Azure runs your image on demand in the cloud Ch5 az containerapp job create

<aside> 🖼️ Visual: Three containers in Week 6

</aside>

When a teammate says "push it to the raw container," they mean the Blob Storage folder, not a Docker container and not a Container App Job.

Uploading from Python

To interact with Azure Blob Storage from Python, you use the azure-storage-blob package. It is part of the Azure SDK for Python, a collection of libraries maintained by Microsoft that cover every Azure service. The blob storage library gives you classes to create clients, upload and download blobs, list containers, and manage metadata. All through Python instead of the CLI.

Install it with pip:

pip install azure-storage-blob

Upload a file:

from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient

# When running locally, DefaultAzureCredential automatically picks up your `az login` credentials.
# No connection string or secret is required on your laptop!
credential = DefaultAzureCredential()
client = BlobServiceClient(account_url="<https://hyfstoragedev.blob.core.windows.net>", credential=credential)

container_client = client.get_container_client("raw")

with open("weather_data.json", "rb") as f:
    container_client.upload_blob(name="weather_2024-01-15.json", data=f, overwrite=True)

That is the core pattern: create a client, pick a container, upload a file.

<aside> 💡 Locally vs. Cloud authentication:

Downloading and listing blobs

# List blobs in a container
for blob in container_client.list_blobs():
    print(blob.name)

# Download a blob
blob_client = container_client.get_blob_client("weather_2024-01-15.json")
data = blob_client.download_blob().readall()

<aside> ⌨️ Hands on: Upload a small JSON file to the shared storage account, then list blobs in the container to verify it appeared.

</aside>

Using the CLI for blob storage

Use the Python SDK in scripts and pipelines (where you can wrap it in retries, logging, and config); use the Azure CLI for one-off uploads, debugging, or quick verification of what landed.

All three commands below pass --auth-mode login, which routes through your Entra ID identity (and the HYF Data Track Student role's blob dataActions). Without --auth-mode login, the CLI tries account-key auth and fails on the shared account.

# List containers in a storage account
az storage container list \
  --account-name hyfstoragedev \
  --auth-mode login \
  --output table

# Upload a file
az storage blob upload \
  --account-name hyfstoragedev \
  --container-name raw \
  --file weather_data.json \
  --name weather_2024-01-15.json \
  --auth-mode login

# List blobs
az storage blob list \
  --account-name hyfstoragedev \
  --container-name raw \
  --auth-mode login \
  --output table

Organizing blobs

Use a clear naming convention so you can find files later:

raw/weather/2024-01-15.json
raw/weather/2024-01-16.json
processed/weather/summary.csv

Prefixes like raw/ and processed/ act like folders. Azure Blob Storage is flat (no real directories), but prefixes give you logical structure.

<aside> 💡 Using AI to help: Ask an LLM to suggest a blob naming convention for your data source. (⚠️ Ensure no PII or sensitive company data is included!)

</aside>

Blob naming is flexible, but a consistent convention saves you debugging time later.

Azure Data Lake Storage Gen2 (ADLS) is the hierarchical-namespace cousin of blob storage; later weeks use it for Parquet datasets, and its abfss://<container>@<account>.dfs.core.windows.net/<path> URI bakes the same account/container/blob hierarchy you just learned into a single string.

<aside> 🚀 Try it in the widget: Build ADLS URI exercise

</aside>

Practicing the URI structure makes the account, container, and blob hierarchy concrete.

<aside> 🤓 Curious Geek: Why object storage is so cheap

Object storage uses commodity hardware with built-in redundancy. Your file is automatically replicated across multiple disks (and optionally across regions). Because there is no file system overhead, providers can offer storage at fractions of a cent per gigabyte per month.

</aside>

The same extract-to-blob pattern shows up in production pipelines.

<aside> 💡 In the wild: The Meltano open source ELT platform uses blob storage (S3 and Azure Blob Storage) as the default destination for extracted data. Raw API responses are written as JSON objects to a raw/ prefix, then transformed and loaded into a warehouse. This is the same extract-to-blob pattern your pipeline uses.

</aside>

Knowledge Check

<aside> 🚀 Try it in the widget: Interactive Quiz: Azure Blob Storage

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_6_ch3_azure_blob_storage&embed=1

If the storage-account / container / blob hierarchy still feels abstract, this AZ-900 episode walks the Azure storage services with diagrams and a portal demo.

<aside> 🎬 Struggling with this concept? Watch this beginner-friendly video:

</aside>

https://www.youtube.com/watch?v=_Qlkvd4ZQuo

Extra reading

Ready to apply the concepts? Upload and verify a blob end to end in the next practice exercise.

<aside> ⌨️ Hands on: Practice with Exercise 2: End-to-end blob verification, where you upload a JSON blob from Python and verify it from the CLI.

</aside>


Next up: Azure PostgreSQL Databases, where you connect to a managed Postgres server over SSL, create a table, and upsert rows from Python.