Week 6 - Cloud and Azure Essentials

Introduction to Cloud and Azure

Azure CLI and the Portal

Azure Blob Storage

Azure PostgreSQL Databases

Azure Container Apps Jobs

Cost Awareness

History of Cloud Computing

Week 6 Gotchas & Pitfalls

Practice

Week 6 Assignment: Deploy Your Pipeline to Azure

Week 6 Lesson Plan

Azure Blob Storage

Azure Blob Storage is object storage for the cloud. It stores files (called blobs) of any type: JSON, CSV, images, Parquet, logs. Object storage is one of the core reasons cloud platforms exist. It is cheap, scales to petabytes, and every cloud data pipeline reads from or writes to it.

By the end of this chapter, you should be able to upload files to Azure Blob Storage from Python and verify them in the portal.

Concepts

Object storage vs file system storage

You already know file system storage: it is the Documents/ folder on your laptop. Files live in directories, you can rename them, move them between folders, and edit them in place. That works well for day-to-day use, but it has limits when you need to store data at cloud scale.

Object storage works differently. Each object is stored as a single unit with a key (its name) and metadata. You cannot edit part of an object; you replace the whole thing. There are no real directories, only key prefixes that look like paths (e.g. raw/weather/2024-01-15.json).

File system Object storage
Structure Directories and files in a tree Flat namespace with key prefixes
Editing Edit a file in place Replace the entire object
Scale Limited by disk size Scales to petabytes automatically
Access Local or network mount HTTP API (REST calls)
Redundancy You manage backups Automatic replication across disks/regions
Cost Pay for the disk, used or not Pay per GB stored + requests
Best for Application files, code, config Data pipeline output, logs, media, backups

Why use object storage?

In Weeks 3 and 4, your pipeline saved output to local files. That works on your machine, but local files are not accessible to other services, not backed up, and not shareable.

Object storage solves this problem. Instead of files on a single disk, your data lives in a central cloud location that any authorized (allowed) service can read, write, and share.

Amazon launched S3 (Simple Storage Service) in 2006, and it changed how companies think about storing data. Before S3, you had to buy and manage your own file servers. S3 introduced the idea that storage could be a service: you upload objects (files), the provider handles replication, durability, and scaling. You pay only for what you use.

Every major cloud provider now offers the same concept: Amazon S3, Azure Blob Storage, and Google Cloud Storage. The APIs differ, but the mental model is identical: you put objects in, you get objects out. In this track we use Azure Blob Storage, but if you learn one, the others will feel familiar. For the full history of how object storage replaced file servers, see History of Cloud Computing.

<aside> ๐Ÿค“ Curious Geek: The history of S3

S3 was one of the first AWS services ever launched and is widely considered the starting point of modern cloud computing. It grew from Amazon's internal need to store massive amounts of data reliably. If you want to see how it evolved over the years, check out this AWS S3 history and timeline.

</aside>

What this means for data pipelines: Your pipeline produces output files (JSON, CSV, Parquet) that need to be stored reliably, accessed by other services, and kept for a long time. Object storage is designed for exactly this pattern. You write the file once, read it many times, and never edit it in place. File system storage is designed for interactive use where you open, edit, and save files repeatedly.

Storage accounts, containers, and blobs

Azure Blob Storage is organized in three levels:

Storage account (unique name, like a top-level namespace)
โ””โ”€โ”€ Container (like a folder)
    โ”œโ”€โ”€ raw/weather_2024-01-15.json
    โ”œโ”€โ”€ raw/weather_2024-01-16.json
    โ””โ”€โ”€ processed/summary.csv

<aside> โš ๏ธ "Container" in Azure Blob Storage means a storage container (folder), not a Docker container. The naming is confusing, but context makes it clear.

</aside>

Here is how the hierarchy looks in practice:

<aside> ๐Ÿ–ผ๏ธ Visual: Blob Storage hierarchy

</aside>

Uploading from Python

To interact with Azure Blob Storage from Python, you use the azure-storage-blob package. It is part of the Azure SDK for Python, a collection of libraries maintained by Microsoft that cover every Azure service. The blob storage library gives you classes to create clients, upload and download blobs, list containers, and manage metadata. All through Python instead of the CLI.

Install it with pip:

pip install azure-storage-blob

Upload a file:

import os
from azure.storage.blob import BlobServiceClient

connection_string = os.environ["AZURE_STORAGE_CONNECTION_STRING"]  # your teacher provides this value
client = BlobServiceClient.from_connection_string(connection_string)

container_client = client.get_container_client("raw")

with open("weather_data.json", "rb") as f:
    container_client.upload_blob(name="weather_2024-01-15.json", data=f, overwrite=True)

That is the core pattern: create a client, pick a container, upload a file.

<aside> ๐Ÿ’ก Your teacher will provide the AZURE_STORAGE_CONNECTION_STRING value. Set it as an environment variable before running the code: export AZURE_STORAGE_CONNECTION_STRING="<value>". In Chapter 5 you will see how to retrieve it from Azure Key Vault.

</aside>

Downloading and listing blobs

# List blobs in a container
for blob in container_client.list_blobs():
    print(blob.name)

# Download a blob
blob_client = container_client.get_blob_client("weather_2024-01-15.json")
data = blob_client.download_blob().readall()

<aside> โŒจ๏ธ Hands on: Upload a small JSON file to the shared storage account, then list blobs in the container to verify it appeared.

</aside>

Using the CLI for blob storage

You can also manage blobs from the command line:

# List containers in a storage account
az storage container list \\
  --account-name hyfstoragedev \\
  --output table

# Upload a file
az storage blob upload \\
  --account-name hyfstoragedev \\
  --container-name raw \\
  --file weather_data.json \\
  --name weather_2024-01-15.json

# List blobs
az storage blob list \\
  --account-name hyfstoragedev \\
  --container-name raw \\
  --output table

Organizing blobs

Use a clear naming convention so you can find files later:

raw/weather/2024-01-15.json
raw/weather/2024-01-16.json
processed/weather/summary.csv

Prefixes like raw/ and processed/ act like folders. Azure Blob Storage is flat (no real directories), but prefixes give you logical structure.

<aside> ๐Ÿ’ก Using AI to help: Ask an LLM to suggest a blob naming convention for your data source. (โš ๏ธ Ensure no PII or sensitive company data is included!)

</aside>

Blob naming is flexible, but a consistent convention saves you debugging time later.

<aside> ๐Ÿค“ Curious Geek: Why object storage is so cheap

Object storage uses commodity hardware with built-in redundancy. Your file is automatically replicated across multiple disks (and optionally across regions). Because there is no file system overhead, providers can offer storage at fractions of a cent per gigabyte per month.

</aside>

Exercises

  1. Upload a JSON file to the shared storage account using Python.
  2. List all blobs in your container using the CLI.
  3. Download the file you uploaded and verify the contents match.

<aside> ๐Ÿ’ก In the wild: The Meltano open source ELT platform uses blob storage (S3 and Azure Blob Storage) as the default destination for extracted data. Raw API responses are written as JSON objects to a raw/ prefix, then transformed and loaded into a warehouse. This is the same extract-to-blob pattern your pipeline uses.

</aside>

๐Ÿง  Knowledge Check

  1. What is the difference between a storage account, a container, and a blob? How do they relate to each other?
  2. Why does object storage use "replace the whole object" instead of "edit in place" like a file system? What advantage does this give data pipelines?
  3. Your upload script runs without errors, but when you list blobs in the container, your file is not there. What are two possible causes?

Extra reading


The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with โค๏ธ by the HackYourFuture community ยท Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.