Week 6 - Cloud and Azure Essentials
Introduction to Cloud and Azure
Azure Container Apps is a managed platform to run containers without managing servers. For data pipelines, Container Apps offers jobs: containers that run on demand or on a schedule, then exit. That is exactly what a data pipeline needs.
By the end of this chapter, you should understand why serverless computing matters for data pipelines, know the difference between a Container App and a Container App Job, be able to create and run a job from your ACR image, and verify its output end to end.
In Week 5, you built a Docker image for your pipeline. You can run it locally with docker run, but your laptop is not always on, and manual runs do not scale. If the pipeline needs to run every morning at 06:00, someone (or something) has to start it.
Data pipelines have a specific pattern: start, process data, write output, stop. They are not like web servers that need to stay running and wait for requests. You need a service that can run your container on demand or on a schedule, then shut it down when it finishes.
That is exactly what Azure Container Apps Jobs provides: give it your Docker image, tell it when to run, and it handles the rest.
Container Apps Jobs is a serverless container service: you provide a container image, define a trigger, and the platform handles provisioning, execution, and teardown. You pay only for the seconds your code runs. The term "serverless" is misleading (servers still exist), but the cloud provider manages them, not you.
Computing went through several steps to get here: bare metal → VMs → containers → serverless. If you want the full story, see History of Cloud Computing.
Your weather pipeline runs for about 30 seconds: it fetches data, transforms it, writes to Postgres, and exits. On a VM, you would pay for 24 hours to run 30 seconds of work. With Container Apps Jobs, you pay for those 30 seconds (and Azure has a generous free tier: 180,000 vCPU-seconds/month, so small pipelines cost nothing).
This is the same IaaS-to-PaaS shift you saw in Introduction to Cloud and Azure: instead of managing infrastructure yourself, you let the platform handle it. Every cloud provider has equivalent services (AWS Fargate, Google Cloud Run), the concept is the same.
Container Apps supports two modes:
Your data pipeline is not a web server. It runs, ingests data, writes to storage, and stops. That makes it a job.
<aside> 🖼️ Visual: Container Apps – App vs Job lifecycle
</aside>
Before you can create a job, you need a Container Apps environment. This is a shared hosting layer that manages networking and logging for all your apps and jobs. Your teacher has already created a shared environment for the class, so you do not need to create one yourself.
Before creating a job, confirm that your image from Azure Container Registry (ACR) (introduced in Azure Container Registry (Week 5)) is in the registry. In Week 5 you pushed to a repository named after your GitHub handle (lowercase), not a shared class repo. It does not matter whether you pushed manually (docker push) or through CI (GitHub Actions):
az acr repository show-tags \
--name hyfregistry \
--repository <your-handle>-weather-pipeline \
--output table
You should see the tag you pushed (e.g. 1.0 or a commit SHA). Use that repository and tag in the --image flag below.
<aside>
💡 Fallback image: If your Week 5 image is missing or fails to pull, your teacher maintains a reference image at hyfregistry.azurecr.io/weather-pipeline:assignment. Use it to learn the Container Apps Job workflow, then switch back to your own image for the assignment deliverable. Verify it exists with az acr repository show-tags --name hyfregistry --repository weather-pipeline -o table.
</aside>
On your laptop, Week 5 used docker run --env-file .env so the container could read POSTGRES_URL and other secrets from a file on disk. That file does not exist on Azure's servers. In the cloud you must inject the same variables explicitly:
uv run python …--env-vars at job create timeIf you skip --env-vars, the job starts with an empty environment and crashes on the first os.environ[...] lookup, even when the same code worked locally.
<aside> 🖼️ Visual: Local vs cloud environment variables
</aside>
See also Three things called "container" if "container" terminology is still fuzzy.
Before creating a job, you need the actual values for POSTGRES_URL and AZURE_STORAGE_CONNECTION_STRING. Your teacher stores these in Azure Key Vault. Ask your teacher for the values, or if you have Key Vault access, retrieve them with the CLI:
# Postgres connection string
az keyvault secret show \
--vault-name kv-hyf-data \
--name postgres-url \
--query value -o tsv
# Storage account connection string
az keyvault secret show \
--vault-name kv-hyf-data \
--name storage-connection-string \
--query value -o tsv
Pass the values as --env-vars when you create the job (see below).
Use the pre-provisioned environment and resource group to create a job that points to your ACR image:
az containerapp job create \
--name <your-handle>-weather-job \
--resource-group rg-hyf-data \
--environment env-hyf-data \
--image hyfregistry.azurecr.io/<your-handle>-weather-pipeline:<tag> \
--registry-server hyfregistry.azurecr.io \
--trigger-type Manual \
--replica-timeout 300 \
--replica-retry-limit 0 \
--env-vars \
POSTGRES_URL="$(az keyvault secret show --vault-name kv-hyf-data --name postgres-url --query value -o tsv)" \
AZURE_STORAGE_CONNECTION_STRING="$(az keyvault secret show --vault-name kv-hyf-data --name storage-connection-string --query value -o tsv)" \
LOG_LEVEL=INFO
Replace <your-handle> with your lowercase GitHub username and <tag> with the tag from show-tags above. Pick a unique job name per student so you do not overwrite a classmate's job.
Key flags:
--registry-server: tells Container Apps where to pull the image from. The CLI will look up credentials for your ACR automatically.--trigger-type Manual: you start the job yourself (or from CI). Other options include Schedule for cron-based runs.--replica-timeout 300: the job is killed after 300 seconds if it has not finished.--env-vars: pass configuration the same way you used -e with docker run in Week 5.The two flags that most often trip students up are --registry-server (image pull falls back to Docker Hub without it) and --replica-timeout (a stuck job is never killed without it). Practice spotting them is the next exercise.
<aside> 🚀 Try it in the widget: Validate Container App Job create command
</aside>
Container App Jobs are the one resource you create yourself in this track. Unlike the pre-provisioned infrastructure, every job you create costs money while it runs. Jobs are billed per vCPU-second and GiB-second of active execution time. The consumption plan includes a generous free tier (180,000 vCPU-seconds/month), but careless usage can exceed it.
Follow these rules:
--replica-timeout (the replica timeout) so a stuck job is killed automatically. 300 seconds is a good default for data pipelines.az containerapp job delete --name <name> --resource-group <rg>.--trigger-type Schedule) running after class.az containerapp job update rather than creating a new one every time.Your teacher has budget alerts and policies on the subscription to catch runaway costs, but being mindful of resource usage is a professional habit worth building now.
az containerapp job start \
--name <your-handle>-weather-job \
--resource-group rg-hyf-data
az containerapp job execution list \
--name <your-handle>-weather-job \
--resource-group rg-hyf-data \
--output table
This shows whether each run succeeded or failed, and how long it took.
<aside> ⌨️ Hands on: Create a job from your Week 5 image, start it, and check the execution history.
</aside>
Use the CLI in scripts and when iterating quickly on a job; use the portal Log stream when you want to watch a live execution scroll. Container App logs are available in the Azure portal under Log stream, or through the CLI:
az containerapp job logs show \
--name <your-handle>-weather-job \
--resource-group rg-hyf-data \
--container <your-handle>-weather-job
<aside>
💡 The --container flag is required. By default, the container name matches the job name you passed to az containerapp job create (not the image repository name). If the job already exited, the log stream may be unavailable. Check the Azure portal Log stream instead.
</aside>
Use logging in your Python code (not print) so log messages include timestamps and levels (see Logging Basics, Week 1).
Azure SDK libraries (like azure-storage-blob) log HTTP request details at the INFO level, which can make your logs very noisy. Silence them by raising their log level in your pipeline setup:
logging.getLogger("azure").setLevel(logging.WARNING)
Place this right after your logging.basicConfig(...) call. This keeps your own INFO messages visible while hiding the SDK's HTTP headers and request details.
If you are unsure which flags to use, draft the command with an LLM first.
<aside> 💡 Using AI to help: Paste a deployment error and the related Azure CLI command into an LLM and ask for possible causes. (⚠️ Ensure no PII or sensitive company data is included!)
</aside>
Verify the suggested flags with az containerapp job create --help.
<aside> 📘 Core program connection: You learned continuous deployment in the Systems week. Deploying a container job is the Python data pipeline version of that idea. Review here: https://www.notion.so/hackyourfuture/Continuous-deployment-2af50f64ffc981cebd04d416a58af7f6
</aside>
After the job finishes, check three things:
# Check blob storage
az storage blob list \
--account-name hyfstoragedev \
--container-name raw \
--output table
# Connect to Postgres and verify
export POSTGRES_URL="$(az keyvault secret show --vault-name kv-hyf-data --name postgres-url --query value -o tsv)"
psql "$POSTGRES_URL" -c "SELECT COUNT(*) FROM weather_readings"
<aside> ⌨️ Hands on: Add a log line to your pipeline that prints the number of rows written and the blob name uploaded, then deploy and confirm you see it in the logs.
</aside>
When you push a new image version to ACR, update the job instead of creating a new one:
az containerapp job update \
--name <your-handle>-weather-job \
--resource-group rg-hyf-data \
--image hyfregistry.azurecr.io/<your-handle>-weather-pipeline:<new-tag>
Then start it again to run the updated code.
Week 6 passes connection strings as --env-vars on the job. In production, teams often wire Key Vault secret references and Managed Identity so the container never embeds raw credentials in its definition.
<aside>
💡 Keep secrets out of your code and CI logs. Environment variables set through az containerapp job create are stored in Azure, not in your repository.
</aside>
For production pipelines, you often want the job to run on a schedule:
az containerapp job create \
--name <your-handle>-weather-job \
--resource-group rg-hyf-data \
--environment env-hyf-data \
--image hyfregistry.azurecr.io/<your-handle>-weather-pipeline:<tag> \
--registry-server hyfregistry.azurecr.io \
--trigger-type Schedule \
--cron-expression "0 6 * * *" \
--replica-timeout 300 \
--replica-retry-limit 0 \
--env-vars \
POSTGRES_URL="$(az keyvault secret show --vault-name kv-hyf-data --name postgres-url --query value -o tsv)" \
AZURE_STORAGE_CONNECTION_STRING="$(az keyvault secret show --vault-name kv-hyf-data --name storage-connection-string --query value -o tsv)" \
LOG_LEVEL=INFO
This runs the job every day at 06:00 UTC. The assignment pipeline writes to both Blob Storage and Postgres, so pass both connection strings even for scheduled jobs. Scheduling is explored in more depth later in the track when we introduce orchestration with Airflow.
<aside> 🤓 Curious Geek: Scale to zero
</aside>
The pay-per-second model is also how production orchestrators wire batch workloads into Azure.
<aside> 💡 In the wild: Dagster, an open source data orchestrator, supports Azure Container Apps Jobs as a compute backend. Teams define pipelines in Python and Dagster launches them as container jobs on demand. The pattern is the same as what you do manually here: push an image, create a job with env vars, trigger it, and check logs.
</aside>
--replica-timeout when creating a job? What happens if you skip it and your pipeline hangs?