Week 6 - Cloud and Azure Essentials
Introduction to Cloud and Azure
Week 6 Assignment: Deploy Your Pipeline to Azure
Azure Container Apps is a managed platform to run containers without managing servers. For data pipelines, Container Apps offers jobs: containers that run on demand or on a schedule, then exit. That is exactly what a data pipeline needs.
By the end of this chapter, you should understand why serverless computing matters for data pipelines, know the difference between a Container App and a Container App Job, be able to create and run a job from your ACR image, and verify its output end to end.
In Week 5, you built a Docker image for your pipeline. You can run it locally with docker run, but your laptop is not always on, and manual runs do not scale. If the pipeline needs to run every morning at 06:00, someone (or something) has to start it.
Data pipelines have a specific pattern: start, process data, write output, stop. They are not like web servers that need to stay running and wait for requests. You need a service that can run your container on demand or on a schedule, then shut it down when it finishes.
That is exactly what Azure Container Apps Jobs provides: give it your Docker image, tell it when to run, and it handles the rest.
Container Apps Jobs is a serverless container service: you provide a container image, define a trigger, and the platform handles provisioning, execution, and teardown. You pay only for the seconds your code runs. The term "serverless" is misleading (servers still exist), but the cloud provider manages them, not you.
Computing went through several steps to get here: bare metal → VMs → containers → serverless. If you want the full story, see History of Cloud Computing.
Your weather pipeline runs for about 30 seconds: it fetches data, transforms it, writes to Postgres, and exits. On a VM, you would pay for 24 hours to run 30 seconds of work. With Container Apps Jobs, you pay for those 30 seconds (and Azure has a generous free tier: 180,000 vCPU-seconds/month, so small pipelines cost nothing).
This is the same IaaS-to-PaaS shift you saw in Chapter 1: instead of managing infrastructure yourself, you let the platform handle it. Every cloud provider has equivalent services (AWS Fargate, Google Cloud Run), the concept is the same.
Container Apps supports two modes:
Your data pipeline is not a web server. It runs, ingests data, writes to storage, and stops. That makes it a job.
<aside> 🖼️ Visual: Container Apps – App vs Job lifecycle
</aside>
Before you can create a job, you need a Container Apps environment. This is a shared hosting layer that manages networking and logging for all your apps and jobs. Your teacher has already created a shared environment for the class, so you do not need to create one yourself.
Before creating a job, confirm that your image from Week 5 is in the registry. It does not matter whether you pushed it manually (docker push) or through CI (GitHub Actions). Both end up in the same registry:
az acr repository show-tags \\
--name hyfregistry \\
--repository weather-pipeline \\
--output table
You should see the tag you pushed (e.g. 1.0 or a commit SHA). Use that tag in the --image flag below.
Before creating a job, you need the actual values for POSTGRES_URL and AZURE_STORAGE_CONNECTION_STRING. Your teacher stores these in Azure Key Vault. Ask your teacher for the values, or if you have Key Vault access, retrieve them with the CLI:
# Postgres connection string
az keyvault secret show \\
--vault-name kv-hyf-data \\
--name postgres-url \\
--query value -o tsv
# Storage account connection string
az keyvault secret show \\
--vault-name kv-hyf-data \\
--name storage-connection-string \\
--query value -o tsv
You will learn how Key Vault works later in the track. For now, copy the values and use them as --env-vars when creating your job.
Use the pre-provisioned environment and resource group to create a job that points to your ACR image:
az containerapp job create \\
--name weather-job \\
--resource-group rg-weather-dev \\
--environment env-weather-dev \\
--image hyfregistry.azurecr.io/weather-pipeline:1.0 \\
--registry-server hyfregistry.azurecr.io \\
--trigger-type Manual \\
--replica-timeout 300 \\
--replica-retry-limit 0 \\
--env-vars \\
POSTGRES_URL="<connection-string>" \\
AZURE_STORAGE_CONNECTION_STRING="<storage-connection-string>" \\
LOG_LEVEL=INFO
Key flags:
--registry-server: tells Container Apps where to pull the image from. The CLI will look up credentials for your ACR automatically.--trigger-type Manual: you start the job yourself (or from CI). Other options include Schedule for cron-based runs.--replica-timeout 300: the job is killed after 300 seconds if it has not finished.--env-vars: pass configuration the same way you used -e with docker run in Week 5.Container App Jobs are the one resource you create yourself in this track. Unlike the pre-provisioned infrastructure, every job you create costs money while it runs. Jobs are billed per vCPU-second and GiB-second of active execution time. The consumption plan includes a generous free tier (180,000 vCPU-seconds/month), but careless usage can exceed it.
Follow these rules:
--replica-timeout so a stuck job is killed automatically. 300 seconds is a good default for data pipelines.az containerapp job delete --name <name> --resource-group <rg>.--trigger-type Schedule) running after class.az containerapp job update rather than creating a new one every time.Your teacher has budget alerts and policies on the subscription to catch runaway costs, but being mindful of resource usage is a professional habit worth building now.
az containerapp job start \\
--name weather-job \\
--resource-group rg-weather-dev
az containerapp job execution list \\
--name weather-job \\
--resource-group rg-weather-dev \\
--output table
This shows whether each run succeeded or failed, and how long it took.
<aside> ⌨️ Hands on: Create a job from your Week 5 image, start it, and check the execution history.
</aside>
Container App logs are available in the Azure portal under Log stream, or through the CLI:
az containerapp job logs show \\
--name weather-job \\
--resource-group rg-weather-dev \\
--container weather-job
<aside>
💡 The --container flag is required. By default, the container name matches the job name you passed to az containerapp job create. If the job already exited, the log stream may be unavailable. Check the Azure portal Log stream instead.
</aside>
Use logging in your Python code (not print) so log messages include timestamps and levels (see Logging Basics, Week 1).
Azure SDK libraries (like azure-storage-blob) log HTTP request details at the INFO level, which can make your logs very noisy. Silence them by raising their log level in your pipeline setup:
logging.getLogger("azure").setLevel(logging.WARNING)
Place this right after your logging.basicConfig(...) call. This keeps your own INFO messages visible while hiding the SDK's HTTP headers and request details.
If you are unsure which flags to use, draft the command with an LLM first.
<aside> 💡 Using AI to help: Paste a deployment error and the related Azure CLI command into an LLM and ask for possible causes. (⚠️ Ensure no PII or sensitive company data is included!)
</aside>
Verify the suggested flags with az containerapp job create --help.
<aside> 📘 Core program connection: You learned continuous deployment in the Systems week. Deploying a container job is the Python data pipeline version of that idea. Review here: https://www.notion.so/hackyourfuture/Continuous-deployment-2af50f64ffc981cebd04d416a58af7f6
</aside>
After the job finishes, check three things:
# Check blob storage
az storage blob list \\
--account-name hyfstoragedev \\
--container-name raw \\
--output table
# Connect to Postgres and verify
psql "postgresql://pipeline_user:<PASSWORD>@<host>:5432/weather?sslmode=require" \\
-c "SELECT COUNT(*) FROM weather_readings"
<aside> ⌨️ Hands on: Add a log line to your pipeline that prints the number of rows written and the blob name uploaded, then deploy and confirm you see it in the logs.
</aside>
When you push a new image version to ACR, update the job instead of creating a new one:
az containerapp job update \\
--name weather-job \\
--resource-group rg-weather-dev \\
--image hyfregistry.azurecr.io/weather-pipeline:1.1
Then start it again to run the updated code.
Right now you pass connection strings as plain environment variables. That works for learning, but in production you would store secrets in Azure Key Vault and use Managed Identity so your container never handles raw credentials. You will learn Key Vault later in the track. For now, environment variables are fine.
<aside>
💡 Keep secrets out of your code and CI logs. Environment variables set through az containerapp job create are stored in Azure, not in your repository.
</aside>
For production pipelines, you often want the job to run on a schedule:
az containerapp job create \\
--name weather-job \\
--resource-group rg-weather-dev \\
--environment env-weather-dev \\
--image hyfregistry.azurecr.io/weather-pipeline:1.0 \\
--registry-server hyfregistry.azurecr.io \\
--trigger-type Schedule \\
--cron-expression "0 6 * * *" \\
--env-vars POSTGRES_URL="<connection-string>"
This runs the job every day at 06:00 UTC. You will explore scheduling in more depth in Week 11 (Orchestration with Airflow).
<aside> 🤓 Curious Geek: Scale to zero
</aside>
<aside> 💡 In the wild: Dagster, an open source data orchestrator, supports Azure Container Apps Jobs as a compute backend. Teams define pipelines in Python and Dagster launches them as container jobs on demand. The pattern is the same as what you do manually here: push an image, create a job with env vars, trigger it, and check logs.
</aside>
--replica-timeout when creating a job? What happens if you skip it and your pipeline hangs?