Week 6 - Cloud and Azure Essentials

Introduction to Cloud and Azure

Azure CLI and the Portal

Azure Blob Storage

Azure PostgreSQL Databases

Azure Container Apps Jobs

Cost Awareness

History of Cloud Computing

Week 6 Gotchas & Pitfalls

Practice

Week 6 Assignment: Deploy Your Pipeline to Azure

Week 6 Lesson Plan

Azure Container Apps Jobs

Azure Container Apps is a managed platform to run containers without managing servers. For data pipelines, Container Apps offers jobs: containers that run on demand or on a schedule, then exit. That is exactly what a data pipeline needs.

By the end of this chapter, you should understand why serverless computing matters for data pipelines, know the difference between a Container App and a Container App Job, be able to create and run a job from your ACR image, and verify its output end to end.

Concepts

Why data pipelines need on-demand containers

In Week 5, you built a Docker image for your pipeline. You can run it locally with docker run, but your laptop is not always on, and manual runs do not scale. If the pipeline needs to run every morning at 06:00, someone (or something) has to start it.

Data pipelines have a specific pattern: start, process data, write output, stop. They are not like web servers that need to stay running and wait for requests. You need a service that can run your container on demand or on a schedule, then shut it down when it finishes.

That is exactly what Azure Container Apps Jobs provides: give it your Docker image, tell it when to run, and it handles the rest.

How Container Apps Jobs works

Container Apps Jobs is a serverless container service: you provide a container image, define a trigger, and the platform handles provisioning, execution, and teardown. You pay only for the seconds your code runs. The term "serverless" is misleading (servers still exist), but the cloud provider manages them, not you.

Computing went through several steps to get here: bare metal → VMs → containers → serverless. If you want the full story, see History of Cloud Computing.

Your weather pipeline runs for about 30 seconds: it fetches data, transforms it, writes to Postgres, and exits. On a VM, you would pay for 24 hours to run 30 seconds of work. With Container Apps Jobs, you pay for those 30 seconds (and Azure has a generous free tier: 180,000 vCPU-seconds/month, so small pipelines cost nothing).

This is the same IaaS-to-PaaS shift you saw in Chapter 1: instead of managing infrastructure yourself, you let the platform handle it. Every cloud provider has equivalent services (AWS Fargate, Google Cloud Run), the concept is the same.

Apps vs Jobs

Container Apps supports two modes:

Your data pipeline is not a web server. It runs, ingests data, writes to storage, and stops. That makes it a job.

<aside> 🖼️ Visual: Container Apps – App vs Job lifecycle

</aside>

The Container Apps environment

Before you can create a job, you need a Container Apps environment. This is a shared hosting layer that manages networking and logging for all your apps and jobs. Your teacher has already created a shared environment for the class, so you do not need to create one yourself.

Verify your image in ACR

Before creating a job, confirm that your image from Week 5 is in the registry. It does not matter whether you pushed it manually (docker push) or through CI (GitHub Actions). Both end up in the same registry:

az acr repository show-tags \\
  --name hyfregistry \\
  --repository weather-pipeline \\
  --output table

You should see the tag you pushed (e.g. 1.0 or a commit SHA). Use that tag in the --image flag below.

Getting your connection strings

Before creating a job, you need the actual values for POSTGRES_URL and AZURE_STORAGE_CONNECTION_STRING. Your teacher stores these in Azure Key Vault. Ask your teacher for the values, or if you have Key Vault access, retrieve them with the CLI:

# Postgres connection string
az keyvault secret show \\
  --vault-name kv-hyf-data \\
  --name postgres-url \\
  --query value -o tsv

# Storage account connection string
az keyvault secret show \\
  --vault-name kv-hyf-data \\
  --name storage-connection-string \\
  --query value -o tsv

You will learn how Key Vault works later in the track. For now, copy the values and use them as --env-vars when creating your job.

Create a job from ACR

Use the pre-provisioned environment and resource group to create a job that points to your ACR image:

az containerapp job create \\
  --name weather-job \\
  --resource-group rg-weather-dev \\
  --environment env-weather-dev \\
  --image hyfregistry.azurecr.io/weather-pipeline:1.0 \\
  --registry-server hyfregistry.azurecr.io \\
  --trigger-type Manual \\
  --replica-timeout 300 \\
  --replica-retry-limit 0 \\
  --env-vars \\
    POSTGRES_URL="<connection-string>" \\
    AZURE_STORAGE_CONNECTION_STRING="<storage-connection-string>" \\
    LOG_LEVEL=INFO

Key flags:

Keeping costs under control

Container App Jobs are the one resource you create yourself in this track. Unlike the pre-provisioned infrastructure, every job you create costs money while it runs. Jobs are billed per vCPU-second and GiB-second of active execution time. The consumption plan includes a generous free tier (180,000 vCPU-seconds/month), but careless usage can exceed it.

Follow these rules:

Your teacher has budget alerts and policies on the subscription to catch runaway costs, but being mindful of resource usage is a professional habit worth building now.

Run the job

az containerapp job start \\
  --name weather-job \\
  --resource-group rg-weather-dev

Check execution history

az containerapp job execution list \\
  --name weather-job \\
  --resource-group rg-weather-dev \\
  --output table

This shows whether each run succeeded or failed, and how long it took.

<aside> ⌨️ Hands on: Create a job from your Week 5 image, start it, and check the execution history.

</aside>

Viewing logs

Container App logs are available in the Azure portal under Log stream, or through the CLI:

az containerapp job logs show \\
  --name weather-job \\
  --resource-group rg-weather-dev \\
  --container weather-job

<aside> 💡 The --container flag is required. By default, the container name matches the job name you passed to az containerapp job create. If the job already exited, the log stream may be unavailable. Check the Azure portal Log stream instead.

</aside>

Use logging in your Python code (not print) so log messages include timestamps and levels (see Logging Basics, Week 1).

Azure SDK libraries (like azure-storage-blob) log HTTP request details at the INFO level, which can make your logs very noisy. Silence them by raising their log level in your pipeline setup:

logging.getLogger("azure").setLevel(logging.WARNING)

Place this right after your logging.basicConfig(...) call. This keeps your own INFO messages visible while hiding the SDK's HTTP headers and request details.

If you are unsure which flags to use, draft the command with an LLM first.

<aside> 💡 Using AI to help: Paste a deployment error and the related Azure CLI command into an LLM and ask for possible causes. (⚠️ Ensure no PII or sensitive company data is included!)

</aside>

Verify the suggested flags with az containerapp job create --help.

<aside> 📘 Core program connection: You learned continuous deployment in the Systems week. Deploying a container job is the Python data pipeline version of that idea. Review here: https://www.notion.so/hackyourfuture/Continuous-deployment-2af50f64ffc981cebd04d416a58af7f6

</aside>

Verifying the run

After the job finishes, check three things:

  1. Logs: confirm the job completed without errors. Look for your log lines that print row counts and upload confirmations.
  2. Blob storage: verify your output file appeared in the container.
  3. Database: run a quick query to verify rows were inserted.
# Check blob storage
az storage blob list \\
  --account-name hyfstoragedev \\
  --container-name raw \\
  --output table

# Connect to Postgres and verify
psql "postgresql://pipeline_user:<PASSWORD>@<host>:5432/weather?sslmode=require" \\
  -c "SELECT COUNT(*) FROM weather_readings"

<aside> ⌨️ Hands on: Add a log line to your pipeline that prints the number of rows written and the blob name uploaded, then deploy and confirm you see it in the logs.

</aside>

Updating your job

When you push a new image version to ACR, update the job instead of creating a new one:

az containerapp job update \\
  --name weather-job \\
  --resource-group rg-weather-dev \\
  --image hyfregistry.azurecr.io/weather-pipeline:1.1

Then start it again to run the updated code.

A note on secrets management

Right now you pass connection strings as plain environment variables. That works for learning, but in production you would store secrets in Azure Key Vault and use Managed Identity so your container never handles raw credentials. You will learn Key Vault later in the track. For now, environment variables are fine.

<aside> 💡 Keep secrets out of your code and CI logs. Environment variables set through az containerapp job create are stored in Azure, not in your repository.

</aside>

Scheduled runs

For production pipelines, you often want the job to run on a schedule:

az containerapp job create \\
  --name weather-job \\
  --resource-group rg-weather-dev \\
  --environment env-weather-dev \\
  --image hyfregistry.azurecr.io/weather-pipeline:1.0 \\
  --registry-server hyfregistry.azurecr.io \\
  --trigger-type Schedule \\
  --cron-expression "0 6 * * *" \\
  --env-vars POSTGRES_URL="<connection-string>"

This runs the job every day at 06:00 UTC. You will explore scheduling in more depth in Week 11 (Orchestration with Airflow).

<aside> 🤓 Curious Geek: Scale to zero

</aside>

Exercises

  1. Create a Container App Job from your Week 5 image.
  2. Start the job manually and check its execution status.
  3. View the logs and confirm your pipeline ran.
  4. Verify that a blob appeared in storage and rows appeared in Postgres.
  5. Push a new image version, update the job, and confirm the new version runs.

<aside> 💡 In the wild: Dagster, an open source data orchestrator, supports Azure Container Apps Jobs as a compute backend. Teams define pipelines in Python and Dagster launches them as container jobs on demand. The pattern is the same as what you do manually here: push an image, create a job with env vars, trigger it, and check logs.

</aside>

🧠 Knowledge Check

  1. What is the difference between a Container App and a Container App Job? Which one fits a data pipeline, and why?
  2. You create a Container App Job successfully, but every execution fails with an image pull error. What flag did you likely forget?
  3. Your pipeline runs for 30 seconds per day. Explain why a Container App Job is cheaper than a VM for this workload.
  4. Why is it important to set --replica-timeout when creating a job? What happens if you skip it and your pipeline hangs?

Extra reading