Going Further

This page is optional. Nothing here is required for Week 5's learning goals or the assignment. It collects deeper dives and developer-experience improvements for students who finish the assignment early and want to explore more.

Developer tooling

A Makefile to tame Docker commands

The Docker commands you learned in Docker Fundamentals get long fast. A real invocation often looks like:

docker run --rm -v $(pwd):/app -p 8000:8000 --env-file .env weather-pipeline:1.0

Retyping that dozens of times per day is tedious and typo-prone. A Makefile gives each long command a short name, so you run make run instead.

Create a file called Makefile in your project root:

IMAGE := weather-pipeline
TAG   := 1.0

.PHONY: build run shell clean rebuild

build:
	docker build -t $(IMAGE):$(TAG) .

run:
	docker run --rm --env-file .env $(IMAGE):$(TAG)

shell:
	docker run -it --rm --env-file .env $(IMAGE):$(TAG) /bin/bash

clean:
	docker rmi $(IMAGE):$(TAG)

rebuild: clean build run

Now make build, make run, make shell, and make rebuild replace the long docker commands. A few details worth knowing:

.PHONY tells make these targets are command aliases, not files on disk. Without it, if a file named build ever appeared in your project, make build would silently do nothing.
Variables like $(IMAGE) let you change the image name in one place.
rebuild: clean build run is a target that depends on three other targets. make runs them in order. This is the pattern you will see most often in real Makefiles: compose named steps into higher-level workflows.

<aside> ⚠️ Makefile indentation must be a real tab character, not spaces. If you see Makefile:3: *** missing separator, your editor converted tabs to spaces. Most editors have a "show whitespace" option, or you can configure them to preserve tabs specifically for Makefile files.

</aside>

Once the Makefile is in place, the small workflow on top of it is worth practicing deliberately.

<aside> ⌨️ Hands on: Drop this Makefile into your Week 5 project. Replace your next five docker build / docker run invocations with make targets. Then add a ci-local target that runs ruff check ., ruff format --check ., and make build in sequence, so you can check your CI will pass before pushing.

</aside>

`just` as an alternative to `make`

just is a modern task runner that fixes make's main annoyance: the tab-indentation trap. A Justfile uses normal spaces, has built-in --list to show available targets, and supports named arguments. The same tasks as above:

image := "weather-pipeline"
tag   := "1.0"

build:
    docker build -t {{image}}:{{tag}} .

run:
    docker run --rm --env-file .env {{image}}:{{tag}}

shell:
    docker run -it --rm --env-file .env {{image}}:{{tag}} /bin/bash

Run just --list and you see every available recipe. Add a # comment line above any recipe to include a description in that output. No .PHONY, no tab trap, no implicit rules.

So why does this page lead with make and not just? Two practical reasons:

make is preinstalled on macOS and every Linux distro. just needs brew install just or cargo install just. For a classroom setup, one less install is one less support question.
Real repos you will read use Makefiles. Every dbt starter, Airflow project, FastAPI template, and data-engineering Zoomcamp repo on GitHub ships a Makefile. Knowing make lets you navigate other people's code. Knowing just does not, yet.

If you control your own project and your team agrees, just is the nicer tool. If you want a skill that transfers to any repo you will read in your first job, learn make first and pick up just later (it takes ten minutes).

`act`: run GitHub Actions locally

Pushing a commit just to find out your workflow has a typo is slow and clutters your Git history. act runs GitHub Actions workflows inside a local Docker container, so you can debug a workflow in seconds instead of minutes.

Install and run:

# macOS
brew install act

# Linux — see <https://github.com/nektos/act#installation-via-package-manager>
# Windows
# winget install nektos.act

# Run the default `push` event against your workflows
act

# Run a specific job
act -j lint

# Pass secrets from a local file (never commit this)
act --secret-file .secrets

What act gets right: most actions/checkout, setup-python, and shell steps behave identically to GitHub. What it cannot do: reproduce GitHub-hosted runner hardware, issue a real GITHUB_TOKEN, or run Marketplace actions that hit the GitHub API heavily. Use it for the fast iteration loop; push to GitHub for the final check.

<aside> ⚠️ act needs Docker running. On Apple Silicon (M1/M2/M3) Macs, pass --container-architecture linux/amd64 the first time, or workflows that pull x86-only actions will fail with a cryptic exec error.

</aside>

Going deeper with Docker

Multi-stage Dockerfiles for smaller images

The single-stage Dockerfile from Docker Fundamentals produces an image around 1 GB. A lot of that is build-time tooling you do not need at runtime: pip caches, compilers, header files. A multi-stage Dockerfile builds your app in one image and copies only the result into a minimal runtime image.

# Stage 1: build
FROM python:3.11-slim AS builder

WORKDIR /app
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

COPY src/ ./src/

# Stage 2: runtime
FROM python:3.11-slim

WORKDIR /app

# Copy only the installed packages and the source
COPY --from=builder /root/.local /root/.local
COPY --from=builder /app/src ./src

ENV PATH=/root/.local/bin:$PATH

CMD ["python", "-m", "src.pipeline"]

Two things are happening:

AS builder names the first stage. The final image is whatever the last FROM produces, so the builder stage is discarded from the output.
COPY --from=builder pulls artifacts out of the build stage into the runtime image. You only copy what you need to run, not the entire build toolchain.

Compare sizes before and after:

docker build -t weather-pipeline:slim .
docker images weather-pipeline

A typical result: single-stage builds land around 1 GB; the slim multi-stage version above comes in around 180 MB. The gap widens further when your app compiles native dependencies.

<aside> 💡 You already saw the COPY --from=... pattern in Ch3's uv Dockerfile (COPY --from=ghcr.io/astral-sh/uv:0.6 /uv /usr/local/bin/uv). Multi-stage builds are the same mechanism, applied to your own build stage instead of a published image.

</aside>

Running as a non-root user

By default, containers run as root. If your code has a remote code execution bug and an attacker escapes to the container filesystem, they have full write access. Production images should run as an unprivileged user. Add three lines to your Dockerfile:

RUN useradd --create-home --shell /bin/bash appuser
USER appuser
WORKDIR /home/appuser/app

Place these after you install dependencies (root can install system packages; appuser cannot) but before CMD. If your app writes files, make sure those directories are owned by appuser with RUN chown -R appuser:appuser /data after creating them.

You will not see a dramatic local difference, but every production platform (Azure Container Apps, Kubernetes, AWS Fargate) prefers or requires non-root containers. Some will refuse to run images that run as root at all.

Container lifecycle: running detached

The Week 5 pipeline is a batch job: it runs, prints logs, and exits. Long-running services (a REST API, a scheduled worker) behave differently, and you manage them with a few more commands.

By default, docker run runs in the foreground: your terminal stays attached until the container exits. Run in detached mode with -d to get your terminal back. The Week 5 image exits on its own, so these examples use a genuinely long-running image (nginx) to show the lifecycle commands doing something:

# Foreground: serves until you press Ctrl+C
docker run --rm nginx

# Detached: runs in the background and returns your terminal
docker run -d --name web-demo nginx

Once a container runs detached, you manage its lifecycle by name:

docker ps -a                       # List all containers, including stopped ones
docker exec -it web-demo /bin/bash # Open a shell inside the running container
docker stop web-demo               # Gracefully stop a running container
docker rm web-demo                 # Remove a stopped container

docker exec only works while the container is still running. It is how you poke around inside a long-running service. For the Week 5 batch job, which exits on its own, docker run -it ... /bin/bash (covered in Docker Fundamentals) is the right tool instead.

<aside> 💡 You will use detached mode again with docker compose up -d below. Week 6's Azure Container Apps Jobs are the opposite pattern: a scheduled run that exits, like the Week 5 batch job.

</aside>

Multi-service stacks with Docker Compose

Docker Fundamentals covered single-container setups. Most real projects are not single-container: a Python app also needs a database, a cache, or a message broker. Docker Compose lets you describe that whole stack in one YAML file and bring it up with a single command. Nearly every open-source data project on GitHub (Airflow, Superset, MinIO, the dbt labs example repos) ships a docker-compose.yml as the "try it locally" starting point.

Create a docker-compose.yml that pairs your ingestion app with a local Postgres:

services:
  app:
    build: .
    env_file: .env
    depends_on:
      - db

  db:
    image: postgres:16
    environment:
      POSTGRES_USER: app
      POSTGRES_PASSWORD: app
      POSTGRES_DB: app
    volumes:
      - pgdata:/var/lib/postgresql/data
    ports:
      - "5432:5432"

volumes:
  pgdata:

Key commands:

docker compose up          # Start all services, stream logs in the foreground
docker compose up -d       # Start in the background (detached)
docker compose ps          # List running services
docker compose logs -f app # Tail logs for one service
docker compose down        # Stop and remove containers (keeps volumes)
docker compose down -v     # Also remove named volumes (wipes the DB)

Two details that make Compose worth learning:

Service-name DNS. From inside the app container, db resolves to the Postgres container automatically. You do not need to manage IP addresses, ports, or docker network manually. Your connection string is simply postgresql://app:app@db:5432/app.
Named volumes outlive containers. pgdata keeps your Postgres data when you run docker compose down. Only docker compose down -v wipes it. This is why most tutorials can say "reset your state with down -v".

In later weeks you will meet Compose again. Week 11's Airflow setup starts life as a docker-compose file before moving to managed infrastructure; it is the standard way most data teams run open-source tools locally.

<aside> ⚠️ docker-compose (with a hyphen) is the legacy Python CLI, now deprecated. docker compose (no hyphen) is the newer Go plugin bundled with Docker Desktop and Docker Engine. Both accept the same YAML. Always prefer docker compose in new work.

</aside>

Kubernetes: a primer

Docker Compose runs a multi-service stack on one machine. Kubernetes (often abbreviated K8s) runs containers across many machines, with automatic restarts, scaling, and rolling updates. You will not deploy to Kubernetes in this track, but knowing the vocabulary lets you read real production infrastructure without getting lost.

The core concepts map loosely onto things you already know:

Kubernetes concept	Rough analogy
Pod	A running container (technically: one or more containers sharing a network)
Deployment	A spec that keeps N copies of a Pod running
Service	A stable DNS name and IP in front of Pods (like Compose's service-name DNS, but cluster-wide)
ConfigMap / Secret	Env-var config, injected into Pods at runtime
Ingress	Routes external HTTP traffic to Services
Node	A machine (VM or physical) in the cluster
Cluster	The full set of Nodes managed together

A minimal Deployment looks like this. It says "run three copies of the weather-pipeline:1.0 image, and keep them running":

apiVersion: apps/v1
kind: Deployment
metadata:
  name: weather-pipeline
spec:
  replicas: 3
  selector:
    matchLabels:
      app: weather-pipeline
  template:
    metadata:
      labels:
        app: weather-pipeline
    spec:
      containers:
        - name: app
          image: myregistry.azurecr.io/weather-pipeline:1.0
          envFrom:
            - secretRef:
                name: weather-secrets

Daily commands use kubectl:

kubectl apply -f deployment.yaml       # Apply desired state to the cluster
kubectl get pods                       # List running Pods
kubectl logs <pod-name>                # Stream logs from one Pod
kubectl exec -it <pod-name> -- /bin/bash  # Shell into a Pod (like `docker exec`)
kubectl delete -f deployment.yaml      # Remove the Deployment

Why this matters even though you will not deploy to K8s this track:

Most managed data platforms run on Kubernetes under the hood. Airflow, Spark, and dbt Cloud all ship K8s-native operators. Databricks runs K8s internally.

Going Further

Developer tooling

A Makefile to tame Docker commands

just as an alternative to make

act: run GitHub Actions locally

Going deeper with Docker

Multi-stage Dockerfiles for smaller images

Running as a non-root user

Container lifecycle: running detached

Multi-service stacks with Docker Compose

Kubernetes: a primer

`just` as an alternative to `make`

`act`: run GitHub Actions locally