Week 11 - Orchestration

Introduction to Orchestration

Airflow Fundamentals

Scheduling and Triggers

Sequential Pipeline Steps

Parameterized Runs and Backfills

Testing DAGs

Monitoring and Debugging

Deploying to Shared Airflow

Practice

Gotchas & Pitfalls

Assignment: Build an Orchestrated Data Pipeline

Week 11 Lesson Plan (Teachers)

Airflow Fundamentals

In Introduction to Orchestration you saw why a pipeline needs an orchestrator. This chapter gets you to a running one: local Airflow, a first DAG, and the UI views you will spend the rest of the week in. Theory about schedulers and metadata databases only makes sense once you have watched those components start up on your own machine, so this chapter is hands-on first and architecture second.

By the end of this chapter, you should be able to:

Set up Astro CLI

Astro CLI is the Astronomer command-line tool that wraps Docker Compose around a full Airflow stack (scheduler, dag-processor, api-server, workers, Postgres metadata DB). You run one command and get the same Airflow image real teams use in production, without installing Python packages into your system environment. This is also the stack the shared class VM runs, so skills transfer directly.

Prerequisites

<aside> ⚠️ Out of scope: running Airflow without Docker. A uv-based fallback using airflow standalone + SQLite exists for machines that genuinely cannot run Docker (corporate restrictions, old Windows, no virtualization support). It diverges from the shared class VM and is only documented in Going Further. Use Docker + Astro if at all possible.

</aside>

Install by OS

Pick the row for your operating system. Run the Docker install first (link opens to the vendor), then the Astro CLI command.

OS Docker Astro CLI Notes
macOS (Intel or Apple Silicon) Docker Desktop for Mac, or OrbStack if you want a lighter alternative brew install astro OrbStack uses less RAM and starts faster; either works.
Linux (Ubuntu/Debian) Docker Engine install guide. Add yourself to the docker group so docker ps works without sudo. `curl -sSL install.astronomer.io \ sudo bash -s` Use Docker Engine (the native daemon), not Docker Desktop, on Linux.
Windows 10/11 (WSL2 recommended) Docker Desktop for Windows with the WSL2 backend, then install Ubuntu on WSL2 and run everything inside the Ubuntu shell. Inside WSL2 Ubuntu: `curl -sSL install.astronomer.io \ sudo bash -s` Running Astro from Windows PowerShell directly works but performance is significantly worse than inside WSL2.
Windows (no WSL2, PowerShell) Docker Desktop for Windows, Hyper-V backend. winget install -e --id Astronomer.Astro Fallback only if WSL2 is blocked on your machine. Expect slow startup.

Verify the install:

astro version

You should see a version banner like Astro CLI Version: 1.41.0. Any version from 1.40 upward works for Week 11 (older versions ran Airflow 2.x; the chapter assumes Airflow 3).

<aside> ⌨️ Hands on: Install Docker and Astro CLI using the row for your OS. Confirm both docker ps (header row visible, no error) and astro version (version string printed) work before you move on. Everything else in the week depends on this.

</aside>

Start a local Airflow stack

Create an empty folder for your Week 11 work and scaffold an Astro project inside it:

mkdir week11-airflow && cd week11-airflow
astro dev init

astro dev init writes a minimal project layout (you will tour it in a moment). Then start the stack:

astro dev start

The first start downloads the Airflow runtime image (1-2 GB) and can take 3-5 minutes. Subsequent starts take about 30 seconds.

When it finishes, Astro prints the URLs it exposed. The exact URL depends on your project directory name and a random port chosen at init:

➤ Airflow UI: <http://week11-airflow.localhost:6563>
➤ Postgres Database: postgresql://localhost:17733/postgres
➤ The default Postgres DB credentials are: postgres:postgres

Open the URL printed in your terminal (not localhost:8080): Astro uses a per-project subdomain and random port so multiple projects can run side by side. Airflow 3's local auth lets you straight into the UI; no password prompt. Copy the Airflow UI URL now; you will open it many times this week.

<aside> ⚠️ First startup is slow because Docker is pulling the Airflow image. If you see "Waiting for Airflow to be healthy" for over 5 minutes, hit Ctrl+C and check Docker Desktop's resource settings: Airflow needs at least 4 GB of RAM allocated to Docker.

</aside>

Once the stack is healthy, take 30 seconds to poke around the UI before creating your own DAG.

<aside> 🎬 Terminal Tutorial: Astro CLI startup sequence

</aside>

Once you have watched the sequence, run it yourself.

<aside> ⌨️ Hands on: Run astro dev start, wait for Astro to print the three URL lines, and open the Airflow UI URL from your own terminal in a browser. Confirm you see the DAGs page (it will be mostly empty except for one example_astronauts DAG that Astro ships by default).

</aside>

In another terminal, confirm the five Airflow components are now running as containers:

docker ps --format "table {{.Names}}\\t{{.Status}}"

You should see five containers whose names end in scheduler, triggerer, api-server, dag-processor, and postgres. Each one maps to a piece of the architecture this chapter talks about next.

Run your first DAG

Astro's astro dev init generates a starter DAG at dags/exampledag.py (the file is named exampledag.py, the DAG ID inside it is example_astronauts). It is functional but noisy (pulls from a NASA API). Replace it with a minimal DAG so the moving parts are obvious:

rm dags/exampledag.py

Create dags/hello_pipeline.py:

# dags/hello_pipeline.py
from datetime import datetime

from airflow.sdk import dag, task

@dag(
    schedule="@daily",
    start_date=datetime(2025, 1, 1),
    catchup=False,
    tags=["week11", "intro"],
)
def hello_pipeline():
    @task()
    def ingest() -> int:
        return 42

    @task()
    def transform(count: int) -> None:
        print(f"Processed {count} rows")

    transform(ingest())

hello_pipeline()

Save the file. Astro's dag-processor container parses dags/ on a schedule (roughly every 30 seconds in the default local config). New DAGs can take up to a minute to appear in the UI. If you are impatient, force an immediate re-scan:

astro dev run dags reserialize

<aside> ⌨️ Hands on: Create dags/hello_pipeline.py with the code above and wait up to 60 seconds (or run astro dev run dags reserialize to speed it up). Confirm hello_pipeline appears in the UI list, unpause it (toggle at the left of the row), then click ▶ Trigger DAG. Wait for the run to turn dark green.

</aside>

A successful run looks like this in the grid view. Each column is one DAG run; each row inside the column is one task. Dark green means the task succeeded. You may see two run columns instead of one: when you unpause a @daily DAG mid-day, Airflow fires the scheduled run for today alongside your manual trigger.

Airflow grid view showing two successful hello_pipeline runs with ingest and transform tasks green

Airflow grid view showing two successful hello_pipeline runs with ingest and transform tasks green

Now spend five minutes exploring the UI around that one run. The four views you will use every day this week:

View What it shows When to use
DAG list pause/unpause, trigger, last run status start of every session
Grid view one column per run, one cell per task checking historical runs at a glance
Graph view dependency diagram understanding task order
Task instance → Logs stdout/stderr of one task debugging a red cell

<aside> ⌨️ Hands on: Click into your hello_pipeline run and open each of the four views above. Find the log output Processed 42 rows in the task-instance log panel.

</aside>

The five components you just started

Now the architecture discussion has something concrete to point at. Airflow 3's local stack runs five processes, and docker ps already showed them to you:

flowchart LR
    dag["<b>DAG files</b><br/>dags/ folder"] --> dp["<b>DAG Processor</b><br/>parses files"]
    dp --> db[("<b>Metadata DB</b><br/>Postgres, task state")]
    db <--> sch["<b>Scheduler</b><br/>decides what runs"]
    sch --> exec["<b>Executor / Worker</b><br/>runs the task"]
    exec --> db
    db <--> api["<b>API Server</b><br/>UI + REST API"]
    db <--> trg["<b>Triggerer</b><br/>async sensors"]

<aside> 💡 If you ever see the UI but DAG runs do not start, two components can be the cause. astro dev ps shows container health; astro dev logs scheduler and astro dev logs dag-processor show recent activity for each. The API server can be up while the scheduler or DAG processor is down: the UI will look fine and do nothing.

</aside>

DAG file anatomy

Re-read your hello_pipeline.py. Four things matter:

To see the inferred edge, click the graph icon at the top-left of the DAG page (or press g on your keyboard). The graph view renders ingest and transform as two nodes connected by the edge Airflow derived from the transform(ingest()) call:

Airflow graph view showing ingest arrow to transform

Airflow graph view showing ingest arrow to transform

TaskFlow vs classic operators

TaskFlow (@task) is the right default for pure-Python work. For shell commands or third-party integrations, use classic operators: