Parameterized Runs and Backfills
Assignment: Build an Orchestrated Data Pipeline
Week 11 Lesson Plan (Teachers)
In Introduction to Orchestration you saw why a pipeline needs an orchestrator. This chapter gets you to a running one: local Airflow, a first DAG, and the UI views you will spend the rest of the week in. Theory about schedulers and metadata databases only makes sense once you have watched those components start up on your own machine, so this chapter is hands-on first and architecture second.
By the end of this chapter, you should be able to:
docker ps.@task) versus classic operators, using your own starter DAG as the example.Astro CLI is the Astronomer command-line tool that wraps Docker Compose around a full Airflow stack (scheduler, dag-processor, api-server, workers, Postgres metadata DB). You run one command and get the same Airflow image real teams use in production, without installing Python packages into your system environment. This is also the stack the shared class VM runs, so skills transfer directly.
docker ps, which should print a table header even when you have no containers.<aside>
⚠️ Out of scope: running Airflow without Docker. A uv-based fallback using airflow standalone + SQLite exists for machines that genuinely cannot run Docker (corporate restrictions, old Windows, no virtualization support). It diverges from the shared class VM and is only documented in Going Further. Use Docker + Astro if at all possible.
</aside>
Pick the row for your operating system. Run the Docker install first (link opens to the vendor), then the Astro CLI command.
| OS | Docker | Astro CLI | Notes | |
|---|---|---|---|---|
| macOS (Intel or Apple Silicon) | Docker Desktop for Mac, or OrbStack if you want a lighter alternative | brew install astro |
OrbStack uses less RAM and starts faster; either works. | |
| Linux (Ubuntu/Debian) | Docker Engine install guide. Add yourself to the docker group so docker ps works without sudo. |
`curl -sSL install.astronomer.io \ | sudo bash -s` | Use Docker Engine (the native daemon), not Docker Desktop, on Linux. |
| Windows 10/11 (WSL2 recommended) | Docker Desktop for Windows with the WSL2 backend, then install Ubuntu on WSL2 and run everything inside the Ubuntu shell. | Inside WSL2 Ubuntu: `curl -sSL install.astronomer.io \ | sudo bash -s` | Running Astro from Windows PowerShell directly works but performance is significantly worse than inside WSL2. |
| Windows (no WSL2, PowerShell) | Docker Desktop for Windows, Hyper-V backend. | winget install -e --id Astronomer.Astro |
Fallback only if WSL2 is blocked on your machine. Expect slow startup. |
Verify the install:
astro version
You should see a version banner like Astro CLI Version: 1.41.0. Any version from 1.40 upward works for Week 11 (older versions ran Airflow 2.x; the chapter assumes Airflow 3).
<aside>
⌨️ Hands on: Install Docker and Astro CLI using the row for your OS. Confirm both docker ps (header row visible, no error) and astro version (version string printed) work before you move on. Everything else in the week depends on this.
</aside>
Create an empty folder for your Week 11 work and scaffold an Astro project inside it:
mkdir week11-airflow && cd week11-airflow
astro dev init
astro dev init writes a minimal project layout (you will tour it in a moment). Then start the stack:
astro dev start
The first start downloads the Airflow runtime image (1-2 GB) and can take 3-5 minutes. Subsequent starts take about 30 seconds.
When it finishes, Astro prints the URLs it exposed. The exact URL depends on your project directory name and a random port chosen at init:
➤ Airflow UI: <http://week11-airflow.localhost:6563>
➤ Postgres Database: postgresql://localhost:17733/postgres
➤ The default Postgres DB credentials are: postgres:postgres
Open the URL printed in your terminal (not localhost:8080): Astro uses a per-project subdomain and random port so multiple projects can run side by side. Airflow 3's local auth lets you straight into the UI; no password prompt. Copy the Airflow UI URL now; you will open it many times this week.
<aside>
⚠️ First startup is slow because Docker is pulling the Airflow image. If you see "Waiting for Airflow to be healthy" for over 5 minutes, hit Ctrl+C and check Docker Desktop's resource settings: Airflow needs at least 4 GB of RAM allocated to Docker.
</aside>
Once the stack is healthy, take 30 seconds to poke around the UI before creating your own DAG.
<aside> 🎬 Terminal Tutorial: Astro CLI startup sequence
</aside>
Once you have watched the sequence, run it yourself.
<aside>
⌨️ Hands on: Run astro dev start, wait for Astro to print the three ➤ URL lines, and open the Airflow UI URL from your own terminal in a browser. Confirm you see the DAGs page (it will be mostly empty except for one example_astronauts DAG that Astro ships by default).
</aside>
In another terminal, confirm the five Airflow components are now running as containers:
docker ps --format "table {{.Names}}\\t{{.Status}}"
You should see five containers whose names end in scheduler, triggerer, api-server, dag-processor, and postgres. Each one maps to a piece of the architecture this chapter talks about next.
Astro's astro dev init generates a starter DAG at dags/exampledag.py (the file is named exampledag.py, the DAG ID inside it is example_astronauts). It is functional but noisy (pulls from a NASA API). Replace it with a minimal DAG so the moving parts are obvious:
rm dags/exampledag.py
Create dags/hello_pipeline.py:
# dags/hello_pipeline.py
from datetime import datetime
from airflow.sdk import dag, task
@dag(
schedule="@daily",
start_date=datetime(2025, 1, 1),
catchup=False,
tags=["week11", "intro"],
)
def hello_pipeline():
@task()
def ingest() -> int:
return 42
@task()
def transform(count: int) -> None:
print(f"Processed {count} rows")
transform(ingest())
hello_pipeline()
Save the file. Astro's dag-processor container parses dags/ on a schedule (roughly every 30 seconds in the default local config). New DAGs can take up to a minute to appear in the UI. If you are impatient, force an immediate re-scan:
astro dev run dags reserialize
<aside>
⌨️ Hands on: Create dags/hello_pipeline.py with the code above and wait up to 60 seconds (or run astro dev run dags reserialize to speed it up). Confirm hello_pipeline appears in the UI list, unpause it (toggle at the left of the row), then click ▶ Trigger DAG. Wait for the run to turn dark green.
</aside>
A successful run looks like this in the grid view. Each column is one DAG run; each row inside the column is one task. Dark green means the task succeeded. You may see two run columns instead of one: when you unpause a @daily DAG mid-day, Airflow fires the scheduled run for today alongside your manual trigger.

Airflow grid view showing two successful hello_pipeline runs with ingest and transform tasks green
Now spend five minutes exploring the UI around that one run. The four views you will use every day this week:
| View | What it shows | When to use |
|---|---|---|
| DAG list | pause/unpause, trigger, last run status | start of every session |
| Grid view | one column per run, one cell per task | checking historical runs at a glance |
| Graph view | dependency diagram | understanding task order |
| Task instance → Logs | stdout/stderr of one task | debugging a red cell |
<aside>
⌨️ Hands on: Click into your hello_pipeline run and open each of the four views above. Find the log output Processed 42 rows in the task-instance log panel.
</aside>
Now the architecture discussion has something concrete to point at. Airflow 3's local stack runs five processes, and docker ps already showed them to you:
flowchart LR
dag["<b>DAG files</b><br/>dags/ folder"] --> dp["<b>DAG Processor</b><br/>parses files"]
dp --> db[("<b>Metadata DB</b><br/>Postgres, task state")]
db <--> sch["<b>Scheduler</b><br/>decides what runs"]
sch --> exec["<b>Executor / Worker</b><br/>runs the task"]
exec --> db
db <--> api["<b>API Server</b><br/>UI + REST API"]
db <--> trg["<b>Triggerer</b><br/>async sensors"]
…-dag-processor) parses every file in dags/ on a schedule and writes the serialized DAG into the metadata DB. In Airflow 2 this lived inside the scheduler; Airflow 3 split it out so parsing crashes cannot take the scheduler down.…-scheduler) reads serialized DAGs from the DB and queues tasks that are due to run.…-postgres) stores every DAG run, task state transition, and retry decision.…-api-server, renamed from "webserver" in Airflow 3) serves the UI you just used and a REST API for programmatic access.…-triggerer) runs asynchronous sensors efficiently. Not load-bearing in Week 11; mentioned so you know what the fifth container does.<aside>
💡 If you ever see the UI but DAG runs do not start, two components can be the cause. astro dev ps shows container health; astro dev logs scheduler and astro dev logs dag-processor show recent activity for each. The API server can be up while the scheduler or DAG processor is down: the UI will look fine and do nothing.
</aside>
Re-read your hello_pipeline.py. Four things matter:
@dag(...) decorates a Python function to mark it as a DAG definition. The decorator arguments (schedule, start_date, catchup, tags) set the DAG's behaviour. Scheduling and Triggers unpacks schedule and catchup in detail.@task() decorates plain Python functions. Return values become inputs to the next task via the TaskFlow API. This is the modern, Pythonic way to write Airflow code. In Airflow 3 the canonical import is from airflow.sdk import dag, task (see airflow.sdk in the glossary); the older from airflow.decorators import ... still works but emits a deprecation warning.transform(ingest()) wires the dependency: Airflow infers that transform depends on ingest from the function call, and draws the edge in the graph view automatically.hello_pipeline()) is what actually registers the DAG with Airflow. Forget it and the DAG silently does not appear in the UI.To see the inferred edge, click the graph icon at the top-left of the DAG page (or press g on your keyboard). The graph view renders ingest and transform as two nodes connected by the edge Airflow derived from the transform(ingest()) call:

Airflow graph view showing ingest arrow to transform
TaskFlow (@task) is the right default for pure-Python work. For shell commands or third-party integrations, use classic operators:
BashOperator for shell commands. You will use this to run dbt run / dbt test in Sequential Pipelines.PythonOperator when you need an explicit callable and cannot use @task (rare in new code).S3Hook, PostgresOperator, KubernetesPodOperator. Week 11 does not need these.