Parameterized Runs and Backfills
Assignment: Build an Orchestrated Data Pipeline
Week 11 Lesson Plan (Teachers)
Everything you have built so far runs on your laptop. That is the right place to develop a DAG: fast feedback, no blast radius when you break something, no one else watching the scheduler trip over your typo. But the point of orchestration is to run on schedule, without someone opening a laptop. That means your DAG has to live on shared infrastructure.
This chapter walks you through deploying your taxi_pipeline to the class's shared Airflow instance (running on the teacher's Azure VM via Docker Compose) and working alongside your classmates on the same scheduler. The workflow maps directly to how you will deploy Airflow DAGs in a real team: a Git repo, a code-review step, auto-deploy on merge, and an Airflow UI that multiple people share without stepping on each other.
By the end of this chapter, you should be able to:
dags/ folderYou have been running Airflow two ways already, possibly without naming the difference:
Local (astro dev start) |
Shared (class Azure VM) | |
|---|---|---|
| Who runs the scheduler? | You | Teacher, 24/7 |
How many DAGs in dags/? |
Yours only | One per student, plus class demos |
| How do you deploy a change? | Save the file | Push to shared repo, wait for poll |
| Who sees failures? | You | Everyone logged in to the shared UI |
| Who can pause your DAG? | You | Anyone with UI access |
| Blast radius of a bad DAG | Your laptop | Shared scheduler |
The shared model is what production looks like. Your team's ETL runs on a shared cluster that all data engineers deploy to; you do not each get your own Airflow. The habits you build this week (pushing through a repo, tagging your DAGs, not pausing teammates' work) are the same habits that keep a real production Airflow healthy.
<aside>
📘 Core program connection: You already learned in the Core program that pushing to main on a shared repo affects everyone. Shared Airflow is the same pattern applied to pipelines: your Git commit goes through review and then becomes live infrastructure. The extra step is that "live" now means a scheduler runs it on a cron instead of serving an HTTP request.
</aside>
The class's shared Airflow stack runs on a single Azure VM (Docker Compose with the Airflow 3 services: api-server, scheduler, dag-processor, plus a metadata Postgres). The dags/ folder on the VM is a Git working copy of the class repo. A cron job on the VM runs git pull every minute. When it finds new commits, the dag-processor container sees the file change and re-parses. Your DAG appears in the UI.
flowchart LR
laptop["**your laptop**<br/>`git push`"]:::dev
repo[("**GitHub**<br/>class repo `main`")]:::repo
cron["**teacher VM**<br/>`cron: git pull` (every 60s)"]:::vm
parser["dag-processor<br/>re-parses changed files"]:::parser
ui["DAG appears in UI"]:::ui
laptop -->|push| repo
repo -->|pull| cron
cron --> parser --> ui
classDef dev fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px,color:#000
classDef repo fill:#e1d5e7,stroke:#9673a6,stroke-width:2px,color:#000
classDef vm fill:#fff2cc,stroke:#d6b656,stroke-width:2px,color:#000
classDef parser fill:#d5e8d4,stroke:#82b366,stroke-width:2px,color:#000
classDef ui fill:#f5f5f5,stroke:#666,stroke-width:2px,color:#000
No CI pipeline, no Docker image rebuild, no teacher-intervention needed for a DAG change. The trade-off is a ~60-second deploy lag; in exchange you get a workflow that is trivially easy to debug ("git pull on my laptop matches what's in prod") and uses zero infrastructure budget.
The class repo is structured so each student has an isolated subdirectory. Conceptually:
class-airflow/
├── dags/
│ ├── lasse/
│ │ └── taxi_pipeline.py
│ ├── maria/
│ │ └── taxi_pipeline.py
│ └── student-N/
│ └── ...
├── include/
│ └── dbt_project/ # shared dbt project, read-only for students
├── tests/
│ └── test_dag_integrity.py # from Testing DAGs: runs against the whole dags/ tree
└── requirements.txt # shared
The Airflow dag-processor scans dags/ recursively, so a DAG at dags/lasse/taxi_pipeline.py is picked up exactly the same as one at dags/taxi_pipeline.py: Airflow does not care about subdirectories, it only cares about finding .py files that define a DAG.
The subdirectory is your sandbox. You only ever edit files under dags/<your-username>/. If you see your classmate's DAG in the shared UI, treat it as read-only: do not trigger it, do not pause it, do not look at its XCom unless asked. Same rule as a shared production environment with multiple teams.
Thirty students each deploying a taxi_pipeline DAG means the Airflow UI grid lists thirty rows of taxi_pipeline, and every teacher-triggered demo adds more noise. Two conventions prevent a cluttered UI:
1. Prefix your dag_id with your username. Edit the decorator at the bottom of your DAG file:
# before (local development)
@dag(
dag_id="taxi_pipeline",
schedule="@monthly",
...
)
# after (shared deployment)
@dag(
dag_id="lasse_taxi_pipeline",
schedule="@monthly",
...
)
The UI sorts alphabetically; your DAGs land next to each other.
2. Tag your DAGs with your username. The tags= list you have been using for ["week11", "taxi"] gets one more entry:
@dag(
dag_id="lasse_taxi_pipeline",
tags=["week11", "taxi", "student:lasse"],
...
)
Airflow's UI has a tag filter. Click student:lasse and the grid shows only your DAGs. This is the 10-second "find my stuff" tool you will use every class.
You already set AIRFLOW_STUDENT locally and your DAG writes to airflow_<name> on the shared Azure Postgres (from Sequential Pipelines). On the shared VM there is no per-student env variable; the snapshot's STUDENT constant falls back to the parent directory name of the DAG file, so a DAG at dags/<yourname>/taxi_pipeline.py automatically picks up airflow_<yourname> without any manual configuration. Your ingest_taxi_month task writes to your schema, not the class's public tables and not your classmates' schemas.
The schema separation is why "thirty students deploying the same DAG" does not blow up into thirty fights over the raw_trips table. Each student's DAG writes to a different database schema. You can SQL-query your own airflow_lasse.raw_trips and compare against a classmate's airflow_maria.raw_trips for a sanity check without either of you overwriting the other's data.
<aside>
⚠️ You still share the azure_pg Airflow Connection with everyone. Do not change its host, password, or sslmode on the shared UI: you would break every classmate's run at the same time. If you need to experiment with connection settings, do it on your local stack first.
</aside>
On the shared UI, every student has a login with the same permission level: full read and write access to everything. This is both practical (thirty students do not need thirty separate Airflow RBAC roles for a Week 11 exercise) and deliberate (you learn shared-infra etiquette by being able to misuse it and choosing not to).
The four rules of shared-Airflow etiquette:
maria_taxi_pipeline to "clean up the UI", you just stopped Maria's scheduled run. She will not thank you.azure_pg connection is wrong), tell the teacher in the class channel. Do not "fix" it yourself from the UI.The single exception: the teacher occasionally triggers a class-demo DAG in front of everyone. That is expected.
<aside> 🤓 Curious Geek: Airflow's "Multi-Tenancy" story
Airflow has never had a great multi-tenant story. Until Airflow 2.8 (late 2023), DAG parsing was global: one broken DAG file could crash the scheduler for everyone. The separate dag-processor was introduced in 2.8 as opt-in, then made mandatory in Airflow 3.0. Your class VM runs it as a required service because the Airflow 3 scheduler no longer parses DAG files itself. Isolated parse failures mean one student's typo no longer takes down the shared UI. For real production multi-tenancy (dozens of teams, thousands of DAGs), teams still usually split into multiple Airflow instances, one per environment or team, rather than trying to slice access controls inside one.
</aside>
lassebenni/class-airflow-reference (public; you can browse and clone without auth, but you need push access from the teacher to open PRs). Your UI login is student-<yourname> with the initial password sent in the class channel on day one. git clone <repo-url> class-airflow
cd class-airflow
git checkout -b <yourname>/taxi-pipeline
mkdir -p dags/<yourname>
cp /path/to/your/local/dags/taxi_pipeline.py dags/<yourname>/taxi_pipeline.py
dags/<yourname>/taxi_pipeline.py and update the @dag(...) decorator:dag_id to "<yourname>_taxi_pipeline" (the default is your function name, so either rename the function or pass dag_id= explicitly)."student:<yourname>" to the tags= list. astro dev pytest tests/test_dag_integrity.py --args "-v"
This catches import errors before your commit hits the shared scheduler. A broken push does not crash the class VM (the separate dag-processor isolates parse failures), but it does show up as a red 1 badge on the DAGs list and a passive-aggressive Slack message from the teacher.
git add dags/<yourname>/taxi_pipeline.py
git commit -m "add <yourname> taxi_pipeline"
git push -u origin <yourname>/taxi-pipeline
gh pr create --title "<yourname>: add taxi_pipeline" --body "Week 11 shared deploy"
If your cohort has a teacher-approval rule, wait for the PR to be merged. If merges are self-service for Week 11, merge your own PR after the integrity-test CI passes.

Shared Airflow DAG list: class_demo_hello (teacher, paused) and lasse_taxi_pipeline (student:lasse tag, unpaused) both visible on the same UI
To find only yours, click the Filter by tag dropdown and pick student:<yourname>:

Shared Airflow DAG list filtered by the student:lasse tag, showing 1 Dag
Wait up to 60 seconds for the cron poll + dag-processor re-parse after your merge. You should see <yourname>_taxi_pipeline appear.
ingest_taxi_month → dbt_run → dbt_test. The ingest task runs against the shared Azure Postgres, writing to airflow_<yourname> (same schema as your local runs):
Shared Airflow task instances view: ingest_taxi_month=Success running against the shared Azure Postgres, dbt_run=Up For Retry because the shared VM does not ship dbt, dbt_test still queued
<aside>
💡 On the current class VM, ingest_taxi_month goes green but dbt_run and dbt_test stay orange (Up For Retry): the shared VM does not include dbt-core in its pip requirements by design (dbt-core + Airflow's transitive OTEL pins cause pip to backtrack indefinitely every container restart). Students whose DAGs need dbt on the shared VM either install it inline in the BashOperator (pip install --quiet dbt-postgres==1.10.* && dbt run …) or the teacher bakes a custom Airflow image. The Going Further page covers the custom-image pattern. For this step, seeing ingest_taxi_month green is the signal the deploy loop works end-to-end; the dbt tasks are expected orange.
</aside>
student:<name> tag). Do not trigger or pause it. Note how it coexists with yours: same scheduler, same UI, separate schema writes, separate log streams. This is what "shared production" feels like.Completing these ten steps once reproduces the core deployment loop you will use every day as a data engineer: edit locally, integrity-test locally, push, code-review, auto-deploy, verify on prod UI, iterate.
When a shared deploy fails, the debugging steps from Monitoring and Debugging apply unchanged: open the UI, find the red task, read the log upward, check the rendered template tab. The only difference is that the log file physically lives on the teacher's VM, not your laptop, and the UI streams it to you. You cannot tail -f the log from your terminal unless you have SSH access.
Two new failure modes you did not have locally:
requirements.txt. Fix: add the dependency, open a PR, wait for the teacher to restart the scheduler + api-server containers on merge (one docker compose up -d --force-recreate, ~30 seconds). Container restart re-runs _PIP_ADDITIONAL_REQUIREMENTS; no image rebuild needed.