Week 11 - Orchestration

Introduction to Orchestration

Airflow Fundamentals

Scheduling and Triggers

Sequential Pipeline Steps

Parameterized Runs and Backfills

Testing DAGs

Monitoring and Debugging

Deploying to Shared Airflow

Practice

Gotchas & Pitfalls

Assignment: Build an Orchestrated Data Pipeline

Week 11 Lesson Plan (Teachers)

Deploying to Shared Airflow

Everything you have built so far runs on your laptop. That is the right place to develop a DAG: fast feedback, no blast radius when you break something, no one else watching the scheduler trip over your typo. But the point of orchestration is to run on schedule, without someone opening a laptop. That means your DAG has to live on shared infrastructure.

This chapter walks you through deploying your taxi_pipeline to the class's shared Airflow instance (running on the teacher's Azure VM via Docker Compose) and working alongside your classmates on the same scheduler. The workflow maps directly to how you will deploy Airflow DAGs in a real team: a Git repo, a code-review step, auto-deploy on merge, and an Airflow UI that multiple people share without stepping on each other.

By the end of this chapter, you should be able to:

Concepts

Local Airflow vs shared Airflow

You have been running Airflow two ways already, possibly without naming the difference:

Local (astro dev start) Shared (class Azure VM)
Who runs the scheduler? You Teacher, 24/7
How many DAGs in dags/? Yours only One per student, plus class demos
How do you deploy a change? Save the file Push to shared repo, wait for poll
Who sees failures? You Everyone logged in to the shared UI
Who can pause your DAG? You Anyone with UI access
Blast radius of a bad DAG Your laptop Shared scheduler

The shared model is what production looks like. Your team's ETL runs on a shared cluster that all data engineers deploy to; you do not each get your own Airflow. The habits you build this week (pushing through a repo, tagging your DAGs, not pausing teammates' work) are the same habits that keep a real production Airflow healthy.

<aside> 📘 Core program connection: You already learned in the Core program that pushing to main on a shared repo affects everyone. Shared Airflow is the same pattern applied to pipelines: your Git commit goes through review and then becomes live infrastructure. The extra step is that "live" now means a scheduler runs it on a cron instead of serving an HTTP request.

</aside>

How deploy-on-push works

The class's shared Airflow stack runs on a single Azure VM (Docker Compose with the Airflow 3 services: api-server, scheduler, dag-processor, plus a metadata Postgres). The dags/ folder on the VM is a Git working copy of the class repo. A cron job on the VM runs git pull every minute. When it finds new commits, the dag-processor container sees the file change and re-parses. Your DAG appears in the UI.

flowchart LR
    laptop["**your laptop**<br/>`git push`"]:::dev
    repo[("**GitHub**<br/>class repo `main`")]:::repo
    cron["**teacher VM**<br/>`cron: git pull` (every 60s)"]:::vm
    parser["dag-processor<br/>re-parses changed files"]:::parser
    ui["DAG appears in UI"]:::ui

    laptop -->|push| repo
    repo -->|pull| cron
    cron --> parser --> ui

    classDef dev fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px,color:#000
    classDef repo fill:#e1d5e7,stroke:#9673a6,stroke-width:2px,color:#000
    classDef vm fill:#fff2cc,stroke:#d6b656,stroke-width:2px,color:#000
    classDef parser fill:#d5e8d4,stroke:#82b366,stroke-width:2px,color:#000
    classDef ui fill:#f5f5f5,stroke:#666,stroke-width:2px,color:#000

No CI pipeline, no Docker image rebuild, no teacher-intervention needed for a DAG change. The trade-off is a ~60-second deploy lag; in exchange you get a workflow that is trivially easy to debug ("git pull on my laptop matches what's in prod") and uses zero infrastructure budget.

The shared repo layout

The class repo is structured so each student has an isolated subdirectory. Conceptually:

class-airflow/
├── dags/
│   ├── lasse/
│   │   └── taxi_pipeline.py
│   ├── maria/
│   │   └── taxi_pipeline.py
│   └── student-N/
│       └── ...
├── include/
│   └── dbt_project/          # shared dbt project, read-only for students
├── tests/
│   └── test_dag_integrity.py # from Testing DAGs: runs against the whole dags/ tree
└── requirements.txt          # shared

The Airflow dag-processor scans dags/ recursively, so a DAG at dags/lasse/taxi_pipeline.py is picked up exactly the same as one at dags/taxi_pipeline.py: Airflow does not care about subdirectories, it only cares about finding .py files that define a DAG.

The subdirectory is your sandbox. You only ever edit files under dags/<your-username>/. If you see your classmate's DAG in the shared UI, treat it as read-only: do not trigger it, do not pause it, do not look at its XCom unless asked. Same rule as a shared production environment with multiple teams.

Namespacing your DAGs (so you can find them)

Thirty students each deploying a taxi_pipeline DAG means the Airflow UI grid lists thirty rows of taxi_pipeline, and every teacher-triggered demo adds more noise. Two conventions prevent a cluttered UI:

1. Prefix your dag_id with your username. Edit the decorator at the bottom of your DAG file:

# before (local development)
@dag(
    dag_id="taxi_pipeline",
    schedule="@monthly",
    ...
)

# after (shared deployment)
@dag(
    dag_id="lasse_taxi_pipeline",
    schedule="@monthly",
    ...
)

The UI sorts alphabetically; your DAGs land next to each other.

2. Tag your DAGs with your username. The tags= list you have been using for ["week11", "taxi"] gets one more entry:

@dag(
    dag_id="lasse_taxi_pipeline",
    tags=["week11", "taxi", "student:lasse"],
    ...
)

Airflow's UI has a tag filter. Click student:lasse and the grid shows only your DAGs. This is the 10-second "find my stuff" tool you will use every class.

Schema isolation, again

You already set AIRFLOW_STUDENT locally and your DAG writes to airflow_<name> on the shared Azure Postgres (from Sequential Pipelines). On the shared VM there is no per-student env variable; the snapshot's STUDENT constant falls back to the parent directory name of the DAG file, so a DAG at dags/<yourname>/taxi_pipeline.py automatically picks up airflow_<yourname> without any manual configuration. Your ingest_taxi_month task writes to your schema, not the class's public tables and not your classmates' schemas.

The schema separation is why "thirty students deploying the same DAG" does not blow up into thirty fights over the raw_trips table. Each student's DAG writes to a different database schema. You can SQL-query your own airflow_lasse.raw_trips and compare against a classmate's airflow_maria.raw_trips for a sanity check without either of you overwriting the other's data.

<aside> ⚠️ You still share the azure_pg Airflow Connection with everyone. Do not change its host, password, or sslmode on the shared UI: you would break every classmate's run at the same time. If you need to experiment with connection settings, do it on your local stack first.

</aside>

Who can pause whose DAG?

On the shared UI, every student has a login with the same permission level: full read and write access to everything. This is both practical (thirty students do not need thirty separate Airflow RBAC roles for a Week 11 exercise) and deliberate (you learn shared-infra etiquette by being able to misuse it and choosing not to).

The four rules of shared-Airflow etiquette:

  1. Pause and unpause only your own DAGs. If you pause maria_taxi_pipeline to "clean up the UI", you just stopped Maria's scheduled run. She will not thank you.
  2. Trigger only your own DAGs. Clicking Trigger on someone else's DAG wastes their shared Postgres quota and adds rows to their schema they did not ask for.
  3. Clear only your own task instances. The Clear button resets state and re-queues tasks. Doing it on a classmate's DAG gives them a fresh surprise run.
  4. If something shared is broken (the scheduler is down, the shared Postgres is full, the azure_pg connection is wrong), tell the teacher in the class channel. Do not "fix" it yourself from the UI.

The single exception: the teacher occasionally triggers a class-demo DAG in front of everyone. That is expected.

<aside> 🤓 Curious Geek: Airflow's "Multi-Tenancy" story

Airflow has never had a great multi-tenant story. Until Airflow 2.8 (late 2023), DAG parsing was global: one broken DAG file could crash the scheduler for everyone. The separate dag-processor was introduced in 2.8 as opt-in, then made mandatory in Airflow 3.0. Your class VM runs it as a required service because the Airflow 3 scheduler no longer parses DAG files itself. Isolated parse failures mean one student's typo no longer takes down the shared UI. For real production multi-tenancy (dozens of teams, thousands of DAGs), teams still usually split into multiple Airflow instances, one per environment or team, rather than trying to slice access controls inside one.

</aside>

⌨️ Hands on: Deploy your taxi_pipeline to shared Airflow

  1. Get the class repo URL and your UI login from the teacher if you do not have them yet. For this cohort the repo is lassebenni/class-airflow-reference (public; you can browse and clone without auth, but you need push access from the teacher to open PRs). Your UI login is student-<yourname> with the initial password sent in the class channel on day one.
  2. Clone and branch. From anywhere on your laptop:
   git clone <repo-url> class-airflow
   cd class-airflow
   git checkout -b <yourname>/taxi-pipeline
  1. Create your subdirectory. Inside the repo, make your sandbox and copy your taxi_pipeline into it:
   mkdir -p dags/<yourname>
   cp /path/to/your/local/dags/taxi_pipeline.py dags/<yourname>/taxi_pipeline.py
  1. Namespace the DAG. Open dags/<yourname>/taxi_pipeline.py and update the @dag(...) decorator:
  1. Run the integrity test locally before pushing. The one from Testing DAGs:
   astro dev pytest tests/test_dag_integrity.py --args "-v"

This catches import errors before your commit hits the shared scheduler. A broken push does not crash the class VM (the separate dag-processor isolates parse failures), but it does show up as a red 1 badge on the DAGs list and a passive-aggressive Slack message from the teacher.

  1. Push and open a pull request. Standard Git flow:
   git add dags/<yourname>/taxi_pipeline.py
   git commit -m "add <yourname> taxi_pipeline"
   git push -u origin <yourname>/taxi-pipeline
   gh pr create --title "<yourname>: add taxi_pipeline" --body "Week 11 shared deploy"

If your cohort has a teacher-approval rule, wait for the PR to be merged. If merges are self-service for Week 11, merge your own PR after the integrity-test CI passes.

  1. Watch for your DAG in the shared UI. Open the class Airflow UI. The DAG list shows everyone's DAGs: yours plus classmates' plus the teacher's class-demo.

Shared Airflow DAG list: class_demo_hello (teacher, paused) and lasse_taxi_pipeline (student:lasse tag, unpaused) both visible on the same UI

Shared Airflow DAG list: class_demo_hello (teacher, paused) and lasse_taxi_pipeline (student:lasse tag, unpaused) both visible on the same UI

To find only yours, click the Filter by tag dropdown and pick student:<yourname>:

Shared Airflow DAG list filtered by the student:lasse tag, showing 1 Dag

Shared Airflow DAG list filtered by the student:lasse tag, showing 1 Dag

Wait up to 60 seconds for the cron poll + dag-processor re-parse after your merge. You should see <yourname>_taxi_pipeline appear.

  1. Unpause and trigger. Flip the DAG switch from paused to unpaused. Click Trigger. Open the run and watch the tasks execute: ingest_taxi_month → dbt_run → dbt_test. The ingest task runs against the shared Azure Postgres, writing to airflow_<yourname> (same schema as your local runs):

Shared Airflow task instances view: ingest_taxi_month=Success running against the shared Azure Postgres, dbt_run=Up For Retry because the shared VM does not ship dbt, dbt_test still queued

Shared Airflow task instances view: ingest_taxi_month=Success running against the shared Azure Postgres, dbt_run=Up For Retry because the shared VM does not ship dbt, dbt_test still queued

<aside> 💡 On the current class VM, ingest_taxi_month goes green but dbt_run and dbt_test stay orange (Up For Retry): the shared VM does not include dbt-core in its pip requirements by design (dbt-core + Airflow's transitive OTEL pins cause pip to backtrack indefinitely every container restart). Students whose DAGs need dbt on the shared VM either install it inline in the BashOperator (pip install --quiet dbt-postgres==1.10.* && dbt run …) or the teacher bakes a custom Airflow image. The Going Further page covers the custom-image pattern. For this step, seeing ingest_taxi_month green is the signal the deploy loop works end-to-end; the dbt tasks are expected orange.

</aside>

  1. Verify isolation. Go back to the full DAG list and open a classmate's DAG (different student:<name> tag). Do not trigger or pause it. Note how it coexists with yours: same scheduler, same UI, separate schema writes, separate log streams. This is what "shared production" feels like.
  2. Clean up only your own work. If your DAG failed, debug it using the patterns from Monitoring and Debugging. Push a fix to your branch, open a new PR, merge, watch the re-parse, re-trigger. Do not delete or rename other students' DAG files to "tidy up".

Completing these ten steps once reproduces the core deployment loop you will use every day as a data engineer: edit locally, integrity-test locally, push, code-review, auto-deploy, verify on prod UI, iterate.

Debugging on shared vs local

When a shared deploy fails, the debugging steps from Monitoring and Debugging apply unchanged: open the UI, find the red task, read the log upward, check the rendered template tab. The only difference is that the log file physically lives on the teacher's VM, not your laptop, and the UI streams it to you. You cannot tail -f the log from your terminal unless you have SSH access.

Two new failure modes you did not have locally: