Week 11 - Orchestration

Introduction to Orchestration

Airflow Fundamentals

Scheduling and Triggers

Sequential Pipeline Steps

Parameterized Runs and Backfills

Testing DAGs

Monitoring and Debugging

Deploying to Shared Airflow

Practice

Gotchas & Pitfalls

Assignment: Build an Orchestrated Data Pipeline

Week 11 Lesson Plan (Teachers)

Week 11 Glossary

Career relevance: Week 11 in the NL data job market

Going Further: Optional Deep Dives

Week 11 Glossary

A single place to look up every Airflow and orchestration term you meet this week. Entries are ordered by the chapter that first introduces the term: each term lives in exactly one place, and each chapter back-links to its glossary entry on first use. Skim the section for the chapter you are reading; the section above it covers terms you have already met.

Chapter 1 - Introduction to Orchestration

Orchestration

Coordinating pipeline tasks, dependencies, retries, and run history across a full workflow. What Airflow does. Compare with scheduling, which only answers "when to start."

Pipeline

The end-to-end sequence of data steps you want to run together. In Week 11 the pipeline is ingest → load → dbt run → dbt test.

DAG (Directed Acyclic Graph)

The step-by-step map of your pipeline expressed as a Python file in dags/. Acyclic means no loops: step B can depend on step A, but A cannot also depend on B.

Task

One unit of work inside a DAG, usually a Python function decorated with @task. Deeper coverage in Airflow Fundamentals.

Dependency

The order relation between tasks, written as a >> b or implied by passing one task's return value into another. Detailed in Sequential Pipeline Steps.

Retry

Automatic re-attempt of a task after a temporary failure, configured via retries=N and retry_delay. Tested in Testing DAGs.

Idempotent

Rerunning the same task or DAG gives the same correct result without duplicates. Usually achieved with a DELETE-by-partition-key before the INSERT. Drilled in Sequential Pipeline Steps and the assignment.

Chapter 2 - Airflow Fundamentals

Task instance

One execution of a task for one DAG run. The same task can have many task instances across runs (one per scheduled date, plus retries).

Operator

A pre-built task class such as BashOperator or PythonOperator. The @task decorator is sugar over PythonOperator.

TaskFlow API

The decorator-based way of writing DAGs (@dag, @task). Pairs naturally with XCom for passing values.

airflow.sdk

The Python import path you see in every Week 11 DAG (from airflow.sdk import dag, task). This is Airflow 3's public API surface; the older airflow.decorators is 2.x-era.

Scheduler

The Airflow service that decides which task instances are ready to run and hands them to workers. Distinct from "scheduler" in the generic sense (a thing that runs commands on a clock), which is only one of orchestration's responsibilities.

api-server

The Airflow 3 service that hosts the web UI and REST API. Renamed from webserver in 2.x. Detailed in Deploying to Shared Airflow.

dag-processor

The separate Airflow 3 service that parses .py files in dags/ into the scheduler's DAG list. A broken DAG fails parsing here without crashing the scheduler for everyone. Tested in Testing DAGs.

Worker

The process that actually executes the task code. In Astro CLI's local stack, workers and scheduler share a container.

Triggerer

The Airflow service that runs async-style sensors and deferrable operators without holding worker slots. Sits next to scheduler and dag-processor in the Astro CLI's local stack.

Metadata DB

The Postgres database Airflow itself uses to store DAG definitions, run history, task states, and XComs. Not to be confused with the Azure Postgres where your data lives.

Upstream / downstream

"Upstream of B" means "runs before B" (A in A >> B). A failure upstream puts the downstream task into upstream_failed. Drilled in Monitoring and Debugging.

Airflow UI

The web interface (the api-server) where you trigger DAGs, view logs, and inspect task states. Key views: DAG list, Grid view, Graph view, Task instance → Logs, Rendered Templates tab.

Chapter 3 - Scheduling and Triggers

Schedule

The cadence at which Airflow should create DAG runs. Accepts cron strings ("0 6 * * *"), presets ("@daily"), or cron-based timetables.

Cron expressions

The five-field syntax min hour day month weekday that Airflow's schedule= accepts. Inherited from Paul Vixie's 1987 cron rewrite; read the optional history chapter for the story.

Presets

Convenience aliases like "@daily", "@hourly", "@monthly" that expand to standard cron strings. Use them for clarity unless the cadence is non-standard.

Logical date (ds)

The date label for the data interval being processed. Written {{ ds }} in templates. Stable across retries and manual reruns: this is what makes backfills idempotent. Detailed in Parameterized Runs and Backfills.

Data interval

The [start, end) window of data a run is responsible for. For a daily schedule the interval for 2024-03-10 runs after 2024-03-10 is complete.

Catchup

When set to True, Airflow creates historical runs from the DAG's start_date up to now. False means "only the next scheduled interval." Default is False in this course; flip to True only when you want automatic backfilling on un-pause.

Manual trigger

Starting a DAG run from the UI or CLI outside its schedule. On Airflow 3, manually-triggered runs may have logical_date=None, so code that reads ds should fall back to dag_run.run_after via the _ds_from_context() helper.

Sensor

A task type that waits for an external condition (a file, a time, an API response) before letting the DAG continue. Use poke_interval and mode="reschedule" so sensors do not hold worker slots.

Dataset-driven scheduling

Airflow 3's pattern of triggering one DAG when another DAG produces a Dataset. Sits next to cron schedules and sensors as a third kind of trigger. Not used this week; see Going Further for context.

Chapter 4 - Sequential Pipeline Steps

Connection

An Airflow object that stores a host, credentials, and metadata under a named ID (for example azure_pg). Tasks reference connections by ID rather than reading environment variables directly. Detailed in Monitoring and Debugging.

Hook

The Python client that uses a Connection to talk to a system. PostgresHook("azure_pg") is the one you use most this week.

XCom

(cross-communication) Airflow's built-in mechanism for passing small values (row counts, paths, IDs) between tasks via the metadata DB. Good for int/str; bad for DataFrames or filesystem paths.

AIRFLOWSTUDENT / airflow<name>

The per-student schema-isolation pattern. Your DAG's STUDENT constant reads this env var (or falls back to the DAG's subdirectory name on the shared VM) and writes to an airflow_<your-name> schema in Azure Postgres so you do not clobber classmates' data. Used again in Deploying to Shared Airflow.

Per-student schema isolation

The convention of writing to airflow_<your-name> instead of a shared analytics schema during development. Pairs with AIRFLOW_STUDENT.

Chapter 5 - Parameterized Runs and Backfills

Backfill

Running historical dates after the fact, typically with astro dev run backfill create. Airflow 3 uses deterministic backfill__<logical-date> run IDs so repeat backfills are idempotent.