Parameterized Runs and Backfills
Assignment: Build an Orchestrated Data Pipeline
Week 11 Lesson Plan (Teachers)
Career relevance: Week 11 in the NL data job market
Going Further: Optional Deep Dives
A single place to look up every Airflow and orchestration term you meet this week. Entries are ordered by the chapter that first introduces the term: each term lives in exactly one place, and each chapter back-links to its glossary entry on first use. Skim the section for the chapter you are reading; the section above it covers terms you have already met.
Coordinating pipeline tasks, dependencies, retries, and run history across a full workflow. What Airflow does. Compare with scheduling, which only answers "when to start."
The end-to-end sequence of data steps you want to run together. In Week 11 the pipeline is ingest → load → dbt run → dbt test.
The step-by-step map of your pipeline expressed as a Python file in dags/. Acyclic means no loops: step B can depend on step A, but A cannot also depend on B.
One unit of work inside a DAG, usually a Python function decorated with @task. Deeper coverage in Airflow Fundamentals.
The order relation between tasks, written as a >> b or implied by passing one task's return value into another. Detailed in Sequential Pipeline Steps.
Automatic re-attempt of a task after a temporary failure, configured via retries=N and retry_delay. Tested in Testing DAGs.
Rerunning the same task or DAG gives the same correct result without duplicates. Usually achieved with a DELETE-by-partition-key before the INSERT. Drilled in Sequential Pipeline Steps and the assignment.
One execution of a task for one DAG run. The same task can have many task instances across runs (one per scheduled date, plus retries).
A pre-built task class such as BashOperator or PythonOperator. The @task decorator is sugar over PythonOperator.
The decorator-based way of writing DAGs (@dag, @task). Pairs naturally with XCom for passing values.
The Python import path you see in every Week 11 DAG (from airflow.sdk import dag, task). This is Airflow 3's public API surface; the older airflow.decorators is 2.x-era.
The Airflow service that decides which task instances are ready to run and hands them to workers. Distinct from "scheduler" in the generic sense (a thing that runs commands on a clock), which is only one of orchestration's responsibilities.
The Airflow 3 service that hosts the web UI and REST API. Renamed from webserver in 2.x. Detailed in Deploying to Shared Airflow.
The separate Airflow 3 service that parses .py files in dags/ into the scheduler's DAG list. A broken DAG fails parsing here without crashing the scheduler for everyone. Tested in Testing DAGs.
The process that actually executes the task code. In Astro CLI's local stack, workers and scheduler share a container.
The Airflow service that runs async-style sensors and deferrable operators without holding worker slots. Sits next to scheduler and dag-processor in the Astro CLI's local stack.
The Postgres database Airflow itself uses to store DAG definitions, run history, task states, and XComs. Not to be confused with the Azure Postgres where your data lives.
"Upstream of B" means "runs before B" (A in A >> B). A failure upstream puts the downstream task into upstream_failed. Drilled in Monitoring and Debugging.
The web interface (the api-server) where you trigger DAGs, view logs, and inspect task states. Key views: DAG list, Grid view, Graph view, Task instance → Logs, Rendered Templates tab.
The cadence at which Airflow should create DAG runs. Accepts cron strings ("0 6 * * *"), presets ("@daily"), or cron-based timetables.
The five-field syntax min hour day month weekday that Airflow's schedule= accepts. Inherited from Paul Vixie's 1987 cron rewrite; read the optional history chapter for the story.
Convenience aliases like "@daily", "@hourly", "@monthly" that expand to standard cron strings. Use them for clarity unless the cadence is non-standard.
ds)The date label for the data interval being processed. Written {{ ds }} in templates. Stable across retries and manual reruns: this is what makes backfills idempotent. Detailed in Parameterized Runs and Backfills.
The [start, end) window of data a run is responsible for. For a daily schedule the interval for 2024-03-10 runs after 2024-03-10 is complete.
When set to True, Airflow creates historical runs from the DAG's start_date up to now. False means "only the next scheduled interval." Default is False in this course; flip to True only when you want automatic backfilling on un-pause.
Starting a DAG run from the UI or CLI outside its schedule. On Airflow 3, manually-triggered runs may have logical_date=None, so code that reads ds should fall back to dag_run.run_after via the _ds_from_context() helper.
A task type that waits for an external condition (a file, a time, an API response) before letting the DAG continue. Use poke_interval and mode="reschedule" so sensors do not hold worker slots.
Airflow 3's pattern of triggering one DAG when another DAG produces a Dataset. Sits next to cron schedules and sensors as a third kind of trigger. Not used this week; see Going Further for context.
An Airflow object that stores a host, credentials, and metadata under a named ID (for example azure_pg). Tasks reference connections by ID rather than reading environment variables directly. Detailed in Monitoring and Debugging.
The Python client that uses a Connection to talk to a system. PostgresHook("azure_pg") is the one you use most this week.
(cross-communication) Airflow's built-in mechanism for passing small values (row counts, paths, IDs) between tasks via the metadata DB. Good for int/str; bad for DataFrames or filesystem paths.
The per-student schema-isolation pattern. Your DAG's STUDENT constant reads this env var (or falls back to the DAG's subdirectory name on the shared VM) and writes to an airflow_<your-name> schema in Azure Postgres so you do not clobber classmates' data. Used again in Deploying to Shared Airflow.
The convention of writing to airflow_<your-name> instead of a shared analytics schema during development. Pairs with AIRFLOW_STUDENT.
Running historical dates after the fact, typically with astro dev run backfill create. Airflow 3 uses deterministic backfill__<logical-date> run IDs so repeat backfills are idempotent.