Parameterized Runs and Backfills
Assignment: Build an Orchestrated Data Pipeline
Week 11 Lesson Plan (Teachers)
You already built ingestion and transformation logic in earlier weeks. Your next step is to operate that logic like a real data platform: automatic runs, clear dependencies, historical backfills, and failure visibility.
Your task is to build a production-style Airflow DAG that orchestrates a multi-step pipeline end-to-end.
Choose your level at the start. You can always move up later.
dbt_run and dbt_test), schedule, retry, one manual rerun, one successful run on your local Astro stack, plus the DAG integrity test from Testing DAGs (it is 20 lines, catches the "visible in UI without import errors" check for free, and deserves to be in every Week 11 project).{{ ds }}-parameterized runs + 7-day backfill + clear runbook + one successful deploy to the shared class Airflow (ingest_taxi_month green; dbt_run / dbt_test orange-by-design on the stock VM is acceptable, see Ch8).<aside> 💡 Start with Minimum, then upgrade to Target. This reduces overwhelm and improves completion quality.
</aside>
The Minimum tier stays completable even if the shared class Airflow is offline or your cohort has not set it up. The shared-deploy step is mandatory at Target and above; if your cohort does not have a shared VM running, ask your teacher to confirm Target tier is achievable, or aim for Minimum.
Build one DAG that includes:
At Target tier and above, also:
{{ ds }}-parameterized execution so reruns are idempotent.dags/.schedule="0 6 * * *" (or another justified daily time)start_datecatchup=False for normal operationCreate at least three tasks in a strict dependency chain:
ingest: produce or fetch the raw input your DAG will processdbt_run: run dbt run --project-dir <path> --profiles-dir <path> against your Week 10 project using BashOperatordbt_test: run dbt test with the same flagsThe dbt integration is required; it is the realistic use case for Week 11 and the mechanics were covered in Sequential Pipelines. Mount your Week 10 project under include/dbt_project/ (Astro's convention) and use --project-dir / --profiles-dir flags rather than cd.
If your Week 10 project is not runnable, the class's shared Airflow repo ships a working dbt project at lassebenni/class-airflow-reference under include/dbt_project/. Copy that directory into your assignment project:
git clone <https://github.com/lassebenni/class-airflow-reference> /tmp/class-ref
cp -r /tmp/class-ref/include/dbt_project include/dbt_project
Document which project you used (your Week 10 project or the class reference) in ASSIGNMENT_REPORT.md.
Your DAG should process one month (or day, depending on your cadence choice) of data per run, with the partition derived from Airflow's logical date rather than datetime.now(). If you followed the Parameterized Runs and Backfills snapshot, you already have this: _ds_from_context() returns the current run's date, and the ingest task slices it down to a year-month for the DELETE and the parquet URL.
{{ ds }} (via the _ds_from_context() helper, or as a templated task argument for scheduled runs) in at least one task and use it to pick the partition.params entry (for example {"dataset": "green"} to pick between green_tripdata and yellow_tripdata) and use it in your ingest task.retries, retry_delay) for at least one critical task.airflow dags backfill with backfill create. The date range you pass should be large enough to fire 7 runs for your schedule:0 6 * * *): 7 days → --from-date 2024-01-01 --to-date 2024-01-07@monthly, matching the Ch5 snapshot): 7 months → --from-date 2024-01-01 --to-date 2024-07-31 astro dev run backfill create \\
--dag-id <your_dag_id> \\
--from-date 2024-01-01 \\
--to-date 2024-01-07 \\
--max-active-runs 1
--max-active-runs 1 is required if your DAG invokes dbt (concurrent dbt_run tasks collide on __dbt_backup relations; see Gotcha #6).airflow_<yourname> schema, record row counts, re-run the same backfill, confirm the counts are identical.Create RUNBOOK.md with:
Once your DAG runs green locally, finish by deploying it to the class's shared Airflow (see Deploying to Shared Airflow for the full workflow). Minimum-tier students can skip this task; Target-tier students must complete it to score the 10 shared-deploy rubric points.
dags/<yourname>/ of the class's shared Airflow repo.dag_id to <yourname>_<dag_name> and add "student:<yourname>" to tags.2024-01-01) so the TLC parquet actually exists. Confirm it writes to your airflow_<yourname> schema on the shared Azure PG.student: tag.<aside>
⚠️ Airflow 3's backfill create uses deterministic run-ids like backfill__2024-01-01T00:00:00+00:00. If a classmate already triggered a backfill for the same DAG + logical date, your second invocation is a no-op (no duplicate run is created). For the assignment, one manual trigger is enough; backfill on the shared VM is not required.
</aside>
This is what submitting real production work looks like: your code runs next to other people's code on shared infrastructure, you own your own schema, and you do not touch classmates' DAGs.
All tiers submit:
RUNBOOK.mdASSIGNMENT_REPORT.md containing:Target tier adds:
ASSIGNMENT_REPORT.md also documents the {{ ds }} parameter usage and the backfill command(s) you ranstudent:<yourname> tag (ingest_taxi_month green expected; dbt_run / dbt_test orange-by-design on the stock VM is acceptable per Ch8)(tests/test_dag_integrity.py is required for all tiers, listed under Minimum above and in the Definition-of-done checklist below; not a Target-only addition.)
Before submission, confirm all items:
All tiers:
astro dev pytest tests/test_dag_integrity.py --args "-v" passes)retries + retry_delay configured in default_args or on at least one critical taskRUNBOOK.md is specific enough that another student can operate your DAGTarget tier and above:
{{ ds }} (or the _ds_from_context() helper) to drive the partition<yourname>_ and your schema-isolated writes landing in airflow_<yourname>student:<yourname> tag, showing the green runweek11-orchestration-assignment/
├── dags/
│ └── week_11_pipeline.py
├── include/
│ ├── dbt_project/ # your Week 10 project (or the reference repo)
│ └── pipeline_config.yaml
├── tests/
│ └── test_dag_integrity.py # required for ALL tiers, see Ch6
├── RUNBOOK.md
├── ASSIGNMENT_REPORT.md
└── requirements.txt