Week 11 - Orchestration

Introduction to Orchestration

Airflow Fundamentals

Scheduling and Triggers

Sequential Pipeline Steps

Parameterized Runs and Backfills

Testing DAGs

Monitoring and Debugging

Deploying to Shared Airflow

Practice

Gotchas & Pitfalls

Assignment: Build an Orchestrated Data Pipeline

Week 11 Lesson Plan (Teachers)

🎒 Assignment: Build an Orchestrated Data Pipeline

Scenario

You already built ingestion and transformation logic in earlier weeks. Your next step is to operate that logic like a real data platform: automatic runs, clear dependencies, historical backfills, and failure visibility.

Your task is to build a production-style Airflow DAG that orchestrates a multi-step pipeline end-to-end.

Difficulty levels

Choose your level at the start. You can always move up later.

<aside> 💡 Start with Minimum, then upgrade to Target. This reduces overwhelm and improves completion quality.

</aside>

The Minimum tier stays completable even if the shared class Airflow is offline or your cohort has not set it up. The shared-deploy step is mandatory at Target and above; if your cohort does not have a shared VM running, ask your teacher to confirm Target tier is achievable, or aim for Minimum.

Project scope

Build one DAG that includes:

  1. A schedule (daily is the simplest; monthly matches the Ch5 snapshot).
  2. At least three sequential tasks.
  3. Retry behavior for transient failure.

At Target tier and above, also:

  1. {{ ds }}-parameterized execution so reruns are idempotent.
  2. A 7-day (or 7-month, depending on your cadence) historical backfill.
  3. One green run deployed to the shared class Airflow.

Requirements

Task 1: DAG setup

Task 2: Sequential workflow with the Week 10 dbt project

Create at least three tasks in a strict dependency chain:

  1. ingest: produce or fetch the raw input your DAG will process
  2. dbt_run: run dbt run --project-dir <path> --profiles-dir <path> against your Week 10 project using BashOperator
  3. dbt_test: run dbt test with the same flags

The dbt integration is required; it is the realistic use case for Week 11 and the mechanics were covered in Sequential Pipelines. Mount your Week 10 project under include/dbt_project/ (Astro's convention) and use --project-dir / --profiles-dir flags rather than cd.

If your Week 10 project is not runnable, the class's shared Airflow repo ships a working dbt project at lassebenni/class-airflow-reference under include/dbt_project/. Copy that directory into your assignment project:

git clone <https://github.com/lassebenni/class-airflow-reference> /tmp/class-ref
cp -r /tmp/class-ref/include/dbt_project include/dbt_project

Document which project you used (your Week 10 project or the class reference) in ASSIGNMENT_REPORT.md.

Task 3: Parameterized runs

Your DAG should process one month (or day, depending on your cadence choice) of data per run, with the partition derived from Airflow's logical date rather than datetime.now(). If you followed the Parameterized Runs and Backfills snapshot, you already have this: _ds_from_context() returns the current run's date, and the ingest task slices it down to a year-month for the DELETE and the parquet URL.

Task 4: Retry and failure handling

Task 5: Backfill

  astro dev run backfill create \\
    --dag-id <your_dag_id> \\
    --from-date 2024-01-01 \\
    --to-date 2024-01-07 \\
    --max-active-runs 1

Task 6: Operational notes

Create RUNBOOK.md with:

Task 7: Deploy to the shared Airflow (Target tier and above)

Once your DAG runs green locally, finish by deploying it to the class's shared Airflow (see Deploying to Shared Airflow for the full workflow). Minimum-tier students can skip this task; Target-tier students must complete it to score the 10 shared-deploy rubric points.

<aside> ⚠️ Airflow 3's backfill create uses deterministic run-ids like backfill__2024-01-01T00:00:00+00:00. If a classmate already triggered a backfill for the same DAG + logical date, your second invocation is a no-op (no duplicate run is created). For the assignment, one manual trigger is enough; backfill on the shared VM is not required.

</aside>

This is what submitting real production work looks like: your code runs next to other people's code on shared infrastructure, you own your own schema, and you do not touch classmates' DAGs.

Deliverables

All tiers submit:

Target tier adds:

(tests/test_dag_integrity.py is required for all tiers, listed under Minimum above and in the Definition-of-done checklist below; not a Target-only addition.)

Definition of done checklist

Before submission, confirm all items:

All tiers:

Target tier and above:

Suggested project structure

week11-orchestration-assignment/
├── dags/
│   └── week_11_pipeline.py
├── include/
│   ├── dbt_project/          # your Week 10 project (or the reference repo)
│   └── pipeline_config.yaml
├── tests/
│   └── test_dag_integrity.py # required for ALL tiers, see Ch6
├── RUNBOOK.md
├── ASSIGNMENT_REPORT.md
└── requirements.txt