🎒 Assignment: Build an Orchestrated Data Pipeline

Scenario

You already built ingestion and transformation logic in earlier weeks. Your next step is to operate that logic like a real data platform: automatic runs, clear dependencies, historical backfills, and failure visibility.

Your task is to build a production-style Airflow DAG that orchestrates a multi-step pipeline end-to-end.

Difficulty levels

Choose your level at the start. You can always move up later.

Minimum (pass): one DAG with 3 sequential tasks (including dbt_run and dbt_test), schedule, retry, one manual rerun, one successful run on your local Astro stack, plus the DAG integrity test from Testing DAGs (it is 20 lines, catches the "visible in UI without import errors" check for free, and deserves to be in every Week 11 project).
Target (recommended): minimum + {{ ds }}-parameterized runs + 7-day backfill + clear runbook + one successful deploy to the shared class Airflow (ingest_taxi_month green; dbt_run / dbt_test orange-by-design on the stock VM is acceptable, see Ch8).
Stretch (industry-ready): target + alert callback, unit tests for TaskFlow task logic, and stronger failure simulation.

<aside> 💡 Start with Minimum, then upgrade to Target. This reduces overwhelm and improves completion quality.

</aside>

The Minimum tier stays completable even if the shared class Airflow is offline or your cohort has not set it up. The shared-deploy step is mandatory at Target and above; if your cohort does not have a shared VM running, ask your teacher to confirm Target tier is achievable, or aim for Minimum.

Project scope

Build one DAG that includes:

A schedule (daily is the simplest; monthly matches the Ch5 snapshot).
At least three sequential tasks.
Retry behavior for transient failure.

At Target tier and above, also:

{{ ds }}-parameterized execution so reruns are idempotent.
A 7-day (or 7-month, depending on your cadence) historical backfill.
One green run deployed to the shared class Airflow.

Requirements

Task 1: DAG setup

Create an Astro project and add your DAG in dags/.
DAG must include:
schedule="0 6 * * *" (or another justified daily time)
start_date
catchup=False for normal operation
at least one descriptive tag

Task 2: Sequential workflow with the Week 10 dbt project

Create at least three tasks in a strict dependency chain:

ingest: produce or fetch the raw input your DAG will process
dbt_run: run dbt run --project-dir <path> --profiles-dir <path> against your Week 10 project using BashOperator
dbt_test: run dbt test with the same flags

The dbt integration is required; it is the realistic use case for Week 11 and the mechanics were covered in Sequential Pipelines. Mount your Week 10 project under include/dbt_project/ (Astro's convention) and use --project-dir / --profiles-dir flags rather than cd.

If your Week 10 project is not runnable, the class's shared Airflow repo ships a working dbt project at lassebenni/class-airflow-reference under include/dbt_project/. Copy that directory into your assignment project:

git clone <https://github.com/lassebenni/class-airflow-reference> /tmp/class-ref
cp -r /tmp/class-ref/include/dbt_project include/dbt_project

Document which project you used (your Week 10 project or the class reference) in ASSIGNMENT_REPORT.md.

Task 3: Parameterized runs

Your DAG should process one month (or day, depending on your cadence choice) of data per run, with the partition derived from Airflow's logical date rather than datetime.now(). If you followed the Parameterized Runs and Backfills snapshot, you already have this: _ds_from_context() returns the current run's date, and the ingest task slices it down to a year-month for the DELETE and the parquet URL.

Read {{ ds }} (via the _ds_from_context() helper, or as a templated task argument for scheduled runs) in at least one task and use it to pick the partition.
The rubric's 20 "Parameterized execution" points score this date-driven partitioning.
Optional stretch: add a custom DAG params entry (for example {"dataset": "green"} to pick between green_tripdata and yellow_tripdata) and use it in your ingest task.

Task 4: Retry and failure handling

Configure retry behavior (retries, retry_delay) for at least one critical task.
Demonstrate one failure and recovery path in your documentation.
Optional stretch: add alerting callback if your environment supports it.

Task 5: Backfill

Perform a 7-run backfill. Airflow 3 replaced airflow dags backfill with backfill create. The date range you pass should be large enough to fire 7 runs for your schedule:
Daily schedule (0 6 * * *): 7 days → --from-date 2024-01-01 --to-date 2024-01-07
Monthly schedule (@monthly, matching the Ch5 snapshot): 7 months → --from-date 2024-01-01 --to-date 2024-07-31

  astro dev run backfill create \\
    --dag-id <your_dag_id> \\
    --from-date 2024-01-01 \\
    --to-date 2024-01-07 \\
    --max-active-runs 1

--max-active-runs 1 is required if your DAG invokes dbt (concurrent dbt_run tasks collide on __dbt_backup relations; see Gotcha #6).
Verify each day processed correctly.
Document evidence that reruns do not create duplicates or inconsistent results: query your airflow_<yourname> schema, record row counts, re-run the same backfill, confirm the counts are identical.

Task 6: Operational notes

Create RUNBOOK.md with:

how to trigger the DAG manually
how to run a backfill
how to inspect task logs
top 3 likely failures and first-response steps

Task 7: Deploy to the shared Airflow (Target tier and above)

Once your DAG runs green locally, finish by deploying it to the class's shared Airflow (see Deploying to Shared Airflow for the full workflow). Minimum-tier students can skip this task; Target-tier students must complete it to score the 10 shared-deploy rubric points.

Copy your DAG into dags/<yourname>/ of the class's shared Airflow repo.
Rename the dag_id to <yourname>_<dag_name> and add "student:<yourname>" to tags.
Open a PR, merge it, and confirm the DAG appears in the shared UI within 60 seconds.
Trigger one manual run on the shared scheduler, explicitly passing a past logical date (e.g. 2024-01-01) so the TLC parquet actually exists. Confirm it writes to your airflow_<yourname> schema on the shared Azure PG.
Capture a screenshot of the shared UI grid view showing your green run, filtered by your student: tag.

<aside> ⚠️ Airflow 3's backfill create uses deterministic run-ids like backfill__2024-01-01T00:00:00+00:00. If a classmate already triggered a backfill for the same DAG + logical date, your second invocation is a no-op (no duplicate run is created). For the assignment, one manual trigger is enough; backfill on the shared VM is not required.

</aside>

This is what submitting real production work looks like: your code runs next to other people's code on shared infrastructure, you own your own schema, and you do not touch classmates' DAGs.

Deliverables

All tiers submit:

Astro project with runnable DAG code
RUNBOOK.md
short ASSIGNMENT_REPORT.md containing:
schedule choice and reason
task dependency graph explanation
one debugging case you resolved
screenshots:
Graph view of DAG
successful run in Grid view (local Astro)
one task log snippet

Target tier adds:

ASSIGNMENT_REPORT.md also documents the {{ ds }} parameter usage and the backfill command(s) you ran
one additional screenshot: a successful deploy in the shared Airflow UI, filtered by your student:<yourname> tag (ingest_taxi_month green expected; dbt_run / dbt_test orange-by-design on the stock VM is acceptable per Ch8)

(tests/test_dag_integrity.py is required for all tiers, listed under Minimum above and in the Definition-of-done checklist below; not a Target-only addition.)

Definition of done checklist

Before submission, confirm all items:

All tiers:

[ ] DAG is visible in Airflow UI without import errors (astro dev pytest tests/test_dag_integrity.py --args "-v" passes)
[ ] One manual DAG run (manual trigger) is green end-to-end on your local Astro
[ ] retries + retry_delay configured in default_args or on at least one critical task
[ ] One controlled failure is documented with root cause and fix
[ ] RUNBOOK.md is specific enough that another student can operate your DAG

Target tier and above:

[ ] DAG reads {{ ds }} (or the _ds_from_context() helper) to drive the partition
[ ] 7-day backfill evidence is included (commands + outcome + idempotency row counts)
[ ] One successful run on the shared class Airflow, with your DAG namespaced under <yourname>_ and your schema-isolated writes landing in airflow_<yourname>
[ ] Shared-UI screenshot filtered by your student:<yourname> tag, showing the green run

Suggested project structure

week11-orchestration-assignment/
├── dags/
│   └── week_11_pipeline.py
├── include/
│   ├── dbt_project/          # your Week 10 project (or the reference repo)
│   └── pipeline_config.yaml
├── tests/
│   └── test_dag_integrity.py # required for ALL tiers, see Ch6
├── RUNBOOK.md
├── ASSIGNMENT_REPORT.md
└── requirements.txt