Parameterized Runs and Backfills
Assignment: Build an Orchestrated Data Pipeline
Week 11 Lesson Plan (Teachers)
These exercises help you turn Week 11 concepts into working Airflow behavior. Complete them in order. Each exercise builds toward the assignment.
Every exercise reuses the taxi_pipeline you built across Sequential Pipelines, Parameterized Runs and Backfills, Testing DAGs, and Monitoring and Debugging. You are not building a new DAG here; you are exercising the one you already have. If yours is broken, diff against assets/dag_snapshots/taxi_pipeline.py and start from the reference.
<aside> 📝 Practice is optional, but strongly recommended before starting the assignment.
</aside>
Before you run any exercise, set AIRFLOW_STUDENT=yourname in your Astro project's .env and restart the stack (astro dev restart). Without it, your DAG writes to airflow_default on the shared Azure Postgres and collides with every other student who also forgot. Sequential Pipelines covers the why; Practice just needs you to verify the env var is live before Exercise 1.
<aside> 💭 If you get blocked for more than 20 minutes, write what you tried, ask for help, and continue with the next exercise. This is professional behavior, not failure.
</aside>
Concepts: Astro CLI, DAG discovery, manual trigger.
Instructions:
astro dev start.http://<project-dirname>.localhost:<random-port> line, not http://localhost:8080: see Airflow Fundamentals) and confirm taxi_pipeline appears in the DAG list.2024-01-01. Airflow 3's plain Trigger button leaves logical_date unset (see Airflow macros and runtime context), so use one of these instead: in the UI click the dropdown next to Trigger → Trigger DAG w/ config and paste {"logical_date": "2024-01-01T00:00:00+00:00"}; or from the terminal run astro dev run dags trigger taxi_pipeline -e 2024-01-01.ingest_taxi_month, dbt_run, and dbt_test are all green. Inspect the dbt_test log and verify the Done. PASS=9 WARN=2 summary from Week 10.Success criteria: One green run for 2024-01-01 in your airflow_<yourname> schema, with raw_trips populated and fct_trips rebuilt.
Concepts: cron schedule, retry configuration.
Instructions:
taxi_pipeline uses schedule="@monthly" and default_args={"retries": 2}. Override these per-instance on your own DAG so the scheduler fires at 06:00 UTC on the first of each month, not midnight: change schedule to the cron expression "0 6 1 * *".retry_delay=timedelta(minutes=2) next to retries in default_args.astro dev run dags reserialize) and open the DAG's Details tab in the UI to confirm both settings are applied.Success criteria: DAG metadata shows schedule 0 6 1 * *, retries=2, retry_delay=0:02:00.
Concepts: task dependencies, TaskFlow return values, >> chaining.
Instructions:
check_row_count task between ingest_taxi_month and dbt_run. It takes the row count returned by ingest_taxi_month (an int, not a file path: XCom handles ints cleanly) and raises ValueError if the count is below 1000.ingest_taxi_month() >> check_row_count() >> dbt_run >> dbt_test. The TaskFlow form is check_row_count(ingest_taxi_month()) followed by >> dbt_run.Success criteria: Four-node chain visible in Graph view; happy-path run is green; all dependencies fire in order.
Concepts: {{ ds }} templating, parameterized runs.
Instructions:
taxi_pipeline for 2024-01-01, then 2024-02-01, then 2024-03-01 (use the "Trigger DAG w/ config" option if your Airflow version requires an explicit logical date for manual triggers).ingest_taxi_month task and find the ds value in the task log (it will show up in the URL built by parquet_url_for(ds)).airflow_<yourname>.raw_trips grouped by to_char(lpep_pickup_datetime, 'YYYY-MM'). You should see three rows: 2024-01, 2024-02, 2024-03, each with the correct row count for that month's TLC parquet.Success criteria: Three distinct ds values across three runs; three distinct year-month buckets in raw_trips.
Concepts: airflow backfill create, idempotent DELETE-then-append.
Instructions:
astro dev run backfill create --dag-id taxi_pipeline --from-date 2023-11-01 --to-date 2024-01-31 --max-active-runs 1. Wait for all three runs to go green.raw_trips grouped by year-month: you should see Nov 2023, Dec 2023, Jan 2024.Success criteria: Three year-month buckets in raw_trips after the first backfill, identical counts after the second. DELETE-then-append keeps the pipeline idempotent.
Concepts: log investigation, raise_for_status, 403 vs transient failures.
Instructions:
parquet_url_for in your DAG: append _typo to the filename (as in Monitoring and Debugging): f"{TLC_BASE}/green_tripdata_{year_month}_typo.parquet".ingest_taxi_month goes red.HTTPError: 403 Client Error: Forbidden for url: ....RUNBOOK.md: symptom, check, fix.Success criteria: You read the 403 from the log without scrolling all over the UI, you wrote one runbook entry, and the revert-and-rerun round-trip completed cleanly.
<aside> 🤓 Curious Geek: Blameless post-mortems
The runbook entry you just wrote is the seed of a post-mortem: a written account of what broke, when, why, and what changes prevent a repeat. Mature data teams write one for every customer-impacting incident, and they write them blamelessly: the document names systems and decisions, never people. The reasoning, popularised by Google's SRE practice, is that engineers who fear being named hide context, and hidden context is what causes the next incident. A weekly review of post-mortems is how a team's collective knowledge grows faster than any single member's experience.
</aside>
Concepts: shared-infra workflow, deploy-on-push, multi-student etiquette (from Deploying to Shared Airflow).
Instructions:
dags/<yourname>/) of the class's shared Airflow repo.dag_id to <yourname>_<dag_name> and add "student:<yourname>" to the tags list.astro dev pytest tests/test_dag_integrity.py --args "-v" locally before pushing.dag-processor to re-parse.student:<yourname> tag, unpause your DAG, and trigger one manual run. Verify ingest_taxi_month goes green; dbt_run and dbt_test are expected orange (Up For Retry) on the shared VM unless you inline-install dbt or the teacher has rebuilt the image: see Deploying to Shared Airflow for the rationale.student: tag). Note that you can read it but must not pause or trigger it. Write one sentence on what you observed about their DAG's shape.Success criteria: Your DAG is parsed by the shared scheduler, ingest_taxi_month runs green and writes to your own airflow_<yourname> schema, the DAG is discoverable via tag filter, and you have not touched any classmate's DAG. (dbt_run / dbt_test orange-by-design on the stock shared VM is acceptable per Ch8.)
After finishing practice, write 5 to 8 lines:
Add this reflection to your assignment report draft.
<aside>
💡 Using AI to help: Before Exercise 3, paste your taxi_pipeline (redacted: ⚠️ no credentials, no PII, no real database host) into an LLM and ask it to suggest where a quality-gate task could sit without breaking the idempotency contract. Review the suggestion before coding: LLMs occasionally recommend placing the gate after dbt_run, which defeats the purpose of failing fast on bad data.
</aside>
In the next chapter, you will scan the Week 11 gotchas before starting your assignment. Read them fresh: you have just practiced enough that most of them will jump out as "oh, I nearly did that" rather than "I have no idea what this warning is about."