Parameterized Runs and Backfills
Assignment: Build an Orchestrated Data Pipeline
Week 11 Lesson Plan (Teachers)
Orchestration failures can be hard to notice. Your DAG may look correct but still produce wrong results because of schedule assumptions, rerun behavior, or environment differences. These are the Week 11 traps you are most likely to hit, grouped by when they show up in the pipeline lifecycle.
Each gotcha links back to the chapter that teaches the correct pattern; use the page as an index when your DAG misbehaves.
The misconception: the run date in Airflow is the same as "current time now."
The reality: Airflow's logical date represents the data interval, not the clock time when the run starts. If a run is delayed by retries or queueing, the logical date still points at the scheduled interval. Using datetime.now() inside a task means each retry of the same logical run reads a different boundary, which breaks idempotency.
The fix: use {{ ds }} (or the get_current_context()["dag_run"].logical_date helper from Parameterized Runs and Backfills) for partition logic. Never call datetime.now() to build a data boundary.
catchup=True accidentally on a past start_dateThe misconception: turning on a new DAG with catchup=True is harmless.
The reality: if start_date is far in the past and catchup=True, Airflow creates one scheduled run per past interval the moment you unpause. For a @monthly DAG with start_date=datetime(2024, 1, 1) that could be 16 runs firing at once. Half of those runs probably target TLC months that are not published yet, so your scheduler log fills with 403s before you even realize what happened.
The fix: keep catchup=False during development (as Scheduling and Triggers and the Parameterized Runs and Backfills snapshot do). Load history deliberately via airflow backfill create --from-date … --to-date … instead.
The misconception: rerunning tasks is always safe.
The reality: if a task always appends without scoping a DELETE to the logical date, retries and backfills double-count. A Feb re-run that appends another 53,000 rows on top of the existing 53,000 Feb rows silently corrupts every downstream count.
The fix: use DELETE-then-append scoped to the logical date, as Parameterized Runs and Backfills teaches. Validate by running the same logical date twice and confirming the row count does not change.
The misconception: DELETE FROM raw_trips WHERE year_month = '2024-01' followed by INSERT from the 2024-01 parquet makes the Jan load fully idempotent.
The reality: the TLC parquet for month N contains a handful of rows whose lpep_pickup_datetime actually falls in month N-1 or N+1 (drivers reporting trips across the month boundary, device-clock drift). The DELETE only wipes January's bucket, but the re-append re-adds Jan's parquet entirely, including its late-December spillover. The Jan bucket is stable across re-runs; the Dec bucket grows by 2-10 rows each time you re-run Jan. Measured during Week 11 verification: Dec=64213 on pass 1, Dec=64215 on pass 2.
The fix: for Week 11's scale, know the limitation and accept the drift. For production, scope the DELETE to an explicit datetime range covering the full parquet, e.g. WHERE lpep_pickup_datetime BETWEEN '2024-01-01' AND '2024-02-01'.
The misconception: a quick PG_PASSWORD = "hunter2" in the DAG header is fine for a classroom project.
The reality: DAG files are checked into Git. Inlined secrets leak into the repo history, into Notion exports of the curriculum, and into any LLM a student pastes the DAG into for debugging. Sequential Pipelines established the correct pattern: an azure_pg Airflow Connection referenced via {{ conn.azure_pg.password }} (and never printed, never returned from a task, never logged).
The fix: any secret accessed by the DAG goes through azure_pg (or a Variable, for non-credential runtime config). Week 12 will replace even the Connection with Key Vault + Managed Identity so the password stops living on disk entirely.
max_active_runs=1 from the snapshotThe misconception: parallel backfill runs are faster, so a higher max_active_runs is an obvious optimization.
The reality: dbt stages model materializations by renaming the target table to <model_name>__dbt_backup. Two concurrent dbt_run tasks against the same schema both try to create stg_trips__dbt_backup and the second fails with relation already exists. Parameterized Runs and Backfills sets max_active_runs=1 explicitly for this reason. A 3-month backfill runs serially in ~3x the time of one run, which is still faster than chasing a half-corrupted dbt backup relation the morning after.
The fix: leave max_active_runs=1 in the @dag(...) decorator for any DAG that invokes dbt. If you need parallelism at scale, the proper fix is per-run schemas (Jinja the schema name with {{ ds_nodash }}), not removing the constraint.
The misconception: if the DAG file exists in dags/, Airflow is running it.
The reality: a syntax error or missing import silently drops the DAG from the scheduler's list. You will not see a failed run because no run was ever created. The UI shows 0 Dags with a red 1 badge next to it, which is easy to miss if you do not know to look for it.
The fix: add the DagBag integrity test from Testing DAGs to every project. Run astro dev pytest tests/test_dag_integrity.py --args "-v" before every push. For the shared class VM, the same test runs as a CI gate on each PR.
The misconception: paths and environment variables that work on your laptop also work inside the Airflow worker.
The reality: Airflow tasks run inside a container with a different filesystem view from your shell. /tmp on your Mac is not /tmp inside the Astro scheduler container (which is what runs the task). $HOME is /home/airflow, not /Users/you. Network paths, mounted volumes, and writable directories all differ. Deploying to Shared Airflow documents this split explicitly.
The fix: use the mounted include/ path. Both Astro and the class VM mount it (at /usr/local/airflow/include and /opt/airflow/include respectively); the reference snapshot's find_dbt_dir() detects which one applies. Read configuration from Airflow Variables or Connections, not from host env vars.
The misconception: more retries always improve reliability.
The reality: retries help transient failures (network blip, momentary database unavailability, short API rate-limit). They do not help a 403 from a typo'd URL, a missing column, or bad SQL; those fail identically every retry. Monitoring and Debugging walks through the exact 403 case on ingest_taxi_month: two retries cost you ten minutes and recover nothing.
The fix: keep retries=2 in default_args for transient-failure catching. When a task has failed three tries with the same error, treat it as a deterministic bug; fix the code rather than bumping retries to 5.
The misconception: if the code is clear, a runbook is optional.
The reality: incidents happen at 03:00 when the person who wrote the DAG is asleep. The runbook is what lets the on-call engineer (often future-you) answer "what failed, how do I fix it, what happens if I do nothing?" without reading 400 lines of Python first. Monitoring and Debugging establishes the minimum shape: symptom, probable cause, immediate check, safe recovery, escalation.
The fix: write a one-page RUNBOOK.md next to your DAG listing the two most likely failure modes (Ch7 covers the TLC 403 and the dbt-test FAIL), with "what to do first" under each. Update it the next time you hit a failure not in the runbook.
Open your taxi_pipeline.py from the Week 11 Practice exercises. Walk through each of the 10 gotchas and check whether your DAG guards against it. For each gotcha, answer in one line:
Paste the 10-line checklist into your assignment report draft. Not graded, but the exercise is what makes the gotchas stick. It is the difference between "I read the gotchas page" and "I used the gotchas page to find three bugs in my own code."
datetime.now() inside a task break backfill correctness? Name the specific contract it violates.catchup=True introduce when combined with a past start_date? Walk through what happens in the 60 seconds after you unpause such a DAG.max_active_runs=1 load-bearing for any DAG that invokes dbt, and what would you change if you wanted to run two months in parallel anyway?<aside>
📚 For the "what comes next" list (dbt incremental for partition-scoped loads at scale, Airflow Variables vs Connections, custom XCom backends), see the optional Going Further page.
</aside>
In the next chapter, you apply Week 11 end-to-end: build a scheduled, parameterized, backfillable DAG, write the runbook, and deploy it to the shared Airflow. The gotchas above are the checklist to walk through before you submit.
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with ❤️ by the HackYourFuture community · Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.