Parameterized Runs and Backfills
Assignment: Build an Orchestrated Data Pipeline
Week 11 Lesson Plan (Teachers)
A pipeline is not production-ready until you can handle failures. On a green run you ignore the Airflow UI; on a red run it is your only lever. This chapter teaches you how to pull that lever: find the failing task, read its logs, decide whether retries will help, and write down what you learned so the next person (future you, at 03:00) does not have to rediscover it.
By the end of this chapter, you should be able to investigate failed Airflow runs, configure retries, and build a one-page runbook for the taxi pipeline.
When a DAG fails, first inspect:
{{ ds }}, {{ conn.azure_pg.host }}, and the other Jinja placeholders you wrote in Parameterized Runs and Backfills. When a task fails with "host is None" or "file not found for 2024-13", the rendered tab usually tells you why in one glance.This order usually shows whether the issue is in data, configuration, dependency, or infrastructure layers.
Task logs often show the failing line. Focus on:
try 1 of N)Do not read only the last line. The main clue is often earlier.
Retries help with transient failures:
Example (see Airflow's default_args reference for the full list of retry-related keys):
default_args = {
"retries": 2,
"retry_delay": timedelta(minutes=5),
}
Retries do not fix deterministic code bugs (bad SQL, missing column, wrong path). Those need code changes.
<aside> ⚠️ Too many retries can hide real failures and increase system load.
</aside>
When tasks do not start at all (the UI shows them stuck in queued, or the DAG is missing), the problem is below the DAG code: a scheduler or dag-processor container is down or crashing. For the Astro local stack introduced in Airflow Fundamentals:
astro dev ps # container health
astro dev logs scheduler # recent scheduler activity
astro dev logs dag-processor # parsing errors
On the shared class VM (raw Docker Compose, no Astro wrapper):
docker compose ps
docker compose logs -f scheduler
Abstract failure taxonomies are easy to nod along to and forget. Here are the two that will bite you first on the Week 11 taxi_pipeline DAG from Sequential Pipelines, with the exact log pattern you will see and the fix.
What it looks like in the UI. ingest_taxi_month goes red in the grid view; dbt_run and dbt_test go grey (upstream_failed). The DAG overview surfaces the failure counts, and the sidebar's per-task rail lights up the downstream cascade:

Airflow DAG view: ingest task red, dbt_run and dbt_test marked upstream_failed, Failed Task and Failed Run counters both at 1
Click the red ingest cell and open the Logs tab. The real log from a typo'd URL looks like this (captured against the live TLC endpoint):

Airflow task log panel: ERROR line reads HTTPError: 403 Client Error: Forbidden for url: <https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2026-04_typo.parquet> with the full traceback below
Diagnose by reading upward from the requests.exceptions:
_typo suffix or a 2026-04 that TLC has not yet published, CloudFront refuses to sign the asset and you get HTTPError: 403 Client Error: Forbidden./trip_data/ instead of /trip-data/).HTTPError: line, not the generic "Task failed with exception" on the line above.Fix. For 403: audit the ds[:7] slice; confirm catchup is not creating runs for future dates; narrow the end_date of the backfill to last month if the current month is not yet published. For 404: copy the URL from the log into curl -I and verify which segment of the path is wrong.
Why retry alone will not rescue this. retries=2 re-runs the task, which re-issues the same GET to the same URL. The HTTP error is deterministic against the URL, so retries only waste time. Retries help with transient network failures (DNS blip, connection reset), not 4xx responses.
<aside>
💡 Notice the log line File ".../requests/models.py", line 1028 in raise_for_status. This tells you the reference DAG calls response.raise_for_status() on every TLC download, which converts a silent 403/404 HTML body into a real Python exception. Without it, the HTML error page would be fed straight to pd.read_parquet inside the same task and you would see a cryptic ArrowInvalid: Parquet magic bytes not found instead: harder to diagnose because it points at pandas rather than at the real culprit (the HTTP response).
</aside>
dbt_test red-lines after an ingestion loaded bad rowsWhat it looks like in the UI. ingest_taxi_month and dbt_run both green. dbt_test red. Downstream consumers (any dashboard DAG that depends on fct_trips) are blocked, which is exactly what the failure gate exists for.
Log pattern:
[2026-04-22 06:03:41] INFO - Running: dbt test --project-dir /usr/local/airflow/include/dbt_project
...
3 of 11 WARN 4 dbt_utils_unique_combination_of_columns_stg_trips_... [WARN 4 in 0.68s]
5 of 11 FAIL 47 accepted_values_stg_trips_payment_type__1__2__3__4__5__6 [FAIL 47 in 0.52s]
Done. PASS=9 WARN=1 ERROR=1 SKIP=0 NO-OP=0 TOTAL=11
[2026-04-22 06:03:42] ERROR - Task failed: dbt test exited with code 1
Diagnose from the dbt-test summary:
WARN 4 on the unique-combination test is the expected 4 duplicates in TLC data from Week 10; it is not the cause (warn does not fail the task).FAIL 47 on accepted_values_stg_trips_payment_type is the cause: 47 rows have a payment_type outside the documented [1,2,3,4,5,6]. Either the TLC added a new code this month or the ingest mangled the type. Open the compiled query from target/compiled/... and run it against stg_trips in psql to see the offending values.Fix. If the TLC truly added a new code, update accepted_values in _stg_trips.yml to include it. If the ingest is the culprit (e.g. integer parsed as string), fix the type coercion in the load task and rerun the affected month.
Why this is the ideal orchestration win. The four-task split you built in Sequential Pipelines means the broken dbt_test does not contaminate downstream dashboards: they are blocked by the failure gate until a human resolves the data-quality issue. A single-script version of this pipeline would have quietly shipped bad data.
Three topics matter in production but are out of scope for a Week 11 project and live in the optional Going Further page:
execution_timeout tuning (a per-task deadline Airflow enforces by killing the task), resource pressure, and upstream-data delay. Worth skimming when you plan a real on-call rotation.Import errors are the one category you do already have a guard for: the integrity test from Testing DAGs catches them before they reach the scheduler.
A short runbook should include:
This reduces panic and helps teams respond in a consistent way.
<aside> 💭 Treat runbooks as living documents. Update them after important failures.
</aside>
The best runbooks are short enough to use under pressure.
Work through the full incident-handling loop: break → observe → diagnose → fix → verify. Each step maps to a real action an on-call engineer takes.
dags/taxi_pipeline.py, find the parquet_url_for helper and add a _typo suffix to the filename: # before
return f"{TLC_BASE}/green_tripdata_{year_month}.parquet"
# after (break it)
return f"{TLC_BASE}/green_tripdata_{year_month}_typo.parquet"
astro dev run dags trigger taxi_pipeline
ingest_taxi_month) and two orange upstream_failed tasks (dbt_run, dbt_test), matching the first screenshot above. The Failed Task and Failed Run counters on the overview both read 1.ingest_taxi_month, open Logs. Scroll to the bottom and then read upward from the last frame: the useful line is the HTTPError: 403 Client Error: Forbidden for url: ... line, not the generic "Task failed with exception" above it. Note which URL got requested.dbt_run / dbt_test, the Rendered Templates tab shows what Airflow actually substituted for {{ conn.azure_pg.host }} and the other DBT_ENV placeholders. This is where you verify "is Airflow seeing the connection I think it is?" when a DB-connection task fails:
Airflow Rendered Templates tab for dbt_run: env dict shows PG_HOST substituted to hyf-data-pg.***.database.azure.com, PG_DBNAME to team1, PG_SCHEMA to airflow_<student>, PG_PASSWORD redacted as ***
The password is redacted; the host, database, and schema appear substituted verbatim. The tab is empty for tasks that never ran (upstream_failed state) and for pure-Python @task tasks that read context directly instead of relying on Jinja, so do not be alarmed when a failed task's rendered tab looks blank.
RUNBOOK.md next to your DAG and add: ## ingest_taxi_month fails with HTTPError 4xx
**Symptom:** ingest_taxi_month goes red; dbt_run / dbt_test go upstream_failed.
**Check:** Task log, read upward from the last traceback frame. The first useful line is `HTTPError: <code> ... for url: <url>`.
**Fix:** If 403, the TLC path is wrong or the month is not yet published. If 404, a path segment is mistyped. Copy the URL into `curl -I` to confirm.
ingest_taxi_month starts with DELETE FROM ... WHERE year_month = %s and re-inserts the same rows. No drift, no duplicates.Completing this loop once gives you the muscle memory the runbook is trying to encode.
Operational maturity grows through repeated, structured incident handling.
<aside> 🤓 Curious Geek: "Toil" in SRE
</aside>
You can speed up triage notes with AI, but keep human judgment for final actions.
<aside> 💡 Using AI to help: Paste a sanitized log excerpt (⚠️ Ensure no PII or sensitive company data is included!) and ask for likely root-cause categories before you decide the final fix.
</aside>
ingest_taxi_month by forcing a month that does not exist (e.g. hardcode ds = "2024-13-01" at the top of the function). Trigger the DAG, find the first meaningful line in the scheduler logs (hint: search upward from requests.exceptions), and note the fix. Revert.taxi_pipeline already sets default_args={"retries": 2} at the DAG level, so every task inherits two retries. Override the retry count per task instead: add retries=4, retry_delay=timedelta(minutes=1) to ingest_taxi_month only (keep the DAG-level default as-is, and do not add per-task retries to dbt_test). Explain in one sentence why generous retries belong on the network-bound ingest task but are pointless on the test task.taxi_pipeline that covers at least the two failure modes documented in this chapter (TLC 404, accepted_values FAIL) plus connection-to-azure_pg failure.<aside> 📝 Practice: The week's Practice chapter has an exercise on debugging a deliberately broken DAG run. Do it after the exercises above; it reinforces the log-reading + runbook pattern against a fresh failure you haven't seen before.
</aside>
Production teams go further than the Grid view + log-reading you practised here.
<aside>
💡 In the wild: Production Airflow teams instrument failures with OpenLineage, which captures lineage and run metadata across DAGs and downstream consumers. Browse the apache-airflow-providers-openlineage provider to see what observability looks like beyond the Grid view: every task emits structured events that drive dashboards, alerts, and post-mortem timelines.
</aside>