Monitoring and Debugging

A pipeline is not production-ready until you can handle failures. On a green run you ignore the Airflow UI; on a red run it is your only lever. This chapter teaches you how to pull that lever: find the failing task, read its logs, decide whether retries will help, and write down what you learned so the next person (future you, at 03:00) does not have to rediscover it.

By the end of this chapter, you should be able to investigate failed Airflow runs, configure retries, and build a one-page runbook for the taxi pipeline.

Concepts

Start in the Airflow UI

When a DAG fails, first inspect:

DAG run status in Grid view
Failed task instance
Task logs
Try number and duration
Rendered templates for parameter values: the Rendered Template tab in the task-instance view shows what Airflow actually substituted for {{ ds }}, {{ conn.azure_pg.host }}, and the other Jinja placeholders you wrote in Parameterized Runs and Backfills. When a task fails with "host is None" or "file not found for 2024-13", the rendered tab usually tells you why in one glance.

This order usually shows whether the issue is in data, configuration, dependency, or infrastructure layers.

Reading logs effectively

Task logs often show the failing line. Focus on:

first real error line
exception type
lines before the exception
retry behavior (try 1 of N)

Do not read only the last line. The main clue is often earlier.

Retry strategy

Retries help with transient failures:

network timeouts
temporary database unavailability
short API rate limits

Example (see Airflow's default_args reference for the full list of retry-related keys):

default_args = {
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
}

Retries do not fix deterministic code bugs (bad SQL, missing column, wrong path). Those need code changes.

<aside> ⚠️ Too many retries can hide real failures and increase system load.

</aside>

Infra-level diagnostics

When tasks do not start at all (the UI shows them stuck in queued, or the DAG is missing), the problem is below the DAG code: a scheduler or dag-processor container is down or crashing. For the Astro local stack introduced in Airflow Fundamentals:

astro dev ps                      # container health
astro dev logs scheduler          # recent scheduler activity
astro dev logs dag-processor      # parsing errors

On the shared class VM (raw Docker Compose, no Astro wrapper):

docker compose ps
docker compose logs -f scheduler

Two failure modes you will actually hit on the taxi pipeline

Abstract failure taxonomies are easy to nod along to and forget. Here are the two that will bite you first on the Week 11 taxi_pipeline DAG from Sequential Pipelines, with the exact log pattern you will see and the fix.

Failure 1: TLC download task returns HTTP 403 or 404

What it looks like in the UI. ingest_taxi_month goes red in the grid view; dbt_run and dbt_test go grey (upstream_failed). The DAG overview surfaces the failure counts, and the sidebar's per-task rail lights up the downstream cascade:

Airflow DAG view: ingest task red, dbt_run and dbt_test marked upstream_failed, Failed Task and Failed Run counters both at 1

Click the red ingest cell and open the Logs tab. The real log from a typo'd URL looks like this (captured against the live TLC endpoint):

Airflow task log panel: ERROR line reads > with the full traceback below

Airflow task log panel: ERROR line reads HTTPError: 403 Client Error: Forbidden for url: <https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2026-04_typo.parquet> with the full traceback below

Diagnose by reading upward from the requests.exceptions:

403 is what TLC's CloudFront returns for unknown paths (including typos and future months). If the URL has a _typo suffix or a 2026-04 that TLC has not yet published, CloudFront refuses to sign the asset and you get HTTPError: 403 Client Error: Forbidden.
404 is rarer: you will see it when a hostname is right but a subpath is genuinely missing (e.g. you typed /trip_data/ instead of /trip-data/).
Either way, the first useful line is the HTTPError: line, not the generic "Task failed with exception" on the line above.

Fix. For 403: audit the ds[:7] slice; confirm catchup is not creating runs for future dates; narrow the end_date of the backfill to last month if the current month is not yet published. For 404: copy the URL from the log into curl -I and verify which segment of the path is wrong.

Why retry alone will not rescue this. retries=2 re-runs the task, which re-issues the same GET to the same URL. The HTTP error is deterministic against the URL, so retries only waste time. Retries help with transient network failures (DNS blip, connection reset), not 4xx responses.

<aside> 💡 Notice the log line File ".../requests/models.py", line 1028 in raise_for_status. This tells you the reference DAG calls response.raise_for_status() on every TLC download, which converts a silent 403/404 HTML body into a real Python exception. Without it, the HTML error page would be fed straight to pd.read_parquet inside the same task and you would see a cryptic ArrowInvalid: Parquet magic bytes not found instead: harder to diagnose because it points at pandas rather than at the real culprit (the HTTP response).

</aside>

Failure 2: `dbt_test` red-lines after an ingestion loaded bad rows

What it looks like in the UI. ingest_taxi_month and dbt_run both green. dbt_test red. Downstream consumers (any dashboard DAG that depends on fct_trips) are blocked, which is exactly what the failure gate exists for.

Log pattern:

[2026-04-22 06:03:41] INFO - Running: dbt test --project-dir /usr/local/airflow/include/dbt_project
...
 3 of 11 WARN 4   dbt_utils_unique_combination_of_columns_stg_trips_...  [WARN 4 in 0.68s]
 5 of 11 FAIL 47  accepted_values_stg_trips_payment_type__1__2__3__4__5__6 [FAIL 47 in 0.52s]
Done. PASS=9 WARN=1 ERROR=1 SKIP=0 NO-OP=0 TOTAL=11
[2026-04-22 06:03:42] ERROR - Task failed: dbt test exited with code 1

Diagnose from the dbt-test summary:

The WARN 4 on the unique-combination test is the expected 4 duplicates in TLC data from Week 10; it is not the cause (warn does not fail the task).
The FAIL 47 on accepted_values_stg_trips_payment_type is the cause: 47 rows have a payment_type outside the documented [1,2,3,4,5,6]. Either the TLC added a new code this month or the ingest mangled the type. Open the compiled query from target/compiled/... and run it against stg_trips in psql to see the offending values.

Fix. If the TLC truly added a new code, update accepted_values in _stg_trips.yml to include it. If the ingest is the culprit (e.g. integer parsed as string), fix the type coercion in the load task and rerun the affected month.

Why this is the ideal orchestration win. The four-task split you built in Sequential Pipelines means the broken dbt_test does not contaminate downstream dashboards: they are blocked by the failure gate until a human resolves the data-quality issue. A single-script version of this pipeline would have quietly shipped bad data.

What this chapter does not cover

Three topics matter in production but are out of scope for a Week 11 project and live in the optional Going Further page:

Alerting (email, Slack, PagerDuty integrations): needs a preconfigured endpoint the class environment does not provide.
SLA and duration monitoring: the one-shot monthly taxi pipeline has no runtime trend to observe; this pays off on pipelines that run hourly for months.
Broader failure taxonomy beyond the two above: connection rotation, execution_timeout tuning (a per-task deadline Airflow enforces by killing the task), resource pressure, and upstream-data delay. Worth skimming when you plan a real on-call rotation.

Import errors are the one category you do already have a guard for: the integrity test from Testing DAGs catches them before they reach the scheduler.

Build a runbook

A short runbook should include:

symptoms
probable causes
immediate checks
safe recovery actions
escalation path

This reduces panic and helps teams respond in a consistent way.

<aside> 💭 Treat runbooks as living documents. Update them after important failures.

</aside>

The best runbooks are short enough to use under pressure.

⌨️ Hands on: Reproduce failure 1 end-to-end

Work through the full incident-handling loop: break → observe → diagnose → fix → verify. Each step maps to a real action an on-call engineer takes.

Break the download URL. In dags/taxi_pipeline.py, find the parquet_url_for helper and add a _typo suffix to the filename:

   # before
   return f"{TLC_BASE}/green_tripdata_{year_month}.parquet"
   # after (break it)
   return f"{TLC_BASE}/green_tripdata_{year_month}_typo.parquet"

Trigger a run. Either click Trigger in the UI or from your project root run:

   astro dev run dags trigger taxi_pipeline

Observe the failure in the UI. Open the DAG page. You should see one red task (ingest_taxi_month) and two orange upstream_failed tasks (dbt_run, dbt_test), matching the first screenshot above. The Failed Task and Failed Run counters on the overview both read 1.
Read the log upward. Click ingest_taxi_month, open Logs. Scroll to the bottom and then read upward from the last frame: the useful line is the HTTPError: 403 Client Error: Forbidden for url: ... line, not the generic "Task failed with exception" above it. Note which URL got requested.
Check the Rendered Template tab (for any failing task with Jinja templating). For dbt_run / dbt_test, the Rendered Templates tab shows what Airflow actually substituted for {{ conn.azure_pg.host }} and the other DBT_ENV placeholders. This is where you verify "is Airflow seeing the connection I think it is?" when a DB-connection task fails:

Airflow Rendered Templates tab for dbt_run: env dict shows PG_HOST substituted to hyf-data-pg.***.database.azure.com, PG_DBNAME to team1, PG_SCHEMA to airflow_<student>, PG_PASSWORD redacted as ***

The password is redacted; the host, database, and schema appear substituted verbatim. The tab is empty for tasks that never ran (upstream_failed state) and for pure-Python @task tasks that read context directly instead of relying on Jinja, so do not be alarmed when a failed task's rendered tab looks blank.

Write a runbook entry. Create RUNBOOK.md next to your DAG and add:

   ## ingest_taxi_month fails with HTTPError 4xx

   **Symptom:** ingest_taxi_month goes red; dbt_run / dbt_test go upstream_failed.
   **Check:** Task log, read upward from the last traceback frame. The first useful line is `HTTPError: <code> ... for url: <url>`.
   **Fix:** If 403, the TLC path is wrong or the month is not yet published. If 404, a path segment is mistyped. Copy the URL into `curl -I` to confirm.

Revert the typo and trigger once more. Confirm the DAG goes fully green (all three tasks success). Idempotency check: re-trigger the same logical date a second time without changing anything. You should get the same green DAG, because ingest_taxi_month starts with DELETE FROM ... WHERE year_month = %s and re-inserts the same rows. No drift, no duplicates.

Completing this loop once gives you the muscle memory the runbook is trying to encode.

Operational maturity grows through repeated, structured incident handling.

<aside> 🤓 Curious Geek: "Toil" in SRE

</aside>

You can speed up triage notes with AI, but keep human judgment for final actions.

<aside> 💡 Using AI to help: Paste a sanitized log excerpt (⚠️ Ensure no PII or sensitive company data is included!) and ask for likely root-cause categories before you decide the final fix.

</aside>

Exercises

Break ingest_taxi_month by forcing a month that does not exist (e.g. hardcode ds = "2024-13-01" at the top of the function). Trigger the DAG, find the first meaningful line in the scheduler logs (hint: search upward from requests.exceptions), and note the fix. Revert.
The reference taxi_pipeline already sets default_args={"retries": 2} at the DAG level, so every task inherits two retries. Override the retry count per task instead: add retries=4, retry_delay=timedelta(minutes=1) to ingest_taxi_month only (keep the DAG-level default as-is, and do not add per-task retries to dbt_test). Explain in one sentence why generous retries belong on the network-bound ingest task but are pointless on the test task.
Draft a one-page runbook for taxi_pipeline that covers at least the two failure modes documented in this chapter (TLC 404, accepted_values FAIL) plus connection-to-azure_pg failure.

<aside> 📝 Practice: The week's Practice chapter has an exercise on debugging a deliberately broken DAG run. Do it after the exercises above; it reinforces the log-reading + runbook pattern against a fresh failure you haven't seen before.

</aside>

Production teams go further than the Grid view + log-reading you practised here.

<aside> 💡 In the wild: Production Airflow teams instrument failures with OpenLineage, which captures lineage and run metadata across DAGs and downstream consumers. Browse the apache-airflow-providers-openlineage provider to see what observability looks like beyond the Grid view: every task emits structured events that drive dashboards, alerts, and post-mortem timelines.

</aside>

Knowledge Check

Why is the Airflow UI your first debugging stop?

Monitoring and Debugging

Concepts

Start in the Airflow UI

Reading logs effectively

Retry strategy

Infra-level diagnostics

Two failure modes you will actually hit on the taxi pipeline

Failure 1: TLC download task returns HTTP 403 or 404

Failure 2: dbt_test red-lines after an ingestion loaded bad rows

What this chapter does not cover

Build a runbook

⌨️ Hands on: Reproduce failure 1 end-to-end

Exercises

Knowledge Check

Failure 2: `dbt_test` red-lines after an ingestion loaded bad rows