Parameterized Runs and Backfills
Assignment: Build an Orchestrated Data Pipeline
Week 11 Lesson Plan (Teachers)
In Weeks 3 through 10, you built important pipeline pieces: ingestion, transformations, tests, and cloud deployment patterns. In production, those pieces must run in the right order, at the right time, and with clear failure visibility. That coordination layer is called orchestration.
By the end of this chapter, you should be able to:
Orchestration coordinates pipeline tasks, dependencies, retries, and run history across your full workflow.
Think about a daily flow:
If one step fails, the next step should not run. If a transient API failure occurs, a retry should happen automatically. If a run misses its window, someone should know. Orchestration handles all of this.
<aside> 💡 Scheduling answers "when to start." Orchestration answers "what to run, in what order, with what recovery behavior."
</aside>
Manual execution works for experiments, but it breaks in production:
In data work, a "small" miss can become a business incident: stale dashboards, wrong KPIs, or incomplete model outputs.
A scheduler can trigger one command at a time, similar to cron.
An orchestrator adds:
flowchart LR
subgraph Cron["06:00 · Cron"]
direction LR
C1["run script.py"]
end
subgraph Orchestrator["06:00 · Orchestrator"]
direction LR
O1["ingest"] --> O2["transform"]
O2 --> O3["test"]
O3 --> O4["notify"]
O1 -. retry on<br/>transient failure .-> O1
O3 -. skip on<br/>upstream failure .-> O4
end
The cron side is a single black box: one command, no visibility, no recovery. The orchestrator side is a graph of tasks with explicit edges for success, failure, and retry.
<aside> 🤓 Curious Geek: Why DAGs are "acyclic"
In graph theory, "acyclic" means "no loops." Airflow DAGs cannot contain cycles because cyclic dependencies have no valid execution order. This design forces pipelines to be explicit and executable.
</aside>
This is a preview list. Later chapters unpack idempotency and backfills in detail; treat the definitions above as breadcrumbs, not the full story.
<aside> ⚠️ If tasks are not idempotent, retries and backfills can create duplicate or corrupt outputs.
</aside>
By Week 11, your stack looks like this:
flowchart TB
subgraph pipeline["Pipeline"]
direction LR
src[("Source<br/>API / files")] --> ing["Ingestion"]
ing --> sto[("Storage<br/>Postgres / Blob")]
sto --> dbt["dbt models<br/>+ dbt tests"]
dbt --> bi["BI dashboard"]
end
orch["<b>Orchestration layer</b><br/>schedule · retries · logs · alerts"]
orch -.-> ing
orch -.-> sto
orch -.-> dbt
orch -.-> bi
classDef layer fill:#fff4e6,stroke:#e08e45,stroke-width:2px,color:#333;
class orch layer;
Without orchestration, each step might be correct in isolation but unreliable as a system.
Common orchestration tools include:
This week uses Airflow because it is a practical standard in many teams and maps well to your Python and dbt workflow.
Your pipeline should run every morning at 06:00 UTC:
dbt run.dbt test.This chapter gives the "why." The next chapters show the "how."
You work in two environments:
| Environment | Purpose | Tooling |
|---|---|---|
| Local machine | Build and test DAG code | Astro CLI (astro dev start) |
| Shared class VM | Demo and teacher review | Airflow + Docker Compose |
<aside> 💭 Develop locally first. Use the shared VM for class demos and validation, not as your primary development environment.
</aside>
Before moving on to the Airflow-specific setup in the next chapter, take five minutes to put the concepts above onto paper against a pipeline you already know.
<aside> ⌨️ Hands on: Draw your current pipeline from memory and mark where orchestration should control order, retries, and alerts. Keep it to one flow with 3 to 5 steps.
</aside>
An LLM can give you fast first-pass feedback on whether the failure gates you drew are in the right places.
<aside> 💡 Using AI to help: Paste your draft task list and dependency chain (⚠️ Ensure no PII or sensitive company data is included!) into an LLM and ask: "Which tasks are unclear or missing failure gates?"
</aside>
The same problem shows up at much larger scale in production codebases.
<aside>
💡 In the wild: Airflow itself was open-sourced by Airbnb in 2015 to solve exactly this problem at scale. Browse apache/airflow/airflow/example_dags to see how the project's own maintainers wire up dependencies, retries, and trigger rules. The example DAGs are the canonical reference for "what good looks like": they ship with every install.
</aside>
<aside>
🚀 Try the Week 11 quiz once you have finished the week: assets/week_11_quiz.yaml (7 questions: multiple-choice + open-ended, covering orchestration, logical date, idempotency, retries, and debugging flow).
</aside>
In the next chapter, you will set up Airflow locally and run your first DAG.