Introduction to Orchestration

In Weeks 3 through 10, you built important pipeline pieces: ingestion, transformations, tests, and cloud deployment patterns. In production, those pieces must run in the right order, at the right time, and with clear failure visibility. That coordination layer is called orchestration.

By the end of this chapter, you should be able to:

Explain what orchestration is and how it differs from cron scheduling.
Describe the six core concepts (DAG, task, dependency, retry, idempotency, backfill) in one sentence each.
Identify where the orchestration layer fits in the Week 11 stack (ingestion → storage → dbt → BI).
Justify the choice of an orchestrator over a bare cron job for a multi-step daily pipeline.

Concepts

What orchestration means

Orchestration coordinates pipeline tasks, dependencies, retries, and run history across your full workflow.

Think about a daily flow:

Ingest raw data.
Validate and store raw data.
Run dbt models.
Run dbt tests.
Notify on success or failure.

If one step fails, the next step should not run. If a transient API failure occurs, a retry should happen automatically. If a run misses its window, someone should know. Orchestration handles all of this.

<aside> 💡 Scheduling answers "when to start." Orchestration answers "what to run, in what order, with what recovery behavior."

</aside>

Why manual runs fail at scale

Manual execution works for experiments, but it breaks in production:

People forget to run jobs.
Steps run in the wrong order.
Failures are missed until stakeholders complain.
Reprocessing historical dates becomes slow and error-prone.

In data work, a "small" miss can become a business incident: stale dashboards, wrong KPIs, or incomplete model outputs.

Orchestration vs scheduling

A scheduler can trigger one command at a time, similar to cron.

An orchestrator adds:

dependency graphs
task-level retries
run-level metadata
logs per task
manual triggers/backfills
UI visibility for operators

flowchart LR
    subgraph Cron["06:00 · Cron"]
        direction LR
        C1["run script.py"]
    end
    subgraph Orchestrator["06:00 · Orchestrator"]
        direction LR
        O1["ingest"] --> O2["transform"]
        O2 --> O3["test"]
        O3 --> O4["notify"]
        O1 -. retry on<br/>transient failure .-> O1
        O3 -. skip on<br/>upstream failure .-> O4
    end

The cron side is a single black box: one command, no visibility, no recovery. The orchestrator side is a graph of tasks with explicit edges for success, failure, and retry.

Key concepts you will use all week

DAG (Directed Acyclic Graph): the pipeline graph definition
Task: one unit of work in the DAG
Dependency: the order relation between tasks
Retry: automatic re-attempt for transient failures
Idempotent: rerunning produces the same correct result
Backfill: rerunning for historical dates

<aside> 🤓 Curious Geek: Why DAGs are "acyclic"

In graph theory, "acyclic" means "no loops." Airflow DAGs cannot contain cycles because cyclic dependencies have no valid execution order. This design forces pipelines to be explicit and executable.

</aside>

This is a preview list. Later chapters unpack idempotency and backfills in detail; treat the definitions above as breadcrumbs, not the full story.

<aside> ⚠️ If tasks are not idempotent, retries and backfills can create duplicate or corrupt outputs.

</aside>

Where orchestration fits in your stack

By Week 11, your stack looks like this:

flowchart TB
    subgraph pipeline["Pipeline"]
        direction LR
        src[("Source<br/>API / files")] --> ing["Ingestion"]
        ing --> sto[("Storage<br/>Postgres / Blob")]
        sto --> dbt["dbt models<br/>+ dbt tests"]
        dbt --> bi["BI dashboard"]
    end
    orch["<b>Orchestration layer</b><br/>schedule · retries · logs · alerts"]
    orch -.-> ing
    orch -.-> sto
    orch -.-> dbt
    orch -.-> bi

    classDef layer fill:#fff4e6,stroke:#e08e45,stroke-width:2px,color:#333;
    class orch layer;

Without orchestration, each step might be correct in isolation but unreliable as a system.

Tools you should know

Common orchestration tools include:

Apache Airflow: Python-based DAGs, strong ecosystem, widely used
Azure Data Factory: managed orchestration in Azure
Prefect and Dagster: modern Python orchestration frameworks

This week uses Airflow because it is a practical standard in many teams and maps well to your Python and dbt workflow.

A concrete Week 11 scenario

Your pipeline should run every morning at 06:00 UTC:

Pull yesterday's data from an external source.
Load it into your raw tables.
Run dbt run.
Run dbt test.
Report failures quickly.

This chapter gives the "why." The next chapters show the "how."

Two environments for Week 11

You work in two environments:

Environment	Purpose	Tooling
Local machine	Build and test DAG code	Astro CLI (`astro dev start`)
Shared class VM	Demo and teacher review	Airflow + Docker Compose

<aside> 💭 Develop locally first. Use the shared VM for class demos and validation, not as your primary development environment.

</aside>

Before moving on to the Airflow-specific setup in the next chapter, take five minutes to put the concepts above onto paper against a pipeline you already know.

<aside> ⌨️ Hands on: Draw your current pipeline from memory and mark where orchestration should control order, retries, and alerts. Keep it to one flow with 3 to 5 steps.

</aside>

An LLM can give you fast first-pass feedback on whether the failure gates you drew are in the right places.

<aside> 💡 Using AI to help: Paste your draft task list and dependency chain (⚠️ Ensure no PII or sensitive company data is included!) into an LLM and ask: "Which tasks are unclear or missing failure gates?"

</aside>

The same problem shows up at much larger scale in production codebases.

<aside> 💡 In the wild: Airflow itself was open-sourced by Airbnb in 2015 to solve exactly this problem at scale. Browse apache/airflow/airflow/example_dags to see how the project's own maintainers wire up dependencies, retries, and trigger rules. The example DAGs are the canonical reference for "what good looks like": they ship with every install.

</aside>

Exercises

Describe the difference between scheduling and orchestration in your own words.
List three risks of running a multi-step pipeline manually.
Draw a simple DAG for: ingest -> transform -> test -> notify.

Knowledge Check

Why is orchestration more than a cron schedule?
What does idempotency protect you from during retries or backfills?
Why should a failed upstream task block downstream tasks?
When would manual scripts still be enough?

<aside> 🚀 Try the Week 11 quiz once you have finished the week: assets/week_11_quiz.yaml (7 questions: multiple-choice + open-ended, covering orchestration, logical date, idempotency, retries, and debugging flow).

</aside>

In the next chapter, you will set up Airflow locally and run your first DAG.

Introduction to Orchestration

Concepts

What orchestration means

Why manual runs fail at scale

Orchestration vs scheduling

Key concepts you will use all week

Where orchestration fits in your stack

Tools you should know

A concrete Week 11 scenario

Two environments for Week 11

Exercises

Knowledge Check

Extra reading