Week 10 - SQL Transformations with dbt

Introduction to dbt Core

dbt Setup for Azure PostgreSQL

SQL and Jinja Templating

dbt Tests

Incremental Models, Docs, and Snapshots

Building Your First Mart

Week 10 Gotchas & Pitfalls

Practice

Week 10 Assignment: Build a Daily Borough Stats Mart with dbt

Week 10 - Video Guide

Week 10 Lesson Plan (Teachers)

SQL and Jinja Templating

In the setup chapter you wrote your first model:

select
    pickup_datetime,
    ...
from {{ source('nyc_taxi', 'raw_trips') }}

You used {{ source(...) }} without ever defining what those double curly braces mean. That syntax is Jinja, the templating language dbt layers on top of SQL. This chapter explains what Jinja gives you, when it earns its keep, and when it makes a mess.

By the end of this chapter, you should be able to:

Why pure SQL is not enough

Why use dbt covered why raw SQL views in the database lack version control, build order, and reuse. Jinja solves a narrower, related problem: SQL itself has no abstractions. No variables, no functions, no conditionals that survive past a single query. Once you try to share a fragment of SQL across models: a filter, a cast, a column-mapping case: you hit a wall.

Suppose your project has 30 models, all reading from raw_trips, all needing to filter out trips with negative fares.

Choice 1: Copy-paste the filter into every model. The where fare_amount > 0 line lives in 30 files. When the rule changes to fare_amount > 0 and fare_amount < 1000, you change 30 files. The next analyst forgets one. Now your dashboard quietly includes a $0 trip nobody intended.

Choice 2: Build a view in the database. You create a clean_trips view once and have all 30 models read from it. Better, but the view's definition lives in the database, not in git. Two engineers running the same view definition can drift apart silently. There is no PR diff, no review, no history.

Choice 3: Generate the SQL with a Python script. You write a script that produces 30 .sql files from a template, then commit the generated files. Now you have two layers to debug: the template logic and the SQL output. Every change is two diffs, and the generated files still need to be kept in sync.

dbt's answer is to put the templating layer inside the SQL files themselves and run it at compile time. You write SQL with placeholders, dbt resolves them against the project, and the rendered SQL goes to the database. Templates and output stay in lockstep because dbt regenerates the output on every run.

<aside> 💭 The same problem recurs in every codebase that produces text. Web frameworks have Jinja for HTML, infrastructure tools have Jinja for YAML, and dbt has Jinja for SQL. Once you have used one, the others are familiar.

</aside>

How Jinja runs inside dbt

A dbt model file is two languages stacked on top of each other: Jinja is processed first, SQL second.

flowchart TB
    source["**stg_trips.sql**<br/>(your file: SQL + Jinja)"]:::source
    compiled["**target/compiled/.../stg_trips.sql**<br/>(pure SQL, no curly braces)"]:::compiled
    db["**Azure PostgreSQL**<br/>(executes the rendered SQL)"]:::db

    source -->|"step 1: dbt renders Jinja"| compiled
    compiled -->|"step 2: adapter sends SQL to Postgres"| db

    classDef source fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px,color:#000
    classDef compiled fill:#fff2cc,stroke:#d6b656,stroke-width:2px,color:#000
    classDef db fill:#d5e8d4,stroke:#82b366,stroke-width:2px,color:#000

There are only two Jinja markers you need to recognize:

Postgres never sees a curly brace. By the time the SQL reaches the database, dbt has already replaced every {{ }} and {% %} with plain text.

<aside> 💡 If you make a Jinja mistake (typo in a variable name, unclosed {% if %}), dbt fails at the compile step and never sends anything to the database. You will not see a "syntax error near {{" from Postgres.

</aside>

The two functions you have already used

source() and ref() do two things at once: they resolve a table name for the SQL, and they register a dependency in the DAG. The diagram below shows how these calls wire up the NYC taxi project you are building:

graph LR
    src_trips[("raw_trips<br/><code>source('nyc_taxi', 'raw_trips')</code>")]:::source
    src_zones[("raw_zones<br/><code>source('nyc_taxi', 'raw_zones')</code>")]:::source

    stg_trips["stg_trips.sql<br/><code>{{ source(...) }}</code>"]:::model
    stg_zones["stg_zones.sql<br/><code>{{ source(...) }}</code>"]:::model

    int_enriched["int_trips_enriched.sql<br/><code>{{ ref('stg_trips') }}</code><br/><code>{{ ref('stg_zones') }}</code>"]:::model

    fct_trips["fct_trips.sql<br/><code>{{ ref('int_trips_enriched') }}</code>"]:::model

    src_trips --> stg_trips
    src_zones --> stg_zones
    stg_trips --> int_enriched
    stg_zones --> int_enriched
    int_enriched --> fct_trips

    classDef source fill:#e1d5e7,stroke:#9673a6,stroke-width:2px,color:#000
    classDef model fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px,color:#000

Every arrow exists because a model file contains a source() or ref() call. dbt reads those calls, builds this graph, and runs models in dependency order. If you hardcode a table name instead (from "public"."raw_trips"), the arrow disappears and dbt no longer knows about the relationship.

source(): refer to a raw table

You met {{ source('nyc_taxi', 'raw_trips') }} in the setup chapter. The first argument is the source group name from _sources.yml; the second is the table name. dbt resolves it to the fully qualified "public"."raw_trips" at compile time, and registers the dependency shown as the purple-to-blue arrows above.

ref(): refer to another model

When one model selects from another, you use {{ ref('model_name') }} instead of hardcoding the table name:

-- models/staging/stg_clean_trips.sql
select *
from {{ ref('stg_trips') }}
where fare_amount > 0

dbt resolves {{ ref('stg_trips') }} to the schema-qualified name of the stg_trips model in the current target. In your dev target, that is "dev_<your_name>"."stg_trips". In a CI job running against the prod target, it would be "analytics"."stg_trips". Same source file, different output schemas, no code change.

ref() also tells dbt that this model depends on stg_trips (the blue-to-blue arrows in the diagram), so the DAG knows to build stg_trips first. This is the mechanism behind everything you saw in the introduction: the dependency graph, the build ordering, the lineage docs.

<aside> ⚠️ Never hardcode table names in a model. Even if the schema name is correct today, hardcoding it breaks dependency tracking. The arrow disappears from the DAG: the lineage view will show your model as having no parents, and dbt will not rebuild upstream models when your model is selected.

</aside>

var(): configurable values from the project

When a value should change between projects or environments (a date cutoff, a country filter, a minimum fare), put it in dbt_project.yml and read it with {{ var() }}.

# dbt_project.yml
vars:
  min_fare_amount: 0.01
  start_date: '2024-01-01'
-- models/staging/stg_trips_filtered.sql
select *
from {{ ref('stg_trips') }}
where fare_amount >= {{ var('min_fare_amount') }}
  and pickup_datetime >= '{{ var("start_date") }}'

You can also override a var on the command line for one-off runs:

dbt run --select stg_trips_filtered --vars '{"min_fare_amount": 1.00}'

Control flow: {% if %} and {% for %}

Jinja lets you generate repetitive SQL from a list, and switch behavior based on environment. Two examples on the NYC taxi data.

{% for %}: generate a CASE WHEN from a list

The payment_type column in raw_trips is an integer code. The TLC publishes the meaning of each code:

Code Label
1 Credit card
2 Cash
3 No charge
4 Dispute
5 Unknown
6 Voided trip

You could write the case by hand:

case payment_type
    when 1 then 'Credit card'
    when 2 then 'Cash'
    when 3 then 'No charge'
    when 4 then 'Dispute'
    when 5 then 'Unknown'
    when 6 then 'Voided trip'
end as payment_type_label

That works for six values. It does not scale to 200 zone IDs, and it is silently wrong if the list changes. With Jinja, you describe the data once and let dbt expand the SQL:

case payment_type
{% set payment_types = {
    1: 'Credit card',
    2: 'Cash',
    3: 'No charge',
    4: 'Dispute',
    5: 'Unknown',
    6: 'Voided trip'
} %}
{% for code, label in payment_types.items() %}
    when {{ code }} then '{{ label }}'
{% endfor %}
end as payment_type_label

After compilation, both versions produce identical SQL. The Jinja version stays correct when the list grows, and the source of truth is one dictionary.

{% if %}: switch behavior by environment

target.name exposes the active target ("dev", "prod", etc.) inside Jinja. A common pattern is to sample a small slice of the data in dev and the full set in prod:

select *
from {{ source('nyc_taxi', 'raw_trips') }}
{% if target.name == 'dev' %}
    where pickup_datetime >= '2024-01-15'
{% endif %}

In dbt run --target dev you build a one-day slice in seconds. In dbt run --target prod you build the full month. The model file does not need a "dev" copy and a "prod" copy.

Macros: reusable Jinja functions

A macro is a named Jinja function. You define it once in macros/, and call it from any model. Macros are how you stop copy-pasting small SQL fragments.

A real example from the NYC taxi data: any analyst looking at the trips wants useful per-trip metrics like tip percentage (tip_amount / fare_amount) and fare per mile (fare_amount / trip_distance). Both are simple divisions. Both have the same trap: a trip with fare_amount = 0 or trip_distance = 0 will throw a division by zero error and abort the run. The fix is a nullif guard on the denominator, written the same way every time:

tip_amount / nullif(fare_amount, 0) as tip_pct,
fare_amount / nullif(trip_distance, 0) as fare_per_mile

That pattern is going to repeat: per-passenger fare, fare per minute, surcharge ratio, and so on. Extract it once into a macro:

-- macros/safe_divide.sql
{% macro safe_divide(numerator, denominator) %}
    {{ numerator }} / nullif({{ denominator }}, 0)
{% endmacro %}

Then use it in stg_trips.sql to add two analytics columns to the model you built in the setup chapter:

-- models/staging/stg_trips.sql
select
    pickup_datetime,
    dropoff_datetime,
    pickup_location_id,
    dropoff_location_id,
    fare_amount,
    tip_amount,
    trip_distance,
    {{ safe_divide('tip_amount', 'fare_amount') }} as tip_pct,
    {{ safe_divide('fare_amount', 'trip_distance') }} as fare_per_mile
from {{ source('nyc_taxi', 'raw_trips') }}
where pickup_location_id is not null

Two callers of the same macro in a single model already make the case for extracting it: the rule for handling zero denominators lives in one place. When the team decides "actually, return 0 instead of NULL for zero-distance trips," you edit the macro and both columns update.

<aside> 💡 You can verify both columns immediately. Run dbt run --select stg_trips, then select tip_pct, fare_per_mile from dev_<your_name>.stg_trips limit 10; in psql. The average across the January 2024 dataset is roughly 15% tip and around \$12 per mile (NYC's mix of long airport trips and short Manhattan rides skews the per-mile number high).

</aside>

Inspecting compiled SQL

Whenever you write Jinja, your debugging move is to look at the rendered SQL.

dbt compile --select stg_trips

dbt compile runs steps 1 and 2 of the parse/compile/wrap pipeline from the setup chapter (parse the project, render Jinja) and stops without sending anything to the database. The output goes to target/compiled/nyc_taxi/models/staging/stg_trips.sql. Open it side-by-side with your source file. If the Jinja did what you expected, the output looks like clean SQL. If it did not, the difference jumps out immediately.

⌨️ Hands on: extract safe_divide and watch dbt compile it

This is your first time editing a model and a macro together. The point is to see: with your own eyes, in target/compiled/: that {{ safe_divide(...) }} is expanded into literal SQL before anything reaches the database.

  1. Add the safe_divide macro to macros/safe_divide.sql.
  2. Update stg_trips.sql to use it for both tip_pct and fare_per_mile.
  3. Compile without running:
   dbt compile --select stg_trips
  1. Open target/compiled/nyc_taxi/models/staging/stg_trips.sql and confirm both {{ safe_divide(...) }} calls have been replaced by literal tip_amount / nullif(fare_amount, 0) (and the equivalent for fare_per_mile).
  2. Run the model and verify it works against real data:
   dbt run --select stg_trips
   SELECT AVG(tip_pct) FROM dev_<your_name>.stg_trips;

When Jinja hurts more than it helps

Jinja is a power tool. The same features that let you DRY up 30 models can also turn a 20-line model into a 200-line macro spaghetti that nobody can read. Three patterns to avoid (for a full set of annotated examples, see good patterns (gist) and bad patterns with fixes (gist)):

1. Macros for things that are not actually repeated

A macro you call once, in one model, is worse than the inline SQL it replaces. The reader now has to open a second file to understand what the model does. The rule of thumb: extract a macro when you have three or more callers doing the exact same thing, not when you imagine you might in the future.

2. {% if %} chains that hide business logic

This is unreadable:

select
    fare_amount,
    {% if target.name == 'prod' and var('include_tax', false) %}
        fare_amount + mta_tax + improvement_surcharge as total
    {% elif target.name == 'dev' %}
        fare_amount as total
    {% else %}
        null as total
    {% endif %}
from {{ ref('stg_trips') }}

Rule of thumb: {% if %} is for environment switching (target.name, debug flags), not for varying business definitions. If "what counts as total fare" depends on configuration, build two models and let the consumer pick.

3. Over-templated string concatenation

This compiles to working SQL, but a future you (or a teammate) has to mentally render the Jinja before they can read the query:

select
    {% for col in ['fare_amount', 'tip_amount', 'tolls_amount'] %}
    {{ col }}{% if not loop.last %},{% endif %}
    {% endfor %}
from {{ ref('stg_trips') }}

The plain-SQL version is shorter and clearer:

select
    fare_amount,
    tip_amount,
    tolls_amount
from {{ ref('stg_trips') }}

<aside> 💡 The right reading of "DRY" in dbt is "every business rule is defined in one place." It is not "every typed character appears once." Repeating three column names in a select is fine. Repeating a case expression in three models is not.

</aside>

With that rule of thumb in hand, you have enough to exercise the patterns from this chapter in the practice file.

<aside> 💡 Using AI to help: When a compiled SQL file looks wrong and you cannot spot why, paste both the source (the .sql with Jinja tags) and the compiled output (target/compiled/...) into an LLM with the prompt "explain what dbt did to transform the first file into the second, line by line." This is faster than reading the Jinja docs and catches silent whitespace / loop-variable mistakes. ⚠️ Public TLC column names are safe; for real projects, scrub table and column names that might hint at customer PII before sending.

</aside>

The matching practice exercises live in the Practice chapter, one per concept introduced above.