Week 10 - SQL Transformations with dbt
dbt Setup for Azure PostgreSQL
Incremental Models, Docs, and Snapshots
Week 10 Assignment: Build a Daily Borough Stats Mart with dbt
Week 10 Lesson Plan (Teachers)
In the setup chapter you wrote your first model:
select
pickup_datetime,
...
from {{ source('nyc_taxi', 'raw_trips') }}
You used {{ source(...) }} without ever defining what those double curly braces mean. That syntax is Jinja, the templating language dbt layers on top of SQL. This chapter explains what Jinja gives you, when it earns its keep, and when it makes a mess.
By the end of this chapter, you should be able to:
{{ source() }}, {{ ref() }}, and {{ var() }} correctly.{% if %}, {% for %}) inside a select statement.{% macro %} and call it from a model.dbt compile to inspect the SQL dbt actually sends to the database.Why use dbt covered why raw SQL views in the database lack version control, build order, and reuse. Jinja solves a narrower, related problem: SQL itself has no abstractions. No variables, no functions, no conditionals that survive past a single query. Once you try to share a fragment of SQL across models: a filter, a cast, a column-mapping case: you hit a wall.
Suppose your project has 30 models, all reading from raw_trips, all needing to filter out trips with negative fares.
Choice 1: Copy-paste the filter into every model. The where fare_amount > 0 line lives in 30 files. When the rule changes to fare_amount > 0 and fare_amount < 1000, you change 30 files. The next analyst forgets one. Now your dashboard quietly includes a $0 trip nobody intended.
Choice 2: Build a view in the database. You create a clean_trips view once and have all 30 models read from it. Better, but the view's definition lives in the database, not in git. Two engineers running the same view definition can drift apart silently. There is no PR diff, no review, no history.
Choice 3: Generate the SQL with a Python script. You write a script that produces 30 .sql files from a template, then commit the generated files. Now you have two layers to debug: the template logic and the SQL output. Every change is two diffs, and the generated files still need to be kept in sync.
dbt's answer is to put the templating layer inside the SQL files themselves and run it at compile time. You write SQL with placeholders, dbt resolves them against the project, and the rendered SQL goes to the database. Templates and output stay in lockstep because dbt regenerates the output on every run.
<aside> 💭 The same problem recurs in every codebase that produces text. Web frameworks have Jinja for HTML, infrastructure tools have Jinja for YAML, and dbt has Jinja for SQL. Once you have used one, the others are familiar.
</aside>
A dbt model file is two languages stacked on top of each other: Jinja is processed first, SQL second.
flowchart TB
source["**stg_trips.sql**<br/>(your file: SQL + Jinja)"]:::source
compiled["**target/compiled/.../stg_trips.sql**<br/>(pure SQL, no curly braces)"]:::compiled
db["**Azure PostgreSQL**<br/>(executes the rendered SQL)"]:::db
source -->|"step 1: dbt renders Jinja"| compiled
compiled -->|"step 2: adapter sends SQL to Postgres"| db
classDef source fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px,color:#000
classDef compiled fill:#fff2cc,stroke:#d6b656,stroke-width:2px,color:#000
classDef db fill:#d5e8d4,stroke:#82b366,stroke-width:2px,color:#000
There are only two Jinja markers you need to recognize:
{{ ... }} prints a value. {{ ref('stg_trips') }} becomes "dev_<your_name>"."stg_trips" in the compiled SQL.{% ... %} runs logic but prints nothing by itself. {% if target.name == 'dev' %} decides which SQL lines appear in the output, but the {% if %} tag itself disappears.Postgres never sees a curly brace. By the time the SQL reaches the database, dbt has already replaced every {{ }} and {% %} with plain text.
<aside>
💡 If you make a Jinja mistake (typo in a variable name, unclosed {% if %}), dbt fails at the compile step and never sends anything to the database. You will not see a "syntax error near {{" from Postgres.
</aside>
source() and ref() do two things at once: they resolve a table name for the SQL, and they register a dependency in the DAG. The diagram below shows how these calls wire up the NYC taxi project you are building:
graph LR
src_trips[("raw_trips<br/><code>source('nyc_taxi', 'raw_trips')</code>")]:::source
src_zones[("raw_zones<br/><code>source('nyc_taxi', 'raw_zones')</code>")]:::source
stg_trips["stg_trips.sql<br/><code>{{ source(...) }}</code>"]:::model
stg_zones["stg_zones.sql<br/><code>{{ source(...) }}</code>"]:::model
int_enriched["int_trips_enriched.sql<br/><code>{{ ref('stg_trips') }}</code><br/><code>{{ ref('stg_zones') }}</code>"]:::model
fct_trips["fct_trips.sql<br/><code>{{ ref('int_trips_enriched') }}</code>"]:::model
src_trips --> stg_trips
src_zones --> stg_zones
stg_trips --> int_enriched
stg_zones --> int_enriched
int_enriched --> fct_trips
classDef source fill:#e1d5e7,stroke:#9673a6,stroke-width:2px,color:#000
classDef model fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px,color:#000
Every arrow exists because a model file contains a source() or ref() call. dbt reads those calls, builds this graph, and runs models in dependency order. If you hardcode a table name instead (from "public"."raw_trips"), the arrow disappears and dbt no longer knows about the relationship.
source(): refer to a raw tableYou met {{ source('nyc_taxi', 'raw_trips') }} in the setup chapter. The first argument is the source group name from _sources.yml; the second is the table name. dbt resolves it to the fully qualified "public"."raw_trips" at compile time, and registers the dependency shown as the purple-to-blue arrows above.
ref(): refer to another modelWhen one model selects from another, you use {{ ref('model_name') }} instead of hardcoding the table name:
-- models/staging/stg_clean_trips.sql
select *
from {{ ref('stg_trips') }}
where fare_amount > 0
dbt resolves {{ ref('stg_trips') }} to the schema-qualified name of the stg_trips model in the current target. In your dev target, that is "dev_<your_name>"."stg_trips". In a CI job running against the prod target, it would be "analytics"."stg_trips". Same source file, different output schemas, no code change.
ref() also tells dbt that this model depends on stg_trips (the blue-to-blue arrows in the diagram), so the DAG knows to build stg_trips first. This is the mechanism behind everything you saw in the introduction: the dependency graph, the build ordering, the lineage docs.
<aside> ⚠️ Never hardcode table names in a model. Even if the schema name is correct today, hardcoding it breaks dependency tracking. The arrow disappears from the DAG: the lineage view will show your model as having no parents, and dbt will not rebuild upstream models when your model is selected.
</aside>
var(): configurable values from the projectWhen a value should change between projects or environments (a date cutoff, a country filter, a minimum fare), put it in dbt_project.yml and read it with {{ var() }}.
# dbt_project.yml
vars:
min_fare_amount: 0.01
start_date: '2024-01-01'
-- models/staging/stg_trips_filtered.sql
select *
from {{ ref('stg_trips') }}
where fare_amount >= {{ var('min_fare_amount') }}
and pickup_datetime >= '{{ var("start_date") }}'
You can also override a var on the command line for one-off runs:
dbt run --select stg_trips_filtered --vars '{"min_fare_amount": 1.00}'
{% if %} and {% for %}Jinja lets you generate repetitive SQL from a list, and switch behavior based on environment. Two examples on the NYC taxi data.
{% for %}: generate a CASE WHEN from a listThe payment_type column in raw_trips is an integer code. The TLC publishes the meaning of each code:
| Code | Label |
|---|---|
| 1 | Credit card |
| 2 | Cash |
| 3 | No charge |
| 4 | Dispute |
| 5 | Unknown |
| 6 | Voided trip |
You could write the case by hand:
case payment_type
when 1 then 'Credit card'
when 2 then 'Cash'
when 3 then 'No charge'
when 4 then 'Dispute'
when 5 then 'Unknown'
when 6 then 'Voided trip'
end as payment_type_label
That works for six values. It does not scale to 200 zone IDs, and it is silently wrong if the list changes. With Jinja, you describe the data once and let dbt expand the SQL:
case payment_type
{% set payment_types = {
1: 'Credit card',
2: 'Cash',
3: 'No charge',
4: 'Dispute',
5: 'Unknown',
6: 'Voided trip'
} %}
{% for code, label in payment_types.items() %}
when {{ code }} then '{{ label }}'
{% endfor %}
end as payment_type_label
After compilation, both versions produce identical SQL. The Jinja version stays correct when the list grows, and the source of truth is one dictionary.
{% if %}: switch behavior by environmenttarget.name exposes the active target ("dev", "prod", etc.) inside Jinja. A common pattern is to sample a small slice of the data in dev and the full set in prod:
select *
from {{ source('nyc_taxi', 'raw_trips') }}
{% if target.name == 'dev' %}
where pickup_datetime >= '2024-01-15'
{% endif %}
In dbt run --target dev you build a one-day slice in seconds. In dbt run --target prod you build the full month. The model file does not need a "dev" copy and a "prod" copy.
A macro is a named Jinja function. You define it once in macros/, and call it from any model. Macros are how you stop copy-pasting small SQL fragments.
A real example from the NYC taxi data: any analyst looking at the trips wants useful per-trip metrics like tip percentage (tip_amount / fare_amount) and fare per mile (fare_amount / trip_distance). Both are simple divisions. Both have the same trap: a trip with fare_amount = 0 or trip_distance = 0 will throw a division by zero error and abort the run. The fix is a nullif guard on the denominator, written the same way every time:
tip_amount / nullif(fare_amount, 0) as tip_pct,
fare_amount / nullif(trip_distance, 0) as fare_per_mile
That pattern is going to repeat: per-passenger fare, fare per minute, surcharge ratio, and so on. Extract it once into a macro:
-- macros/safe_divide.sql
{% macro safe_divide(numerator, denominator) %}
{{ numerator }} / nullif({{ denominator }}, 0)
{% endmacro %}
Then use it in stg_trips.sql to add two analytics columns to the model you built in the setup chapter:
-- models/staging/stg_trips.sql
select
pickup_datetime,
dropoff_datetime,
pickup_location_id,
dropoff_location_id,
fare_amount,
tip_amount,
trip_distance,
{{ safe_divide('tip_amount', 'fare_amount') }} as tip_pct,
{{ safe_divide('fare_amount', 'trip_distance') }} as fare_per_mile
from {{ source('nyc_taxi', 'raw_trips') }}
where pickup_location_id is not null
Two callers of the same macro in a single model already make the case for extracting it: the rule for handling zero denominators lives in one place. When the team decides "actually, return 0 instead of NULL for zero-distance trips," you edit the macro and both columns update.
<aside>
💡 You can verify both columns immediately. Run dbt run --select stg_trips, then select tip_pct, fare_per_mile from dev_<your_name>.stg_trips limit 10; in psql. The average across the January 2024 dataset is roughly 15% tip and around \$12 per mile (NYC's mix of long airport trips and short Manhattan rides skews the per-mile number high).
</aside>
Whenever you write Jinja, your debugging move is to look at the rendered SQL.
dbt compile --select stg_trips
dbt compile runs steps 1 and 2 of the parse/compile/wrap pipeline from the setup chapter (parse the project, render Jinja) and stops without sending anything to the database. The output goes to target/compiled/nyc_taxi/models/staging/stg_trips.sql. Open it side-by-side with your source file. If the Jinja did what you expected, the output looks like clean SQL. If it did not, the difference jumps out immediately.
safe_divide and watch dbt compile itThis is your first time editing a model and a macro together. The point is to see: with your own eyes, in target/compiled/: that {{ safe_divide(...) }} is expanded into literal SQL before anything reaches the database.
safe_divide macro to macros/safe_divide.sql.stg_trips.sql to use it for both tip_pct and fare_per_mile. dbt compile --select stg_trips
target/compiled/nyc_taxi/models/staging/stg_trips.sql and confirm both {{ safe_divide(...) }} calls have been replaced by literal tip_amount / nullif(fare_amount, 0) (and the equivalent for fare_per_mile). dbt run --select stg_trips
SELECT AVG(tip_pct) FROM dev_<your_name>.stg_trips;
Jinja is a power tool. The same features that let you DRY up 30 models can also turn a 20-line model into a 200-line macro spaghetti that nobody can read. Three patterns to avoid (for a full set of annotated examples, see good patterns (gist) and bad patterns with fixes (gist)):
A macro you call once, in one model, is worse than the inline SQL it replaces. The reader now has to open a second file to understand what the model does. The rule of thumb: extract a macro when you have three or more callers doing the exact same thing, not when you imagine you might in the future.
{% if %} chains that hide business logicThis is unreadable:
select
fare_amount,
{% if target.name == 'prod' and var('include_tax', false) %}
fare_amount + mta_tax + improvement_surcharge as total
{% elif target.name == 'dev' %}
fare_amount as total
{% else %}
null as total
{% endif %}
from {{ ref('stg_trips') }}
Rule of thumb: {% if %} is for environment switching (target.name, debug flags), not for varying business definitions. If "what counts as total fare" depends on configuration, build two models and let the consumer pick.
This compiles to working SQL, but a future you (or a teammate) has to mentally render the Jinja before they can read the query:
select
{% for col in ['fare_amount', 'tip_amount', 'tolls_amount'] %}
{{ col }}{% if not loop.last %},{% endif %}
{% endfor %}
from {{ ref('stg_trips') }}
The plain-SQL version is shorter and clearer:
select
fare_amount,
tip_amount,
tolls_amount
from {{ ref('stg_trips') }}
<aside>
💡 The right reading of "DRY" in dbt is "every business rule is defined in one place." It is not "every typed character appears once." Repeating three column names in a select is fine. Repeating a case expression in three models is not.
</aside>
With that rule of thumb in hand, you have enough to exercise the patterns from this chapter in the practice file.
<aside>
💡 Using AI to help: When a compiled SQL file looks wrong and you cannot spot why, paste both the source (the .sql with Jinja tags) and the compiled output (target/compiled/...) into an LLM with the prompt "explain what dbt did to transform the first file into the second, line by line." This is faster than reading the Jinja docs and catches silent whitespace / loop-variable mistakes. ⚠️ Public TLC column names are safe; for real projects, scrub table and column names that might hint at customer PII before sending.
</aside>
The matching practice exercises live in the Practice chapter, one per concept introduced above.