Week 10 - SQL Transformations with dbt
dbt Setup for Azure PostgreSQL
Week 10 Assignment: Build a Daily Borough Stats Mart with dbt
Week 10 Lesson Plan (Teachers)
In SQL and Jinja Templating you built stg_trips and put it under a models/staging/ folder. That folder name is not just convention: it is one of three layers that dbt projects use to organize transformations: staging, intermediate, and marts. Week 9 introduced these layers as a conceptual model; this chapter makes them real in your dbt project.
Once the layers are in place, you will learn the four materializations (view, table, ephemeral, incremental) that decide how dbt turns each model's select into a database object. By the end of the chapter you will have built fct_trips: the business-ready mart that downstream dashboards will query.
By the end of this chapter, you should be able to:
dbt_project.yml.table model has outgrown a full rebuild and needs a more advanced strategy (snapshot is covered in Docs & Extras; incremental is previewed on the Going Further page and practiced later in the track, on a dataset large enough for the pattern to matter).Inside the dbt project, the layers are folders under models/:
models/
├── staging/ ← 1:1 with sources, light cleaning
│ ├── _sources.yml
│ ├── stg_trips.sql
│ └── stg_zones.sql
├── intermediate/ ← joins and business logic (optional)
│ └── int_trips_enriched.sql
└── marts/ ← business-ready, queried by consumers
└── fct_trips.sql
A staging model reads from exactly one source and does light work: rename columns to snake*case, cast types, filter out junk rows. It does not join, aggregate, or add business logic. The naming convention is stg*<source_table>.
The raw tables it reads from are declared in models/staging/_sources.yml (set up in the setup chapter), which is what lets staging models reference them through {{ source('nyc_taxi', ...) }} instead of hardcoded table names.
You already have stg_trips from SQL and Jinja Templating. Next, you will build stg_zones:
-- models/staging/stg_zones.sql
select
location_id,
borough,
zone,
service_zone
from {{ source('nyc_taxi', 'raw_zones') }}
raw_zones is already clean (snake*case columns, no nulls to filter), so the staging model is a thin pass-through. That is normal. The point of staging is not to transform; it is to give every source table a stable ref()-able name that downstream models depend on. If the column names in raw*zones change tomorrow (because the ingestion script is updated), you fix stg_zones and nothing else breaks.
An intermediate model joins staging models or applies business logic that is too complex for a staging model but not yet ready for a mart. The naming convention is int_<description>.
For the NYC taxi project, an intermediate model could enrich trips with zone names (using the tip_pct, fare_per_mile, and payment_type_label columns added to stg_trips in Practice Exercise 1):
-- models/intermediate/int_trips_enriched.sql
select
t.pickup_datetime,
t.dropoff_datetime,
t.fare_amount,
t.tip_amount,
t.trip_distance,
t.tip_pct,
t.fare_per_mile,
t.payment_type_label,
pz.borough as pickup_borough,
pz.zone as pickup_zone,
dz.borough as dropoff_borough,
dz.zone as dropoff_zone
from {{ ref('stg_trips') }} t
left join {{ ref('stg_zones') }} pz
on t.pickup_location_id = pz.location_id
left join {{ ref('stg_zones') }} dz
on t.dropoff_location_id = dz.location_id
The intermediate layer is optional. For a project this small, you could skip it and put the join directly in the mart. The layer exists for when the join logic is reused by multiple marts (one mart for trip analysis, another for zone-level aggregations). If only one mart needs the join, putting it in the mart is simpler.
<aside> 💡 The decision is pragmatic, not dogmatic: if two or more marts would duplicate the same join, extract it into an intermediate model. If only one mart needs it, keep the join in the mart.
</aside>
A mart is the model that analysts, dashboards, and downstream tools query. It has a clear grain (one row = one trip), stable column names, and is materialized as a table so reads are fast.
In Week 9 you met two ways to shape a mart: the Kimball star schema (a narrow fact table plus separate dimension tables you join at query time) and One Big Table (OBT) (a wide, denormalized table with dimension attributes pre-joined into the fact row). This course builds OBT-style marts: the fct_trips you will write below already folds zone attributes (pickup_borough, pickup_zone, dropoff_borough, dropoff_zone) into the fact row instead of keeping them in a separate dim_zones model that consumers would have to join.
Two reasons OBT is the practical default in dbt projects:
fct_trips with dim_zones at query time is a join the warehouse runs on every dashboard refresh. Doing the join once at dbt run time and materializing a wide mart moves that cost out of the read path.The naming convention is fct_<noun> for event-grained tables (taxi trips, orders, page views) and dim_<noun> if you ever do split a reference table off. fct_ in the wild signals "one row per event with numeric measures," regardless of whether the mart is pure Kimball or OBT.
<aside> 📘 See Week 9's data modeling chapter for the Kimball-vs-OBT comparison and the history behind the fact/dimension vocabulary.
</aside>
OBT is the default in modern dbt projects, but it has not always been. The shift is recent enough to be worth a short note:
<aside> 🤓 Curious Geek: how columnar storage flipped the modelling default
Star schemas dominated warehousing from the 1990s through the 2010s for a hardware reason: row-oriented databases (Postgres, MySQL, the old Oracle stacks) read entire rows from disk even when a query touches three columns out of forty. A wide OBT mart was wasteful; narrow dim_* tables joined at query time were cheap. Then columnar warehouses (Vertica 2005, Redshift 2012, BigQuery, Snowflake, DuckDB) flipped that. Reading three columns out of forty now costs a tenth of reading the full row; the join cost dropped at the same time as the wide-row cost. By 2020 most analytics-engineering teams had moved to OBT-by-default. Your fct_trips is still on Postgres for this course, but the modelling style points at where you will work in production.
</aside>
You can see the OBT-by-default pattern in production at company scale:
<aside>
💡 In the wild: GitLab's analytics team runs their entire company on a public dbt project: gitlab-data/analytics on GitLab. 1000+ models, full staging/intermediate/marts layering, mix of view / table / incremental materializations, all open source. Browse transform/snowflake-dbt/models/ to see what the same patterns you write this week look like at company scale.
</aside>
You now have three layers to organize models: staging, intermediate, marts. Every model inside those layers is still a plain select statement, and dbt has to turn each one into something the database can query. A materialization is the strategy dbt uses to do that.
The strategy matters. stg_trips from SQL and Jinja Templating is materialized as a view: every query re-executes the full select against raw_trips. For a staging model read only by downstream dbt models during a run, that is fine. For a mart a Grafana dashboard queries every 30 seconds, it is not.
You pick the strategy with one line of config; dbt handles the DDL.
| Materialization | What dbt creates | Rebuilds on dbt run |
Data stored on disk? |
|---|---|---|---|
view (default) |
CREATE VIEW |
Yes (instant, replaces the view definition) | No, query runs on read |
table |
CREATE TABLE AS SELECT |
Yes (drops and recreates the full table) | Yes |
ephemeral |
Nothing (inlined as a CTE) | N/A | No |
incremental |
CREATE TABLE on first run, then INSERT/MERGE on subsequent runs |
Only new/changed rows | Yes |
view: always-fresh, zero build cost: use for stagingA view stores only the select. The query re-runs against raw_trips every time a downstream model or dashboard reads from it. No data is copied, no disk is used, and the result always reflects the latest rows in the source. That is exactly what you want for a staging model: the renaming and type casts in stg_trips are cheap, and you never want them to return stale data.
Rule of thumb: use view when the transformation is cheap (no joins, no aggregations) and freshness matters. All your stg_* models in this project are views.
-- models/staging/stg_trips.sql
{{ config(materialized='view') }}
select
pickup_datetime,
dropoff_datetime,
pickup_location_id,
fare_amount,
tip_amount,
trip_distance
from {{ source('nyc_taxi', 'raw_trips') }}
where pickup_location_id is not null
The {{ config(...) }} block is optional here because view is the default; show it only when you want the materialization to be explicit in the file itself.
table: compute once, read cheaply: use for martsA table materialization runs the full select and writes the result to disk. Subsequent queries read pre-computed rows at table-scan speed. The cost you pay is build time and storage: every dbt run drops the table and rebuilds it from scratch.
Rule of thumb: use table when the query is compute-heavy (multi-way joins, aggregations, window functions) and the result will be read many times between runs. Dashboards and BI tools querying your marts fit both conditions.
A sketch of the mart you will build in full in the Building fct_trips section below:
-- models/marts/fct_trips.sql
{{ config(materialized='table') }}
select
t.pickup_datetime,
t.fare_amount,
t.tip_amount,
pz.borough as pickup_borough,
pz.zone as pickup_zone
from {{ ref('stg_trips') }} t
left join {{ ref('stg_zones') }} pz
on t.pickup_location_id = pz.location_id
The join with stg_zones happens once per dbt run, not once per dashboard query. For a Grafana dashboard refreshing every 30 seconds, that is the difference between one join per run and nearly three thousand joins per day.
ephemeral and incremental: not practiced this weekTwo further materializations exist. ephemeral is never created in the database: dbt inlines its SQL as a CTE inside every downstream model that references it. The view materialization covers the same ground with better debuggability, so most projects skip it; see the Going Further page if you want the full picture. incremental processes only new or changed rows on each run instead of rebuilding the whole table, but it only earns its complexity past tens of millions of rows: your 57K-row January 2024 dataset rebuilds as a table in under two seconds. The Going Further page has a preview of the is_incremental() / {{ this }} pattern for when you meet it later in the track.
Mapping the four materializations back to the three layers you already know:
| Layer | Materialization | Why |
|---|---|---|
Staging (stg_*) |
view |
Read only by downstream models during dbt run, never queried directly |
Intermediate (int_*) |
view |
Same reasoning as staging |
Mart (fct_*) |
table |
Queried directly by dashboards and analysts; needs fast reads. The snapshot strategy (SCD Type 2 row history) is covered in Docs & Extras. |
This course builds OBT-style marts, so you do not produce a dim_* model. If you later split a dimension off (Kimball-style), materialize it as a table for the same reason as a fact mart.
You first met dbt_project.yml in the setup chapter where you configured name:, profile:, and model-paths:. Here you extend the same file with a models: block that sets defaults for every model in a folder. You can set the materialization per model (in the SQL file) or per folder (in dbt_project.yml); the folder-level config is cleaner because it applies a convention once instead of repeating it in every file.
# dbt_project.yml
models:
nyc_taxi:
staging:
+materialized: view
intermediate:
+materialized: view
marts:
+materialized: table
The + prefix means "apply this config to all models in this folder and its subfolders." Any {{ config() }} block inside a model overrides the folder-level setting.
<aside>
⌨️ Hands on: Add the models: block above to your dbt_project.yml. Run dbt run and check the output: staging models should show CREATE VIEW, and mart models should show CREATE TABLE.
</aside>
fct_tripsTime to build the mart. fct_trips joins trips with zones, selects the columns a dashboard needs, and is materialized as a table. The grain is one row per trip.
The mart assumes stg_trips already exposes tip_pct, fare_per_mile, and payment_type_label: the columns added in Practice Exercise 1. If you skipped that exercise, do it first, or drop those three columns from the select below.
Create the file. The {{ config(...) }} line is redundant once the marts/ folder default is set in dbt_project.yml above, but keeping it inline makes the materialization visible to anyone reading only this file:
-- models/marts/fct_trips.sql
{{ config(materialized='table') }}
select
t.pickup_datetime,
t.dropoff_datetime,
t.fare_amount,
t.tip_amount,
t.trip_distance,
t.tip_pct,
t.fare_per_mile,
t.payment_type_label,
pz.borough as pickup_borough,
pz.zone as pickup_zone,
dz.borough as dropoff_borough,
dz.zone as dropoff_zone
from {{ ref('stg_trips') }} t
left join {{ ref('stg_zones') }} pz
on t.pickup_location_id = pz.location_id
left join {{ ref('stg_zones') }} dz
on t.dropoff_location_id = dz.location_id
<aside>
💡 This mart skips the intermediate layer and puts the join directly in fct_trips. For a two-source project that is the right call. If you later add a second mart that needs the same enriched trip data, extract the join into int_trips_enriched and have both marts ref() it.
</aside>
Build everything from sources through the mart:
dbt run --select +fct_trips
Concurrency: 4 threads (target='dev')
1 of 3 START sql view model dev_<your_name>.stg_trips ........................ [RUN]
2 of 3 START sql view model dev_<your_name>.stg_zones ........................ [RUN]
1 of 3 OK created sql view model dev_<your_name>.stg_trips ................... [CREATE VIEW in 0.55s]
2 of 3 OK created sql view model dev_<your_name>.stg_zones ................... [CREATE VIEW in 0.56s]
3 of 3 START sql table model dev_<your_name>.fct_trips ....................... [RUN]
3 of 3 OK created sql table model dev_<your_name>.fct_trips .................. [SELECT 56551 in 0.73s]
Finished running 1 table model, 2 view models in 0 hours 0 minutes and 2.81 seconds (2.81s).
Done. PASS=3 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=3
The status tokens tell the story of the chapter: CREATE VIEW for the two staging models, SELECT 56551 for the mart (the row count written as a CREATE TABLE AS SELECT). Notice dbt ran the two staging models in parallel (1 of 3 and 2 of 3 both START before either finishes) and only started the mart after both parents finished: that is the DAG respecting dependency order with four threads of concurrency. The + prefix tells dbt to include all upstream models in the run. Without it, dbt would try to build fct_trips alone and fail if stg_trips or stg_zones are not already built.
Verify the mart:
SELECT pickup_borough, COUNT(*) AS trips, ROUND(AVG(tip_pct)::numeric, 3) AS avg_tip_pct
FROM dev_<your_name>.fct_trips
GROUP BY pickup_borough
ORDER BY trips DESC;