Week 10 - SQL Transformations with dbt

Introduction to dbt Core

dbt Setup for Azure PostgreSQL

SQL and Jinja Templating

Materializations & Layers

dbt Tests

Docs & Extras

Practice

Week 10 Assignment: Build a Daily Borough Stats Mart with dbt

Week 10 Gotchas & Pitfalls

Week 10 Lesson Plan (Teachers)

Docs & Extras

In dbt Tests you attached tests to your models so that the producer of the data (you) can trust what leaves the pipeline. This chapter is about the other side of that contract: the consumers: analysts, dashboard builders, the next person on the team: who need to find your mart, understand what it means, and decide whether they can trust it.

Tests answer "are the numbers right?" Documentation answers "what does this column mean?" and "where did it come from?" Without either, a dashboard that says "revenue" is a guess. With both, it's a claim that can be audited.

This chapter reuses the _stg_trips.yml, _stg_zones.yml, and _fct_trips.yml files you wrote in dbt Tests: you already have the skeleton; you are adding descriptions to it. Then you generate and host a documentation site with lineage. After that, the chapter covers snapshots (SCD Type 2 row history, practiced against a personal mutable dimension) and the package ecosystem (you already used dbt_utils in the previous chapter; this adds context).

By the end of this chapter, you should be able to:

<aside> 📦 Reference repo: The end-state of this chapter (schema YAMLs with descriptions, a generated docs site, and an example snapshots/ folder) lives at tag ch6-docs-extras in the reference repo. As with the other chapters: type it yourself first, consult the tag when something breaks.

</aside>

Documentation: the YAML you already have, with more words

In the four built-in generic tests every column in your _stg_trips.yml already has a description: field: you wrote them when you added tests. What you have not done yet is generate the HTML site that turns those descriptions into something a non-dbt user can browse. The site is free: dbt builds it from the same YAML files you already maintain.

Column and model descriptions

Two lessons from the previous chapter already set up the description workflow:

Open _fct_trips.yml from the dbt Tests practice. Right now it probably has a one-line stub. Flesh it out:

# models/marts/_fct_trips.yml
version: 2

models:
  - name: fct_trips
    description: |
      One row per completed NYC green taxi trip in January 2024, with
      pickup/dropoff zone attributes folded in (OBT-style mart). Queried
      directly by dashboards and ad-hoc analysis.

      **Grain:** one row per trip.
      **Source:** `public.raw_trips` joined to `public.raw_zones` on
      `pickup_location_id` and `dropoff_location_id`.
      **Not included:** trips where `pickup_location_id` is NULL (dropped
      in `stg_trips`); duplicate rows from the TLC source are kept as-is
      and surfaced by `dbt_utils.unique_combination_of_columns`.
    columns:
      - name: pickup_datetime
        description: Wall-clock time the trip began (America/New_York, no timezone attached).
        tests: [not_null]
      - name: fare_amount
        description: Metered fare in USD, not including tip, tolls, or surcharges.
      - name: tip_pct
        description: |
          `tip_amount / fare_amount`, rounded to 4 decimals. NULL when
          `fare_amount` is 0 (voided trips, no-charge rides).
      - name: pickup_borough
        description: |
          NYC borough of the pickup zone, joined from `stg_zones.borough`.
          Values: Manhattan, Brooklyn, Queens, Bronx, Staten Island, EWR,
          Unknown, NaN, or NULL when `pickup_location_id` did not resolve.

The | (YAML literal block) lets a description span multiple lines and keeps newlines as written: useful for anything longer than a sentence.

<aside> 💡 What's mandatory vs. convention. The only parts dbt actually parses are the YAML keys: version, models, name, description, columns, tests. Everything inside a description: value is free-form Markdown: dbt dumps it straight into the docs site without interpreting it. The **Grain:**, **Source:**, **Not included:** labels above are a house convention for making mart docs scannable, not dbt syntax. Use them, replace them with your own headings, or skip them entirely; dbt does not care. What matters is that every mart column has some description that helps the next reader.

</aside>

Rule of thumb: document every column in every mart. Staging and intermediate models can be sparser: they are an implementation detail: but a mart is a contract, and each column is a public API.

<aside> 💡 Using AI to help: Writing good descriptions for 30 columns by hand is tedious. A practical use of an LLM here is: paste your model's select list and the underlying source's column semantics, ask the model to draft one-sentence descriptions, then edit for accuracy. ⚠️ Ensure no PII or sensitive company data is included: for this week's public NYC TLC dataset that is not a concern, but on a real project always scrub customer identifiers, business metrics, or internal URLs before sending columns to a model.

</aside>

Doc blocks: longer prose, reused in multiple places

When a description grows past a couple of sentences, or when the same text needs to live next to five columns, move it into a doc block: a named chunk of Markdown in a .md file under models/.

<!-- models/marts/fct_trips_docs.md -->
{% docs trip_grain %}

One row per completed taxi trip. "Completed" means the TLC submitted the
trip record to the public dataset; cancellations and trips in progress
are not included. Duplicates exist in the source data (roughly 4 rows
in January 2024) and are kept as-is; see `dbt_utils.unique_combination_of_columns`
test results.

{% enddocs %}

Then reference it from YAML with the {{ doc(...) }} function:

- name: fct_trips
  description: '{{ doc("trip_grain") }}'

Rule of thumb: reach for doc blocks when the description runs 3+ lines, contains lists or formatting, or would otherwise be duplicated across models.

dbt docs generate and dbt docs serve

Two commands turn the YAML into a browsable site:

dbt docs generate

This reads every schema YAML, introspects your warehouse for column types and row counts (via a SELECT ... FROM information_schema query per model), and writes three files into target/:

Then:

dbt docs serve --port 8001

Starts a local web server at http://localhost:8001. The page you land on lists every model, source, and test in the project. Click a model to see its description, columns, tests, and compiled SQL. In the bottom-right corner there is a blue "view lineage graph" button: click it.

dbt docs: fct_trips model page showing description, details, and column table

dbt docs: fct_trips model page showing description, details, and column table

The fct_trips page above is what your _fct_trips.yml description and column descriptions render into: Markdown in the description, the column types pulled from Postgres, the attached tests surfaced inline. Every bit of prose a stakeholder reads here came from a YAML field you maintain.

<aside> 💡 If port 8001 is already busy (a previous dbt docs serve you forgot to stop), pick any other port (--port 8002). The default port 8080 collides with many other tools, which is why the dbt docs recommend choosing one explicitly.

</aside>

The lineage graph

The lineage graph shows every model as a node and every ref() / source() call as an edge. For the nyc_taxi project you will see four nodes and four edges: the two raw sources feeding two staging models, both feeding the mart:

dbt docs lineage graph showing raw_trips and raw_zones feeding stg_trips and stg_zones, which both feed fct_trips

dbt docs lineage graph showing raw_trips and raw_zones feeding stg_trips and stg_zones, which both feed fct_trips

Source nodes are green, staging models are cyan, marts are purple. Click fct_trips in the graph and its upstream dependencies stay visible while everything else fades; do the same on a stg_* model and you see which marts depend on it. On a real project with 50+ models, this is the single most useful navigation tool dbt ships: before you change a column in stg_trips, the lineage graph tells you which marts will need review.

<aside> 💡 The docs site as a team communication tool. In a real team the docs site is hosted somewhere anyone can reach it (often as static HTML in an S3 bucket, Cloudflare Pages, or GitHub Pages, regenerated on every merge to main). When a stakeholder asks "where does revenue come from?", you send them a link to the mart's page in the docs site instead of explaining it in Slack. The docs site is how dbt turns an internal engineering tool into a cross-functional one. Setting up hosted docs is outside this course's scope; see the dbt docs' Hosting and serving guide when you get to it on your own project.

</aside>

Snapshots: tracking rows that change over time

Most marts assume new rows are appended and old rows stay unchanged: good enough for event-shaped data. But some dimensional data changes in place: a customer's address updates, a taxi zone is redrawn, a product's price changes. Overwriting the old value loses history that queries asking "what did this row look like on date X?" still need. A snapshot keeps both.

A snapshot materializes a table with two extra columns: dbt_valid_from and dbt_valid_to: that record when each version of a row was current. This is the classic Kimball Slowly Changing Dimension Type 2 pattern: every time a row changes, the old version gets a dbt_valid_to timestamp and a new row is added with dbt_valid_to = NULL.

The pattern is older than dbt by two decades, and worth a brief detour:

<aside> 🤓 Curious Geek: SCD Type 2 is older than the warehouse

Slowly Changing Dimension Type 2 was named by Ralph Kimball in 1996 when the data warehouse was a mainframe-era idea. The "Type 2" in the name is because Kimball numbered the patterns: Type 1 overwrites, Type 2 keeps history with new rows, Type 3 keeps the previous value in a second column. Type 2 won because it answered the most common analytics question ("what did this look like on date X?") without losing the current state. dbt's contribution is not the pattern; it is making the pattern declarative (write a config block, not a 200-line MERGE statement) and version-controllable.

</aside>

Concrete taxi example. The TLC revises its published zone lookup every few years: zones get added, boundaries get redrawn, service_zone classifications flip. The TLC just overwrites the file; there is no version or changelog. Without a snapshot, fct_trips rebuilt after a revision silently reports different borough totals for the same historical trips. Snapshotting raw_zones captures each revision as a new dated row, so you can always answer "which zone did pickup_location_id belong to on the day the trip happened?":

-- snapshots/zones_snapshot.sql
{% snapshot zones_snapshot %}

{{ config(
    target_schema='snapshots',
    unique_key='location_id',
    strategy='check',
    check_cols=['borough', 'zone', 'service_zone']
) }}

select * from {{ source('nyc_taxi', 'raw_zones') }}

{% endsnapshot %}

To use the historical zone for a given trip, join against the snapshot on location_id and on the trip's pickup_datetime falling between dbt_valid_from and coalesce(dbt_valid_to, now()). For the January 2024 dataset this changes no numbers: nothing has been revised since the trips landed: but the pattern matters the moment the next revision ships.

A quick detour: seeds: small reference CSVs managed in the project

Before the snapshot exercise, one dbt concept this chapter has not yet named: a seed. A seed is a small CSV file in your project's seeds/ folder that dbt loads into the warehouse as a table. The command is dbt seed; dbt build also runs seeds alongside models.

The line between a seed and a source is a matter of ownership, not size:

Keep seeds small (under a few thousand rows). Large CSVs belong in a real ingestion pipeline: not because dbt refuses, but because dbt seed re-inserts every row on every run.

Exercising snapshots safely: mutable_zones in your own schema

There is an obvious pedagogical problem: to see a snapshot capture a change, something has to change. But public.raw_zones is shared: you cannot mutate it to generate history without breaking every classmate's project.

The fix is a per-student seed. Every student has a personal dev_<your_name> schema (from Personal dev schemas); that is a sandbox you own. A 10-row subset of the TLC zone lookup is published as a gist: mutable_zones_seed.csv. Download it into your project under seeds/mutable_zones.csv and load it:

curl -L <https://gist.githubusercontent.com/lassebenni/7536be5cff0e237302f30551619288e1/raw/mutable_zones_seed.csv> \\
  -o seeds/mutable_zones.csv
dbt seed --select mutable_zones

The result is dev_<your_name>.mutable_zones: yours to mutate freely. Now you can point a snapshot at it:

-- snapshots/mutable_zones_snapshot.sql
{% snapshot mutable_zones_snapshot %}

{{ config(
    target_schema=target.schema ~ '_snapshots',
    unique_key='location_id',
    strategy='check',
    check_cols=['borough', 'zone', 'service_zone']
) }}

select * from {{ ref('mutable_zones') }}

{% endsnapshot %}

The target_schema=target.schema ~ '_snapshots' expression uses your profile's schema: value (your personal dev schema) and appends _snapshots, so the snapshot table lives at dev_<your_name>_snapshots.mutable_zones_snapshot. Per-student, same pattern as your model schema.

Run the snapshot to capture the initial state:

dbt snapshot
1 of 1 START snapshot dev_<your_name>_snapshots.mutable_zones_snapshot ......... [RUN]
1 of 1 OK snapshotted dev_<your_name>_snapshots.mutable_zones_snapshot ......... [SELECT 10 in 0.55s]
Done. PASS=1 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=1

The SELECT 10 in the status line is dbt telling you it captured 10 rows: the initial state. Now mutate a row in psql and re-snapshot to watch the history grow:

-- imagine the TLC reclassifies zone 132 (JFK) from 'Airports' to 'Yellow Zone'
UPDATE dev_<your_name>.mutable_zones
SET service_zone = 'Yellow Zone'
WHERE location_id = 132;
dbt snapshot
1 of 1 OK snapshotted dev_<your_name>_snapshots.mutable_zones_snapshot ......... [INSERT 0 1 in 0.87s]
Done. PASS=1 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=1

Note the status changed from SELECT 10 on the initial run to INSERT 0 1 on this one: dbt detected one row whose check_cols changed and inserted exactly one new history row (the snapshot table now has 11 rows, up from 10). Query the snapshot:

SELECT location_id, service_zone, dbt_valid_from, dbt_valid_to
FROM dev_<your_name>_snapshots.mutable_zones_snapshot
WHERE location_id = 132
ORDER BY dbt_valid_from;

You will see two rows: the original 'Airports' version with a dbt_valid_to timestamp, and the new 'Yellow Zone' version with dbt_valid_to IS NULL. That is the SCD Type 2 pattern working end to end. Revert the change (UPDATE ... SET service_zone = 'Airports' WHERE location_id = 132;) and run dbt snapshot again to add a third history row when you are done exploring.

Two strategies to know:

Running dbt snapshot on a schedule (hourly, daily) builds up the history. Querying the snapshot with where dbt_valid_to is null gives you the current rows; without that filter, you get every historical version.

Rule of thumb: reach for a snapshot when you need to answer "what did this row look like on date X?" Skip it for data that is purely additive (events, log lines): an append-only table handles those. Skip it for high-volume mutable data (user sessions, IoT readings): those want event sourcing in a different tool.

The package ecosystem

You already used a package in Putting it all together: dbt_utils.unique_combination_of_columns shipped the multi-column uniqueness test your project would have had to implement by hand. Packages are how the dbt community collects reusable macros, tests, and materializations without every project writing the same code.

The packages.yml file you wrote in the previous chapter already has the pattern:

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: [">=1.1.0", "<2.0.0"]

dbt deps reads the file, resolves versions, and installs each package into dbt_packages/. The macros and tests inside become available under the package's namespace: dbt_utils.generate_surrogate_key(...), dbt_utils.unique_combination_of_columns, and so on.

Beyond dbt_utils, the dbt packages hub is the discovery surface for anything else. When you meet a dbt pattern that feels like it should already exist (pivot tables, JSON parsing, better assertion libraries, calendar-table macros), check the hub before writing it yourself. The Going Further page lists the two community packages most projects reach for after dbt_utils.

<aside> ⚠️ Every package you add is code you did not write running against your warehouse. Pin versions (as above), read the package's README for what permissions it needs, and keep an eye on release notes for breaking changes. dbt deps does not audit what it installs.

</aside>

⌨️ Hands on

Four steps that turn the schema YAML you already have into a browsable docs site.

1. Flesh out descriptions in _fct_trips.yml

Open models/marts/_fct_trips.yml. The Chapter 5 practice left it as a two-line stub with just a not_null test on pickup_datetime. Replace it with the fuller example from the Column and model descriptions section above: a multi-paragraph model description: using the | literal block, and a description: on every column in the fct_trips select (pickup_datetime, dropoff_datetime, fare_amount, tip_pct, fare_per_mile, payment_type_label, pickup_borough, pickup_zone, dropoff_borough, dropoff_zone).

Keep each column description to one or two sentences. The goal is to give an analyst enough to decide whether the column is what they want, not to re-derive the SQL.

2. Generate the catalog

dbt docs generate
Found 3 models, 1 snapshot, 1 seed, 10 data tests, 2 sources, 581 macros, 1 unit test
Concurrency: 4 threads (target='dev')
Building catalog
Catalog written to target/catalog.json

The Building catalog step is dbt introspecting the warehouse: one information_schema query per model: to capture the column types and row counts the docs site displays. If you see a warning about "relations not found in the catalog," run dbt run --select +fct_trips first: dbt cannot introspect columns in tables that do not exist yet.

3. Serve the site and explore

dbt docs serve --port 8001