Week 10 - SQL Transformations with dbt

Introduction to dbt Core

dbt Setup for Azure PostgreSQL

SQL and Jinja Templating

dbt Tests

Incremental Models, Docs, and Snapshots

Building Your First Mart

Week 10 Gotchas & Pitfalls

Practice

Week 10 Assignment: Build a Daily Borough Stats Mart with dbt

Week 10 - Video Guide

Week 10 Lesson Plan (Teachers)

dbt Setup for Azure PostgreSQL

Setup is usually the least interesting part of learning a new tool, and also the part where most people get stuck. This chapter gets you from "nothing installed" to a working dbt project that connects to the shared Azure PostgreSQL database from Week 6, runs a trivial model into your personal schema, and passes dbt debug. Every later chapter in this week assumes you have finished the steps here.

By the end of this chapter, you should be able to:

<aside> 📦 Reference repo: A complete, CI-tested copy of the project this week builds lives at lassebenni/nyc-taxi-dbt-reference. The tags ch2-dbt-setup-azure through ch6-docs-extras mark the state at the end of each chapter. Do not clone it as a starting point: typing each file yourself is how the dbt mental model sticks. Use it to compare your work when something breaks: git checkout ch2-dbt-setup-azure shows what your project should look like at the end of this chapter.

</aside>

dbt Core, adapters, and why you install both

dbt Core is the engine: it parses your project, resolves Jinja templates, builds the DAG, and orchestrates the run. It does not know how to talk to any specific database.

Database communication lives in a separate package called an adapter. There is one adapter per warehouse:

Adapter Database Maintained by
dbt-postgres PostgreSQL (including Azure PostgreSQL) dbt Labs
dbt-snowflake Snowflake dbt Labs
dbt-bigquery Google BigQuery dbt Labs
dbt-redshift AWS Redshift dbt Labs
dbt-duckdb DuckDB community

The adapter translates dbt's generic instructions ("create this view", "select these columns") into the SQL dialect and connection protocol the warehouse expects. This week you target Azure PostgreSQL, so you install dbt-postgres alongside dbt Core.

<aside> 💡 The split exists so dbt Core can stay neutral and the dbt Labs team (plus the community) can ship adapters on independent release cycles. You only ever install the adapters you actually use.

</aside>

Install dbt Core and dbt-postgres with uv

In Week 5 you switched to uv as the recommended Python package manager. dbt is a CLI tool you reuse across projects, so install it once with uv tool install rather than as a project dependency. The same command works identically on macOS, Linux, and Windows:

uv tool install --python 3.12 dbt-core --with dbt-postgres

Two parts to notice:

Verify the install:

dbt --version

You should see something like:

Core:
  - installed: 1.11.x
  - latest:    1.11.x - Up to date!

Plugins:
  - postgres: 1.10.x - Up to date!

A second PATH issue can happen if dbt is not on your PATH at all after install:

<aside> 💡 Run uv tool update-shell once and open a new terminal. This adds ~/.local/bin (macOS/Linux) or %USERPROFILE%\\.local\\bin (Windows) to your shell.

</aside>

If installation fails on macOS or Windows with a psycopg2 build error, install the binary wheel of the driver instead:

<aside> ⚠️ If uv tool install fails with a psycopg2 build error (most often on Apple Silicon or fresh Windows installs), retry with the binary wheel: uv tool install --python 3.12 dbt-core --with dbt-postgres --with psycopg2-binary. The -binary package bundles the Postgres client library so no system compiler is needed.

</aside>

The error is different on Linux because the system package, not the wheel, is missing:

<aside> ⚠️ Linux: if the install fails with pg_config not found, install the system Postgres dev headers first: sudo apt-get install libpq-dev (Debian/Ubuntu) or sudo dnf install libpq-devel (Fedora/RHEL). Then re-run the uv tool install command above.

</aside>

Create a new dbt project

A dbt project is a folder containing a dbt_project.yml config file and subfolders for models, tests, and macros. Every dbt command runs against a project, and the project is the unit of version control: one project, one git repo.

dbt ships with a project generator. Run it from a folder where you want the project to live (not inside an existing git repo of another project):

dbt init nyc_taxi --skip-profile-setup

The --skip-profile-setup flag tells dbt init to scaffold the project without prompting you for connection details. You will configure those manually in the next section using a project-local profiles.yml file. (If you forget the flag, dbt asks ~10 prompts for host, user, password, etc. Press Enter through all of them.)

If you have multiple adapters installed, dbt init first asks which one to use. With only dbt-postgres installed it auto-picks postgres without asking.

After dbt init finishes, you have a folder structure like this:

nyc_taxi/
├── dbt_project.yml       # Project config: name, version, model paths, materializations
├── models/
│   └── example/          # Scaffolded example models (delete these later)
├── macros/               # Your custom Jinja macros go here
├── tests/                # Custom singular tests
├── seeds/                # Small CSV files you want dbt to load into the database
├── snapshots/            # Slowly changing dimension tracking (advanced)
└── analyses/             # Ad-hoc SQL scripts dbt compiles but does not run

The two folders you will use this week are models/ and macros/. Everything else you can ignore for now.

The most important file is dbt_project.yml. Open it and note these fields:

# dbt_project.yml (key fields only)
name: 'nyc_taxi'
profile: 'nyc_taxi'          # Must match the entry name in profiles.yml
model-paths: ["models"]

The profile: field is how dbt knows which database connection to use: it looks up a matching entry in profiles.yml, which you create in the next section.

Configure a project-local profiles.yml

profiles.yml holds the connection details (host, user, password, schema). dbt looks for it in two places, in order:

  1. The current project folder (next to dbt_project.yml).
  2. ~/.dbt/profiles.yml (your home directory).

For this course you will keep profiles.yml inside the project folder so the connection config travels with the project and is visible in code review. To keep secrets out of git, every sensitive field is loaded from an environment variable using {{ env_var('VAR_NAME') }}.

Two terms to know before you read the file: an outputs block is a dictionary of named connections, and each entry inside it is a target. The top-level target: field selects which one is used by default. You can have a dev target pointing at your sandbox and a prod target pointing at the production schema, then switch between them at runtime with dbt run --target prod.

Create nyc_taxi/profiles.yml with the following content. The host, user, and database name match the shared instance from Week 6:

# nyc_taxi/profiles.yml
nyc_taxi:
  target: dev
  outputs:
    dev:
      type: postgres
      host: hyf-data-pg.postgres.database.azure.com
      port: 5432
      user: hyfadmin
      password: "{{ env_var('PG_PASSWORD') }}"
      dbname: team1
      schema: dev_<your_name>     # ⚠️ Replace with your own first name, lowercase
      sslmode: require
      threads: 4

Add profiles.yml to .gitignore so credentials never end up in git, even if the env var fails to expand:

echo "profiles.yml" >> .gitignore

<aside> ⚠️ Only the password needs to be a secret on this course. The host, user, and database name are not sensitive (they were already shared with everyone in Week 6). You still gitignore the file because mistakes happen: a teammate adds a real password: line on their machine, you pull, and now it lives in your local copy. Gitignore protects everyone.

</aside>

Set the password environment variable in the same shell where you will run dbt:

export PG_PASSWORD='your-password-from-week-6'    # macOS/Linux/WSL
# $env:PG_PASSWORD = 'your-password-from-week-6'  # Windows PowerShell

Then verify dbt can connect:

dbt debug
Connection:
  host: hyf-data-pg.postgres.database.azure.com
  port: 5432
  user: hyfadmin
  database: team1
  schema: dev_<your_name>
  sslmode: require
  ...
Registered adapter: postgres=1.10.0
  Connection test: [OK connection ok]
All checks passed!

The last two lines are what you want: Connection test: [OK connection ok] means dbt successfully opened a connection, and All checks passed! means every config file parsed cleanly. If you see connection errors, check:

  1. The password env var is set in the same terminal where you run dbt debug.
  2. sslmode: require is present (Azure PostgreSQL rejects non-TLS connections).
  3. Your IP is allow-listed on the Azure PostgreSQL firewall (check with your teacher if unsure).

Personal dev schemas

This is the most important configuration choice in this chapter and it trips up everyone who skips it.

All students share the same team1 Azure PostgreSQL database. If you and a classmate both run dbt run with schema: analytics, your models write to the same analytics.stg_trips table and overwrite each other every few minutes. The fix is to give each student a personal schema.

A schema in PostgreSQL is a namespace inside a database. Multiple schemas like dev_anna.stg_trips, dev_marco.stg_trips, and analytics.stg_trips can coexist in the same team1 database without collisions. The diagram uses example names; in your own profile you replace the placeholder with your own first name (the rest of the chapter writes this as dev_<your_name>).

graph TB
    subgraph db[" Azure PostgreSQL: team1 database "]
        subgraph dev_anna[" dev_anna schema "]
            l1[stg_trips]
            l2[stg_zones]
            l3[fct_trips]
        end
        subgraph dev_marco[" dev_marco schema "]
            a1[stg_trips]
            a2[stg_zones]
            a3[fct_trips]
        end
        subgraph analytics[" analytics schema (prod) "]
            p1[stg_trips]
            p2[stg_zones]
            p3[fct_trips]
        end
    end

    classDef dev fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px,color:#000
    classDef prod fill:#d5e8d4,stroke:#82b366,stroke-width:2px,color:#000
    class l1,l2,l3,a1,a2,a3 dev
    class p1,p2,p3 prod

Each student has their own sandbox. You can break your dev schema as many times as you want without affecting anyone else. The shared analytics schema is written to only by the Airflow-orchestrated production run you set up in Week 11, never from a student's laptop.

<aside> 💭 If you have worked with traditional DTAP (Development, Test, Acceptance, Production) databases before, personal schemas inside a shared database probably look wrong: hand-written tables drift and pollute. dbt inverts that rule for non-prod because dbt models are code: every dbt run drops and recreates the objects from scratch, so personal schemas cannot drift. In industry, the typical layout is one shared non-prod database with one schema per engineer (dev_anna, dev_marco, ...) and a separate prod database written to only by CI.

</aside>

The schema: line in profiles.yml (dev_<your_name>) is what makes this work. Use lowercase, no spaces, no special characters, and stick with the same name for the whole week so you can find your objects in a SQL client. dbt creates the schema automatically the first time you run a model: no manual CREATE SCHEMA required.

<aside> ⚠️ Do not use the analytics schema from your laptop. That schema is reserved for the Airflow-orchestrated production pipeline you build in Week 11. Writing to it from dbt run locally means your work-in-progress models become visible to everyone querying dashboards.

</aside>

Define sources for the Week 9 raw tables

A source in dbt is a raw table that already exists in the database, declared in a schema.yml file so you can reference it with {{ source('schema_name', 'table_name') }}. Using sources instead of hardcoding table names gives you dependency tracking and lets dbt test raw data freshness later.

The Week 9 NYC taxi data lives in the public schema as raw_trips and raw_zones. In Week 9 you queried these tables with hardcoded FROM public.raw_trips. Declaring them as dbt sources gives you the same access plus dependency tracking, freshness checks, and a single place to update if the table name ever changes.

Create a new file at models/staging/_sources.yml:

# models/staging/_sources.yml
version: 2

sources:
  - name: nyc_taxi
    description: Raw NYC green taxi trip records and zone lookup, loaded in Week 9.
    database: team1
    schema: public
    tables:
      - name: raw_trips
        description: One row per green taxi trip for January 2024 (~57K rows).
      - name: raw_zones
        description: NYC taxi zone lookup (265 rows mapping location IDs to boroughs).

Now you can reference these tables from any model in the project:

select * from {{ source('nyc_taxi', 'raw_trips') }}

dbt will resolve the call to "public"."raw_trips" at compile time.

Your first model

A dbt model is a .sql file containing a single select statement. When you run dbt run, dbt wraps your select in CREATE VIEW (or CREATE TABLE, depending on the materialization) and executes it against the database. The model's name is the filename without the .sql extension: stg_trips.sql becomes the table stg_trips in your dev schema.

Delete the scaffolded models/example/ folder from dbt init, and create a new file at models/staging/stg_trips.sql:

-- models/staging/stg_trips.sql
select
    pickup_datetime,
    dropoff_datetime,
    pickup_location_id,
    dropoff_location_id,
    fare_amount,
    trip_distance
from {{ source('nyc_taxi', 'raw_trips') }}
where pickup_location_id is not null

<aside> 💡 The Week 9 ingestion already cleaned the original NYC TLC column names (lpep_pickup_datetimepickup_datetime, etc.) before writing to raw_trips. If you query the table in psql you will see the renamed columns directly.

</aside>

Run it:

dbt run --select stg_trips
1 of 1 START sql view model dev_<your_name>.stg_trips ........................ [RUN]
1 of 1 OK created sql view model dev_<your_name>.stg_trips ................... [CREATE VIEW in 0.55s]
Finished running 1 view model in 0 hours 0 minutes and 0.88 seconds (0.88s).
Completed successfully
Done. PASS=1 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=1

The status token in brackets is what dbt sent to Postgres: CREATE VIEW here, because this is a view-materialized model. Verify it worked by querying the database from psql or your SQL client:

SELECT COUNT(*) FROM dev_<your_name>.stg_trips;

You should get a row count close to 56,000 (the ~57K rows from Week 9, minus any with NULL pickup_location_id).

<aside> 🎉 If you got a row count, your dbt project is working end-to-end. The rest of the week builds on this exact setup.

</aside>

What dbt run actually does

It is worth slowing down on this command, because every later chapter assumes you understand what happens between dbt run and a new view appearing in your schema.

When you run dbt run --select stg_trips, dbt executes these steps in order:

  1. Parse the project. dbt reads dbt_project.yml, every .sql file under models/, and every .yml file. It builds the DAG from the ref() and source() calls in your SQL.
  2. Compile Jinja. For each selected model, dbt resolves the templates against the warehouse catalog. {{ source('nyc_taxi', 'raw_trips') }} becomes "public"."raw_trips". {{ ref('stg_trips') }} becomes "dev_<your_name>"."stg_trips". The compiled SQL is written to target/compiled/nyc_taxi/models/staging/stg_trips.sql.
  3. Wrap the compiled SQL in a CREATE statement. For views: CREATE VIEW <schema>.<model>__dbt_tmp AS (...). For tables: CREATE TABLE AS SELECT (CTAS), which drops and recreates the table from scratch on every run. dbt does not append or insert by default. The wrapped SQL lands in target/run/nyc_taxi/models/staging/stg_trips.sql.
  4. Atomically swap into place. dbt creates the new object under a __dbt_tmp suffix, then renames it to the final name in a single transaction. This swap exists so that a failed build never leaves your schema in a half-written state: either the new view replaces the old one cleanly, or nothing changes.