Week 10 - SQL Transformations with dbt
dbt Setup for Azure PostgreSQL
Incremental Models, Docs, and Snapshots
Week 10 Assignment: Build a Daily Borough Stats Mart with dbt
Week 10 Lesson Plan (Teachers)
Setup is usually the least interesting part of learning a new tool, and also the part where most people get stuck. This chapter gets you from "nothing installed" to a working dbt project that connects to the shared Azure PostgreSQL database from Week 6, runs a trivial model into your personal schema, and passes dbt debug. Every later chapter in this week assumes you have finished the steps here.
By the end of this chapter, you should be able to:
uv.profiles.yml that loads secrets from environment variables.raw_trips and raw_zones tables as dbt sources.dbt debug and dbt run successfully, and find the SQL dbt actually executed.<aside>
📦 Reference repo: A complete, CI-tested copy of the project this week builds lives at lassebenni/nyc-taxi-dbt-reference. The tags ch2-dbt-setup-azure through ch6-docs-extras mark the state at the end of each chapter. Do not clone it as a starting point: typing each file yourself is how the dbt mental model sticks. Use it to compare your work when something breaks: git checkout ch2-dbt-setup-azure shows what your project should look like at the end of this chapter.
</aside>
dbt Core is the engine: it parses your project, resolves Jinja templates, builds the DAG, and orchestrates the run. It does not know how to talk to any specific database.
Database communication lives in a separate package called an adapter. There is one adapter per warehouse:
| Adapter | Database | Maintained by |
|---|---|---|
dbt-postgres |
PostgreSQL (including Azure PostgreSQL) | dbt Labs |
dbt-snowflake |
Snowflake | dbt Labs |
dbt-bigquery |
Google BigQuery | dbt Labs |
dbt-redshift |
AWS Redshift | dbt Labs |
dbt-duckdb |
DuckDB | community |
The adapter translates dbt's generic instructions ("create this view", "select these columns") into the SQL dialect and connection protocol the warehouse expects. This week you target Azure PostgreSQL, so you install dbt-postgres alongside dbt Core.
<aside> 💡 The split exists so dbt Core can stay neutral and the dbt Labs team (plus the community) can ship adapters on independent release cycles. You only ever install the adapters you actually use.
</aside>
dbt-postgres with uvIn Week 5 you switched to uv as the recommended Python package manager. dbt is a CLI tool you reuse across projects, so install it once with uv tool install rather than as a project dependency. The same command works identically on macOS, Linux, and Windows:
uv tool install --python 3.12 dbt-core --with dbt-postgres
Two parts to notice:
--with dbt-postgres tells uv to install the adapter into the same isolated environment as dbt Core. Without it, dbt run would fail with Could not find adapter type postgres.--python 3.12 pins the Python interpreter dbt runs on. dbt's transitive dependencies (notably mashumaro) do not yet support Python 3.13 or 3.14. Without the pin, uv may pick the newest Python on your machine and dbt --version will crash on import with a mashumaro.exceptions.UnserializableField error.Verify the install:
dbt --version
You should see something like:
Core:
- installed: 1.11.x
- latest: 1.11.x - Up to date!
Plugins:
- postgres: 1.10.x - Up to date!
A second PATH issue can happen if dbt is not on your PATH at all after install:
<aside>
💡 Run uv tool update-shell once and open a new terminal. This adds ~/.local/bin (macOS/Linux) or %USERPROFILE%\\.local\\bin (Windows) to your shell.
</aside>
If installation fails on macOS or Windows with a psycopg2 build error, install the binary wheel of the driver instead:
<aside>
⚠️ If uv tool install fails with a psycopg2 build error (most often on Apple Silicon or fresh Windows installs), retry with the binary wheel: uv tool install --python 3.12 dbt-core --with dbt-postgres --with psycopg2-binary. The -binary package bundles the Postgres client library so no system compiler is needed.
</aside>
The error is different on Linux because the system package, not the wheel, is missing:
<aside>
⚠️ Linux: if the install fails with pg_config not found, install the system Postgres dev headers first: sudo apt-get install libpq-dev (Debian/Ubuntu) or sudo dnf install libpq-devel (Fedora/RHEL). Then re-run the uv tool install command above.
</aside>
A dbt project is a folder containing a dbt_project.yml config file and subfolders for models, tests, and macros. Every dbt command runs against a project, and the project is the unit of version control: one project, one git repo.
dbt ships with a project generator. Run it from a folder where you want the project to live (not inside an existing git repo of another project):
dbt init nyc_taxi --skip-profile-setup
The --skip-profile-setup flag tells dbt init to scaffold the project without prompting you for connection details. You will configure those manually in the next section using a project-local profiles.yml file. (If you forget the flag, dbt asks ~10 prompts for host, user, password, etc. Press Enter through all of them.)
If you have multiple adapters installed, dbt init first asks which one to use. With only dbt-postgres installed it auto-picks postgres without asking.
After dbt init finishes, you have a folder structure like this:
nyc_taxi/
├── dbt_project.yml # Project config: name, version, model paths, materializations
├── models/
│ └── example/ # Scaffolded example models (delete these later)
├── macros/ # Your custom Jinja macros go here
├── tests/ # Custom singular tests
├── seeds/ # Small CSV files you want dbt to load into the database
├── snapshots/ # Slowly changing dimension tracking (advanced)
└── analyses/ # Ad-hoc SQL scripts dbt compiles but does not run
The two folders you will use this week are models/ and macros/. Everything else you can ignore for now.
The most important file is dbt_project.yml. Open it and note these fields:
# dbt_project.yml (key fields only)
name: 'nyc_taxi'
profile: 'nyc_taxi' # Must match the entry name in profiles.yml
model-paths: ["models"]
The profile: field is how dbt knows which database connection to use: it looks up a matching entry in profiles.yml, which you create in the next section.
profiles.ymlprofiles.yml holds the connection details (host, user, password, schema). dbt looks for it in two places, in order:
dbt_project.yml).~/.dbt/profiles.yml (your home directory).For this course you will keep profiles.yml inside the project folder so the connection config travels with the project and is visible in code review. To keep secrets out of git, every sensitive field is loaded from an environment variable using {{ env_var('VAR_NAME') }}.
Two terms to know before you read the file: an outputs block is a dictionary of named connections, and each entry inside it is a target. The top-level target: field selects which one is used by default. You can have a dev target pointing at your sandbox and a prod target pointing at the production schema, then switch between them at runtime with dbt run --target prod.
Create nyc_taxi/profiles.yml with the following content. The host, user, and database name match the shared instance from Week 6:
# nyc_taxi/profiles.yml
nyc_taxi:
target: dev
outputs:
dev:
type: postgres
host: hyf-data-pg.postgres.database.azure.com
port: 5432
user: hyfadmin
password: "{{ env_var('PG_PASSWORD') }}"
dbname: team1
schema: dev_<your_name> # ⚠️ Replace with your own first name, lowercase
sslmode: require
threads: 4
Add profiles.yml to .gitignore so credentials never end up in git, even if the env var fails to expand:
echo "profiles.yml" >> .gitignore
<aside>
⚠️ Only the password needs to be a secret on this course. The host, user, and database name are not sensitive (they were already shared with everyone in Week 6). You still gitignore the file because mistakes happen: a teammate adds a real password: line on their machine, you pull, and now it lives in your local copy. Gitignore protects everyone.
</aside>
Set the password environment variable in the same shell where you will run dbt:
export PG_PASSWORD='your-password-from-week-6' # macOS/Linux/WSL
# $env:PG_PASSWORD = 'your-password-from-week-6' # Windows PowerShell
Then verify dbt can connect:
dbt debug
Connection:
host: hyf-data-pg.postgres.database.azure.com
port: 5432
user: hyfadmin
database: team1
schema: dev_<your_name>
sslmode: require
...
Registered adapter: postgres=1.10.0
Connection test: [OK connection ok]
All checks passed!
The last two lines are what you want: Connection test: [OK connection ok] means dbt successfully opened a connection, and All checks passed! means every config file parsed cleanly. If you see connection errors, check:
dbt debug.sslmode: require is present (Azure PostgreSQL rejects non-TLS connections).This is the most important configuration choice in this chapter and it trips up everyone who skips it.
All students share the same team1 Azure PostgreSQL database. If you and a classmate both run dbt run with schema: analytics, your models write to the same analytics.stg_trips table and overwrite each other every few minutes. The fix is to give each student a personal schema.
A schema in PostgreSQL is a namespace inside a database. Multiple schemas like dev_anna.stg_trips, dev_marco.stg_trips, and analytics.stg_trips can coexist in the same team1 database without collisions. The diagram uses example names; in your own profile you replace the placeholder with your own first name (the rest of the chapter writes this as dev_<your_name>).
graph TB
subgraph db[" Azure PostgreSQL: team1 database "]
subgraph dev_anna[" dev_anna schema "]
l1[stg_trips]
l2[stg_zones]
l3[fct_trips]
end
subgraph dev_marco[" dev_marco schema "]
a1[stg_trips]
a2[stg_zones]
a3[fct_trips]
end
subgraph analytics[" analytics schema (prod) "]
p1[stg_trips]
p2[stg_zones]
p3[fct_trips]
end
end
classDef dev fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px,color:#000
classDef prod fill:#d5e8d4,stroke:#82b366,stroke-width:2px,color:#000
class l1,l2,l3,a1,a2,a3 dev
class p1,p2,p3 prod
Each student has their own sandbox. You can break your dev schema as many times as you want without affecting anyone else. The shared analytics schema is written to only by the Airflow-orchestrated production run you set up in Week 11, never from a student's laptop.
<aside>
💭 If you have worked with traditional DTAP (Development, Test, Acceptance, Production) databases before, personal schemas inside a shared database probably look wrong: hand-written tables drift and pollute. dbt inverts that rule for non-prod because dbt models are code: every dbt run drops and recreates the objects from scratch, so personal schemas cannot drift. In industry, the typical layout is one shared non-prod database with one schema per engineer (dev_anna, dev_marco, ...) and a separate prod database written to only by CI.
</aside>
The schema: line in profiles.yml (dev_<your_name>) is what makes this work. Use lowercase, no spaces, no special characters, and stick with the same name for the whole week so you can find your objects in a SQL client. dbt creates the schema automatically the first time you run a model: no manual CREATE SCHEMA required.
<aside>
⚠️ Do not use the analytics schema from your laptop. That schema is reserved for the Airflow-orchestrated production pipeline you build in Week 11. Writing to it from dbt run locally means your work-in-progress models become visible to everyone querying dashboards.
</aside>
A source in dbt is a raw table that already exists in the database, declared in a schema.yml file so you can reference it with {{ source('schema_name', 'table_name') }}. Using sources instead of hardcoding table names gives you dependency tracking and lets dbt test raw data freshness later.
The Week 9 NYC taxi data lives in the public schema as raw_trips and raw_zones. In Week 9 you queried these tables with hardcoded FROM public.raw_trips. Declaring them as dbt sources gives you the same access plus dependency tracking, freshness checks, and a single place to update if the table name ever changes.
Create a new file at models/staging/_sources.yml:
# models/staging/_sources.yml
version: 2
sources:
- name: nyc_taxi
description: Raw NYC green taxi trip records and zone lookup, loaded in Week 9.
database: team1
schema: public
tables:
- name: raw_trips
description: One row per green taxi trip for January 2024 (~57K rows).
- name: raw_zones
description: NYC taxi zone lookup (265 rows mapping location IDs to boroughs).
Now you can reference these tables from any model in the project:
select * from {{ source('nyc_taxi', 'raw_trips') }}
dbt will resolve the call to "public"."raw_trips" at compile time.
A dbt model is a .sql file containing a single select statement. When you run dbt run, dbt wraps your select in CREATE VIEW (or CREATE TABLE, depending on the materialization) and executes it against the database. The model's name is the filename without the .sql extension: stg_trips.sql becomes the table stg_trips in your dev schema.
Delete the scaffolded models/example/ folder from dbt init, and create a new file at models/staging/stg_trips.sql:
-- models/staging/stg_trips.sql
select
pickup_datetime,
dropoff_datetime,
pickup_location_id,
dropoff_location_id,
fare_amount,
trip_distance
from {{ source('nyc_taxi', 'raw_trips') }}
where pickup_location_id is not null
<aside>
💡 The Week 9 ingestion already cleaned the original NYC TLC column names (lpep_pickup_datetime → pickup_datetime, etc.) before writing to raw_trips. If you query the table in psql you will see the renamed columns directly.
</aside>
Run it:
dbt run --select stg_trips
1 of 1 START sql view model dev_<your_name>.stg_trips ........................ [RUN]
1 of 1 OK created sql view model dev_<your_name>.stg_trips ................... [CREATE VIEW in 0.55s]
Finished running 1 view model in 0 hours 0 minutes and 0.88 seconds (0.88s).
Completed successfully
Done. PASS=1 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=1
The status token in brackets is what dbt sent to Postgres: CREATE VIEW here, because this is a view-materialized model. Verify it worked by querying the database from psql or your SQL client:
SELECT COUNT(*) FROM dev_<your_name>.stg_trips;
You should get a row count close to 56,000 (the ~57K rows from Week 9, minus any with NULL pickup_location_id).
<aside> 🎉 If you got a row count, your dbt project is working end-to-end. The rest of the week builds on this exact setup.
</aside>
dbt run actually doesIt is worth slowing down on this command, because every later chapter assumes you understand what happens between dbt run and a new view appearing in your schema.
When you run dbt run --select stg_trips, dbt executes these steps in order:
dbt_project.yml, every .sql file under models/, and every .yml file. It builds the DAG from the ref() and source() calls in your SQL.{{ source('nyc_taxi', 'raw_trips') }} becomes "public"."raw_trips". {{ ref('stg_trips') }} becomes "dev_<your_name>"."stg_trips". The compiled SQL is written to target/compiled/nyc_taxi/models/staging/stg_trips.sql.CREATE statement. For views: CREATE VIEW <schema>.<model>__dbt_tmp AS (...). For tables: CREATE TABLE AS SELECT (CTAS), which drops and recreates the table from scratch on every run. dbt does not append or insert by default. The wrapped SQL lands in target/run/nyc_taxi/models/staging/stg_trips.sql.__dbt_tmp suffix, then renames it to the final name in a single transaction. This swap exists so that a failed build never leaves your schema in a half-written state: either the new view replaces the old one cleanly, or nothing changes.