Week 10 - SQL Transformations with dbt

Introduction to dbt Core

dbt Setup for Azure PostgreSQL

SQL and Jinja Templating

Materializations & Layers

dbt Tests

Docs & Extras

Practice

Week 10 Assignment: Build a Daily Borough Stats Mart with dbt

Week 10 Gotchas & Pitfalls

Week 10 Lesson Plan (Teachers)

Career relevance: Week 10 in the NL data job market

Glossary

Going Further: Optional Deep Dives

Glossary

Quick reference for the dbt, Jinja, and analytics engineering vocabulary used this week. Entries are ordered by the chapter that first introduces the term: each term lives in exactly one place, and each chapter back-links to its glossary entry on first use. Skim the section for the chapter you are reading; the section above it covers terms you have already met.

Chapter 1 - Introduction to dbt Core

dbt (data build tool)

Open-source CLI that runs SQL transformations as a versioned, tested, documented project. dbt only handles the T in ELT: you still need ingestion code to land raw data before dbt picks it up. See Introduction to dbt Core.

ELT vs ETL

ETL transforms data before loading it into the warehouse. ELT loads raw data first and transforms it inside the warehouse. dbt is the T in ELT.

Model

A .sql file containing a single select statement. dbt wraps it in CREATE VIEW or CREATE TABLE and runs it against the warehouse. One file, one model, one resulting database object.

DAG (Directed Acyclic Graph)

The graph of models dbt builds from the ref() and source() calls in your project. "Directed" means edges have a direction (parent → child); "acyclic" means a model cannot depend on itself, even indirectly. dbt uses the DAG to run models in dependency order.

Dependency order

The build order implied by the DAG. If A depends on B, dbt builds B first. You never declare the order by hand; ref() calls are all dbt needs.

Source

A raw table that already exists in the warehouse. In Chapter 1 this is the concept (the entry point of the DAG); in dbt Setup for Azure PostgreSQL you declare one in _sources.yml, and in SQL and Jinja Templating you reference it with source(). Sources are not created by dbt; they are inputs to the project.

Staging layer

models/staging/, stg_*. Models that read from exactly one source and do light work only: rename columns to snake*case, cast types, filter out obvious junk. No joins, no aggregations, no business logic. Usually materialized as views. Deeper coverage in Materializations & Layers.

Intermediate layer

models/intermediate/, int_*. Optional layer for joins and derived logic that is reused by more than one mart. If only one mart needs the join, skip the intermediate layer and put the join in the mart.

Mart layer

models/marts/, fct_*, dim_*. The business-ready models that dashboards and analysts query. Have a clear grain, stable column names, and are materialized as tables for fast reads. Once a mart is in use, its schema becomes a contract.

Target

A named connection profile in profiles.yml (for example dev, prod). dbt run --target dev writes models to your sandbox schema; dbt run --target prod writes to the shared production schema. The same .sql file builds into different schemas without any code change.

Data test

A query that asserts something about the data in a model. Passes when the query returns zero rows. dbt ships four generic tests out of the box (not_null, unique, accepted_values, relationships). Deeper coverage in dbt Tests.

Analytics engineering

The role that sits between data engineering (moving data) and data analysis (interpreting data). An analytics engineer turns raw warehouse tables into clean, tested, documented models that analysts and BI tools query. See also the optional history chapter.

Chapter 2 - dbt Setup for Azure PostgreSQL

dbt Core vs dbt Cloud

dbt Core is the open-source command-line tool you install with uv. dbt Cloud is the paid hosted service from dbt Labs that adds a web IDE, scheduler, and CI. This week uses dbt Core only.

Adapter

A separate package (e.g. dbt-postgres, dbt-snowflake) that translates dbt's generic instructions into the SQL dialect and connection protocol of a specific warehouse. dbt Core stays database-agnostic; the adapter does the database-specific work.

dbt project

A folder containing dbt_project.yml plus subfolders for models, tests, and macros. Every dbt command runs against a project, and one project is one git repo.

dbt_project.yml

The project-level config file. Declares the project name, model paths, folder-level materialization defaults, and variables. Lives at the root of the dbt project.

profiles.yml

Holds database connection details (host, database, user, password, schema) and target definitions. Secrets are read from environment variables, not hardcoded. Lives in ~/.dbt/ by default, or in the project folder if DBT_PROFILES_DIR points there.

Materialization

The strategy dbt uses to turn a model's select into a database object. Configured per model with {{ config(materialized='...') }} or per folder in dbt_project.yml. Built-in strategies (view, table, ephemeral, incremental) are detailed in Materializations & Layers.

Schema (personal dev schema)

The database namespace where dbt writes your models. Each student uses a personal dev_<your_name> schema so classmates do not collide on the shared Azure PostgreSQL instance. dbt creates the schema automatically on first run.

Chapter 3 - SQL and Jinja Templating

Jinja

A templating language that dbt layers on top of SQL. Jinja runs at compile time, before the SQL reaches the database. Postgres never sees a curly brace: by the time the SQL is executed, dbt has already replaced every Jinja tag with plain text.

Jinja expression

{{ ... }}. Prints a value into the compiled SQL. {{ ref('stg_trips') }} becomes "dev_<your_name>"."stg_trips". Any Jinja function call that produces text uses these braces.

Jinja statement

{% ... %}. Runs logic without printing anything by itself. {% if target.name == 'dev' %} decides which SQL lines appear in the output; the tag itself disappears. Used for control flow (if, for, set, macro).

ref()

Jinja function that refers to another dbt model by name. {{ ref('stg_trips') }} resolves to the fully qualified table name and registers a dependency in the DAG. Never hardcode a table name that ref() could resolve.

source() (Jinja function)

Jinja function that refers to a source declared in _sources.yml. {{ source('nyc_taxi', 'raw_trips') }} resolves to "public"."raw_trips" and registers the source as a DAG parent.

Macro

A named Jinja function defined in the macros/ folder. Call it from any model with {{ my_macro(arg1, arg2) }}. Extract a macro when three or more models need the same SQL snippet; one-off macros hurt readability more than they help.

Compile / compiled SQL

The step where dbt renders Jinja into pure SQL and writes it to target/compiled/.../<model>.sql. Running dbt compile stops after this step without touching the database. Your debugging move whenever Jinja behaves unexpectedly.

Chapter 4 - Materializations & Layers

View materialization

{{ config(materialized='view') }}. dbt issues CREATE VIEW. Fast to build, slow to query: Postgres re-executes the underlying select on every read. The default, and the right choice for staging models.

Table materialization

{{ config(materialized='table') }}. dbt issues CREATE TABLE AS SELECT. Slow to build, fast to query: rows are written to disk once per run and served directly. The right choice for marts that dashboards query repeatedly.

Ephemeral materialization

{{ config(materialized='ephemeral') }}. No database object at all. dbt inlines the model's SQL as a CTE inside every downstream model that references it. Keeps the schema uncluttered but makes debugging harder (you cannot select * from an ephemeral model). Use sparingly.

Incremental materialization

{{ config(materialized='incremental') }}. dbt builds the full table on the first run, then appends or merges only new rows on later runs. Pairs with is_incremental() and unique_key. Worth the extra complexity only when a full rebuild becomes too slow; previewed on the Going Further page and practiced later in the track.

unique_key

Config on incremental models that tells dbt how to identify existing rows. When an incoming row has the same unique_key as a row already in the table, dbt updates it instead of inserting a duplicate.

is_incremental()

Jinja macro that returns true on every run except the first and except when --full-refresh is passed. Wrap the incremental where clause in {% if is_incremental() %} so the first build loads everything and subsequent builds load only new rows.

this

Jinja variable that refers to the current model's own table in the database. Shows up in incremental models as from {{ this }} when you need to look up the maximum existing timestamp before loading new rows.

config()

{{ config(...) }}. Jinja function that sets model-level options inline in the .sql file (materialized='table', unique_key='id', severity='warn', etc.). The per-model equivalent of setting options in dbt_project.yml. Inline config wins over folder-level config on conflicts.

CTE (Common Table Expression)

A named subquery inside a WITH ... AS (...) block in SQL. Ephemeral models are inlined as CTEs into their consumers. Also used by hand inside a single model to break a long query into named steps.

Fact table

fct_*. A table where each row is an event or transaction (one trip, one order, one page view) with measurable numeric columns (fare, amount, duration). The fct_ prefix signals event grain regardless of whether the table is a narrow Kimball fact or a wide OBT mart.

Dimension table

dim_*. A table of descriptive attributes used to filter and group facts (one row per zone, per customer, per product). In a star schema, consumers join it to the fact table at query time. This course does not build a separate dim_zones; zone attributes are denormalized into fct_trips.

Grain

The meaning of "one row" in a table. The grain of fct_trips is "one row per completed taxi trip." Every mart should have a single, clearly stated grain; mixing grains in one table is a common source of wrong numbers.

One Big Table (OBT)

A wide, denormalized mart with dimension attributes pre-joined into the fact row. Our fct_trips with pickup_borough, pickup_zone, dropoff_borough, dropoff_zone is OBT: consumers query one table, no join required at read time. The practical default in most dbt projects on columnar warehouses.

Star schema

The Kimball alternative to OBT: one narrow fact table surrounded by separate dimension tables, joined at query time. Week 9 introduces the theory; Week 10 builds OBT instead because columnar storage and dbt-driven rebuilds have made wide marts the common default.

Plus selector

The + prefix or suffix on --select that includes upstream or downstream models. dbt run --select +fct_trips builds fct_trips and all its parents. dbt run --select stg_trips+ builds stg_trips and all its children.

Chapter 5 - dbt Tests

Generic test

A reusable test (like not_null) that you attach to a column in a schema YAML file. One definition, many applications across the project.

not_null

Generic test that fails if any row has a NULL in the specified column. Attach to every join key, group-by key, and identifier.

unique

Generic test that fails if any value appears more than once in the column. Combine with not_null for a primary-key assertion. For compound uniqueness, reach for dbt_utils.unique_combination_of_columns.

accepted_values

Generic test that fails if any value in the column is outside the provided list. Silently ignores NULLs: pair with not_null if NULLs are not allowed.

relationships

Generic test that fails if a column's values are not found in the referenced model's column. dbt's answer to foreign-key assertions.

Singular test