Week 10 - SQL Transformations with dbt

Introduction to dbt Core

dbt Setup for Azure PostgreSQL

SQL and Jinja Templating

dbt Tests

Incremental Models, Docs, and Snapshots

Building Your First Mart

Week 10 Gotchas & Pitfalls

Practice

Week 10 Assignment: Build a Daily Borough Stats Mart with dbt

Week 10 - Video Guide

Week 10 Lesson Plan (Teachers)

Introduction to dbt Core

In Week 9, you wrote analytical SQL by hand: queries, validation checks, and a fact and dimension view in Azure PostgreSQL. That worked for one week of exercises. It does not scale to a real project where you have dozens of models, tests, and dependencies between them.

This week introduces dbt (data build tool), the industry standard for managing SQL transformations as software. dbt turns your SQL into a versioned, tested, and documented project, using the same workflow discipline you already apply with git and pytest.

By the end of this chapter, you should be able to:

<aside> ๐Ÿ“˜ Core program connection: Building on the SQL basics you learned in Core program Week 12 (and used in Week 9), dbt does not replace SQL, it extends it. Everything you write in dbt is still SQL that runs against the database.

</aside>

What is dbt?

dbt is an open-source command-line tool that lets you write SQL transformations as files in a project, run them in dependency order against your data warehouse, test the results, and generate documentation.

<aside> ๐Ÿ’ก Key insight: dbt only handles the T in ELT (Extract, Load, Transform). It does not extract or load data. You still need ingestion code (Python, like Weeks 6-8) to land raw data in the warehouse first. dbt picks up from there.

</aside>

In one sentence: dbt is the framework that allows you to treat your raw SQL queries as a professional, version-controlled software project.

You write a model as a plain select statement in a .sql file:

-- models/staging/stg_trips.sql
select
    pickup_datetime,
    dropoff_datetime,
    pickup_location_id,
    dropoff_location_id,
    fare_amount,
    trip_distance
from {{ source('nyc_taxi', 'raw_trips') }}
where pickup_location_id is not null

When you run dbt run, dbt compiles this file (resolving the templated references and wrapping the query in CREATE VIEW or CREATE TABLE) and sends the final SQL to your database. You never write the CREATE VIEW boilerplate yourself, and you never hand-manage the order in which models are built.

You will see compilation in detail in SQL and Jinja Templating.

Why use dbt?

You can do everything dbt does with plain SQL files and a shell script. So why has dbt become the default for analytics engineering?

1. Version control and code review

dbt projects are folders of .sql and .yml files. They live in git. Pull requests show diffs. Code review works the same way it does for any other codebase. In Week 9, your SQL views existed only inside the database, where there was no history of who changed what.

  -- models/staging/stg_trips.sql
  select
      pickup_datetime,
      dropoff_datetime,
      pickup_location_id,
      dropoff_location_id,
      fare_amount,
-     trip_distance
+     trip_distance,
+     tip_amount
  from {{ source('nyc_taxi', 'raw_trips') }}
  where pickup_location_id is not null

A reviewer sees exactly what changed, comments on the line, and approves or requests changes. With a raw CREATE VIEW in the database, the same change is invisible to anyone who is not watching the database schema directly.

2. Dependency management

When you write {{ ref('stg_trips') }} instead of hardcoding the table name, dbt builds a directed acyclic graph (DAG) of every model and runs them in the correct order. If model A depends on model B, dbt builds B first. You never write the order yourself.

Here is what a small DAG looks like for the NYC taxi project you will build this week:

graph LR
    src_trips[("source: raw_trips")]:::source
    src_zones[("source: raw_zones")]:::source

    stg_trips["stg_trips<br/>(rename + filter)"]:::staging
    stg_zones["stg_zones<br/>(rename)"]:::staging

    int_trips_enriched["int_trips_enriched<br/>(join zones)"]:::intermediate

    fct_trips["fct_trips<br/>(fact mart)"]:::mart

    src_trips --> stg_trips
    src_zones --> stg_zones
    stg_trips --> int_trips_enriched
    stg_zones --> int_trips_enriched
    int_trips_enriched --> fct_trips

    classDef source fill:#e1d5e7,stroke:#9673a6,stroke-width:2px,color:#000
    classDef staging fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px,color:#000
    classDef intermediate fill:#fff2cc,stroke:#d6b656,stroke-width:2px,color:#000
    classDef mart fill:#d5e8d4,stroke:#82b366,stroke-width:2px,color:#000

Reading left to right:

When you run dbt run, dbt reads the ref() and source() calls from every file, infers this graph, and builds each node in the correct dependency order (so that every parent is built before its children). stg_trips runs before int_trips_enriched, which runs before fct_trips. You never specify that order by hand.

<aside> โŒจ๏ธ Hands on: Look at the DAG diagram above. If the ingestion pipeline added a new column to raw_zones, which dbt models would need to be rebuilt? Trace the path from source to mart and write down the models in the order dbt would run them.

</aside>

3. Testing

dbt ships with built-in data tests that assert things like "this column is never null" or "this foreign key always matches a row in the parent table." You declare them next to the model, run dbt test, and dbt fails loudly if the data violates the assumption. This catches bad data before it reaches a dashboard.

graph LR
    model["stg_trips<br/>(built by dbt run)"]:::staging

    t1["not_null<br/>pickup_datetime"]:::testpass
    t2["unique<br/>trip_id"]:::testpass
    t3["accepted_values<br/>payment_type in 1..6"]:::testfail
    t4["relationships<br/>pickup_location_id โ†’ zones"]:::testpass

    model --> t1
    model --> t2
    model --> t3
    model --> t4

    t1 --> pass1["โœ… PASS"]:::pass
    t2 --> pass2["โœ… PASS"]:::pass
    t3 --> fail1["โŒ FAIL<br/>blocks the pipeline"]:::fail
    t4 --> pass3["โœ… PASS"]:::pass

    classDef staging fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px,color:#000
    classDef testpass fill:#fff2cc,stroke:#d6b656,stroke-width:1px,color:#000
    classDef testfail fill:#fff2cc,stroke:#d6b656,stroke-width:1px,color:#000
    classDef pass fill:#d5e8d4,stroke:#82b366,stroke-width:2px,color:#000
    classDef fail fill:#f8cecc,stroke:#b85450,stroke-width:2px,color:#000

Each test runs a query against the model and checks whether the result is empty. If any row violates the assertion, the test fails, dbt test exits with a non-zero status, and your CI pipeline (or orchestrator in Week 11) stops before bad data reaches the next stage. You will meet each built-in test and learn how to write custom ones in dbt Tests.

4. Documentation

dbt docs generate produces a static website showing every model, its columns, descriptions, and a clickable lineage graph from raw sources to final marts. Anyone on the team can see how a number on a dashboard was calculated: column descriptions, tests, and upstream dependencies all come from the YAML files you already maintain alongside your models. You will generate your own docs site in Docs & Extras.

5. Environment switching

dbt separates what your SQL does from where it runs. The same .sql file can build into your personal sandbox schema during development and into a shared production schema in CI, without changing a line of code: you just switch the target on the command line. You will configure your personal dev target in dbt Setup for Azure PostgreSQL; production targets are handled by Airflow in Week 11.

<aside> ๐Ÿ’ญ None of these features are unique to dbt. You could build them yourself. dbt is valuable because it bundles all of them into one tool with a consistent workflow that the whole industry has standardized on.

</aside>

What is analytics engineering?

The workflow you just read about, version control, tests, documentation, modular SQL models, has a name: analytics engineering. It describes a role that sits between data engineers (who move and store data) and data analysts (who interpret it and build dashboards).

An analytics engineer takes raw data that a data engineer has landed in the warehouse and turns it into clean, well-tested, documented models that analysts and BI tools can query directly. The output is not a one-off chart or a CSV, it is a maintained set of SQL models that the whole company depends on.

Role Main question Primary tools Output
Data engineer How does raw data get into the warehouse, reliably? Python, Airflow, Kafka, cloud storage, Grafana Pipelines, raw tables
Analytics engineer How do I turn raw data into trusted models for analysis? dbt, SQL, git, CI/CD Staging models, marts, tests, docs
Data analyst What do the numbers mean, and what should the business do? BI tools (Tableau, Looker, Power BI), SQL Dashboards, reports, insights

In practice, small teams blur these boundaries: one person often does all three. The point is that the work described by the middle column exists at every company with a data warehouse, and dbt is the industry-standard tool for doing it: which is why this week exists.

How dbt fits into the data pipeline

Here is the full picture of where dbt sits in a modern data stack, and how it connects to the other weeks in this course:

flowchart LR
    sources["Sources<br/>(APIs, files)"]:::external
    ingestion["Ingestion<br/>(Python, Airflow)<br/>*Weeks 6-8*"]:::ingest
    dbt["**dbt**<br/>(transform, test, doc)<br/>*Week 10 (this week)*"]:::dbt
    dashboards["Dashboards<br/>(Grafana)<br/>*Week 12*"]:::output

    warehouse[("Azure PostgreSQL<br/>raw_trips, raw_zones<br/>*Week 9*")]:::storage

    sources --> ingestion --> warehouse
    warehouse --> dbt --> dashboards

    classDef external fill:#e1d5e7,stroke:#9673a6,stroke-width:2px,color:#000
    classDef ingest fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px,color:#000
    classDef dbt fill:#d5e8d4,stroke:#82b366,stroke-width:3px,color:#000
    classDef output fill:#fff2cc,stroke:#d6b656,stroke-width:2px,color:#000
    classDef storage fill:#f5f5f5,stroke:#666,stroke-width:2px,color:#000

The NYC taxi data is already loaded in raw_trips and raw_zones from Week 9. This week, you build dbt models on top of those tables: stg_trips and stg_zones (cleaning and renaming), then fct_trips (joining them into a business-ready mart). In Week 12, that mart will become the data source for a dashboard.

<aside> ๐Ÿ“˜ Core Program Refresher: In the Core program, you used functions and modules to organize your code and hide complexity. The mental model is similar: a dbt model is a "public interface" that hides messy SQL joins and cleaning logic behind a simple, stable table name. Downstream users only need to know the name of the model, not the complex code inside.

</aside>

Every later chapter includes a short "Using AI to help" pointer next to the skill it applies to; here is the week-level version:

<aside> ๐Ÿ’ก Using AI to help: Throughout the week you will see concrete "paste this kind of prompt" callouts in the chapters that practice a specific skill. The short version: LLMs are great at explaining compiled SQL, drafting YAML descriptions, and suggesting unit-test cases; they are not great at choosing materializations or writing tests that match your actual data. Use them for the first category; do the second yourself. โš ๏ธ Do not paste real passwords, connection strings, or customer data into any LLM, ever.

</aside>

Before moving on, a short detour on where dbt came from:

<aside> ๐Ÿค“ Curious Geek: where dbt came from

dbt started inside a Philadelphia consultancy called Fishtown Analytics in 2016. Founder Tristan Handy was tired of writing the same CREATE VIEW boilerplate by hand for every client and built a Python wrapper that let him reference one model from another with {{ ref(...) }}. The wrapper was open-sourced as dbt-core; Fishtown later renamed itself dbt Labs and the company is now valued at $4.2B. The killer feature was not the Jinja templating (every templating engine can do that): it was making {{ ref() }} define the DAG. Once one tool owned the dependency graph, version control, testing, and docs all followed naturally.

</aside>

Knowledge Check

  1. dbt only handles one letter of ELT. Which one, and what does it mean for the rest of your pipeline?
  2. Why does dbt build a DAG from your models, and what would break if you hardcoded table names instead of letting dbt manage the references?
  3. You and a teammate both want to add a new column to the same dbt model. How does using dbt make this safer than directly editing a CREATE VIEW statement in the database?
  4. Name two things an analytics engineer does day to day that a data engineer typically does not.

Ready for the next chapter when

The next chapter moves from concepts to a working install. No state to check yet: just confirm the mental model is in place:

Extra reading

<aside> ๐ŸŽฌ Second-angle video: Introduction to Analytics Engineering and What is dbt? from the DataTalks.Club DE Zoomcamp. Same concepts as this chapter; the zoomcamp uses BigQuery/DuckDB instead of Azure PostgreSQL, but the dbt layer is identical.

</aside>

If the video explanations leave you hungry for more context, the supplementary reading list has the long-form material.

<aside> ๐Ÿ’ก For full courses, books, and community resources on dbt and analytics engineering, see the optional Going Further page. None of it is required for Week 10.

</aside>


In the next chapter you go from "nothing installed" to a working dbt project: install dbt Core with the PostgreSQL adapter, connect to the shared Azure PostgreSQL database from Week 6, set up a personal dev schema, and run your first model.


The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with โค๏ธ by the HackYourFuture community ยท Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.