Week 10 - SQL Transformations with dbt
dbt Setup for Azure PostgreSQL
Incremental Models, Docs, and Snapshots
Week 10 Assignment: Build a Daily Borough Stats Mart with dbt
Week 10 Lesson Plan (Teachers)
In Week 9, you wrote analytical SQL by hand: queries, validation checks, and a fact and dimension view in Azure PostgreSQL. That worked for one week of exercises. It does not scale to a real project where you have dozens of models, tests, and dependencies between them.
This week introduces dbt (data build tool), the industry standard for managing SQL transformations as software. dbt turns your SQL into a versioned, tested, and documented project, using the same workflow discipline you already apply with git and pytest.
By the end of this chapter, you should be able to:
<aside> ๐ Core program connection: Building on the SQL basics you learned in Core program Week 12 (and used in Week 9), dbt does not replace SQL, it extends it. Everything you write in dbt is still SQL that runs against the database.
</aside>
dbt is an open-source command-line tool that lets you write SQL transformations as files in a project, run them in dependency order against your data warehouse, test the results, and generate documentation.
<aside> ๐ก Key insight: dbt only handles the T in ELT (Extract, Load, Transform). It does not extract or load data. You still need ingestion code (Python, like Weeks 6-8) to land raw data in the warehouse first. dbt picks up from there.
</aside>
In one sentence: dbt is the framework that allows you to treat your raw SQL queries as a professional, version-controlled software project.
You write a model as a plain select statement in a .sql file:
-- models/staging/stg_trips.sql
select
pickup_datetime,
dropoff_datetime,
pickup_location_id,
dropoff_location_id,
fare_amount,
trip_distance
from {{ source('nyc_taxi', 'raw_trips') }}
where pickup_location_id is not null
When you run dbt run, dbt compiles this file (resolving the templated references and wrapping the query in CREATE VIEW or CREATE TABLE) and sends the final SQL to your database. You never write the CREATE VIEW boilerplate yourself, and you never hand-manage the order in which models are built.
You will see compilation in detail in SQL and Jinja Templating.
You can do everything dbt does with plain SQL files and a shell script. So why has dbt become the default for analytics engineering?
dbt projects are folders of .sql and .yml files. They live in git. Pull requests show diffs. Code review works the same way it does for any other codebase. In Week 9, your SQL views existed only inside the database, where there was no history of who changed what.
-- models/staging/stg_trips.sql
select
pickup_datetime,
dropoff_datetime,
pickup_location_id,
dropoff_location_id,
fare_amount,
- trip_distance
+ trip_distance,
+ tip_amount
from {{ source('nyc_taxi', 'raw_trips') }}
where pickup_location_id is not null
A reviewer sees exactly what changed, comments on the line, and approves or requests changes. With a raw CREATE VIEW in the database, the same change is invisible to anyone who is not watching the database schema directly.
When you write {{ ref('stg_trips') }} instead of hardcoding the table name, dbt builds a directed acyclic graph (DAG) of every model and runs them in the correct order. If model A depends on model B, dbt builds B first. You never write the order yourself.
Here is what a small DAG looks like for the NYC taxi project you will build this week:
graph LR
src_trips[("source: raw_trips")]:::source
src_zones[("source: raw_zones")]:::source
stg_trips["stg_trips<br/>(rename + filter)"]:::staging
stg_zones["stg_zones<br/>(rename)"]:::staging
int_trips_enriched["int_trips_enriched<br/>(join zones)"]:::intermediate
fct_trips["fct_trips<br/>(fact mart)"]:::mart
src_trips --> stg_trips
src_zones --> stg_zones
stg_trips --> int_trips_enriched
stg_zones --> int_trips_enriched
int_trips_enriched --> fct_trips
classDef source fill:#e1d5e7,stroke:#9673a6,stroke-width:2px,color:#000
classDef staging fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px,color:#000
classDef intermediate fill:#fff2cc,stroke:#d6b656,stroke-width:2px,color:#000
classDef mart fill:#d5e8d4,stroke:#82b366,stroke-width:2px,color:#000
Reading left to right:
{{ source(...) }}.{{ ref(...) }} to the staging layer.{{ ref(...) }} to the intermediate (or staging) layer.When you run dbt run, dbt reads the ref() and source() calls from every file, infers this graph, and builds each node in the correct dependency order (so that every parent is built before its children). stg_trips runs before int_trips_enriched, which runs before fct_trips. You never specify that order by hand.
<aside>
โจ๏ธ Hands on: Look at the DAG diagram above. If the ingestion pipeline added a new column to raw_zones, which dbt models would need to be rebuilt? Trace the path from source to mart and write down the models in the order dbt would run them.
</aside>
dbt ships with built-in data tests that assert things like "this column is never null" or "this foreign key always matches a row in the parent table." You declare them next to the model, run dbt test, and dbt fails loudly if the data violates the assumption. This catches bad data before it reaches a dashboard.
graph LR
model["stg_trips<br/>(built by dbt run)"]:::staging
t1["not_null<br/>pickup_datetime"]:::testpass
t2["unique<br/>trip_id"]:::testpass
t3["accepted_values<br/>payment_type in 1..6"]:::testfail
t4["relationships<br/>pickup_location_id โ zones"]:::testpass
model --> t1
model --> t2
model --> t3
model --> t4
t1 --> pass1["โ
PASS"]:::pass
t2 --> pass2["โ
PASS"]:::pass
t3 --> fail1["โ FAIL<br/>blocks the pipeline"]:::fail
t4 --> pass3["โ
PASS"]:::pass
classDef staging fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px,color:#000
classDef testpass fill:#fff2cc,stroke:#d6b656,stroke-width:1px,color:#000
classDef testfail fill:#fff2cc,stroke:#d6b656,stroke-width:1px,color:#000
classDef pass fill:#d5e8d4,stroke:#82b366,stroke-width:2px,color:#000
classDef fail fill:#f8cecc,stroke:#b85450,stroke-width:2px,color:#000
Each test runs a query against the model and checks whether the result is empty. If any row violates the assertion, the test fails, dbt test exits with a non-zero status, and your CI pipeline (or orchestrator in Week 11) stops before bad data reaches the next stage. You will meet each built-in test and learn how to write custom ones in dbt Tests.
dbt docs generate produces a static website showing every model, its columns, descriptions, and a clickable lineage graph from raw sources to final marts. Anyone on the team can see how a number on a dashboard was calculated: column descriptions, tests, and upstream dependencies all come from the YAML files you already maintain alongside your models. You will generate your own docs site in Docs & Extras.
dbt separates what your SQL does from where it runs. The same .sql file can build into your personal sandbox schema during development and into a shared production schema in CI, without changing a line of code: you just switch the target on the command line. You will configure your personal dev target in dbt Setup for Azure PostgreSQL; production targets are handled by Airflow in Week 11.
<aside> ๐ญ None of these features are unique to dbt. You could build them yourself. dbt is valuable because it bundles all of them into one tool with a consistent workflow that the whole industry has standardized on.
</aside>
The workflow you just read about, version control, tests, documentation, modular SQL models, has a name: analytics engineering. It describes a role that sits between data engineers (who move and store data) and data analysts (who interpret it and build dashboards).
An analytics engineer takes raw data that a data engineer has landed in the warehouse and turns it into clean, well-tested, documented models that analysts and BI tools can query directly. The output is not a one-off chart or a CSV, it is a maintained set of SQL models that the whole company depends on.
| Role | Main question | Primary tools | Output |
|---|---|---|---|
| Data engineer | How does raw data get into the warehouse, reliably? | Python, Airflow, Kafka, cloud storage, Grafana | Pipelines, raw tables |
| Analytics engineer | How do I turn raw data into trusted models for analysis? | dbt, SQL, git, CI/CD | Staging models, marts, tests, docs |
| Data analyst | What do the numbers mean, and what should the business do? | BI tools (Tableau, Looker, Power BI), SQL | Dashboards, reports, insights |
In practice, small teams blur these boundaries: one person often does all three. The point is that the work described by the middle column exists at every company with a data warehouse, and dbt is the industry-standard tool for doing it: which is why this week exists.
Here is the full picture of where dbt sits in a modern data stack, and how it connects to the other weeks in this course:
flowchart LR
sources["Sources<br/>(APIs, files)"]:::external
ingestion["Ingestion<br/>(Python, Airflow)<br/>*Weeks 6-8*"]:::ingest
dbt["**dbt**<br/>(transform, test, doc)<br/>*Week 10 (this week)*"]:::dbt
dashboards["Dashboards<br/>(Grafana)<br/>*Week 12*"]:::output
warehouse[("Azure PostgreSQL<br/>raw_trips, raw_zones<br/>*Week 9*")]:::storage
sources --> ingestion --> warehouse
warehouse --> dbt --> dashboards
classDef external fill:#e1d5e7,stroke:#9673a6,stroke-width:2px,color:#000
classDef ingest fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px,color:#000
classDef dbt fill:#d5e8d4,stroke:#82b366,stroke-width:3px,color:#000
classDef output fill:#fff2cc,stroke:#d6b656,stroke-width:2px,color:#000
classDef storage fill:#f5f5f5,stroke:#666,stroke-width:2px,color:#000
The NYC taxi data is already loaded in raw_trips and raw_zones from Week 9. This week, you build dbt models on top of those tables: stg_trips and stg_zones (cleaning and renaming), then fct_trips (joining them into a business-ready mart). In Week 12, that mart will become the data source for a dashboard.
<aside> ๐ Core Program Refresher: In the Core program, you used functions and modules to organize your code and hide complexity. The mental model is similar: a dbt model is a "public interface" that hides messy SQL joins and cleaning logic behind a simple, stable table name. Downstream users only need to know the name of the model, not the complex code inside.
</aside>
Every later chapter includes a short "Using AI to help" pointer next to the skill it applies to; here is the week-level version:
<aside> ๐ก Using AI to help: Throughout the week you will see concrete "paste this kind of prompt" callouts in the chapters that practice a specific skill. The short version: LLMs are great at explaining compiled SQL, drafting YAML descriptions, and suggesting unit-test cases; they are not great at choosing materializations or writing tests that match your actual data. Use them for the first category; do the second yourself. โ ๏ธ Do not paste real passwords, connection strings, or customer data into any LLM, ever.
</aside>
Before moving on, a short detour on where dbt came from:
<aside> ๐ค Curious Geek: where dbt came from
dbt started inside a Philadelphia consultancy called Fishtown Analytics in 2016. Founder Tristan Handy was tired of writing the same CREATE VIEW boilerplate by hand for every client and built a Python wrapper that let him reference one model from another with {{ ref(...) }}. The wrapper was open-sourced as dbt-core; Fishtown later renamed itself dbt Labs and the company is now valued at $4.2B. The killer feature was not the Jinja templating (every templating engine can do that): it was making {{ ref() }} define the DAG. Once one tool owned the dependency graph, version control, testing, and docs all followed naturally.
</aside>
CREATE VIEW statement in the database?The next chapter moves from concepts to a working install. No state to check yet: just confirm the mental model is in place:
<aside> ๐ฌ Second-angle video: Introduction to Analytics Engineering and What is dbt? from the DataTalks.Club DE Zoomcamp. Same concepts as this chapter; the zoomcamp uses BigQuery/DuckDB instead of Azure PostgreSQL, but the dbt layer is identical.
</aside>
If the video explanations leave you hungry for more context, the supplementary reading list has the long-form material.
<aside> ๐ก For full courses, books, and community resources on dbt and analytics engineering, see the optional Going Further page. None of it is required for Week 10.
</aside>
In the next chapter you go from "nothing installed" to a working dbt project: install dbt Core with the PostgreSQL adapter, connect to the shared Azure PostgreSQL database from Week 6, set up a personal dev schema, and run your first model.
The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with โค๏ธ by the HackYourFuture community ยท Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.