Week 10 - SQL Transformations with dbt
dbt Setup for Azure PostgreSQL
Incremental Models, Docs, and Snapshots
Week 10 Assignment: Build a Daily Borough Stats Mart with dbt
Week 10 Lesson Plan (Teachers)
The chapter hands-on walked you through building fct_trips: a one-row-per-trip mart: step by step. The practice exercises drilled specific skills (macros, singular tests, debugging refs, propagating column changes) against that same project.
This assignment asks you to build a second, independent dbt project end to end, against the same raw_trips and raw_zones source data, but producing a mart at a different grain: one row per pickup_borough per date. You will redo every step (sources → staging → mart → tests → docs) on your own, without the chapter's line-by-line guidance. The goal is to prove you can take a business question and deliver a versioned, tested, documented dbt project that answers it.
By the end of this assignment, you should be able to:
dbt build, and understand without asking you questions.Your stakeholder (imagine a product manager for NYC Green Taxi) wants a dashboard showing, for each day in January 2024:
fare_amount per borough per day.tip_pct per borough per day.trip_distance per borough per day.Grain of the mart: one row per (pickup_borough, pickup_date).
<aside>
💡 Do not reuse your Week 10 project folder. Start a fresh nyc_taxi_borough_daily/ project. You will reuse the connection settings from Chapter 2, but the models, tests, and docs are new.
</aside>
Create a new folder outside your Week 10 project (e.g. ~/projects/nyc_taxi_borough_daily/). Run dbt init nyc_taxi_borough_daily --skip-profile-setup, then add a project-local profiles.yml pointing at the same shared Azure PostgreSQL instance and a personal dev_<your_name>_borough schema.
<aside>
⚠️ Delete the dbt init scaffolding. The generator creates models/example/my_first_dbt_model.sql + my_second_dbt_model.sql + a schema.yml with tests attached. If you leave these in place, dbt build will report FAIL 1 not_null_my_first_dbt_model_id and block your submission's ERROR=0 requirement. Run rm -rf models/example/ and remove the example: block from dbt_project.yml's models: config before you continue.
</aside>
<aside>
⚠️ Do not reuse the dev_<your_name> schema from Week 10 chapter work, so the assignment builds are isolated. A separate schema also makes the reference repo's checkpoint tags irrelevant to your submission.
</aside>
Run dbt debug and confirm All checks passed!.
Create models/staging/_sources.yml that declares raw_trips and raw_zones in public. This is identical to Define sources for the Week 9 raw tables: the point is that you write it again from scratch, not copy-paste.
Create two staging models:
stg_trips.sql: one row per trip. Include the columns you will need for the mart: pickup_datetime, pickup_location_id, fare_amount, tip_amount, trip_distance, plus a derived tip_pct using your own safe_divide macro (copy the macro from your Week 10 project). Drop rows where pickup_location_id is null or fare_amount < 0.stg_zones.sql: one row per TLC zone. Expose at least location_id and borough.Materialize both as view.
Create models/marts/fct_daily_borough_stats.sql materialized as table, grain one row per (pickup_borough, pickup_date). Join stg_trips to stg_zones on pickup_location_id, then aggregate. Columns to expose:
| Column | Type | Definition |
|---|---|---|
pickup_borough |
text | Borough from stg_zones.borough |
pickup_date |
date | pickup_datetime::date |
trip_count |
bigint | count(*) |
total_fare |
numeric | sum(fare_amount) |
avg_tip_pct |
numeric | avg(tip_pct) |
avg_trip_distance |
numeric | avg(trip_distance) |
<aside>
⌨️ Sanity check: after dbt run --select +fct_daily_borough_stats, query your mart and confirm Manhattan has the highest trip_count on most days. If it does not, you likely joined on the wrong key or filtered too aggressively in staging.
</aside>
Attach at least the following to your staging and mart models via YAML schema files:
not_null on every column used as a join or group-by key.pickup_borough, pickup_date) using dbt_utils.unique_combination_of_columns. This is the mart's primary key; if it fails, your GROUP BY is wrong.tests/ that expresses a business rule your generic tests cannot, e.g. avg_tip_pct is never below 0 or above 1.Install dbt_utils via packages.yml + dbt deps before running the compound uniqueness test.
Run dbt build (not dbt run + dbt test separately) and ensure the summary ends with ERROR=0. Warnings are acceptable.
description: on every column of fct_daily_borough_stats in its YAML schema file. Descriptions must explain what the column means and what its units are, not restate the column name.description: with the mart's grain, source lineage, and known caveats (rows dropped in staging, any WARN-severity tests, etc.).dbt docs generate and dbt docs serve. Take a screenshot of the lineage graph showing raw_trips + raw_zones → stg_* → fct_daily_borough_stats, and save it as docs/lineage.png in your repo.Query your mart to answer the stakeholder's questions. Save the queries and results in reports/answers.md:
total_fare across all of January 2024?trip_count (sum across all boroughs)?avg_tip_pct seen for any (borough, day) combination, and on which borough/date?trip_count for Manhattan vs Brooklyn?Each answer must include: the SQL you ran, the result, and one sentence interpreting it.
Create AI_ASSIST.md describing:
<aside> ⚠️ Do not paste any real data into an LLM: the NYC TLC dataset is public so sample rows are safe here, but the discipline of scrubbing data is what you must practise.
</aside>
nyc_taxi_borough_daily/
├── .gitignore # must exclude profiles.yml
├── dbt_project.yml
├── packages.yml
├── profiles.yml.example # commit a sanitized copy (no password)
├── macros/
│ └── safe_divide.sql
├── models/
│ ├── staging/
│ │ ├── _sources.yml
│ │ ├── _stg_trips.yml
│ │ ├── _stg_zones.yml
│ │ ├── stg_trips.sql
│ │ └── stg_zones.sql
│ └── marts/
│ ├── _fct_daily_borough_stats.yml
│ └── fct_daily_borough_stats.sql
├── tests/
│ └── assert_avg_tip_pct_within_bounds.sql
├── docs/
│ └── lineage.png
├── reports/
│ └── answers.md
└── AI_ASSIST.md
dbt build must exit 0 (ERROR=0 in the summary line). Any WARN-severity failures must be called out in reports/answers.md with one sentence explaining whether they represent a real data issue or a known limitation.profiles.yml must be git-ignored; commit profiles.yml.example with password: "{{ env_var('PG_PASSWORD') }}" instead.docs/lineage.pngreports/answers.md with the four business-question answersAI_ASSIST.mdYour mentor reviews against a detailed rubric covering correctness, modelling, testing, documentation, reproducibility, business-answer quality, and AI discipline. Ask them for specific feedback on any dimension you want to improve: they will share rubric-aligned notes in the PR review.
week10/your-name in your submission repo.