🎒 Week 10 Assignment: Build a Daily Borough Stats Mart with dbt

The chapter hands-on walked you through building fct_trips: a one-row-per-trip mart: step by step. The practice exercises drilled specific skills (macros, singular tests, debugging refs, propagating column changes) against that same project.

This assignment asks you to build a second, independent dbt project end to end, against the same raw_trips and raw_zones source data, but producing a mart at a different grain: one row per pickup_borough per date. You will redo every step (sources → staging → mart → tests → docs) on your own, without the chapter's line-by-line guidance. The goal is to prove you can take a business question and deliver a versioned, tested, documented dbt project that answers it.

By the end of this assignment, you should be able to:

Start a fresh dbt project and point it at a shared warehouse with a personal dev schema.
Translate a business question ("how many trips and how much revenue does each borough generate per day?") into a staging → mart layer structure.
Choose appropriate materializations, tests, and column-level documentation.
Submit a reproducible git repo another engineer could clone, dbt build, and understand without asking you questions.

The business question

Your stakeholder (imagine a product manager for NYC Green Taxi) wants a dashboard showing, for each day in January 2024:

Trip volume per borough: how many trips were picked up in each of the five NYC boroughs?
Revenue per borough: total fare_amount per borough per day.
Tipping behaviour per borough: average tip_pct per borough per day.
Distance profile per borough: average trip_distance per borough per day.

Grain of the mart: one row per (pickup_borough, pickup_date).

<aside> 💡 Do not reuse your Week 10 project folder. Start a fresh nyc_taxi_borough_daily/ project. You will reuse the connection settings from Chapter 2, but the models, tests, and docs are new.

</aside>

Task 1: Initialise a new dbt project

Create a new folder outside your Week 10 project (e.g. ~/projects/nyc_taxi_borough_daily/). Run dbt init nyc_taxi_borough_daily --skip-profile-setup, then add a project-local profiles.yml pointing at the same shared Azure PostgreSQL instance and a personal dev_<your_name>_borough schema.

<aside> ⚠️ Delete the dbt init scaffolding. The generator creates models/example/my_first_dbt_model.sql + my_second_dbt_model.sql + a schema.yml with tests attached. If you leave these in place, dbt build will report FAIL 1 not_null_my_first_dbt_model_id and block your submission's ERROR=0 requirement. Run rm -rf models/example/ and remove the example: block from dbt_project.yml's models: config before you continue.

</aside>

<aside> ⚠️ Do not reuse the dev_<your_name> schema from Week 10 chapter work, so the assignment builds are isolated. A separate schema also makes the reference repo's checkpoint tags irrelevant to your submission.

</aside>

Run dbt debug and confirm All checks passed!.

Task 2: Declare sources

Create models/staging/_sources.yml that declares raw_trips and raw_zones in public. This is identical to Define sources for the Week 9 raw tables: the point is that you write it again from scratch, not copy-paste.

Task 3: Build staging models

Create two staging models:

stg_trips.sql: one row per trip. Include the columns you will need for the mart: pickup_datetime, pickup_location_id, fare_amount, tip_amount, trip_distance, plus a derived tip_pct using your own safe_divide macro (copy the macro from your Week 10 project). Drop rows where pickup_location_id is null or fare_amount < 0.
stg_zones.sql: one row per TLC zone. Expose at least location_id and borough.

Materialize both as view.

Task 4: Build the mart

Create models/marts/fct_daily_borough_stats.sql materialized as table, grain one row per (pickup_borough, pickup_date). Join stg_trips to stg_zones on pickup_location_id, then aggregate. Columns to expose:

Column	Type	Definition
`pickup_borough`	text	Borough from `stg_zones.borough`
`pickup_date`	date	`pickup_datetime::date`
`trip_count`	bigint	`count(*)`
`total_fare`	numeric	`sum(fare_amount)`
`avg_tip_pct`	numeric	`avg(tip_pct)`
`avg_trip_distance`	numeric	`avg(trip_distance)`

<aside> ⌨️ Sanity check: after dbt run --select +fct_daily_borough_stats, query your mart and confirm Manhattan has the highest trip_count on most days. If it does not, you likely joined on the wrong key or filtered too aggressively in staging.

</aside>

Task 5: Add tests

Attach at least the following to your staging and mart models via YAML schema files:

Staging: not_null on every column used as a join or group-by key.
Mart: a compound uniqueness test on (pickup_borough, pickup_date) using dbt_utils.unique_combination_of_columns. This is the mart's primary key; if it fails, your GROUP BY is wrong.
At least one singular test in tests/ that expresses a business rule your generic tests cannot, e.g. avg_tip_pct is never below 0 or above 1.

Install dbt_utils via packages.yml + dbt deps before running the compound uniqueness test.

Run dbt build (not dbt run + dbt test separately) and ensure the summary ends with ERROR=0. Warnings are acceptable.

Task 6: Document the project

Add a description: on every column of fct_daily_borough_stats in its YAML schema file. Descriptions must explain what the column means and what its units are, not restate the column name.
Add a model-level description: with the mart's grain, source lineage, and known caveats (rows dropped in staging, any WARN-severity tests, etc.).
Run dbt docs generate and dbt docs serve. Take a screenshot of the lineage graph showing raw_trips + raw_zones → stg_* → fct_daily_borough_stats, and save it as docs/lineage.png in your repo.

Task 7: Answer the business questions

Query your mart to answer the stakeholder's questions. Save the queries and results in reports/answers.md:

Which borough had the highest total total_fare across all of January 2024?
Which day had the highest overall trip_count (sum across all boroughs)?
What was the highest avg_tip_pct seen for any (borough, day) combination, and on which borough/date?
What was the median daily trip_count for Manhattan vs Brooklyn?

Each answer must include: the SQL you ran, the result, and one sentence interpreting it.

Task 8: AI Assist Report

Create AI_ASSIST.md describing:

One place you used an LLM during this assignment (a specific debugging question, a SQL explanation, or a review of your YAML).
The exact prompt you sent.
What the LLM returned.
What you changed, kept, or discarded after reviewing its response.

<aside> ⚠️ Do not paste any real data into an LLM: the NYC TLC dataset is public so sample rows are safe here, but the discipline of scrubbing data is what you must practise.

</aside>

Technical requirements

Project structure (as an example):

nyc_taxi_borough_daily/
├── .gitignore                    # must exclude profiles.yml
├── dbt_project.yml
├── packages.yml
├── profiles.yml.example          # commit a sanitized copy (no password)
├── macros/
│   └── safe_divide.sql
├── models/
│   ├── staging/
│   │   ├── _sources.yml
│   │   ├── _stg_trips.yml
│   │   ├── _stg_zones.yml
│   │   ├── stg_trips.sql
│   │   └── stg_zones.sql
│   └── marts/
│       ├── _fct_daily_borough_stats.yml
│       └── fct_daily_borough_stats.sql
├── tests/
│   └── assert_avg_tip_pct_within_bounds.sql
├── docs/
│   └── lineage.png
├── reports/
│   └── answers.md
└── AI_ASSIST.md

The final dbt build must exit 0 (ERROR=0 in the summary line). Any WARN-severity failures must be called out in reports/answers.md with one sentence explaining whether they represent a real data issue or a known limitation.
No secrets committed. profiles.yml must be git-ignored; commit profiles.yml.example with password: "{{ env_var('PG_PASSWORD') }}" instead.
Column descriptions in the mart's YAML must go past restating the column name.

Deliverables

The dbt project folder in a git repo, pushed to your GitHub
A lineage graph screenshot at docs/lineage.png
reports/answers.md with the four business-question answers
AI_ASSIST.md

Your mentor reviews against a detailed rubric covering correctness, modelling, testing, documentation, reproducibility, business-answer quality, and AI discipline. Ask them for specific feedback on any dimension you want to improve: they will share rubric-aligned notes in the PR review.

Extra reading

How we structure our dbt projects (dbt Labs): the canonical layering you are applying here.
dbt test reference (dbt docs): full list of config options for generic tests.

Submission

Create a git branch week10/your-name in your submission repo.
Commit your work with a clear message.