9. Assignment

The Scenario

The NYC Taxi trips are loaded into PostgreSQL, but the analytics team can't use them yet. The raw table has negative fares, missing zone references, and column names that only make sense if you wrote the ingestion script. Your job: audit the raw data, clean up what you can through views, and hand the team a star schema they can query with confidence.

The Assignment

This assignment focuses on analytics engineering fundamentals: moving from raw data to the data structure used by analysts.

You will audit data quality, design a dimensional model, and implement that model using SQL Views. This layer serves as the "source of truth" for all downstream reporting.

By the end of this assignment, you should be able to:

Audit raw data: Write defensive SQL to identify dirty data (nulls, duplicates, outliers).
Dimensional modeling: Translate raw tables into a fact and dimension structure.
Implement SQL Views: Use CREATE VIEW to abstract complexity for end-users.
Data documentation: Define the grain and metadata of your analytical models.

<aside> ⚠️ Run all queries against raw_trips in your own schema (the one your teacher assigned you), not the shared public schema. Views you create must also live in your schema.

</aside>

Task 1: Data Quality Audit

Before modeling, you must understand the state of your raw_trips data: the same dataset used throughout the week. Write a single SQL script validation_queries.sql that answers the following:

Duplicate check: Are there any rows with the exact same vendor_id, pickup_datetime, and dropoff_datetime?
Null integrity: Count how many rows have a NULL pickup_location_id or dropoff_location_id.
Range validation: Find the minimum and maximum fare_amount. Are there negative values?
Relationship check: Are there any pickup_location_id values in the trips table that do not exist in the raw_zones table?

Task 2: Design the Star Schema

Organize the NYC Taxi data into a star schema to make it easier to query.

Fact table: vw_fact_trips (the quantitative events)
Dimension table: vw_dim_zones (the descriptive attributes)

Create a script schema_setup.sql with the following CREATE VIEW statements:

vw_dim_zones: Create a view based on raw_zones. Treat location_id as the primary key.
vw_fact_trips: Create a view based on raw_trips.
Filter: Exclude rows where fare_amount is less than 0.
Cast: Explicitly cast pickup_datetime as a TIMESTAMP.
Join-ready: The view must be joinable to vw_dim_zones on location_id. Verify with a test SELECT COUNT(*) FROM vw_fact_trips f JOIN vw_dim_zones d ON f.pickup_location_id = d.location_id after creating it.

<aside> 💡 Why Views? Views save storage and ensure that if the underlying raw_trips data is updated, your star schema reflects those changes instantly without a manual refresh.

</aside>

Task 3: Document the Data Dictionary

Create a file named data_dictionary.md. For both views, document:

The grain: What does a single row represent? (e.g., "One row per taxi trip".)
The keys: Identify the primary keys and foreign keys.
Measures: List the columns that can be aggregated (e.g., fare_amount, passenger_count).

<aside> 💡 Using AI to help: Ask an LLM to review your grain statement and check whether your key definitions are consistent with a proper star schema. Then verify its answer against your actual view columns. (⚠️ no real customer data, no PII)

</aside>

Task 4: Verification Queries

Query your new views to answer these specific questions. Save the queries in verification_results.sql with a comment above each one labeling the question it answers.

<aside> 💡 Borough names and zone names live in vw_dim_zones, not in vw_fact_trips. For borough- or zone-level breakdowns, join on pickup_location_id = location_id.

</aside>

Volume: How many total rows are in vw_fact_trips? How many rows per borough? What is the most common pickup/dropoff location combination?
Revenue: Which pickup zone (name, not ID) generated the highest total fare_amount? Which pickup zone had the highest total fare_amount in a single month?
Geospatial: What is the total number of trips and average trip_distance for each borough?
Seasonality: Which month in the dataset had the highest total tip_amount? What hour of the day has the highest average tip?

For question 1, take a screenshot of your query result showing the per-borough row counts. Save it as assets/borough_count.png.

Task 5: AI Assistance Log

Create a file named AI_ASSIST.md. While completing tasks 1-4, document one session where you used an LLM to help with a query or a design decision. Fill in these four sections:

The problem: What were you trying to solve? Paste the relevant SQL or schema fragment.
The prompt: What did you ask the AI? Include the context you provided.
The response: What did it suggest? Did it work first try?
Reflection: Did you understand why the suggestion worked, or did you accept it blindly?

<aside> ⚠️ Never paste real customer data or PII into an LLM. The NYC Taxi dataset used here is public data, safe to paste.

</aside>

Deliverables

Your submission should contain:

week9-attempt/
├── validation_queries.sql       (Task 1: validation queries)
├── schema_setup.sql             (Task 2: CREATE VIEW statements)
├── data_dictionary.md           (Task 3: grain, keys, measures)
├── verification_results.sql     (Task 4: verification queries)
├── assets/
│   └── borough_count.png        (Task 4: screenshot of borough query result)
└── AI_ASSIST.md                 (Task 5: LLM session log)

Technical requirements

Database: All views must be created in your assigned schema in the Azure PostgreSQL instance.
Syntax: Use standard PostgreSQL syntax.
Formatting: SQL keywords capitalized (SELECT, FROM, JOIN), consistent indentation.

How you will be evaluated

Task	Points	What the teacher checks
Task 1: Data Quality Audit	20	All four queries run without error; findings match actual data
Task 2: Star Schema Views	30	Both views exist in your schema; fact view filters negatives; join-readiness test passes
Task 3: Data Dictionary	20	Grain is precise; keys are correctly identified; measures list is complete
Task 4: Verification Queries	20	Queries run and return results; screenshot shows borough counts
Task 5: AI Assist Log	10	All four sections filled; reflection shows real understanding

Total: 100 points. Passing: 60.

Extra reading

PostgreSQL CREATE VIEW: official syntax reference
Kimball dimensional modeling techniques: the canonical source for star schema design
10 Data Quality Checks in SQL: practical null, duplicate, range, and referential-integrity checks: the same patterns you apply in Task 1

Submission

Once you have given the assignment your best attempt (or the deadline is approaching):

Commit and push your branch.
Open a Pull Request against the cohort repo's main.
Share the PR URL with your teacher.

Next up: Introduction to dbt Core, where you replace hand-written SQL views with version-controlled dbt models that run, test, and document themselves automatically.

The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.