Week 9 - SQL for Analytics

Introduction to SQL for Analytics

Joins, CTEs, and Aggregations

3. Data Validation Queries

5. Data Modeling Concepts

6. Building Views in Azure PostgreSQL

7. Gotchas & Pitfalls

8. Practice

9. Assignment

Lesson Plan

9. Assignment

The Scenario

The NYC Taxi trips are loaded into PostgreSQL, but the analytics team can't use them yet. The raw table has negative fares, missing zone references, and column names that only make sense if you wrote the ingestion script. Your job: audit the raw data, clean up what you can through views, and hand the team a star schema they can query with confidence.

The Assignment

This assignment focuses on analytics engineering fundamentals: moving from raw data to the data structure used by analysts.

You will audit data quality, design a dimensional model, and implement that model using SQL Views. This layer serves as the "source of truth" for all downstream reporting.

By the end of this assignment, you should be able to:

<aside> โš ๏ธ Run all queries against raw_trips in your own schema (the one your teacher assigned you), not the shared public schema. Views you create must also live in your schema.

</aside>

Task 1: Data Quality Audit

Before modeling, you must understand the state of your raw_trips data: the same dataset used throughout the week. Write a single SQL script validation_queries.sql that answers the following:

  1. Duplicate check: Are there any rows with the exact same vendor_id, pickup_datetime, and dropoff_datetime?
  2. Null integrity: Count how many rows have a NULL pickup_location_id or dropoff_location_id.
  3. Range validation: Find the minimum and maximum fare_amount. Are there negative values?
  4. Relationship check: Are there any pickup_location_id values in the trips table that do not exist in the raw_zones table?

Task 2: Design the Star Schema

Organize the NYC Taxi data into a star schema to make it easier to query.

Create a script schema_setup.sql with the following CREATE VIEW statements:

<aside> ๐Ÿ’ก Why Views? Views save storage and ensure that if the underlying raw_trips data is updated, your star schema reflects those changes instantly without a manual refresh.

</aside>

Task 3: Document the Data Dictionary

Create a file named data_dictionary.md. For both views, document:

<aside> ๐Ÿ’ก Using AI to help: Ask an LLM to review your grain statement and check whether your key definitions are consistent with a proper star schema. Then verify its answer against your actual view columns. (โš ๏ธ no real customer data, no PII)

</aside>

Task 4: Verification Queries

Query your new views to answer these specific questions. Save the queries in verification_results.sql with a comment above each one labeling the question it answers.

<aside> ๐Ÿ’ก Borough names and zone names live in vw_dim_zones, not in vw_fact_trips. For borough- or zone-level breakdowns, join on pickup_location_id = location_id.

</aside>

  1. Volume: How many total rows are in vw_fact_trips? How many rows per borough? What is the most common pickup/dropoff location combination?
  2. Revenue: Which pickup zone (name, not ID) generated the highest total fare_amount? Which pickup zone had the highest total fare_amount in a single month?
  3. Geospatial: What is the total number of trips and average trip_distance for each borough?
  4. Seasonality: Which month in the dataset had the highest total tip_amount? What hour of the day has the highest average tip?

For question 1, take a screenshot of your query result showing the per-borough row counts. Save it as assets/borough_count.png.

Task 5: AI Assistance Log

Create a file named AI_ASSIST.md. While completing tasks 1-4, document one session where you used an LLM to help with a query or a design decision. Fill in these four sections:

<aside> โš ๏ธ Never paste real customer data or PII into an LLM. The NYC Taxi dataset used here is public data, safe to paste.

</aside>

Deliverables

Your submission should contain:

week9-attempt/
โ”œโ”€โ”€ validation_queries.sql       (Task 1: validation queries)
โ”œโ”€โ”€ schema_setup.sql             (Task 2: CREATE VIEW statements)
โ”œโ”€โ”€ data_dictionary.md           (Task 3: grain, keys, measures)
โ”œโ”€โ”€ verification_results.sql     (Task 4: verification queries)
โ”œโ”€โ”€ assets/
โ”‚   โ””โ”€โ”€ borough_count.png        (Task 4: screenshot of borough query result)
โ””โ”€โ”€ AI_ASSIST.md                 (Task 5: LLM session log)

Technical requirements

How you will be evaluated

Task Points What the teacher checks
Task 1: Data Quality Audit 20 All four queries run without error; findings match actual data
Task 2: Star Schema Views 30 Both views exist in your schema; fact view filters negatives; join-readiness test passes
Task 3: Data Dictionary 20 Grain is precise; keys are correctly identified; measures list is complete
Task 4: Verification Queries 20 Queries run and return results; screenshot shows borough counts
Task 5: AI Assist Log 10 All four sections filled; reflection shows real understanding

Total: 100 points. Passing: 60.

Extra reading

Submission

Once you have given the assignment your best attempt (or the deadline is approaching):

  1. Commit and push your branch.
  2. Open a Pull Request against the cohort repo's main.
  3. Share the PR URL with your teacher.

Next up: Introduction to dbt Core, where you replace hand-written SQL views with version-controlled dbt models that run, test, and document themselves automatically.


The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with โค๏ธ by the HackYourFuture community ยท Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.