Week 9 - SQL for Analytics

Introduction to SQL for Analytics

2. Joins, CTEs, and Aggregations

3. Data Validation Queries

4. Data Modeling Concepts

5. Building Views in Azure PostgreSQL

6. Gotchas & Pitfalls

7. Practice

8. Assignment

Lesson Plan

8. Practice

In this section, you'll apply SQL techniques to the raw_trips and raw_zonesdataset tables that we used in this chapter.

Exercise 1: Basic Exploration & Validation

Concepts: SELECT, COUNT, IS NULL, GROUP BY

Goal: Understand the "grain" of your data and ensure quality before analysis.

Instructions:

  1. Identify the Grain: Verify if each row is a unique trip. Count the total rows and compare it to the count of unique vendor_id and pickup_datetime combinations.
  2. Missing Data: Write a query to find all trips where pickup_location_id or dropoff_location_id is NULL.
  3. Data Integrity: Check for "orphaned" IDs. Find any pickup_location_id in the raw_trips table that does not exist in the zones table.

Exercise 2: Joining and Denormalizing

Concepts: INNER JOIN, LEFT JOIN, Table Aliasing

Goal: Bring human-readable context to your numeric ID columns.

Instructions:

  1. Zone Mapping: Write a query that joins raw_trips with the raw_zones table to show the vendor_id, dropff_datetime, and the Zone Name for each dropff location.
  2. Borough Revenue: Which Borough had the highest average fare_amount? Join the tables and aggregate the revenue by the borough column.

Exercise 3: Common Table Expressions (CTEs)

Concepts: WITH clause, Readability, Multistep Aggregation

Goal: Refactor complex logic into readable, modular steps.

Instructions:

  1. Daily Trends: Use a CTE named daily_stats to calculate the total number of trips and total fare_amount per day.
  2. The Busiest Day: Query that CTE to find the single date with the highest trip count.
  3. Refactor Challenge: Take a nested subquery (e.g., finding trips above the average trip distance) and rewrite it using a CTE for better readability.

Exercise 4: Building the Analytics Layer (Views)

Concepts: CREATE VIEW, Dimensional Modeling

Goal: Abstract the complexity of the raw data for end-users.

Instructions:

  1. Create vw_fact_trips: Create a view that includes all columns from raw_trips plus a calculated column total_surcharge (sum of extra, mta_tax, and improvement_surcharge).
  2. Create vw_dim_zones: Create a view for the zones table, ensuring column names are clean (e.g., zone_id, zone_name, borough).
  3. Analyze: Query your new views to find the top 5 pickup zones by trip count.

Exercise 5: Performance & Optimization

Concepts: EXPLAIN, Join Logic, Cartesian Products

Goal: Understand how the database engine executes your intent.

Instructions:

  1. The "Accidental" Cross Join: Run an EXPLAIN on a join between raw_trips and zones where you intentionally omit the ON condition. Observe the predicted row count.
  2. Filtered Join: Run EXPLAIN on a join that includes a specific filter (e.g., WHERE pickup_datetime > '2024-01-01').
  3. Comparison: Note the difference in "Cost" between the two plans in the output.

💡 Conceptual Check-In

Q: What is the grain of the trips table?

Q: If you join trips to zones without specifying a join condition, how many rows do you get and why?

Q: Why should you validate for orphaned location IDs before building a fact view?

Q: The raw_trips table uses pickup_datetime and dropoff_datetime. If you need to report "Trips per Hour," why is it better to use a dedicated Date Dimension table rather than just extracting the hour from the timestamp?


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.