Introduction to SQL for Analytics
2. Joins, CTEs, and Aggregations
5. Building Views in Azure PostgreSQL
In this section, you'll apply SQL techniques to the raw_trips and raw_zonesdataset tables that we used in this chapter.
Concepts: SELECT, COUNT, IS NULL, GROUP BY
Goal: Understand the "grain" of your data and ensure quality before analysis.
Instructions:
vendor_id and pickup_datetime combinations.pickup_location_id or dropoff_location_id is NULL.pickup_location_id in the raw_trips table that does not exist in the zones table.Concepts: INNER JOIN, LEFT JOIN, Table Aliasing
Goal: Bring human-readable context to your numeric ID columns.
Instructions:
raw_trips with the raw_zones table to show the vendor_id, dropff_datetime, and the Zone Name for each dropff location.fare_amount? Join the tables and aggregate the revenue by the borough column.Concepts: WITH clause, Readability, Multistep Aggregation
Goal: Refactor complex logic into readable, modular steps.
Instructions:
daily_stats to calculate the total number of trips and total fare_amount per day.Concepts: CREATE VIEW, Dimensional Modeling
Goal: Abstract the complexity of the raw data for end-users.
Instructions:
vw_fact_trips: Create a view that includes all columns from raw_trips plus a calculated column total_surcharge (sum of extra, mta_tax, and improvement_surcharge).vw_dim_zones: Create a view for the zones table, ensuring column names are clean (e.g., zone_id, zone_name, borough).Concepts: EXPLAIN, Join Logic, Cartesian Products
Goal: Understand how the database engine executes your intent.
Instructions:
EXPLAIN on a join between raw_trips and zones where you intentionally omit the ON condition. Observe the predicted row count.EXPLAIN on a join that includes a specific filter (e.g., WHERE pickup_datetime > '2024-01-01').Q: What is the grain of the trips table?
Q: If you join trips to zones without specifying a join condition, how many rows do you get and why?
Q: Why should you validate for orphaned location IDs before building a fact view?
Q: The
raw_tripstable usespickup_datetimeanddropoff_datetime. If you need to report "Trips per Hour," why is it better to use a dedicated Date Dimension table rather than just extracting the hour from the timestamp?
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with ❤️ by the HackYourFuture community · Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.