Introduction to SQL for Analytics
2. Joins, CTEs, and Aggregations
5. Building Views in Azure PostgreSQL
Welcome to the "Architect" phase of the week. In the previous chapters, you learned how to write queries and validate data. But if you just start writing SQL without a plan, your database will quickly turn messy, confusing, and inefficient. Data Modeling is the process of deciding how to structure your tables so they are fast, reliable, and easy for humans to understand.
In a professional data warehouse, raw data is never exposed directly to end users. Instead, data is organized into three distinct layers, each with a clear purpose:
This is a representation of the usual flow:

The Grain is the level of detail represented by a single row. It is the most important decision you will make when designing a fact table.
The grain defines what a table means. If you get it wrong, every metric built on top of it becomes unreliable.
⚠️ The Danger of Mixed Grains: If you join a "Daily Weather" table (1 row per day) to a "Trips" table (1 row per trip) and then sum temperature, you will duplicate the daily temperature for every trip. This leads to inflated numbers that look mathematically correct but are completely wrong.
Always ensure that joins respect the grain of the fact table.
<aside> 💡
Rule of thumb: If you cannot describe the grain in one clean sentence, it’s not well defined yet.
</aside>
This is where things actually go wrong in real life. Joins are one of those things that feel easy but that can silently create issues without you realising. When they break, they don’t throw errors, but they mess up your numbers.
When everything is set up properly:
Each row in your fact table finds exactly one match. Nothing gets duplicated. You just add context and move on. Problems start the moment you stop thinking about the granalarity.
Let’s say:
Both are perfectly fine on their own. But if you join them as-is, you’re mixing levels of detail. And that’s where things get weird:
You can imagine how running aggregations like SUM(), COUNT()on this output causes results to be really wrong. That’s how you end up with inflated totals and incorrect averages.
Most mistakes come from two habits:
A simple way to think about it:
Grain is like a “unit of measurement” for your table.
If you try to combine things with different units without converting them first, the result doesn’t make sense.
So before joining, ask yourself:
If not, you probably need to aggregate first or rethink the join.
<aside> 💡
Big companies like Uber and Lyft use these exact same concepts, but their Fact tables have billions of rows. At that scale, a bad Join (the "Cartesian Explosion" we talked about in the previous chapter) could cost the company thousands of dollars in server fees for a single query. Modeling isn't just about "neatness", but also about financial efficiency!
</aside>
This is the foundation of dimensional modeling.
fare_amount, trip_distance, duration_seconds, tip_amountborough_name, payment_type, pickup_zone, day_of_weekA useful mental model: Facts are the events, dimensions are the story around the events.
The Star Schema is the standard structure for analytical models.
At the center is the Fact table, which contains measurable events at a defined grain. Around it are Dimension tables, which provide context. This is an example of how it can be visualised

Thinking of our dataset, the structure could be:
tripszones, date, payment_type, weatherThis creates a star-like structure:
Why it works:
To make everything join correctly, we rely on keys.
Natural Key: Comes from the real world
Example: location_id from a taxi dataset
Surrogate Key: Artificial ID created in your warehouse
Example: zone_key = 10392
Why it matters:
Primary Key: The column(s) that uniquely identify a row
In a Star Schema, your Primary Key (the unique ID for the row) should almost always be a Surrogate Key (an artificial ID) rather than a Natural Key (like an email).
<aside> 💡
</aside>
Good models don’t just work, they communicate.
Recommended habits:
fact_trips, dim_zone, dim_datetrip_distance_km)