Week 9 - SQL for Analytics

Introduction to SQL for Analytics

2. Joins, CTEs, and Aggregations

3. Data Validation Queries

4. Data Modeling Concepts

5. Building Views in Azure PostgreSQL

6. Gotchas & Pitfalls

7. Practice

8. Assignment

Lesson Plan

Data Modeling Concepts

Welcome to the "Architect" phase of the week. In the previous chapters, you learned how to write queries and validate data. But if you just start writing SQL without a plan, your database will quickly turn messy, confusing, and inefficient. Data Modeling is the process of deciding how to structure your tables so they are fast, reliable, and easy for humans to understand.

The Three-Layer Model (Medallion Architecture)

In a professional data warehouse, raw data is never exposed directly to end users. Instead, data is organized into three distinct layers, each with a clear purpose:

This is a representation of the usual flow:

image.png

The Grain: how granular do we go?

The Grain is the level of detail represented by a single row. It is the most important decision you will make when designing a fact table.

The grain defines what a table means. If you get it wrong, every metric built on top of it becomes unreliable.

⚠️ The Danger of Mixed Grains: If you join a "Daily Weather" table (1 row per day) to a "Trips" table (1 row per trip) and then sum temperature, you will duplicate the daily temperature for every trip. This leads to inflated numbers that look mathematically correct but are completely wrong.

Always ensure that joins respect the grain of the fact table.

<aside> 💡

Rule of thumb: If you cannot describe the grain in one clean sentence, it’s not well defined yet.

</aside>

The relationship between grain and joins

This is where things actually go wrong in real life. Joins are one of those things that feel easy but that can silently create issues without you realising. When they break, they don’t throw errors, but they mess up your numbers.

When everything is set up properly:

Each row in your fact table finds exactly one match. Nothing gets duplicated. You just add context and move on. Problems start the moment you stop thinking about the granalarity.

Let’s say:

Both are perfectly fine on their own. But if you join them as-is, you’re mixing levels of detail. And that’s where things get weird:

You can imagine how running aggregations like SUM(), COUNT()on this output causes results to be really wrong. That’s how you end up with inflated totals and incorrect averages.

Most mistakes come from two habits:

A simple way to think about it:

Grain is like a “unit of measurement” for your table.

If you try to combine things with different units without converting them first, the result doesn’t make sense.

So before joining, ask yourself:

If not, you probably need to aggregate first or rethink the join.

<aside> 💡

Big companies like Uber and Lyft use these exact same concepts, but their Fact tables have billions of rows. At that scale, a bad Join (the "Cartesian Explosion" we talked about in the previous chapter) could cost the company thousands of dollars in server fees for a single query. Modeling isn't just about "neatness", but also about financial efficiency!

</aside>

Facts vs. Dimensions

This is the foundation of dimensional modeling.

A useful mental model: Facts are the events, dimensions are the story around the events.

The Star Schema

The Star Schema is the standard structure for analytical models.

At the center is the Fact table, which contains measurable events at a defined grain. Around it are Dimension tables, which provide context. This is an example of how it can be visualised

image.png

Thinking of our dataset, the structure could be:

This creates a star-like structure:

Why it works:


Keys: how tables actually connect

To make everything join correctly, we rely on keys.

In a Star Schema, your Primary Key (the unique ID for the row) should almost always be a Surrogate Key (an artificial ID) rather than a Natural Key (like an email).

<aside> 💡

</aside>


Naming conventions and documentation

Good models don’t just work, they communicate.

Recommended habits: