Data Modeling Concepts

Welcome to the "Architect" phase of the week. In the previous chapters, you learned how to write queries and validate data. But if you just start writing SQL without a plan, your database will quickly turn messy, confusing, and inefficient. Data Modeling is the process of deciding how to structure your tables so they are fast, reliable, and easy for humans to understand.

The Three-Layer Model (Medallion Architecture)

In a professional data warehouse, raw data is never exposed directly to end users. Instead, data is organized into three distinct layers, each with a clear purpose:

Raw (Landing): This is the data in its original form, unchanged and unprocessed. It’s a direct copy of your source files (e.g. CSV or Parquet) and acts as a reliable backup.
Staging (Cleaned): This is where data gets prepared. You apply validation rules, standardize column names, cast data types (e.g. strings to dates), and remove invalid or corrupted records.
Marts (Business-Ready): Also known as the “Gold” layer, this is where data is structured for analysis. Tables are optimized for performance and clarity, making them ready for dashboards, reporting, and business use.

This is a representation of the usual flow:

The Grain: how granular do we go?

The Grain is the level of detail represented by a single row. It is the most important decision you will make when designing a fact table.

Current Grain: One row = One individual taxi trip.
Aggregated Grain: One row = Total trips per hour per borough.

The grain defines what a table means. If you get it wrong, every metric built on top of it becomes unreliable.

⚠️ The Danger of Mixed Grains: If you join a "Daily Weather" table (1 row per day) to a "Trips" table (1 row per trip) and then sum temperature, you will duplicate the daily temperature for every trip. This leads to inflated numbers that look mathematically correct but are completely wrong.

Always ensure that joins respect the grain of the fact table.

<aside> 💡

Rule of thumb: If you cannot describe the grain in one clean sentence, it’s not well defined yet.

</aside>

The relationship between grain and joins

This is where things actually go wrong in real life. Joins are one of those things that feel easy but that can silently create issues without you realising. When they break, they don’t throw errors, but they mess up your numbers.

When everything is set up properly:

you have a fact table (lots of rows)
you join a dimension table (one row per key)

Each row in your fact table finds exactly one match. Nothing gets duplicated. You just add context and move on. Problems start the moment you stop thinking about the granalarity.

Let’s say:

your trips table = 1 row per trip
another table = 1 row per day

Both are perfectly fine on their own. But if you join them as-is, you’re mixing levels of detail. And that’s where things get weird:

the same daily value gets repeated for every single trip
or rows multiply in ways you didn’t expect

You can imagine how running aggregations like SUM(), COUNT()on this output causes results to be really wrong. That’s how you end up with inflated totals and incorrect averages.

Most mistakes come from two habits:

joining tables without checking if their grains match
aggregating after a join without realizing you’ve duplicated data

A simple way to think about it:

Grain is like a “unit of measurement” for your table.

If you try to combine things with different units without converting them first, the result doesn’t make sense.

So before joining, ask yourself:

what does one row represent here?
and is that the same on both sides?

If not, you probably need to aggregate first or rethink the join.

<aside> 💡

Big companies like Uber and Lyft use these exact same concepts, but their Fact tables have billions of rows. At that scale, a bad Join (the "Cartesian Explosion" we talked about in the previous chapter) could cost the company thousands of dollars in server fees for a single query. Modeling isn't just about "neatness", but also about financial efficiency!

</aside>

Facts vs. Dimensions

This is the foundation of dimensional modeling.

Facts: These are quantitative measurements of a business process. They represent what happened.
- Examples: fare_amount, trip_distance, duration_seconds, tip_amount
- Facts are usually numeric and additive (you can sum, average, etc.).
Dimensions: These provide context around the facts. They describe who, where, when, and how.
- Examples: borough_name, payment_type, pickup_zone, day_of_week

A useful mental model: Facts are the events, dimensions are the story around the events.

The Star Schema

The Star Schema is the standard structure for analytical models.

At the center is the Fact table, which contains measurable events at a defined grain. Around it are Dimension tables, which provide context. This is an example of how it can be visualised

Thinking of our dataset, the structure could be:

Fact: trips
Dimensions: zones, date, payment_type, weather

This creates a star-like structure:

One large fact table (millions of rows)
Several smaller dimension tables (hundreds or thousands of rows)

Why it works:

Joins are simple and predictable (always fact → dimension)
Queries are fast because dimensions are denormalized and small
Business users can understand it without knowing the underlying complexity

Keys: how tables actually connect

To make everything join correctly, we rely on keys.

Natural Key: Comes from the real world Example: location_id from a taxi dataset
Surrogate Key: Artificial ID created in your warehouse

Example: zone_key = 10392

Why it matters:
- never changes (unlike source systems)
- avoids messy upstream logic changes
- safer for joins
Primary Key: The column(s) that uniquely identify a row

In a Star Schema, your Primary Key (the unique ID for the row) should almost always be a Surrogate Key (an artificial ID) rather than a Natural Key (like an email).

<aside> 💡

</aside>

Naming conventions and documentation

Good models don’t just work, they communicate.

Recommended habits:

fact_trips, dim_zone, dim_date
snake_case everywhere (trip_distance_km)