Database normalization is the process of structuring a database to reduce data redundancy and improve data integrity. It is done in stages, which are called “normal forms”. We will cover only the first three (there’s 9 or so in total), as they address the most common design challenges.
We start with an example: a music streaming startup called “Postify” which has collected all its data into a single, unnormalised table. Whenever someone listens to a song, they record it in this table.
streams| user_name | user_email | user_country | track_title | track_duration_s | artist_name | artist_country | album_title | album_year | genre | label_name | label_hq | streamed_at |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| alice | [email protected] | NL | Blinding Lights | 200 | The Weeknd | CA | After Hours | 2020 | synth-pop | Republic Records | New York | 2024-03-01 09:00:00 |
| alice | [email protected] | NL | Save Your Tears | 215 | The Weeknd | CA | After Hours | 2020 | synth-pop | Republic Records | New York | 2024-03-01 09:04:00 |
| bob | [email protected] | DE | Blinding Lights | 200 | The Weeknd | CA | After Hours | 2020 | synth-pop | Republic Records | New York | 2024-03-01 10:00:00 |
| alice | [email protected] | NL | good 4 u | 178 | Olivia Rodrigo | US | SOUR | 2021 | pop-punk | Geffen Records | Los Angeles | 2024-03-02 08:15:00 |
| carol | [email protected] | FR | drivers license | 242 | Olivia Rodrigo | US | SOUR | 2021 | pop-punk | Geffen Records | Los Angeles | 2024-03-02 11:00:00 |
Before normalising, it helps to first look at the problems explicitly:
alice's email and country appear in rows 1, 2, and 4. If she changes her email, every row must be updated.label_hq depends on label_name, not on the stream.Normalisation tries to solve these problems.
Rules:
The table already satisfies atomicity: each column holds only one value. There are for instance no comma-lists in any cells. This table already satisfies first two rules! Each column is consistent (same data type in every row), and each cell is atomic.
There is however no primary key yet. Instead, let’s assign a stream_id as the PK. The result:
| stream_id | user_name | user_email | user_country | track_title | track_duration_s | artist_name | artist_country | album_title | album_year | genre | label_name | label_hq | streamed_at |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | alice | [email protected] | NL | Blinding Lights | 200 | The Weeknd | CA | After Hours | 2020 | synth-pop | Republic Records | New York | 2024-03-01 09:00:00 |
| 2 | alice | [email protected] | NL | Save Your Tears | 215 | The Weeknd | CA | After Hours | 2020 | synth-pop | Republic Records | New York | 2024-03-01 09:04:00 |
| 3 | bob | [email protected] | DE | Blinding Lights | 200 | The Weeknd | CA | After Hours | 2020 | synth-pop | Republic Records | New York | 2024-03-01 10:00:00 |
| 4 | alice | [email protected] | NL | good 4 u | 178 | Olivia Rodrigo | US | SOUR | 2021 | pop-punk | Geffen Records | Los Angeles | 2024-03-02 08:15:00 |
| 5 | carol | [email protected] | FR | drivers license | 242 | Olivia Rodrigo | US | SOUR | 2021 | pop-punk | Geffen Records | Los Angeles | 2024-03-02 11:00:00 |
Rule:
This step contains arguably the bulk of the work. The added rule sounds complicated, but what it boils down to is this: if there are columns that do not directly depend on the primary key, but on something else, define separate tables for them.
Most of the non-key columns do not even describe the stream itself. They describe the user, the track, the album, the artist, or the label.
The formal 2NF rule talks about columns depending on the primary key. A useful way to test this: ask yourself "if I know the primary key value, does that tell me this column's value?"
Our primary key is stream_id . It identifies one streaming event. So ask the question for each column:
| Column | Does knowing stream_id determine this? |
|---|---|
streamed_at |
✅ Yes — each stream happened at one specific time |
user_email |
❌ No — the email belongs to the user, not the stream. Stream #1 and stream #2 both have [email protected] because it is Alice's email, not something specific to those streams |
track_duration_s |
❌ No — the duration belongs to the track. Every stream of "Blinding Lights" will always say 200, regardless of which stream it is |
artist_country |
❌ No — this belongs to the artist, not to any stream |
album_year |
❌ No — this belongs to the album |
label_hq |
❌ No — this belongs to the label |
Most columns in the table describe something other than the stream. That is the problem: the table is really six different things glued together. Each time a user streams a track, all of the user's details, the track's details, the album's details, the artist's details, and the label's details get duplicated into the new row.
The results:
users| user_id (PK) | user_name | user_email | user_country |
|---|---|---|---|
| 1 | alice | [email protected] | NL |
| 2 | bob | [email protected] | DE |
| 3 | carol | [email protected] | FR |
labels| label_id (PK) | label_name | label_hq |
|---|---|---|
| 1 | Republic Records | New York |
| 2 | Geffen Records | Los Angeles |
artists| artist_id (PK) | artist_name | artist_country |
|---|---|---|
| 1 | The Weeknd | CA |
| 2 | Olivia Rodrigo | US |
albums| album_id (PK) | album_title | album_year | artist_id (FK) | label_id (FK) |
|---|---|---|---|---|
| 1 | After Hours | 2020 | 1 | 1 |
| 2 | SOUR | 2021 | 2 | 2 |
tracks| track_id (PK) | track_title | track_duration_s | genre | album_id (FK) |
|---|---|---|---|---|
| 1 | Blinding Lights | 200 | synth-pop | 1 |
| 2 | Save Your Tears | 215 | synth-pop | 1 |
| 3 | good 4 u | 178 | pop-punk | 2 |
| 4 | drivers license | 242 | pop-punk | 2 |
streams (trimmed)| --- | --- | --- | --- |
The stream table now only describes the event itself: who streamed what and when.
Rule:
After the 2NF split, two transitive dependencies remain. That is to say, :
label_hq depends on label_name, not on label_idIn the labels table, label_hq is a fact about the label's name/brand, not directly about its surrogate key. More concretely: if Republic Records moves its headquarters, label_hq changes. However, it changes because of the label entity, so this one is actually fine as-is (it correctly depends on label_id).
✅
labelsis already in 3NF.label_hq→label_idis a direct dependency.
genre transitively depends on album_id via a style decision