Database normalization

Teachers

Database normalization is the process of structuring a database to reduce data redundancy and improve data integrity. It is done in stages, which are called “normal forms”. We will cover only the first three (there’s 9 or so in total), as they address the most common design challenges.

We start with an example: a music streaming startup called “Postify” which has collected all its data into a single, unnormalised table. Whenever someone listens to a song, they record it in this table.

The Unnormalised table: `streams`

user_name	user_email	user_country	track_title	track_duration_s	artist_name	artist_country	album_title	album_year	genre	label_name	label_hq	streamed_at
alice	[email protected]	NL	Blinding Lights	200	The Weeknd	CA	After Hours	2020	synth-pop	Republic Records	New York	2024-03-01 09:00:00
alice	[email protected]	NL	Save Your Tears	215	The Weeknd	CA	After Hours	2020	synth-pop	Republic Records	New York	2024-03-01 09:04:00
bob	[email protected]	DE	Blinding Lights	200	The Weeknd	CA	After Hours	2020	synth-pop	Republic Records	New York	2024-03-01 10:00:00
alice	[email protected]	NL	good 4 u	178	Olivia Rodrigo	US	SOUR	2021	pop-punk	Geffen Records	Los Angeles	2024-03-02 08:15:00
carol	[email protected]	FR	drivers license	242	Olivia Rodrigo	US	SOUR	2021	pop-punk	Geffen Records	Los Angeles	2024-03-02 11:00:00

Problems in the Unnormalised Table

Before normalising, it helps to first look at the problems explicitly:

Repeated user data; alice's email and country appear in rows 1, 2, and 4. If she changes her email, every row must be updated.
Repeated track/album/artist data; "Blinding Lights" appears in rows 1 and 3 with all its metadata duplicated. If the duration is corrected, every copy must change.
Repeated label data; Republic Records' headquarters city is stored once per stream, not once per label.
No unique identity for entities; two artists could share a name; there is no reliable way to distinguish them.
Mixed granularity; a single row conflates a streaming event, a user, a track, an album, an artist, and a label. These are six different things.
Transitive dependency — label_hq depends on label_name, not on the stream.

Normalisation tries to solve these problems.

Step 1: First Normal Form (1NF)

Rules:

Every cell holds a single atomic value.
Each column stores one consistent type.
Every row is uniquely identified by a primary key.

The table already satisfies atomicity: each column holds only one value. There are for instance no comma-lists in any cells. This table already satisfies first two rules! Each column is consistent (same data type in every row), and each cell is atomic.

There is however no primary key yet. Instead, let’s assign a stream_id as the PK. The result:

stream_id	user_name	user_email	user_country	track_title	track_duration_s	artist_name	artist_country	album_title	album_year	genre	label_name	label_hq	streamed_at
1	alice	[email protected]	NL	Blinding Lights	200	The Weeknd	CA	After Hours	2020	synth-pop	Republic Records	New York	2024-03-01 09:00:00
2	alice	[email protected]	NL	Save Your Tears	215	The Weeknd	CA	After Hours	2020	synth-pop	Republic Records	New York	2024-03-01 09:04:00
3	bob	[email protected]	DE	Blinding Lights	200	The Weeknd	CA	After Hours	2020	synth-pop	Republic Records	New York	2024-03-01 10:00:00
4	alice	[email protected]	NL	good 4 u	178	Olivia Rodrigo	US	SOUR	2021	pop-punk	Geffen Records	Los Angeles	2024-03-02 08:15:00
5	carol	[email protected]	FR	drivers license	242	Olivia Rodrigo	US	SOUR	2021	pop-punk	Geffen Records	Los Angeles	2024-03-02 11:00:00

Step 2: Second Normal Form (2NF)

Rule:

Must be in 1NF.
Every non-key column must depend on the whole primary key, not just part of it.

This step contains arguably the bulk of the work. The added rule sounds complicated, but what it boils down to is this: if there are columns that do not directly depend on the primary key, but on something else, define separate tables for them.

Most of the non-key columns do not even describe the stream itself. They describe the user, the track, the album, the artist, or the label.

The formal 2NF rule talks about columns depending on the primary key. A useful way to test this: ask yourself "if I know the primary key value, does that tell me this column's value?"

Our primary key is stream_id . It identifies one streaming event. So ask the question for each column:

Column	Does knowing `stream_id` determine this?
`streamed_at`	✅ Yes — each stream happened at one specific time
`user_email`	❌ No — the email belongs to the user, not the stream. Stream #1 and stream #2 both have `[email protected]` because it is Alice's email, not something specific to those streams
`track_duration_s`	❌ No — the duration belongs to the track. Every stream of "Blinding Lights" will always say 200, regardless of which stream it is
`artist_country`	❌ No — this belongs to the artist, not to any stream
`album_year`	❌ No — this belongs to the album
`label_hq`	❌ No — this belongs to the label

Most columns in the table describe something other than the stream. That is the problem: the table is really six different things glued together. Each time a user streams a track, all of the user's details, the track's details, the album's details, the artist's details, and the label's details get duplicated into the new row.

The results:

`users`

user_id (PK)	user_name	user_email	user_country
1	alice	[email protected]	NL
2	bob	[email protected]	DE
3	carol	[email protected]	FR

`labels`

label_id (PK)	label_name	label_hq
1	Republic Records	New York
2	Geffen Records	Los Angeles

`artists`

artist_id (PK)	artist_name	artist_country
1	The Weeknd	CA
2	Olivia Rodrigo	US

`albums`

album_id (PK)	album_title	album_year	artist_id (FK)	label_id (FK)
1	After Hours	2020	1	1
2	SOUR	2021	2	2

`tracks`

track_id (PK)	track_title	track_duration_s	genre	album_id (FK)
1	Blinding Lights	200	synth-pop	1
2	Save Your Tears	215	synth-pop	1
3	good 4 u	178	pop-punk	2
4	drivers license	242	pop-punk	2

`streams` (trimmed)

| --- | --- | --- | --- |

The stream table now only describes the event itself: who streamed what and when.

Step 3: Third Normal Form (3NF)

Rule:

Must be in 2NF.
No non-key column may depend on another non-key column (no transitive dependencies).

After the 2NF split, two transitive dependencies remain. That is to say, :

Dependency 1: `label_hq` depends on `label_name`, not on `label_id`

In the labels table, label_hq is a fact about the label's name/brand, not directly about its surrogate key. More concretely: if Republic Records moves its headquarters, label_hq changes. However, it changes because of the label entity, so this one is actually fine as-is (it correctly depends on label_id).

✅ labels is already in 3NF. label_hq → label_id is a direct dependency.

Dependency 2: `genre` transitively depends on `album_id` via a style decision

JASPER test

Week 5 - Database design

The Unnormalised table: streams

Problems in the Unnormalised Table

Step 1: First Normal Form (1NF)

Step 2: Second Normal Form (2NF)

users

labels

artists

albums

tracks

streams (trimmed)