Week 5 - Database design

Relational Databases

Primary Keys and Foreign Keys

Constraints

Database normalization

PostgreSQL setup

Joins

Java and databases

Practice

Assignment

Back end Track

Database normalization is the process of structuring a database to reduce data redundancy and improve data integrity. It is done in stages, which are called “normal forms”. We will cover only the first three (there’s 9 or so in total), as they address the most common design challenges.

We start with an example: a music streaming startup called “Postify” which has collected all its data into a single, unnormalised table. Whenever someone listens to a song, they record it in this table.

The Unnormalised table: streams

user_name user_email user_country track_title track_duration_s artist_name artist_country album_title album_year genre label_name label_hq streamed_at
alice [email protected] NL Blinding Lights 200 The Weeknd CA After Hours 2020 synth-pop Republic Records New York 2024-03-01 09:00:00
alice [email protected] NL Save Your Tears 215 The Weeknd CA After Hours 2020 synth-pop Republic Records New York 2024-03-01 09:04:00
bob [email protected] DE Blinding Lights 200 The Weeknd CA After Hours 2020 synth-pop Republic Records New York 2024-03-01 10:00:00
alice [email protected] NL good 4 u 178 Olivia Rodrigo US SOUR 2021 pop-punk Geffen Records Los Angeles 2024-03-02 08:15:00
carol [email protected] FR drivers license 242 Olivia Rodrigo US SOUR 2021 pop-punk Geffen Records Los Angeles 2024-03-02 11:00:00

Problems in the Unnormalised Table

Before normalising, it helps to first look at the problems explicitly:

  1. Repeated user data; alice's email and country appear in rows 1, 2, and 4. If she changes her email, every row must be updated.
  2. Repeated track/album/artist data; "Blinding Lights" appears in rows 1 and 3 with all its metadata duplicated. If the duration is corrected, every copy must change.
  3. Repeated label data; Republic Records' headquarters city is stored once per stream, not once per label.
  4. No unique identity for entities; two artists could share a name; there is no reliable way to distinguish them.
  5. Mixed granularity; a single row conflates a streaming event, a user, a track, an album, an artist, and a label. These are six different things.
  6. Transitive dependency — label_hq depends on label_name, not on the stream.

Normalisation tries to solve these problems.


Step 1: First Normal Form (1NF)

Rules:

The table already satisfies atomicity: each column holds only one value. There are for instance no comma-lists in any cells. This table already satisfies first two rules! Each column is consistent (same data type in every row), and each cell is atomic.

There is however no primary key yet. Instead, let’s assign a stream_id as the PK. The result:

stream_id user_name user_email user_country track_title track_duration_s artist_name artist_country album_title album_year genre label_name label_hq streamed_at
1 alice [email protected] NL Blinding Lights 200 The Weeknd CA After Hours 2020 synth-pop Republic Records New York 2024-03-01 09:00:00
2 alice [email protected] NL Save Your Tears 215 The Weeknd CA After Hours 2020 synth-pop Republic Records New York 2024-03-01 09:04:00
3 bob [email protected] DE Blinding Lights 200 The Weeknd CA After Hours 2020 synth-pop Republic Records New York 2024-03-01 10:00:00
4 alice [email protected] NL good 4 u 178 Olivia Rodrigo US SOUR 2021 pop-punk Geffen Records Los Angeles 2024-03-02 08:15:00
5 carol [email protected] FR drivers license 242 Olivia Rodrigo US SOUR 2021 pop-punk Geffen Records Los Angeles 2024-03-02 11:00:00

Step 2: Second Normal Form (2NF)

Rule:

This step contains arguably the bulk of the work. The added rule sounds complicated, but what it boils down to is this: if there are columns that do not directly depend on the primary key, but on something else, define separate tables for them.

Most of the non-key columns do not even describe the stream itself. They describe the user, the track, the album, the artist, or the label.

The formal 2NF rule talks about columns depending on the primary key. A useful way to test this: ask yourself "if I know the primary key value, does that tell me this column's value?"

Our primary key is stream_id . It identifies one streaming event. So ask the question for each column:

Column Does knowing stream_id determine this?
streamed_at ✅ Yes — each stream happened at one specific time
user_email ❌ No — the email belongs to the user, not the stream. Stream #1 and stream #2 both have [email protected] because it is Alice's email, not something specific to those streams
track_duration_s ❌ No — the duration belongs to the track. Every stream of "Blinding Lights" will always say 200, regardless of which stream it is
artist_country ❌ No — this belongs to the artist, not to any stream
album_year ❌ No — this belongs to the album
label_hq ❌ No — this belongs to the label

Most columns in the table describe something other than the stream. That is the problem: the table is really six different things glued together. Each time a user streams a track, all of the user's details, the track's details, the album's details, the artist's details, and the label's details get duplicated into the new row.

The results:

users

user_id (PK) user_name user_email user_country
1 alice [email protected] NL
2 bob [email protected] DE
3 carol [email protected] FR

labels

label_id (PK) label_name label_hq
1 Republic Records New York
2 Geffen Records Los Angeles

artists

artist_id (PK) artist_name artist_country
1 The Weeknd CA
2 Olivia Rodrigo US

albums

album_id (PK) album_title album_year artist_id (FK) label_id (FK)
1 After Hours 2020 1 1
2 SOUR 2021 2 2

tracks

track_id (PK) track_title track_duration_s genre album_id (FK)
1 Blinding Lights 200 synth-pop 1
2 Save Your Tears 215 synth-pop 1
3 good 4 u 178 pop-punk 2
4 drivers license 242 pop-punk 2

streams (trimmed)

| --- | --- | --- | --- |

The stream table now only describes the event itself: who streamed what and when.


Step 3: Third Normal Form (3NF)

Rule:

After the 2NF split, two transitive dependencies remain. That is to say, :

Dependency 1: label_hq depends on label_name, not on label_id

In the labels table, label_hq is a fact about the label's name/brand, not directly about its surrogate key. More concretely: if Republic Records moves its headquarters, label_hq changes. However, it changes because of the label entity, so this one is actually fine as-is (it correctly depends on label_id).

labels is already in 3NF. label_hqlabel_id is a direct dependency.

Dependency 2: genre transitively depends on album_id via a style decision

JASPER test