Teachers

Reading aggregations of data

When you join tables together, you combine lots of information. In our previous example we got all albums for all Dutch artists. There are often situations where you are not necessarily interested in the actual albums themselves, but maybe you are more interested in summaries and stats about the albums. For instance, you could ask how many albums Dutch artists have produced. You want to aggregate all the albums and extract singular facts from them.

SQL can help you do that. Let’s start with a simpler example.

Aggregate functions

Let’s start with the most basic aggregate function: COUNT . We can collapse query results into a single number using such an aggregate function. Say we want to know how often music has been streamed from the Postify app. All streams are recorded in the streams table, so we’d have to count how many rows the table has!

The syntax for the query is like this:

SELECT COUNT(*) as nr_of_streams FROM streams; 
--- Result:
 nr_of_streams 
---------------
           502
(1 row)

Notice the COUNT(*) ; this is doing the counting for us. It is a SQL function which has been passed an argument, namely *. This tells the COUNT function to count a row even if it’s all NULLs. You could also do COUNT(column_name), in which case the count is only increased if the value of the row’s column_name is not null. There’s also the DISTINCT option when counting:

-- How many streams have actually been tied to a user
-- (would differ from COUNT(*) if user_id could be NULL)
SELECT COUNT(user_id) FROM streams;

-- How many distinct users have streamed anything
SELECT COUNT(DISTINCT user_id) FROM streams;

There are other aggregate functions too! The most important ones:

SUM(column) : sum all column values
AVG(column) : take the average of the column values
MIN(column) : take the minimum of the column values
MAX(column) : take the maximum of the column values

See the postgres docs for a complete overview.

<aside>

🙋‍♀️ How would you write a query to return the total album duration in seconds for album_id = 1? </aside>

Joins and groupings

As we just learned, you can use aggregate functions to turn entire tables into a singular result. A whole-table aggregate answers "what is the count/sum/min/avg/max across everything?". It is also possible the answer the question “what is the count/sum/min/avg/max per category”, where the category is something you define yourself.

For our Postify example, let’s see if we can figure out how to extract the number of albums per country. As a simple attempt, we could JOIN the albums and artistst table, to come up with results per country like this:

SELECT COUNT(*) as nr_of_albums
FROM   artists
INNER JOIN albums
  ON artists.artist_id = albums.artist_id
WHERE artists.artist_country = 'NL';

--- Result:
 nr_of_albums 
--------------
            9
(1 row)

Repeating this for all countries would be a bit of a hassle! You’d have to create and run the query once for each country, and collect and combine the results yourself too. Instead, we’ll make use of SQL’s GROUP BY clause to count results per country:

SELECT artists.artist_country,  -- display the country too, not just the number of counts
       COUNT(*) as nr_of_albums
FROM   artists
INNER JOIN albums
  ON artists.artist_id = albums.artist_id
GROUP BY artists.artist_country; -- GROUP BY instead of the previous `WHERE` clause

--- Result:
 artist_country | nr_of_albums 
----------------+--------------
 US             |           14
 NL             |            9
 CA             |           10
(3 rows)

We’ve joined artists and albums to get the artists’ countries, and next we’ve grouped the results by country, and now the behaviour of count has changed! It no longer counts the rows in the entire result set; instead, it counts per group, where the group is defined by the GROUP BY statement.

Grouping by multiple columns

It’s also possible to group by multiple columns. The group is then defined by the unique combination of column values. For instance, let’s write a query that displays how often users have streamed specific tracks:

-- Stream count per user per track
SELECT   user_id,
         track_id,
         COUNT(*) AS times_played
FROM     streams
GROUP BY user_id, track_id  -- Group by both user and track
ORDER BY times_played DESC; -- Order results by most times played, descending to least times played.

--- Results:
 user_id | track_id | times_played 
---------+----------+--------------
       1 |        1 |           15    -- User 1 streamed track 1 15 times, the most played track in Postify!
       3 |       47 |           11
       2 |       55 |           10
       4 |       90 |           10
       5 |       68 |           10
 (...)

<aside>

🙋‍♀️ How would you rewrite this query to show the user’s name and track’s title? </aside>

Filtering groups

It is possible to add WHERE clauses to queries using GROUP BY and JOIN just fine. For instance, if we want to know the number of albums in North American countries, we can query like so:

SELECT artists.artist_country,
       COUNT(*) as nr_of_albums
FROM   artists
INNER JOIN albums
  ON artists.artist_id = albums.artist_id
WHERE artists.artist_country IN ('CA', 'US') -- WHERE clause comes before the group by!
GROUP BY artists.artist_country;

However, what if we are interested in the results only when there at least X albums released by artists from a country? Adding a WHERE nr_of_albums > 10 clause doesn’t work:

SELECT artists.artist_country,
       COUNT(*) as nr_of_albums
FROM   artists
INNER JOIN albums
  ON artists.artist_id = albums.artist_id
WHERE nr_of_albums > 10  -- Add a where clause on the count alias, or alternatively use COUNT(*) here directly (doesn't matter)
GROUP BY artists.artist_country;

--- Result:
ERROR:  column "nr_of_albums" does not exist
LINE 6: WHERE nr_of_albums > 10

This is because the nr_of_albums has not been computed yet. WHERE filters on single rows, not on aggregates. When you want to filter on aggregates, you can use the HAVING keyword:

SELECT artists.artist_country,
       COUNT(*) as nr_of_albums
FROM   artists
INNER JOIN albums
  ON artists.artist_id = albums.artist_id
GROUP BY artists.artist_country
HAVING COUNT(*) >= 10;  -- Add a 'HAVING' after the 'GROUP BY'. 
                        -- You cannot use the alias, it is not defined at this point yet.

--- Result:
 artist_country | nr_of_albums 
----------------+--------------
 US             |           14
 CA             |           10
(2 rows)

The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.

Week 6

Reading aggregations of data

Aggregate functions

Joins and groupings

Grouping by multiple columns

Filtering groups