Teachers

Performance

We’ve now learned the most important parts of databases: how to design them, how to read from them, and how to write to them. The last bit we want to learn is how to keep them performant.

As your application keeps on being used, its production data will grow and grow. For the Postify app, we will keep recording more streams, artists, tracks, and users. You can imagine your queries will become slower and slower as the database grows in size. This performance problem can be tackled, by helping your database with understanding where and how to find certain data in your tables. The solution to our problem is: indexes.

What are indexes?

Imagine you are looking up the word "transaction" in a 600-page encyclopedia about SQL. You have two options. You could start at page one and read every page until you find it - that works, but it is slow and the time it takes grows with the size of the book. Alternatively, you could flip to the index at the back of the book, find "transaction → page 312", and go directly there.

A database index works exactly the same way. Without one, the database answers a query by scanning every row in the table from top to bottom, checking each one against your WHERE clause. This is called a sequential scan (or full table scan), and like reading a textbook page by page, the time it takes grows linearly with the number of rows.

An index is a separate data structure that the database maintains alongside the table. It stores a subset of the table's data — typically the values of one or more columns — in a way that makes specific lookups extremely fast. When the database can use an index to answer a query, it goes directly to the relevant rows instead of scanning everything.

In PostgreSQL, the default index type is a B-tree (balanced tree). Values are stored in sorted order in a tree structure, which allows the database to find any value (or range of values) in a number of steps proportional to the logarithm of the table size rather than the size itself. A table with one million rows needs at most around 20 comparisons to find a value with a B-tree index. Without an index, it would need up to one million.

When your DBMS runs queries, it will try its best to understand what the quickest way to run the query is. This is a query planner’s job. PostgreSQL's query planner decides on a per-query basis whether using the index is faster than a sequential scan, and chooses accordingly.

If you want a slightly deeper dive into how this is accomplished under the hood, take a look at this video:

https://www.youtube.com/watch?v=lYh6LrSIDvY

Creating indexes

Creating indexes is done on one or multiple columns. You want to give the database a speed increase for queries using those specific columns. This syntax is simple:

CREATE INDEX index_name ON table_name (column_name); -- Or multiple column names

By convention, indexes are often named idx_tablename_columnname. That is all. The database builds the index immediately and begins using it automatically; you do not need to change any queries.

<aside> ⚠️

When creating an index, the database will have to set up an entire data structure. If you create an index on a large table (millions of rows, at least), this may take some time. If you’re working with a live table that likely is still written to frequently while building your index, consider a concurrent creation:

CREATE INDEX CONCURRENTLY idx_tablename_columnname;

</aside>

Dropping an index is simple too:

DROP INDEX idx_tablename_columnname;

When to create indexes

Indexes may sound like magical “silver bullets”; you just add them to almost any columns and table in your database, and your performance magically increases! However, that’s not the case. While indexes speed up reads, they slow down writes. Every time you INSERT, DELETE, or UPDATE data, a database also has to make changes to all relevant indexes. Additionally, the index data structure takes up storage space, which might also incur additional costs. Understanding how your application accesses your data is important in knowing which indexes to make.

There’s a few indexes that are created automatically:

Primary Keys; each PK has a corresponding index.
Unique constraints; are enforced by an index, at least in PostgreSQL.

Other sensible places for indexes:

Foreign Keys: generally speaking, it’s considered best practice to create indices for your Foreign Keys. This is not something PostgreSQL does for you automatically.
Frequent where clauses: columns that appear frequently in where clauses. Queries using those columns as selection criteria are likely sped up.
Ordered query: columns that are used for sorting, in an ORDER BY clause. As mentioned, index data structures are sorted, so effectively the result set is already ordered while the database collects its data.

When not to index:

Low-cardinality columns: when a column has few different values, the index doesn’t optimize much. For instance the artist_country in the Postify app only had three values: 'NL', 'US' and 'CA' . The index would cut the rows to scan in thirds, which is still a large set to scan over. A full table scan might be as quick as first consulting the index, and then going back to the table. This case could be different if Postify had music from all over the world, leading to >200 country codes!
Rarely queried columns: When a column is not used often, the cost of keeping the index updated can be bigger than the boost in read performance.
Small tables: Sequential scans on small tables are still very fast.

<aside>

🙋‍♀️ Can you think of a few indexes for the Postify app? Here’s a couple of suggestions below! </aside>

Composite indexes

A composite index covers two or more columns, instead of just one. If your queries consistently filter on two columns together, a composite index on both columns can be significantly faster than two separate single-column indexes. Take for example the streams for a single user :

-- This query filters on both user_id and track_id together
SELECT * FROM streams WHERE user_id = 3 AND track_id = 42;

A single-column index on user_id helps here — the database narrows the rows down to user 3's streams, then scans those for track_id = 42. But a composite index on (user_id, track_id) lets the database find the exact intersection in one step:

CREATE INDEX idx_streams_user_track ON streams (user_id, track_id);

The order of columns in a composite index is important. A composite index on (user_id, track_id) can be used by:

Queries that filter on user_id alone
Queries that filter on both user_id and track_id

But it cannot efficiently serve queries that filter on track_id alone. The index is sorted first by user_id, so without a known user_id the database cannot use it to jump to specific track_id values; it might as well scan the table. A general rule: put the most selective column first.

Understanding your queries

<aside> 🤓

The following paragraphs are rather advanced. It’s more important to know of the existence of query plans, and how they can be used, than fully understanding all their details. Optimization is hard, so these paragraphs only tell you how to even begin.

</aside>

As Donald Knuth famously said: “premature optimization is the root of all evil”. Besides the “best practices”, you should only add indexes if your app has performance issues. When your app has performance issues, you should first identify whether it is your actual database queries that are the bottleneck. If they are indeed, you still want understand your queries before you throw indexes at them. Sometimes your query could make use indexes, but doesn’t. Sometimes you want to investigate which columns to index properly, instead of guessing. Sometimes, your query is slow despite all of your indexes. In that case, you want to better understand your query.

The `EXPLAIN` statement

The best way to do this, is by looking at HOW your databases fetches the data needed to answer a query. A database’s query planner creates a query plan, which you can ask for using the EXPLAIN command. Let’s try first without any index:

EXPLAIN
SELECT * FROM streams WHERE user_id = 3;

--- Result:
                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on streams  (cost=0.00..10.40 rows=64 width=20)
   Filter: (user_id = 3)
(2 rows)

Now, after creating an index:

CREATE INDEX idx_streams_user_id  ON streams (user_id);
EXPLAIN
SELECT * FROM streams WHERE user_id = 3;

--- Result:
                                    QUERY PLAN                                     
-----------------------------------------------------------------------------------
 Bitmap Heap Scan on streams  (cost=4.64..9.44 rows=64 width=20)
   Recheck Cond: (user_id = 3)
   ->  Bitmap Index Scan on idx_streams_user_id  (cost=0.00..4.63 rows=64 width=0)
         Index Cond: (user_id = 3)
(4 rows)

Note: you should read the query plan from the bottom up. The plan changed from fetching data through a Seq Scan (sequential scan A.K.A. full table scan) to a few other types of scans. There are many strategies that a query planner might deploy, but the significant part is the Bitmap Index Scan on idx_streams_user_id . An index scan was used to only fetch the significant rows (with user_id = 3) from the index data structure, instead of scanning through the entire table. This is way more efficient.

Other than seeing the index was indeed used, we can look at the cost reports. Before the index, it read cost=0.00..10.40 rows=64 width=20 . This shows {startup costs}..{total costs} rows={rows} width={width}:

Startup costs; work done before even a single row is returned. 0.00 in this case, as there’s no ordering or grouping or computation required, for instance.
Total costs; the estimated work to return all rows. 10.40 here.
Rows; the planner’s estimate of how many rows will be returned.
Width; Estimated size per row, in bytes

Note that the costs are not actual seconds (or milliseconds or any unit of time). They are abstract units used internally by the planner. You can use them to compare within a single query plan, but not across other queries or other machines.

After the index, we can see the total costs have gone down from 10.40 to 9.44 . The query planner will choose to use the index for these type of queries because of it!

The `EXPLAIN ANALYZE` statement

When running an EXPLAIN statement, you ask for a query plan based on estimates (which are made using table sizes and other metadata your database keeps track of). If you actually want to see measurements alongside the plan, you have to use the EXPLAIN ANALYZE statement.

Here’s the output for the previous query, with index:

EXPLAIN ANALYZE                                        
SELECT * FROM streams WHERE user_id = 3;
--- Result:
                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on streams  (cost=4.64..9.44 rows=64 width=20) (actual time=0.046..0.053 rows=68 loops=1)
   Recheck Cond: (user_id = 3)
   Heap Blocks: exact=3
   ->  Bitmap Index Scan on idx_streams_user_id  (cost=0.00..4.63 rows=64 width=0) (actual time=0.037..0.037 rows=68 loops=1)
         Index Cond: (user_id = 3)
 Planning Time: 0.054 ms
 Execution Time: 0.075 ms
(7 rows)

This result does show actual measures times. It took 0.054 ms to create a plan, and 0.075 ms to execute the plan and return 7 rows.

<aside> ⚠️

EXPLAIN ANALYZE actually executes your query! Be careful when analyzing INSERT, UPDATE, and especially DELETE statements. If you’re brave enough you can run these in a transaction, and roll them back.

</aside>

Note that running EXPLAIN ANALYZE on the same query multiple times might return different planning and execution time; the query is run by your database, which is a process on your computer, whose speed is still dependent on how much CPU time your OS allows it to use.

The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.

Week 6

Performance

What are indexes?

Creating indexes

When to create indexes

Composite indexes

Understanding your queries

The EXPLAIN statement

The EXPLAIN ANALYZE statement

The `EXPLAIN` statement

The `EXPLAIN ANALYZE` statement