25th Dec 2020 10 minutes read What Is a Database Index? Ignacio L. Bisso sql learn sql data engineering Table of Contents What Is a Database Index? How Does an Index Work? How to Add an Index to a Table How An Index Works Internally Who Chooses Which Index to Use? Multi-Column Key Indexes The Cost of Having Indexes What’s Next with Indexes? Database indexes are critical to query speed and efficiency. Learn when, why, and how to use them – and when not to – in this post. Indexes are very important database objects; they optimize data access and improve database performance by helping the database execute SQL queries faster. Why not take this opportunity to learn how indexes work and how to create them? What Is a Database Index? Indexes are data structures that can increase a database’s efficiency in accessing tables. Indexes are not required; the database can function properly without them, but query response time can be slower. Every index is associated with a table and has a key, which is formed by one or more table columns. When a query needs to access a table that has an index, the database can decide to use the index to retrieve records faster. Question: What is a SQL Index? A SQL index is a database structure that speeds up data retrieval operations by providing quick access to table rows. It works like a book's index, allowing the database to find data without scanning the entire table. How Does an Index Work? If you’ve ever used a book index, you’ll understand immediately how a database index works. In a book index, we first search the index for the topic we want. We get the page number where that topic is found in the book, and we go to this page. It’s very simple. When the database searches in an index, the first step is to find the key value in the index. Every key value is stored with a pointer to the record in the table associated with this key value. Then, when the key value is found, the database follows that pointer and directly reads the record(s) from the table. Suppose we table with 40 million records and a schema like the following: ssnfirst_namelast_namezip_code 5436782JohnSmithZZ123459 5123456IgorChernoskyZZ123459 4325834MaryConneryAF432112 4748122SueLeclercAW543112 We run this query: SELECT ssn FROM persons WHERE last_name = ‘Chernosky’ To get the results, the database must do a full sequential scan of the table persons, which means that the database must read every record on the table and verify if the WHERE condition last_name = ‘Chernosky’ for this record is true or false. As this table has 40 million rows, the database is carrying out more than a few operations. On my local machine, this query takes 46 seconds! How to Add an Index to a Table At this point, we have a really slow query – it takes 46 seconds to return just one record. We can try to add an index to the table persons using the column last_name as the index key. The idea is to improve query performance by avoiding a full sequential scan of the entire table. Here’s how to create such an index: CREATE INDEX ix1_test ON persons ( last_name ) The syntax for index creation is simple. We need a CREATE INDEX statement specifying: The index name (here, ix1_test). The associated table (persons). The column to use as the index key (last_name). After creating the index and executing the previous query a second time, the database will do an index search for the last_name = ‘Chernosky’ condition instead of a full sequential scan. Let’s see how much time it takes. With the new index, the same query takes 15 milliseconds – 30,000 times faster! The reason behind this huge performance increase is using the index to look for last_name = ‘Chernosky’ instead of doing a full sequential scan. Remember, an index’s process is similar to the process of using a book index. Imagine how much time you’d need to find a specific topic if you decided to search for it by reading the entire book, page by page! Before closing this section, I would like to suggest the course Understanding Indexes, which has hands-on exercises that show how indexes work. Also, the article What is a Primary Key in SQL? explains the role of primary keys in SQL databases; it’s a good read if you’re not already familiar with the concept. How An Index Works Internally In SQL databases, indexes are internally organized in the form of trees. Like actual trees, database indexes have many branch bifurcations, and individual records are represented (or pointed) by the leaves. A database index tree is composed of several nodes connected by pointers. The index tree is created when the CREATE INDEX statement is executed. The database software has an algorithm that builds the index tree. The first step of the index creation algorithm is to sort the records based on the index key. Next, it builds the structure by creating and populating every node of the tree index. The creation of an index tree can take some time, especially when the table has many records. If we use an index to search for a specific record in the table, we need to start searching from the root of the index tree. On every branch bifurcation, a decision must be made about which branch to choose. This means logically evaluating the search condition and the range of values in each branch. As a very simple example, imagine you want to find the number 12 in a tree that has two branches. Branch X has numbers from 0-10 and Branch Y has numbers from 11-20. As 12 is greater than 11, you’d search for the number in Branch Y. One special kind of index tree is the B-tree. B-trees are Balanced trees, meaning that the quantity of nodes from the root to any leaf node in the B-tree is the same. We will show a graphical representation of a B-tree index. First, let’s look at the table we’ll associate with the index. It’s called part, and the index will be placed on the part_id column: part_idnamepricemanufacturer 1Touch screen m81$90Cbios Inc 2Processor quad core I4$92Xcreators Inc 5Keyboard 101K$13Xcreators Inc 7Ram Memory 265Mb$145Cbios Inc 8Ram Memory 512Mb$211Cbios Inc 11Processor octo core I8$178Xcreators Inc Next, let’s see the SQL that will create the index ix_part_id: CREATE INDEX ix_part_id ON part ( part_id ) At the time of CREATE INDEX execution, the complete index data structure is created and populated. In the following image, we can see the graphical representation of the B-tree index ix_part_id and the table part: The B-tree index is represented as an inverted tree, where the root is on top and the leaves are on the bottom. Notice that the red boxes representing leaf nodes have a pointer that indicates one record in the table. Every leaf node is a pair consisting of a key value and a pointer to the record. The intermediate and root-level nodes can store four key values (although some of them are not in use). When new records are inserted into the table, these empty places will be used to store new key values. This index can be used for searching records. Suppose the database has to execute a query to search for a specific part_id or a range of part_ids, e.g. WHERE part_id = 2. To find the record with part_id = 2, we start searching in the root of the index, and comparing 2 with the root index value 7. If the result of the comparison is “<”, we take the pointer on the left side. If the result of the comparison is “>=” we take the pointer on the right side. As 2 is less than 7, we take the left pointer to the next node and repeat the same process. When the node has more than one value, we’ll compare until we find a range where the searched value belongs and take the pointer for that range. In our example, the range for 2 is (2,5]; we take the pointer to the leaf node and see that the leaf node value is 2, so we take the pointer to that record in the table. In the image below, the green lines show how we traverse the B-tree to search part_id = 2: An index is an ordered data structure where the index keys are ordered by some criteria; thus, we can take advantage of that and use indexes to avoid sorting records. Then if we have a query with an ORDER BY or a GROUP BY using the same key as an existing index, the database can use the index to avoid sorting the records – a performance improvement! Who Chooses Which Index to Use? We can have more than one index per table as long as each index uses a different column as its key. Depending on the query, perhaps more than one index can be used to improve its performance. So, who decides the best index to use for a given SQL query ? This is selected by an internal database module called the “query planner”. The following example has a query on a table with two indexes. We will see how the database chooses the index that will execute the query faster. First, let’s add a new index to the table persons: CREATE INDEX ix3_test ON persons ( zip_code ) Here’s the query we will execute: SELECT ssn FROM persons WHERE last_name = ‘Smith’ AND zip_code = ‘12345’ In this query, the database has two indexes to choose from: ix1_test based on the key last_name ix3_test based on the key zip_code The execution of the query takes 22 milliseconds, as we can see below: We know the query has used an index, but we don’t know which index was chosen. SQL databases have an EXPLAIN statement that returns the query planner’s execution plan. The output includes several technical details which are outside the scope of this article, but it is simple to identify which index has been chosen. Let’s execute the same query, this time adding the EXPLAIN before the SELECT. We can see the index chosen was idx1_test. Multi-Column Key Indexes Index keys can be formed for one column or several columns, much like primary keys. The syntax for a multi-column key index is pretty simple: CREATE INDEX ix4_test ON person ( first_name, last_name ) We can put any combination of columns in the key of a multi-column index; however, best practices suggest putting columns that are frequently used in queries’ WHERE clauses. For example, the previous index would be very helpful on the following query: SELECT social_security_number FROM person WHERE first_name = ‘Kevin’ AND last_name = ‘Smith’ There are other WHERE conditions that can take advantage of the previous multi-column index, e.g. a query with WHERE first_name = ‘Kevin’ can also use the index, even when last_name is not in the WHERE. Another interesting facet of this topic is redundant indexes; however, due to its length and complexity, we can’t cover it in this article. For a complete review of SQL tables and the many different options available, see How to Create a Table in SQL. The Cost of Having Indexes Indexes – like most things in life – are not free. Every time a table with an index is modified by an INSERT, UPDATE, or DELETE, all the indexes associated with the table have to be refreshed. These changes take time, and the performance of a process that massively modifies tables can be affected when several indexes exist in the table. Before creating an index, we need to evaluate the cost-benefit of having it. Do we expect noticeable performance improvements? Is this table statistical or heavily modified? Can we accept a degradation in performance when we modify the table? After doing this analysis, we can decide if the index needs to be created. Another option is to create and drop indexes based on different needs; in fact, some databases do that automatically. What’s Next with Indexes? We’ve covered the basics of database indexes: The syntax for creating an index. When an index is used by a query. How to know if an index is being used (the EXPLAIN statement). When to put an index on a table. If you want to really get familiar with indexes, I suggest the course Creating Database Structure. It will teach you the secrets of creating and managing the structure of a relational database. Finally, I would also like to suggest the e-book SQL Performance Explained by Markus Winand, where you can find a lot of technical details about SQL performance. Tags: sql learn sql data engineering