Prefix index, finding a balance between performance and space

In the specific practice of the project, we sometimes encounter some special fields, such as ID number.

Brother Song used to have a friend who worked on the government service network of Heilongjiang Province. There were some scenarios involving the storage of user ID cards. Since most of the stored data is local, if you want to index the ID number at this time, Friends know that the first six digits of the ID card are the address code. In this scenario, if the ID field is indexed, the distinction of the first six digits is very low, and even the distinction of the first ten digits is very low ( Because the birth year is limited after all, and a province has tens of millions of people, the birth year repetition rate is very high), which not only wastes storage space, but also has low query performance.

So is there any way to solve this problem? Let’s talk about prefix index today. After talking, I believe you will have the answer yourself.

1.What is a prefix index?

Sometimes in order to improve the performance of the index, we only index the first few characters of the field. This can not only save space, but also reduce the string comparison time. The index string that needs to be stored on B+Tree is shorter, and also It can reduce the height of the index tree to a certain extent and improve query efficiency.

The prefix index in MySQL is somewhat similar to the use of the Left function on fields in Oracle to create a function index. However, the prefix index of MySQL automatically completes the matching internally during the query and does not need to use the Left function.

However, a disadvantage of prefix index is that it may reduce the selectivity of the index .

2. What is index selectivity?

Regarding index selectivity (Index Selectivity), it refers to the ratio of unique index values ​​(also called cardinality) to the total number of records in the data table. The value range is between (0,1]. The higher the selectivity of the index, the higher the query efficiency, because a highly selective index allows MySQL to filter out more rows when searching.

Some friends would like to ask, is the index with higher selectivity better? of course not! The highest index selectivity is 1. If the index selectivity is 1, it is the only index. When searching, you can directly locate a specific row of records through the search conditions! Although the performance is best at this time, it is also the most space-consuming, which is not in line with our original intention of creating a prefix index .

The reason why we want to create a prefix index instead of a unique index at the beginning is that we hope to find a balance between the performance and space of the index . We hope to be able to choose a long enough prefix to ensure high selectivity (so that during the query process There is no need to scan many rows), but we hope that the index will not occupy too much storage space.

So how do we choose an appropriate index selectivity? The index prefix should be long enough so that the selectivity of the prefix index is close to the entire column of the index, i.e. the cardinality of the prefix should be close to the cardinality of the complete column.

First we can get the full column selectivity through the following SQL:

SELECT  COUNT ( DISTINCT column_name) /  COUNT ( * ) FROM table_name;

Then obtain the selectivity of a certain length prefix through the following SQL:

SELECT  COUNT ( DISTINCT  LEFT (column_name, prefix_length)) /  COUNT ( * ) FROM table_name;

When executing the above SQL, we must pay attention to selecting the appropriate prefix_length until the calculation result is approximately equal to the selectivity of the entire column, which is the best result.

3. Create a prefix index

3.1 A small case

As an example, let’s create a prefix index and see.

The data sample used by Brother Song here is a test script found online. There are 300W+ pieces of data. It is enough for SQL test optimization. Friends, please reply in the background of the official account mysql-data-samplesto get the script download link.

Let’s take a rough look at this table structure:

This table has a user_uuid field, and we will focus on this field.

Everyone should be able to use Git, right? Unlike Svn, the version number on Git is not a number but a Hash string. However, when we use it in specific applications, for example, if you want to roll back the version, you do not need to enter the complete version number at this time, only the version. The first few characters of the number are enough, because the publication version number can be determined based on the previous part.

So the user_uuid field in this table also means the same. If we want to index the user_uuid field, there is no need to index the complete string. We only need to index a part of the string.

Some friends may still not understand. Let me give you an example. For example, I want to query based on the user_uuid field, but I don’t need to write the complete user_uuid for the query conditions. I only need to write the first part to distinguish what I want. Recorded, let’s look at the following SQL:

As you can see, I only need to give part of user_uuid to uniquely lock a record.

Of course, the above SQL was tested by Brother Song. The given 

'39352f%'conditions cannot be any shorter, otherwise two or even more records will be found.

From the above example, we can see that if you index the user_uuid field, you may not need to index the complete string, but only a part of the prefix string.

So what about indexing the first few strings? This is not a slap on the forehead, it requires scientific calculation, let’s continue reading.

3.2 Prefix index

First, let’s take a look at the selectivity of the user_uuid full column index through the following SQL:

SELECT  COUNT ( DISTINCT user_uuid) /  COUNT ( * ) FROM  system_user ;

As you can see, the result is 1. The selectivity of the entire column is 1, which means that the values ​​in this column are unique and not repeated.

Next, let’s try a few different prefix_lengths to see how selective they are.

Brother Song has tested a total of 5 different prefix_lengths here. Let’s take a look at their respective selectivities:

The selectivity of 8 and 9 is the same, because in the uuid string, the ninth string is -the same for all uuid ninth strings, so the distinction between 8 characters and 9 strings is the same.

When prefix_length is 10, the selectivity is already 1, which means that among these 300W+ pieces of data, if I use the user_uuid field to query, I only need to enter the first ten characters to uniquely locate a specific piece of data. recorded.

So what are you waiting for? Let’s quickly create a prefix index:

alter  table  system_user  add index user_uuid_index(user_uuid( 10 ));

View the prefix index just created:

show index from  system_user ;

As you can see, the second line is the prefix index we just created.

Next we analyze whether the index is used in the query statement:

select  *  from  system_user  where user_uuid = '39352f81-165e-4405-9715-75fcdf7f7068' ;

As you can see, this prefix index has been used.

The specific search process is as follows:

  1. user_uuid_indexFind the first 39352f81-1record with value (the first ten characters of user_uuid) from the index.
  2. Since user_uuid is a secondary index, the leaf node stores the primary key value, so the primary key id is 1 at this time.
  3. Go back to the table with the primary key id, find the complete record of the row with id 1 on the primary key index, and return it to the server layer.
  4. The server layer determines whether its user_uuid is 39352f81-165e-4405-9715-75fcdf7f7068(so the Extra of the execution plan is Using where).
    1. If not, this row is discarded.
    2. If so, add the record to the result set.
  5. The data on the index leaf nodes are maintained by a one-way linked list, so following the search result in the first step, continue to read the next record backwards, and then repeat steps 2, 3, and 4 until the user_uuid_index is obtained When the value is not 39352f81-1, the loop ends.

If we create a prefix index and the selectivity of the prefix index is 1, then step 5 is not needed. If the selectivity of the prefix index is less than 1, step 5 is required.

From the above case, friends can see that we not only save space, but also improve search efficiency.

3.3 A question

After using the prefix index, let’s look at a problem. Let’s look at the following query SQL:

select user_uuid from  system_user  where user_uuid = '39352f81-165e-4405-9715-75fcdf7f7068' ;

Not this time select *, but select user_uuid, friends, we know that a covering index should be used here. Let’s take a look at the execution plan:

Hey, where’s the index coverage mentioned? (Note that Extra is Using where, not Using index).

Think about it, in the prefix index, the B+Tree does not store the complete user_uuidfield values, and you must return to the table to get the required data. Therefore, if you use a prefix index, you do not need a covering index.

4. Back to the original question

At the beginning of this article, Brother Song asked a question, how to index ID cards more efficiently?

Since the distinction between the first six digits of the ID card is too low, we can consider storing the ID card in reverse order. After storing the ID card in reverse order, create a prefix index for the first six or eight digits (you can calculate the selectivity by yourself). Such an index selection The performance will be higher and the space occupied will be smaller. Just use reverse to reverse the ID number when querying, like this:

explain select  *  from  user  where id_card = reverse( 'positive sequence ID number' );

5. Summary

Okay, this is the prefix index. If you are interested, please try it out~