I figured it would be fun to add a pure SQL solution for the problem. This is not too hard to implement with the
STATISTICS
system view found in the MySQL information_schema database. This solution is limited in that it will only work for MySQL 5.0 (and higher). On the other hand, it has the advantage that it does not require any external programming languages or runtime environments. The solution I'm presenting here also takes the uniqueness of indexes into account, something that - to the best of my knowledge - has not been discussed by the other authors (although I believe that Daniel Schneller's tool does take uniqueness into account).
Those that don't feel like reading the article, hey - you decide. The code is available on MySQLForge as a snippet. The article just explains a little bit of the background of the problem, and explores some techniques to solve it using a pure SQL solution. It finishes with a detailed explanation of the actual query, which you may or may not find interesting to read.
Redundant Indexes
Here's my short account of the problem addressed by Peter Zaitsev. Read the original article for more details.
A redundant index is an index that can be removed without significantly hurting query performance. MySQL allows creation of multiple indexes that are fully or partially similar in structure. This generally does not have any advantages, but does have disadvantages. Therefore, it is usually a good idea to remove these redundant indexes.
I just mentioned the term index structure. I use this term in a rather loose sense as a particular combination of the following three index characteristics:
- The index columns
- Each index defines the list of columns that are indexed by the index
- The index type
- The index type corresponds with the internal algorithm that is used to build and maintain the index
- The uniqueness of the index
- Whether the index is declared to reject duplicate entries
So how do we find out if there are any redundant indexes? Using these characteristics to compare indexes on one particular table to one another covers much of what we need to determine if we are dealing with redundant indexes.
Index Type
The index type corresponds to the particular kinds of internal algorithms and datastructures used to implement the index. In MySQL, support for a particular index type is dependent upon the storage engine:
BTREE
- General purpose indexes. Supported for the
MyISAM
,InnoDB
andMEMORY
storage engines. HASH
- General purpose indexes. Supported for the
MEMORY
andNDB
storage engines. FULLTEXT
- Special purpose indexes for text columns, optimized for advanced searching of combinations of words. Supported only for the
MyISAM
storage engine. SPATIAL
- Special purpose indexes for spatial (geometric) datatypes. Supported only for the
MyISAM
storage engine.
Not all index types are equally suitable for executing an arbitrary query. Rather, particular access patterns implied by the query statement can only be processed efficiently using an index of a certain type.
What does this mean when comparing indexes to on another? Well, it means that a comparison of two particular indexes only makes sense if they have the same type. Also, the index type determines what method should be used to compare the index columns.
Index Columns
As mentioned, columnlists should be compared in a manner that is dependent upon the index type:
BTREE
- A
BTREE
index is redundant if all of it's columns are also indexed in the same order by another index, and the first indexed column is the same for both these indexes. Another way of putting it is to say that the columnlist of the index is the leftmost prefix of the columnlist of the other index. HASH
- A
HASH
index is redundant if each of it's columns match all of the columns in the same order of anotherHASH
index. Another way of putting it is to say that the columnlists of these index are duplicates of oneanother. SPATIAL
- In MySQL,
SPATIAL
indexes can have at most one column. So By definition, aSPATIAL
index is redundant if there exists anotherSPATIAL
index for the indexed column, that is, if there exists a duplicate. FULLTEXT
- A
FULLTEXT
index is redundant if all of the indexed columns are already indexed by anotherFULLTEXT
index, regardless of the order of the columns.
A few notes:
- The criterion for comparing columnlists of
HASH
indexes is a special case of the criterion used for comparing the columnlists ofBTREE
indexes. That's because forHASH
indexes, both columnlists must have exactly the same number of members. - The criterion for comparing columnlists of
SPATIAL
indexes is a special case of the criterion for comparing the columnlists ofHASH
indexes. That's becauseSPATIAL
indexes happen to always have just one column. - The criterion for comparing columnlists of
FULLTEXT
indexes is the only case where the ordering of the columns in the columnlist is not at all important for the comparison. The other cases implicitly or explicitly require that both the names and the ordinal positions of the columns match, and that the indexes have the same column at the first ordinal position.
From the mentioned cases
BTREE
indexes are by far the most abundant. Peter's article is mainly focused on BTREE
indexes and redundancy caused by matching left prefixes.Uniqueness
Uniqueness is a property of an index that tells the database it should not allow duplicate entries. That is, each combination of values in the indexed columns can occur at most once in the entire table.
The uniqueness of an index does not influence the structure of the index as such. However, uniqueness is of major importance to the behaviour applications expect from such an index. Even if an index appears to be redundant according to a comparison based on index type and the columnlist, we should not conclude that we can always remove such seemingly redundant indexes. Doing so could cause a unique index to be dropped, changing the database's rules that validate input.
Because
FULLTEXT
and SPATIAL
indexes cannot be declared to be UNIQUE
, uniqueness needs tobe taken into account only when looking for redundant BTREE
or HASH
indexes.Index Metadata
MySQL Offers several ways to inspect the indexes in the database.
First of all, you can use the
SHOW CREATE TABLE
statement. This is just a general utility statement to quickly inspect the structure of a particular table. The output is a DDL statement that could be used to (re)create the table's structure, including index definitions.For example, take a look at this example from Daniel Schneller's blog:
mysql> show create table people
-> \G
*************************** 1. row ***************************
Table: people
Create Table: CREATE TABLE `people` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`firstname` varchar(32) DEFAULT NULL,
`lastname` varchar(32) DEFAULT NULL,
`birthday` date DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_reversename` (`lastname`,`firstname`),
KEY `idx_id_birth_first` (`id`,`birthday`,`firstname`),
KEY `idx_full` (`firstname`,`lastname`),
KEY `idx_first` (`firstname`),
KEY `idx_id_first` (`id`,`firstname`),
KEY `idx_id_birth` (`id`,`birthday`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
(By the way - I will be using this table definition in the following examples.)
The second way is to use the specialized
SHOW INDEX
statement, which is especially designed to inspect indexes. This statement does not return DDL, rather, it returns a row for each column for all of the indexes in a particular table, along with more detailed information about the index data such as the cardinality (the number of index entries).For this table, the result looks like this:
mysql> show index from people;
+--------+------------+--------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------+------------+--------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| people | 0 | PRIMARY | 1 | id | A | 0 | NULL | NULL | | BTREE | |
| people | 1 | idx_reversename | 1 | lastname | A | 0 | NULL | NULL | YES | BTREE | |
| people | 1 | idx_reversename | 2 | firstname | A | 0 | NULL | NULL | YES | BTREE | |
| people | 1 | idx_id_birth_first | 1 | id | A | 0 | NULL | NULL | | BTREE | |
| people | 1 | idx_id_birth_first | 2 | birthday | A | 0 | NULL | NULL | YES | BTREE | |
| people | 1 | idx_id_birth_first | 3 | firstname | A | 0 | NULL | NULL | YES | BTREE | |
| people | 1 | idx_full | 1 | firstname | A | 0 | NULL | NULL | YES | BTREE | |
| people | 1 | idx_full | 2 | lastname | A | 0 | NULL | NULL | YES | BTREE | |
| people | 1 | idx_first | 1 | firstname | A | 0 | NULL | NULL | YES | BTREE | |
| people | 1 | idx_id_first | 1 | id | A | 0 | NULL | NULL | | BTREE | |
| people | 1 | idx_id_first | 2 | firstname | A | 0 | NULL | NULL | YES | BTREE | |
| people | 1 | idx_id_birth | 1 | id | A | 0 | NULL | NULL | | BTREE | |
| people | 1 | idx_id_birth | 2 | birthday | A | 0 | NULL | NULL | YES | BTREE | |
+--------+------------+--------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
13 rows in set (0.00 sec)
The third way is to use the
STATISTICS
system view in the information_schema
database. This offers exactly the same information as the SHOW INDEX
statement. However, the STATISTICS
system view can be used just as any other table in an arbitrary query, offering possibilities to filter, group, sort, join and whatnot.For example, output equivalent to that returned by
SHOW INDEX
statement can be retrieved using the following statement:
SELECT non_unique
, index_name
, seq_in_index
, column_name
, collation
, cardinality
, sub_part
, packed
, nullable
, index_type
, comment
FROM information_schema.STATISTICS
WHERE table_schema = schema()
AND table_name = 'people'
ORDER BY index_name
, seq_in_index
(I think that -despite the explanation given in the manual- it is a bit strange that this system view should be called
STATISTICS
. A name like INDEX_COLUMNS
would better describe the data it exposes, as the system view contains one row for each index column per index.)Using the STATISTICS
system view
As mentioned before, the
STATISTICS
system view in the information_schema
database contains information about index columns rather than indexes. Now how can we use it to compare indexes to each other?Well, according to the reasoning regarding index comparisons, we know that we should compare indexes appearing in one table to each other, matching at least index type and columns. We could try an see what we can achieve with a self-join of the
STATISTICS
, matching allmost all identifying columns with the exception of the index name (as a particular index would surely match itself). Using the
GROUP BY
operation, we can avoid to report a row for each matched column. Also, this allows us to use GROUP_CONCAT
to make a list of the matching columns.
SELECT s1.table_schema idx_schema
, s1.table_name idx_table
, s1.index_type idx_type
, s1.index_name idx_name1
, s2.index_name idx_name2
, group_concat(
i.column_name
order by i.seq_in_index
) idx_columns -- matching index columns
FROM information_schema.STATISTICS s1
JOIN information_schema.STATISTICS s2
ON s1.table_schema = s2.table_schema
AND s1.table_name = s2.table_name
AND s1.index_type = s2.index_type
AND s1.index_name != s2.index_name
AND s1.seq_in_index = s2.seq_in_index
AND s1.column_name = s2.column_name
WHERE s1.table_schema = schema()
AND s1.table_name = 'people'
GROUP BY s1.table_schema
, s1.table_name
, s1.index_type
, s1.index_name
, s2.index_name
Altough it might seem this kind of query is a step into the right direction, the result is not quite what we need at all. Just create the
people
table if you haven't done so already, try this query, and see if the result makes sense to you.What is particularly lacking is the fact we do not ascertain that we are matching all the columns of one of the indexes:
JOIN
will match the columns if a match is possible, and those columns that do match will be in the result. The columns that don't match, but that are in fact part of the index are simply absent from the result. Effectively, the query reports all indexes for wich parts of the columnlists match without checking at which ordinal position they match and without checking if all the columns of one particular index are matched.
Now these setbacks are of course not impregnable: because we are using
GROUP BY
we can use the COUNT
function to count the number of matched columns. We can choose one of the STATISTICS
instances to play the role of the redundant index. The number of matching columns should equal the total number of columns in the instance that plays the 'redundant' role. By definition, the total number of columns in the other instance is than more than (or equal to) the number of matching columns.
SELECT i.table_schema idx_schema
, i.table_name idx_table
, i.index_name idx_name
, r.index_name redundant_idx_name
, group_concat(
i.column_name
order by i.seq_in_index
) idx_columns -- matching index columns
FROM information_schema.STATISTICS i -- i for index
JOIN information_schema.STATISTICS r -- r for redundant
ON i.table_schema = r.table_schema
AND i.table_name = r.table_name
AND i.index_type = r.index_type
AND i.index_name != r.index_name
AND i.seq_in_index = r.seq_in_index
AND i.column_name = r.column_name
WHERE i.table_schema = schema()
AND i.table_name = 'people'
GROUP BY i.table_schema
, i.table_name
, i.index_type
, i.index_name
, r.index_name
HAVING count(r.seq_in_index) = (
-- check if we matched all columns of the 'redundant' index
SELECT count(1)
FROM information_schema.STATISTICS
WHERE table_schema = idx_schema
AND table_name = idx_table
AND index_name = redundant_idx_name
)
Phew.... Now, let's take a look at the result:
+------------+-----------+--------------------+--------------------+-------------+
| idx_schema | idx_table | idx_name | redundant_idx_name | idx_columns |
+------------+-----------+--------------------+--------------------+-------------+
| test | people | idx_full | idx_first | firstname |
| test | people | idx_id_birth | PRIMARY | id |
| test | people | idx_id_birth_first | idx_id_birth | id,birthday |
| test | people | idx_id_birth_first | PRIMARY | id |
| test | people | idx_id_first | PRIMARY | id |
+------------+-----------+--------------------+--------------------+-------------+
Well, at least, we do get a list of the redundant indexes, but the same index is reported multiple times.
Now we could go on in this manner in order to get a list of unique redundant indexes. We can also try another
JOIN
solution. Joining on complete concatenated columnlists is for exmpple an approach that might seem a bit strange but that does actually have some advantages over this method. However, the solution that I tend to favour right now is one that uses no
JOIN
s at all.Finding redundant indexes
We just discussed how we could use a
JOIN
to compare the column lists of two indexes. Up to a certain extend, GROUP BY
can be used to drive comparison too. It seems strange to think of
GROUP BY
, but what we are trying to do is to compare indexes according columnlist prefixes. However, when you think of it as making groups of a part of the columnlist, it should begin to make sense. Without further ado, here's the solution I eventually came up with:
select table_name
, index_type
, min(column_names) column_names
, trim(',' FROM
case index_type
when 'BTREE' then
replace(
-- report all but the last one
-- (the last one is the longest one)
substring_index(
group_concat(
'`',index_name,'`'
order by column_count asc
, non_unique asc
, index_name desc
separator ','
)
, ','
, count(*) - 1
)
-- get the first one
-- (the first one is the smallest unique one)
, concat(
'`'
, substring_index(
group_concat(
if( non_unique = 0
, index_name
, ''
)
order by non_unique asc
, column_count asc
, index_name asc
separator ','
)
, ','
, 1
)
, '`'
)
, ''
)
when 'HASH' then
substring_index(
group_concat(
'`',index_name,'`'
order by non_unique asc
, index_name asc
separator ','
)
, ','
, 1 - count(*)
)
when 'SPATIAL' then
substring_index(
group_concat(
'`',index_name,'`'
order by index_name asc
separator ','
)
, ','
, 1 - count(*)
)
else 'unexpected type - not implemented'
end
) redundant_indexes
from (
select table_name
, index_name
, index_type
, non_unique
, count(seq_in_index) as column_count
, group_concat(
if(seq_in_index=1,column_name,'')
separator ''
) as column_name
, group_concat(
column_name
order by seq_in_index
separator ','
) as column_names
from information_schema.statistics s
where s.table_schema = schema()
and s.index_type != 'FULLTEXT'
group by table_name
, index_name
, index_type
, non_unique
) as s
group by table_name
, index_type
, if(index_type='HASH',column_names,column_name)
having redundant_indexes != ''
Explanation
Lets start with the straighforward bit. We already mentioned that the
STATISTICS
table returns index columns rather than indexes. Lets first make a query that returns a row for each index rather than for each index column:
select table_name
, index_name
, count(seq_in_index) as column_count
, group_concat(
if(seq_in_index=1,column_name,'')
separator ''
) as column_name
, group_concat(
column_name
order by seq_in_index
) as column_names
from information_schema.statistics s
where s.table_schema = schema()
and s.index_type != 'FULLTEXT'
group by table_name
, index_name
The
GROUP BY
is used to group identifying properties of an index: the table name and the index name. The table_schema
is also partially identifying, but the query requires that only indexes are queried that reside within the current schema, so that is why the schema_name
column appears in the WHERE
clause and thus need not be part of the GROUP BY
list:
where s.table_schema = schema()
We've also excluded
FULLTEXT
indexes, just for convenience.
and s.index_type != 'FULLTEXT'
In the
SELECT
-list, we calculate a few aggregates. The column_count
expression is pretty self-explanatory, it counts the number columns in the index:
count(seq_in_index) as column_count
We won't dwell on this now, but it will prove to be useful later on.
There is one special aggregate:
column_name
:
group_concat(
if(seq_in_index=1,column_name,'')
separator ''
) as column_name
The intention of this expression is to report only the first of the columns in the columnlist, meaning: the column with ordinal postion equal to one. So, we are aggregating over an
IF
expression that returns the columnname only if it appears as the first column seq_in_index=1
) in the index. Otherwise, it returns the empty string. This expression is evaluated for each of the columns for the index that is grouped by the
GROUP BY
, seperating the entries by again, the empty string. The condition will be true for just one of the columns that are grouped together. The entire expression this will give us that one column name, concatenated to lots of empty strings, wich of course amounts to just the one column name.The
column_names
expression concatenates all columns of the index ordered by ordinal position:
group_concat(
column_name
order by seq_in_index
) as column_names
In this case, no separator is specified so the default, a comma, will be used. The result of this expression allows an easy confirmation that the
column_name
indeed reports only the first column of the index.The
column_name
expression is important because it allows us to make groups of indexes according to a column prefix of their columnlists. By definition, all indexes that have the same columnlist prefix must have the same column as first column. So, we don't need to group using the largest common prefix, rather, we are using the smallest thinkable common prefix. This works even if the index with the smalles columnlist has more than just one column.We need the column prefix to compare
BTREE
indexes to each other. However, as mentioned in the introduction, a prefix comparison is not enough for HASH
indexes. For HASH
indexes, the entire columnlist should be used. So, somehow, we need to switch the grouping expression, depending upon whether or not the index type is HASH
or not. This leads us to another little issue we have to take care of. In the introduction we mentioned that it is only useful to compare indexes of the same index type. So, in addition to the previous query, we now also introduce the
index_type
column: only within the group of indexes of a particular type do we want to make groups of a common column prefix. Adding
index_type
to the GROUP BY
list of the previous query will not affect the number of returned rows. That's because the grouping in the previous query was per index, and since a particular index has only one index type, adding the column to the GROUP BY
cannot give rise to subgroups beyond the index level.However, it can (and does) once we start grouping indexes themselves:
select table_name
, index_type
, column_name
, group_conat(
index_name
) index_names
from (
select table_name
, index_name
, index_type
, count(seq_in_index) as column_count
, group_concat(
if(seq_in_index=1,column_name,'')
separator ''
) as column_name
, group_concat(
column_name
order by seq_in_index
separator ','
) as column_names
from information_schema.statistics s
where s.table_schema = schema()
and s.index_type != 'FULLTEXT'
group by table_name
, index_name
, index_type
) as s
group by table_name
, index_type
, if(index_type='HASH',column_names,column_name)
Note that we are not just grouping by the columnlist prefix in all cases. Because indexes of the
HASH
type are only equivalent when there is a match for the entire columnlist, we need to take that into account in the GROUP BY
expression:
if(index_type='HASH',column_names,column_name)
So, if we are dealing with a
HASH
index, then group by the entire columnlist, else group by the (smallest) prefix only.The
SELECT
list contains an expression to produce a comma-separated list of all the indexes that we consider in the comparison:
group_conat(
index_name
) index_names
Some of the indexes that appear in this list will be redundant, but some of them are essential: removing them will certainly affect the performance of certain queries. If we could somehow remove the essential indexes from the list, we would end up with a list of redundant indexes, which is what we set out to do.
How can we go about this? Well, let's stick to the case for
BTREE
's for a while. Remember that for this index type, we are primarily interested in the index with the longest columnlist: we consider that to be the most essential index. It is essential because it offers an access path that is not offered by indexes with shorter columnlists, and the access path offered by the other indexes is already contained within the one with the longest columnlist. (We really can safely forget about other index types for a while. In case of
HASH
and SPATIAL
indexes, we are dealing with identical columnlists, so by definition, all columnlists are 'the longest'. That means that whatever solution we come up with for BTREE's applies automatically to these index types although it may be possible that a simpler solution is available for these index types.)Now, only if we were able to extract the
index_name
with te largest number of columns from the list...We actually know already how many columns each index has, because we computed that already in the subquery as the
column_count
expression. But how do we use that information? Well, we can use the optional ORDER BY
clause of the GROUP_CONCAT()
function. Using that, we could sort our index names according to the number of columns in the index:
group_conat(
index_name
order by column_count
) index_names
This ensures that the indexes with the largest number of columns in the columnlist are queued up at the end of the list. Once it is sitting there, it is not too hard to extract it: we can use the
SUBSTRING_INDEX()
function for that, like so:
substring_index(
group_concat(
index_name
order by column_count
separator ','
)
, ','
, count(*) - 1
)
What happens here is that the comma separated list of index names, ordered according to the number of columns in the index, is generated first. (For clarity the
SEPARATOR
clause is included in the GROUP BY
.) Then, SUBSTRING_INDEX
takes only a part from it, namely the part up to the last comma in the list. The last comma is conveniently calculated by taking the count of indexes that are used to generate te list, and substracting one (because the indexes in the list are separated by a comma, there is exaclty on comma less than the number of indexes). So the last index, the essential index, is cut off, and what then remains is the list of redundant indexes. Having explained the technique, we can now discuss a few minor improvements. First of all, what happens when there is not just one 'essential index' but two, or even more? It is quite possible that there is not just one index with the maximum number of columns for the given prefix. Well, in that case, since we are dealing with duplicates, we will simply elect the index with the 'smallest' index name.
This is easy enough to implement. We simply have to add a new sort key within the
column_count
. Because we have arranged for SUBSTRING_INDEX()
to discard the last entry of the list, we must order by index_name
in descending order:
substring_index(
group_concat(
index_name
order by column_count
, index_name desc
separator ','
)
, ','
, count(*) - 1
)
Ok, so far so good. Now there is really one thing we've consistenly ignored all the time, and that is unicity. What do we do when our list of 'redundant' indexes contains on or more unique indexes? Well, if we were to remove this from the schema, we could be in big trouble! Input that would have been rejected because it violated a unique index would suddenly be allowed, and this is certainly not our intention. This simply means that our concept of a redundant index is incomplete. It doesn't take an important feauture of the indexes not into account, namely the unicity.
So what do we need to do in case we really do have
UNIQUE
indexes? Well, we should look for the UNIQUE
index with the least number of columns. Just as the index with the largest number of columns provides an essential acces path, the unique index with the smallest number of columns provides an essential unicity rule. Any other unique indexs with the same prefix that have more columns are redundant: they cannot make their indexed data 'more unique'.Luckily, this improvement is not more complicated than what we've seen sofar. What we can do is set up an entire new list of only the unique indexes:
substring_index(
group_concat(
if( non_unique = 0
, index_name
, ''
)
order by non_unique asc
, column_count asc
, index_name asc
separator ','
)
, ','
, 1
)
Using the same techniques we used before, we constructed an expression that generates a list of only
UNIQUE
indexes, sorted on the number of columns. If there is at least one UNIQUE
index, that will be in front of the list. Using SUBSTRING_INDEX()
we can cut the first UNIQUE
index of the list: this will be the one with the smallest number of columns, because we included the column_count
expression in the ORDER BY
clause.Now, all we need to do is remove the index name of the unique index with the smallest number of columns from our list of 'redundant indexes'. This is easily done by the
REPLACE()
function. Putting it alltogether, we get something like this:
replace(
substring_index(
group_concat(
'`',index_name,'`'
order by column_count asc
, non_unique asc
, index_name desc
separator ','
)
, ','
, count(*) - 1
)
, concat(
'`'
, substring_index(
group_concat(
if( non_unique = 0
, index_name
, ''
)
order by non_unique asc
, column_count asc
, index_name asc
separator ','
)
, ','
, 1
)
, '`'
)
, ''
)
That's about it really. We do need to do a little more effort to ascertain that
REPLACE()
will only remove complete index names, but that is not too hard. Also, we all this cutting and replacing might leave a few stray commas behind: those must be cleaned up as well. Limitations
This solution certainly has it's limitations. I've made a list of the most important ones:
- Relies on Information schema: this solution is not available in MySQL <= 5.0, and it might be slow
FULLTEXT
indexes are ignored- Column prefixes are ignored.
If you have any comments, suggestions, or whatsoever, don't hesitate to post a comment, please do. Remember, the code will remain available as a snippet at MySQLForge.
No comments:
Post a Comment