Tuesday, September 26, 2006

Finding Redundant Indexes using the MySQL Information Schema

Peter Zaitsev's blog entry on Duplicate indexes and redundant indexes certainly made a bit of a stir! It has already led to a new (mini) project by Baron Schwartz, the Duplicate index/foreign key finder which is publicly available on MySQLForge as a perl script. Daniel Schneller has entered the arena as well, devoting an entire blog to his own java implementation to tackle this and other index trouble.

I figured it would be fun to add a pure SQL solution for the problem. This is not too hard to implement with the STATISTICS system view found in the MySQL information_schema database. This solution is limited in that it will only work for MySQL 5.0 (and higher). On the other hand, it has the advantage that it does not require any external programming languages or runtime environments.

The solution I'm presenting here also takes the uniqueness of indexes into account, something that - to the best of my knowledge - has not been discussed by the other authors (although I believe that Daniel Schneller's tool does take uniqueness into account).

Those that don't feel like reading the article, hey - you decide. The code is available on MySQLForge as a snippet. The article just explains a little bit of the background of the problem, and explores some techniques to solve it using a pure SQL solution. It finishes with a detailed explanation of the actual query, which you may or may not find interesting to read.

Redundant Indexes


Here's my short account of the problem addressed by Peter Zaitsev. Read the original article for more details.

A redundant index is an index that can be removed without significantly hurting query performance. MySQL allows creation of multiple indexes that are fully or partially similar in structure. This generally does not have any advantages, but does have disadvantages. Therefore, it is usually a good idea to remove these redundant indexes.

I just mentioned the term index structure. I use this term in a rather loose sense as a particular combination of the following three index characteristics:

The index columns

Each index defines the list of columns that are indexed by the index

The index type

The index type corresponds with the internal algorithm that is used to build and maintain the index

The uniqueness of the index

Whether the index is declared to reject duplicate entries


So how do we find out if there are any redundant indexes? Using these characteristics to compare indexes on one particular table to one another covers much of what we need to determine if we are dealing with redundant indexes.

Index Type


The index type corresponds to the particular kinds of internal algorithms and datastructures used to implement the index. In MySQL, support for a particular index type is dependent upon the storage engine:

BTREE

General purpose indexes. Supported for the MyISAM, InnoDB and MEMORY storage engines.

HASH

General purpose indexes. Supported for the MEMORY and NDB storage engines.

FULLTEXT

Special purpose indexes for text columns, optimized for advanced searching of combinations of words. Supported only for the MyISAM storage engine.

SPATIAL

Special purpose indexes for spatial (geometric) datatypes. Supported only for the MyISAM storage engine.


Not all index types are equally suitable for executing an arbitrary query. Rather, particular access patterns implied by the query statement can only be processed efficiently using an index of a certain type.

What does this mean when comparing indexes to on another? Well, it means that a comparison of two particular indexes only makes sense if they have the same type. Also, the index type determines what method should be used to compare the index columns.

Index Columns


As mentioned, columnlists should be compared in a manner that is dependent upon the index type:

BTREE

A BTREE index is redundant if all of it's columns are also indexed in the same order by another index, and the first indexed column is the same for both these indexes. Another way of putting it is to say that the columnlist of the index is the leftmost prefix of the columnlist of the other index.

HASH

A HASH index is redundant if each of it's columns match all of the columns in the same order of another HASH index. Another way of putting it is to say that the columnlists of these index are duplicates of oneanother.

SPATIAL

In MySQL, SPATIAL indexes can have at most one column. So By definition, a SPATIAL index is redundant if there exists another SPATIAL index for the indexed column, that is, if there exists a duplicate.

FULLTEXT

A FULLTEXT index is redundant if all of the indexed columns are already indexed by another FULLTEXT index, regardless of the order of the columns.


A few notes:

  • The criterion for comparing columnlists of HASH indexes is a special case of the criterion used for comparing the columnlists of BTREE indexes. That's because for HASH indexes, both columnlists must have exactly the same number of members.

  • The criterion for comparing columnlists of SPATIAL indexes is a special case of the criterion for comparing the columnlists of HASH indexes. That's because SPATIAL indexes happen to always have just one column.

  • The criterion for comparing columnlists of FULLTEXT indexes is the only case where the ordering of the columns in the columnlist is not at all important for the comparison. The other cases implicitly or explicitly require that both the names and the ordinal positions of the columns match, and that the indexes have the same column at the first ordinal position.


From the mentioned cases BTREE indexes are by far the most abundant. Peter's article is mainly focused on BTREE indexes and redundancy caused by matching left prefixes.

Uniqueness


Uniqueness is a property of an index that tells the database it should not allow duplicate entries. That is, each combination of values in the indexed columns can occur at most once in the entire table.

The uniqueness of an index does not influence the structure of the index as such. However, uniqueness is of major importance to the behaviour applications expect from such an index. Even if an index appears to be redundant according to a comparison based on index type and the columnlist, we should not conclude that we can always remove such seemingly redundant indexes. Doing so could cause a unique index to be dropped, changing the database's rules that validate input.

Because FULLTEXT and SPATIAL indexes cannot be declared to be UNIQUE, uniqueness needs tobe taken into account only when looking for redundant BTREE or HASH indexes.

Index Metadata


MySQL Offers several ways to inspect the indexes in the database.

First of all, you can use the SHOW CREATE TABLE statement. This is just a general utility statement to quickly inspect the structure of a particular table. The output is a DDL statement that could be used to (re)create the table's structure, including index definitions.

For example, take a look at this example from Daniel Schneller's blog:

mysql> show create table people
-> \G
*************************** 1. row ***************************
Table: people
Create Table: CREATE TABLE `people` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`firstname` varchar(32) DEFAULT NULL,
`lastname` varchar(32) DEFAULT NULL,
`birthday` date DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_reversename` (`lastname`,`firstname`),
KEY `idx_id_birth_first` (`id`,`birthday`,`firstname`),
KEY `idx_full` (`firstname`,`lastname`),
KEY `idx_first` (`firstname`),
KEY `idx_id_first` (`id`,`firstname`),
KEY `idx_id_birth` (`id`,`birthday`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

(By the way - I will be using this table definition in the following examples.)

The second way is to use the specialized SHOW INDEX statement, which is especially designed to inspect indexes. This statement does not return DDL, rather, it returns a row for each column for all of the indexes in a particular table, along with more detailed information about the index data such as the cardinality (the number of index entries).

For this table, the result looks like this:

mysql> show index from people;
+--------+------------+--------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------+------------+--------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| people | 0 | PRIMARY | 1 | id | A | 0 | NULL | NULL | | BTREE | |
| people | 1 | idx_reversename | 1 | lastname | A | 0 | NULL | NULL | YES | BTREE | |
| people | 1 | idx_reversename | 2 | firstname | A | 0 | NULL | NULL | YES | BTREE | |
| people | 1 | idx_id_birth_first | 1 | id | A | 0 | NULL | NULL | | BTREE | |
| people | 1 | idx_id_birth_first | 2 | birthday | A | 0 | NULL | NULL | YES | BTREE | |
| people | 1 | idx_id_birth_first | 3 | firstname | A | 0 | NULL | NULL | YES | BTREE | |
| people | 1 | idx_full | 1 | firstname | A | 0 | NULL | NULL | YES | BTREE | |
| people | 1 | idx_full | 2 | lastname | A | 0 | NULL | NULL | YES | BTREE | |
| people | 1 | idx_first | 1 | firstname | A | 0 | NULL | NULL | YES | BTREE | |
| people | 1 | idx_id_first | 1 | id | A | 0 | NULL | NULL | | BTREE | |
| people | 1 | idx_id_first | 2 | firstname | A | 0 | NULL | NULL | YES | BTREE | |
| people | 1 | idx_id_birth | 1 | id | A | 0 | NULL | NULL | | BTREE | |
| people | 1 | idx_id_birth | 2 | birthday | A | 0 | NULL | NULL | YES | BTREE | |
+--------+------------+--------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
13 rows in set (0.00 sec)

The third way is to use the STATISTICS system view in the information_schema database. This offers exactly the same information as the SHOW INDEX statement. However, the STATISTICS system view can be used just as any other table in an arbitrary query, offering possibilities to filter, group, sort, join and whatnot.

For example, output equivalent to that returned by SHOW INDEX statement can be retrieved using the following statement:

SELECT non_unique
, index_name
, seq_in_index
, column_name
, collation
, cardinality
, sub_part
, packed
, nullable
, index_type
, comment
FROM information_schema.STATISTICS
WHERE table_schema = schema()
AND table_name = 'people'
ORDER BY index_name
, seq_in_index

(I think that -despite the explanation given in the manual- it is a bit strange that this system view should be called STATISTICS. A name like INDEX_COLUMNS would better describe the data it exposes, as the system view contains one row for each index column per index.)

Using the STATISTICS system view


As mentioned before, the STATISTICS system view in the information_schema database contains information about index columns rather than indexes. Now how can we use it to compare indexes to each other?

Well, according to the reasoning regarding index comparisons, we know that we should compare indexes appearing in one table to each other, matching at least index type and columns. We could try an see what we can achieve with a self-join of the STATISTICS, matching allmost all identifying columns with the exception of the index name (as a particular index would surely match itself).

Using the GROUP BY operation, we can avoid to report a row for each matched column. Also, this allows us to use GROUP_CONCAT to make a list of the matching columns.

SELECT s1.table_schema idx_schema
, s1.table_name idx_table
, s1.index_type idx_type
, s1.index_name idx_name1
, s2.index_name idx_name2
, group_concat(
i.column_name
order by i.seq_in_index
) idx_columns -- matching index columns
FROM information_schema.STATISTICS s1
JOIN information_schema.STATISTICS s2
ON s1.table_schema = s2.table_schema
AND s1.table_name = s2.table_name
AND s1.index_type = s2.index_type
AND s1.index_name != s2.index_name
AND s1.seq_in_index = s2.seq_in_index
AND s1.column_name = s2.column_name
WHERE s1.table_schema = schema()
AND s1.table_name = 'people'
GROUP BY s1.table_schema
, s1.table_name
, s1.index_type
, s1.index_name
, s2.index_name

Altough it might seem this kind of query is a step into the right direction, the result is not quite what we need at all. Just create the people table if you haven't done so already, try this query, and see if the result makes sense to you.

What is particularly lacking is the fact we do not ascertain that we are matching all the columns of one of the indexes: JOIN will match the columns if a match is possible, and those columns that do match will be in the result. The columns that don't match, but that are in fact part of the index are simply absent from the result.

Effectively, the query reports all indexes for wich parts of the columnlists match without checking at which ordinal position they match and without checking if all the columns of one particular index are matched.

Now these setbacks are of course not impregnable: because we are using GROUP BY we can use the COUNT function to count the number of matched columns. We can choose one of the STATISTICS instances to play the role of the redundant index. The number of matching columns should equal the total number of columns in the instance that plays the 'redundant' role. By definition, the total number of columns in the other instance is than more than (or equal to) the number of matching columns.

SELECT i.table_schema idx_schema
, i.table_name idx_table
, i.index_name idx_name
, r.index_name redundant_idx_name
, group_concat(
i.column_name
order by i.seq_in_index
) idx_columns -- matching index columns
FROM information_schema.STATISTICS i -- i for index
JOIN information_schema.STATISTICS r -- r for redundant
ON i.table_schema = r.table_schema
AND i.table_name = r.table_name
AND i.index_type = r.index_type
AND i.index_name != r.index_name
AND i.seq_in_index = r.seq_in_index
AND i.column_name = r.column_name
WHERE i.table_schema = schema()
AND i.table_name = 'people'
GROUP BY i.table_schema
, i.table_name
, i.index_type
, i.index_name
, r.index_name
HAVING count(r.seq_in_index) = (
-- check if we matched all columns of the 'redundant' index
SELECT count(1)
FROM information_schema.STATISTICS
WHERE table_schema = idx_schema
AND table_name = idx_table
AND index_name = redundant_idx_name
)

Phew.... Now, let's take a look at the result:

+------------+-----------+--------------------+--------------------+-------------+
| idx_schema | idx_table | idx_name | redundant_idx_name | idx_columns |
+------------+-----------+--------------------+--------------------+-------------+
| test | people | idx_full | idx_first | firstname |
| test | people | idx_id_birth | PRIMARY | id |
| test | people | idx_id_birth_first | idx_id_birth | id,birthday |
| test | people | idx_id_birth_first | PRIMARY | id |
| test | people | idx_id_first | PRIMARY | id |
+------------+-----------+--------------------+--------------------+-------------+

Well, at least, we do get a list of the redundant indexes, but the same index is reported multiple times.

Now we could go on in this manner in order to get a list of unique redundant indexes. We can also try another JOIN solution. Joining on complete concatenated columnlists is for exmpple an approach that might seem a bit strange but that does actually have some advantages over this method.

However, the solution that I tend to favour right now is one that uses no JOINs at all.

Finding redundant indexes


We just discussed how we could use a JOIN to compare the column lists of two indexes. Up to a certain extend, GROUP BY can be used to drive comparison too.

It seems strange to think of GROUP BY, but what we are trying to do is to compare indexes according columnlist prefixes. However, when you think of it as making groups of a part of the columnlist, it should begin to make sense. Without further ado, here's the solution I eventually came up with:

select table_name
, index_type
, min(column_names) column_names
, trim(',' FROM
case index_type
when 'BTREE' then
replace(
-- report all but the last one
-- (the last one is the longest one)
substring_index(
group_concat(
'`',index_name,'`'
order by column_count asc
, non_unique asc
, index_name desc
separator ','
)
, ','
, count(*) - 1
)
-- get the first one
-- (the first one is the smallest unique one)
, concat(
'`'
, substring_index(
group_concat(
if( non_unique = 0
, index_name
, ''
)
order by non_unique asc
, column_count asc
, index_name asc
separator ','
)
, ','
, 1
)
, '`'
)
, ''
)
when 'HASH' then
substring_index(
group_concat(
'`',index_name,'`'
order by non_unique asc
, index_name asc
separator ','
)
, ','
, 1 - count(*)
)
when 'SPATIAL' then
substring_index(
group_concat(
'`',index_name,'`'
order by index_name asc
separator ','
)
, ','
, 1 - count(*)
)
else 'unexpected type - not implemented'
end
) redundant_indexes
from (
select table_name
, index_name
, index_type
, non_unique
, count(seq_in_index) as column_count
, group_concat(
if(seq_in_index=1,column_name,'')
separator ''
) as column_name
, group_concat(
column_name
order by seq_in_index
separator ','
) as column_names
from information_schema.statistics s
where s.table_schema = schema()
and s.index_type != 'FULLTEXT'
group by table_name
, index_name
, index_type
, non_unique
) as s
group by table_name
, index_type
, if(index_type='HASH',column_names,column_name)
having redundant_indexes != ''

Explanation


Lets start with the straighforward bit. We already mentioned that the STATISTICS table returns index columns rather than indexes. Lets first make a query that returns a row for each index rather than for each index column:

select table_name
, index_name
, count(seq_in_index) as column_count
, group_concat(
if(seq_in_index=1,column_name,'')
separator ''
) as column_name
, group_concat(
column_name
order by seq_in_index
) as column_names
from information_schema.statistics s
where s.table_schema = schema()
and s.index_type != 'FULLTEXT'
group by table_name
, index_name

The GROUP BY is used to group identifying properties of an index: the table name and the index name. The table_schema is also partially identifying, but the query requires that only indexes are queried that reside within the current schema, so that is why the schema_name column appears in the WHERE clause and thus need not be part of the GROUP BY list:

where s.table_schema = schema()

We've also excluded FULLTEXT indexes, just for convenience.

and s.index_type != 'FULLTEXT'

In the SELECT-list, we calculate a few aggregates. The column_count expression is pretty self-explanatory, it counts the number columns in the index:

count(seq_in_index) as column_count

We won't dwell on this now, but it will prove to be useful later on.

There is one special aggregate: column_name:

group_concat(
if(seq_in_index=1,column_name,'')
separator ''
) as column_name

The intention of this expression is to report only the first of the columns in the columnlist, meaning: the column with ordinal postion equal to one. So, we are aggregating over an IF expression that returns the columnname only if it appears as the first column seq_in_index=1) in the index. Otherwise, it returns the empty string.

This expression is evaluated for each of the columns for the index that is grouped by the GROUP BY, seperating the entries by again, the empty string. The condition will be true for just one of the columns that are grouped together. The entire expression this will give us that one column name, concatenated to lots of empty strings, wich of course amounts to just the one column name.

The column_names expression concatenates all columns of the index ordered by ordinal position:

group_concat(
column_name
order by seq_in_index
) as column_names

In this case, no separator is specified so the default, a comma, will be used. The result of this expression allows an easy confirmation that the column_name indeed reports only the first column of the index.

The column_name expression is important because it allows us to make groups of indexes according to a column prefix of their columnlists. By definition, all indexes that have the same columnlist prefix must have the same column as first column. So, we don't need to group using the largest common prefix, rather, we are using the smallest thinkable common prefix. This works even if the index with the smalles columnlist has more than just one column.

We need the column prefix to compare BTREE indexes to each other. However, as mentioned in the introduction, a prefix comparison is not enough for HASH indexes. For HASH indexes, the entire columnlist should be used. So, somehow, we need to switch the grouping expression, depending upon whether or not the index type is HASH or not.

This leads us to another little issue we have to take care of. In the introduction we mentioned that it is only useful to compare indexes of the same index type. So, in addition to the previous query, we now also introduce the index_type column: only within the group of indexes of a particular type do we want to make groups of a common column prefix.

Adding index_type to the GROUP BY list of the previous query will not affect the number of returned rows. That's because the grouping in the previous query was per index, and since a particular index has only one index type, adding the column to the GROUP BY cannot give rise to subgroups beyond the index level.

However, it can (and does) once we start grouping indexes themselves:

select table_name
, index_type
, column_name
, group_conat(
index_name
) index_names
from (
select table_name
, index_name
, index_type
, count(seq_in_index) as column_count
, group_concat(
if(seq_in_index=1,column_name,'')
separator ''
) as column_name
, group_concat(
column_name
order by seq_in_index
separator ','
) as column_names
from information_schema.statistics s
where s.table_schema = schema()
and s.index_type != 'FULLTEXT'
group by table_name
, index_name
, index_type
) as s
group by table_name
, index_type
, if(index_type='HASH',column_names,column_name)

Note that we are not just grouping by the columnlist prefix in all cases. Because indexes of the HASH type are only equivalent when there is a match for the entire columnlist, we need to take that into account in the GROUP BY expression:

if(index_type='HASH',column_names,column_name)

So, if we are dealing with a HASH index, then group by the entire columnlist, else group by the (smallest) prefix only.

The SELECT list contains an expression to produce a comma-separated list of all the indexes that we consider in the comparison:

group_conat(
index_name
) index_names

Some of the indexes that appear in this list will be redundant, but some of them are essential: removing them will certainly affect the performance of certain queries. If we could somehow remove the essential indexes from the list, we would end up with a list of redundant indexes, which is what we set out to do.

How can we go about this? Well, let's stick to the case for BTREE's for a while. Remember that for this index type, we are primarily interested in the index with the longest columnlist: we consider that to be the most essential index. It is essential because it offers an access path that is not offered by indexes with shorter columnlists, and the access path offered by the other indexes is already contained within the one with the longest columnlist.

(We really can safely forget about other index types for a while. In case of HASH and SPATIAL indexes, we are dealing with identical columnlists, so by definition, all columnlists are 'the longest'. That means that whatever solution we come up with for BTREE's applies automatically to these index types although it may be possible that a simpler solution is available for these index types.)

Now, only if we were able to extract the index_name with te largest number of columns from the list...

We actually know already how many columns each index has, because we computed that already in the subquery as the column_count expression. But how do we use that information? Well, we can use the optional ORDER BY clause of the GROUP_CONCAT() function. Using that, we could sort our index names according to the number of columns in the index:

group_conat(
index_name
order by column_count
) index_names

This ensures that the indexes with the largest number of columns in the columnlist are queued up at the end of the list. Once it is sitting there, it is not too hard to extract it: we can use the SUBSTRING_INDEX() function for that, like so:

substring_index(
group_concat(
index_name
order by column_count
separator ','
)
, ','
, count(*) - 1
)

What happens here is that the comma separated list of index names, ordered according to the number of columns in the index, is generated first. (For clarity the SEPARATOR clause is included in the GROUP BY.) Then, SUBSTRING_INDEX takes only a part from it, namely the part up to the last comma in the list. The last comma is conveniently calculated by taking the count of indexes that are used to generate te list, and substracting one (because the indexes in the list are separated by a comma, there is exaclty on comma less than the number of indexes). So the last index, the essential index, is cut off, and what then remains is the list of redundant indexes.

Having explained the technique, we can now discuss a few minor improvements. First of all, what happens when there is not just one 'essential index' but two, or even more? It is quite possible that there is not just one index with the maximum number of columns for the given prefix. Well, in that case, since we are dealing with duplicates, we will simply elect the index with the 'smallest' index name.

This is easy enough to implement. We simply have to add a new sort key within the column_count. Because we have arranged for SUBSTRING_INDEX() to discard the last entry of the list, we must order by index_name in descending order:

substring_index(
group_concat(
index_name
order by column_count
, index_name desc
separator ','
)
, ','
, count(*) - 1
)

Ok, so far so good. Now there is really one thing we've consistenly ignored all the time, and that is unicity. What do we do when our list of 'redundant' indexes contains on or more unique indexes? Well, if we were to remove this from the schema, we could be in big trouble! Input that would have been rejected because it violated a unique index would suddenly be allowed, and this is certainly not our intention. This simply means that our concept of a redundant index is incomplete. It doesn't take an important feauture of the indexes not into account, namely the unicity.

So what do we need to do in case we really do have UNIQUE indexes? Well, we should look for the UNIQUE index with the least number of columns. Just as the index with the largest number of columns provides an essential acces path, the unique index with the smallest number of columns provides an essential unicity rule. Any other unique indexs with the same prefix that have more columns are redundant: they cannot make their indexed data 'more unique'.

Luckily, this improvement is not more complicated than what we've seen sofar. What we can do is set up an entire new list of only the unique indexes:

substring_index(
group_concat(
if( non_unique = 0
, index_name
, ''
)
order by non_unique asc
, column_count asc
, index_name asc
separator ','
)
, ','
, 1
)

Using the same techniques we used before, we constructed an expression that generates a list of only UNIQUE indexes, sorted on the number of columns. If there is at least one UNIQUE index, that will be in front of the list. Using SUBSTRING_INDEX() we can cut the first UNIQUE index of the list: this will be the one with the smallest number of columns, because we included the column_count expression in the ORDER BY clause.

Now, all we need to do is remove the index name of the unique index with the smallest number of columns from our list of 'redundant indexes'. This is easily done by the REPLACE() function. Putting it alltogether, we get something like this:

replace(
substring_index(
group_concat(
'`',index_name,'`'
order by column_count asc
, non_unique asc
, index_name desc
separator ','
)
, ','
, count(*) - 1
)
, concat(
'`'
, substring_index(
group_concat(
if( non_unique = 0
, index_name
, ''
)
order by non_unique asc
, column_count asc
, index_name asc
separator ','
)
, ','
, 1
)
, '`'
)
, ''
)

That's about it really. We do need to do a little more effort to ascertain that REPLACE() will only remove complete index names, but that is not too hard. Also, we all this cutting and replacing might leave a few stray commas behind: those must be cleaned up as well.

Limitations


This solution certainly has it's limitations. I've made a list of the most important ones:

  • Relies on Information schema: this solution is not available in MySQL <= 5.0, and it might be slow

  • FULLTEXT indexes are ignored

  • Column prefixes are ignored.


If you have any comments, suggestions, or whatsoever, don't hesitate to post a comment, please do. Remember, the code will remain available as a snippet at MySQLForge.

Wednesday, September 06, 2006

Refactoring MySQL Cursors

Quite a while ago, I wrote a few articles on using MySQL cursors. This time, I will try and shed some light on why cursors are often unnecessary. To illustrate that, I will refactor a typical case of cursor usage.

The first part of this article explains why cursors are usually unnecessary. A few common problems with cursors are briefly discussed. Also, typical stored procedure pattern is described that uses a cursor, and a demonstration is given that shows how it can be refactored to an equivalent procedure that uses a single SQL statement instead.

In the second part of this article, the negative performance implications of using cursors are illustrated with a few benchmarks, and the cases where a cursor might be useful after all are briefly discussed.

Cursor trouble


Some time ago, I used to be quite active in some of the MySQL Forums, especially the ones that cover the features introduced in MySQL 5.0 (views, stored procedures, triggers and cursors).

After a little while it became quite obvious to me that most of the problems that were posted in the the cursor forum could be grouped into just a few classes.
The following lists them in ascending order according to the abundance (as perceived by me) of posts mentioning that particular type of problem.


Attempts to use unsupported features

In a minority of cases, people expected a feature that just isn't supported by MySQL. To name a few useful ones: cursor FOR loops, FOR UPDATE cursors and passing cursors to and from stored procedures as a parameter.

Nested Cursor loops

Nesting cursors is usually a combined problem of the following two problems. If you really need nested cursors, then you have to be extra careful with loop handling. However, usually, nesting cursors is entirely unnecessary and inappropriate. (Many, many nested cursor loops can and should be written as simple JOINs.)

Basic cursor loop handling problems

This really is quite a common problem. To be frank, there are quite a few things you need to think of if you want to do proper cursor looping in MySQL, and I think I can safely call it a fact that a lot of people run into trouble there.

Unnecessary usage of cursors

This is by far the most seen problem with cursors. In maybe as many as 80% of the cases where a cursor is used, it isn't needed at all. In quite a lot cases, the cursor can be written as a single SQL statement.



In the forum, I found myself explaining the same problem and the solution over and over again. So, I wrote little articles to illustrate basic loop handling (Want to write a CURSOR LOOP with MySQL?,
Why REPEAT and WHILE are usually not handy to handle MySQL CURSORs) and cursor nesting (Nesting MySQL Cursor Loops). These articles gave me a quick pointer to throw with in case I bumped into yet another thread requesting help regarding these types of problems.

On some occasions, I have briefly pointed out that cursors are often unnecessary.
This time, I'll try to elaborate on this a little further. My objective is to illustrate how a typical case of using cursors can be refactored into a single SQL statement that uses no cursor at all. We'll see how this allows us to discard about two thirds of the code and gain considerable performance at the same time.

Are you Sure you need a cursor?



In quite a few cases where I tried to help by explaining in what way the loop handling or the nesting was flawed, I pointed out that although the problem could be fixed, it was probably a better idea to do away with the cursor altogether. More often than not it was not too hard to think of a way to rewrite the entire cursor loop (including the body) as a single SQL statement.

However, it sometimes turned out to be quite hard to convince the person that originally wrote the cursor loop. In many cases, cursors are used by developers that are more familiar with a procedural or object-oriented language than with SQL.

I feel that a single SQL statement is nicer, more elegant. Usually, it involves less code, and often it is easier to understand. However, these are matters of aesthetics and preference, and thus do not count - not really. In my opinion, no matter how useful, creative, revolutionary or brilliant a piece of software is, it will never become a work of art (although it may demonstrate sublime craftmanship).

Luckily, there is an extra reason to avoid cursors in favour of single SQL statements. It's a good one too, and it will often convince the most stubborn procedurally oriented programmer. The extra reason is speed. I think it's safe to say that a single SQL statement will always outperform the equivalent cursor loop.

A nearly Real World Example


My nearly Real World Example is some sort of reservation database. The particular nature of the reservation is not really important. We can pretend that this concerns a car-rental shop, or maybe a hotel. The important thing is that customers can make reservations and that the reservation lists the reserved products.

A product has a default value for the price (value_prod), vat percentage (vat_prod) and discount percentage (discount_prod). The items that appear on a reservation have a reference to the reservation, and a reference to the product. Like a product, an item too has a price (unit_value), vat percentage (vat_value) and a discount percentage (discount). In addition, the item also stores the number of rental days (no_days), and a flag (disc_bool) that determines whether a discount will be charged. The products are stored in the product table, and the items are stored in the line_reserve table.

At first it might seem as if the unit_value, vat_value and discount columns in the line_reserve table are redundant, because these values could be derived from the related product table. However, this is not the case. The idea is that a reservation maybe made in advance. So, the vat and discount percentages as well as the price of the product could change after the reservation was made.

(Although it might be tempting to dicuss the database design, that is not the goal of this article. We won't discuss the validity of the database design here - it just serves as the background for a stored procedure I want't to describe.)

A procedure to update price, vat and discount for reserved items



The database also contains a UpdatePriceAndVatAndDiscountForReservationItems procedure. The procedure is meant to do the actual legwork of calculating and storing
the vat, discount and price values for all items that make up a reservation. The reason why I'm higlighting this procedure is because it contains a pattern of improper cursor handling, and I hope to demonstrate how that pattern can be recognized and rewritten.

(I'm deliberately not discussing any issues concerning coding style, such as identifier conventions. Also, I won't go into the detail concerning the chosen datatypes. Although improvements might be possible in this respect, this article tries to focus on the cursor loop)

Here's the code:

CREATE PROCEDURE `UpdatePriceAndVatAndDiscountForReservationItems`(
IN var_id_res INTEGER
)
BEGIN
-- the product record
DECLARE var_unit_val DOUBLE;
DECLARE var_vat DOUBLE;
DECLARE var_discount DOUBLE;

-- the line_reserve record
DECLARE var_id_line INTEGER;
DECLARE var_id_prod INTEGER;
DECLARE var_disc TINYINT;
DECLARE var_no_days INTEGER;

-- the calculated vat
DECLARE var_val_vat DOUBLE;

-- cursor loop book-keeping
DECLARE no_more_rows BOOLEAN;
DECLARE num INTEGER;

-- the line reserve curosr
DECLARE cur_res CURSOR FOR
SELECT id_line
, id_prod
, disc_bool
, no_days
FROM line_reserve
WHERE id_res = var_id_res;

-- more cursor loop book-keeping
DECLARE CONTINUE HANDLER FOR NOT FOUND
SET no_more_rows = TRUE;

-- loop control
OPEN cur_res;

loop_cur_res: LOOP
FETCH cur_res
INTO var_id_line
, var_id_prod
, var_disc
, var_no_days;

IF no_more_rows THEN
CLOSE cur_res;
LEAVE loop_cur_res;
END IF;


-- get corresponding product
SELECT value_prod
, vat_prod
, discount_prod
INTO var_unit_val
, var_vat
, var_discount
FROM product
WHERE id_prod=var_id_prod;

-- calculate vat
SET var_val_vat =
var_unit_val
* ROUND((var_vat/100),2);

-- calculate discount
IF var_disc=1 THEN
SET var_discount =
((var_unit_val+var_val_vat)*var_no_days)
* ROUND((var_discount/100),2);
END IF;

-- update line_reserve with vat and discount
UPDATE line_reserve
SET unit_value = var_unit_val
, vat_value = var_val_vat
, discount = var_discount
WHERE id_line = var_id_line;

-- record the line number
SET num = num+1;
END LOOP loop_cur_res;

-- return line count
SELECT num as result;
END;


(I admit I'm guilty of including some comments and indentation to aid the readability of the code. So, now you know why I call it a nearly real world example.)

The skeleton


The procedure uses the var_id_res parameter to specify a particular reservation:

CREATE PROCEDURE `UpdatePriceAndVatAndDiscountForReservationItems`(
IN var_id_res INTEGER
)
...

This procedure parameter is used to control a cursor to select only items that correspond to the reservation passed by the parameter:

-- the line reserve curosr
DECLARE cur_res CURSOR FOR
SELECT id_line
, id_prod
, disc_bool
, no_days
FROM line_reserve
WHERE id_res = var_id_res;

The procedure loops through the cursor with an ordinary unstructured LOOP, fetching a record from the cursor for each iteration:

-- loop control
OPEN cur_res;
loop_cur_res: LOOP
FETCH cur_res
INTO var_id_line
, var_id_prod
, var_disc
, var_no_days;
IF no_more_rows THEN
CLOSE cur_res;
LEAVE loop_cur_res;
END IF;
...
-- record the line number
SET num = num+1;
END LOOP loop_cur_res;
-- return line count
SELECT num as result;

The loop is controlled with the no_more_rows flag. The value of the flag is switched inside a NOT FOUND handler:

DECLARE CONTINUE HANDLER FOR NOT FOUND
SET no_more_rows = TRUE;

This ensures the loop is terminated properly when the cursor is exhausted.

A little side-note. The end of the loop code contains explicit code to count the number of lines that are processed by the cursor: the num variable is incremented for each iteration of the loop, and the value is selected and returned to the client caller right after the end of the loop. However, it is flawed. the procedure always returns a NULL for the result column:

mysql> call UpdatePriceAndVatAndDiscountForReservationItems(76);
+--------+
| result |
+--------+
| NULL |
+--------+
1 row in set (0.01 sec)

Query OK, 0 rows affected (0.01 sec)

That's because of an easily overseen detail: the num variable was never initialized. Therefore, the value will remain NULL for evermore. The easiest way to prevent such trouble is to give such variables an explicit default value in their declaration:

DECLARE num INTEGER DEFAULT 0;

So far, we've seen nothing special.

The loop body


Now for the body of the loop. Each iteration essentially executes three separate things:

  1. Fetch data from the product that corresponds to the current reservation item

  2. Calculate the vat and discount using data from both the reservation item and the product

  3. Update the reservation item with the calculated values


Writing SELECT...INTO as a JOIN

For each record fetched from the cursor, the corresponding product record is retrieved with a SELECT...INTO statement. It uses the prod_id from the reservation item that was FETCH-ed in the top of the loop to identify exactly one corresponding record from the product table:

-- get corresponding product
SELECT value_prod
, vat_prod
, discount_prod
INTO var_unit_val
, var_vat
, var_discount
FROM product
WHERE id_prod=var_id_prod;

This is a good moment to realize a couple of things.

A SELECT...INTO statement can retrieve at most one record. In this case, it's probably probably quite safe to assume that it does, because the reservation item has a mandatory foreign key that references the primary key of the product table.

But suppose we would use this type of statement to retrieve data for an optional relationship. Well, val_id_prod could be NULL, and the SELECT...INTO statement not retrieve any record.

Ouch!

That would mean we're in big trouble: the loop is controlled using a NOT FOUND handler, and that's exaclty the handler that will be fired when this SELECT...INTO statement matches no records. Our loop would finish unintentionally, and most probably prematurely,

So, what can we do about it? Well, a lot of things. To name a few:

  • If we can be sure that a record will be matched when var_id_prod is NULL, we can wrap a IF...END IF around it.

  • We can wrap a BEGIN...END block around it, and declare a separate NOT FOUND handler there. Because that handler is nearer than the one that controls our loop, this will capture the condition, and the loop will never be left. The nice thing about this approach is that it also offers an opportunity to handle cases where the statement matches more than one record.

  • We can explicitly reset the loop control variable right before the FETCH statement. You will find more information on that solution here


However the real question is: what is a SELECT...INTO doing inside this loop? If we are sure it will match a record for each record fetched for the loop, why not rewrite the cursor to a join?

DECLARE cur_res CURSOR FOR
SELECT r.id_line
, r.disc_bool
, r.no_days
, p.value_prod
, p.vat_prod
, p.discount_prod
FROM line_reserve r
JOIN product p
ON r.id_prod = p.id_prod
WHERE id_res = var_id_res;


So, this is a pattern we can regonize, and avoid. Any SELECT...INTO that is executed for each iteration of a cursor loop, can be rewritten by joining it to the SELECT-statement of the cursor. The INTO fields are simply added to the
SELECT-list of the cursor query. In this particular case, we can also remove the id_prod column from the SELECT-list. It was only used to retrieve the product data, and this is now already solved.

We should keep in the back of our minds that when we change the SELECT-list we shoud also change the FETCH, or else there will be a mismatch in the number of expressions in the SELECT-list and the variables list used in the FETCH statement.

It is important to realize that this will not result in extra iterations, at least, not if the SELECT...INTO was a right choice in the first place.

Writing SET assignments as part of SELECT

The loop body continues by calculating the vat and discount values:

-- calculate vat
SET var_val_vat =
var_unit_val
* ROUND((var_vat/100),2);

-- calculate discount
IF var_disc=1 THEN
SET var_discount =
((var_unit_val+var_val_vat)*var_no_days)
* ROUND((var_discount/100),2);
END IF;

If we settle for using a JOIN in the cursor query instead of a SELECT...INTO inside the loop body, we might just as well add the expressions that calculate the vat and the discount to the SELECT-list:

DECLARE cur_res CURSOR FOR
SELECT r.id_line
, p.value_prod
* ROUND((p.vat_prod/100),2) var_val_vat
, if( r.disc_bool
, ( p.value_prod
* round(p.vat_prod/100,2)
+ p.value_prod
)
* r.no_days
* round(p.discount_prod/100,2)
, NULL
) var_discount
FROM line_reserve r
JOIN product p
ON r.id_prod = p.id_prod
WHERE id_res = var_id_res;

Again, we can discard columns we don't need anymore, and we must modify the FETCH-statement accordingly.

It seem trivial, but for completeless, we should note that this is a pattern too: any expressions that appear on the right-hand side of the assignment operator in a SET statement, can and probably should be written as an expression in the SELECT-list, and the actual assignment to variables is then handled in the FETCH-statement.

In the original example, the var_discount variable is conditionally set using an IF..THEN statement. If the IF..THEN statement only contains SET statements, then all of the corresponding expressions on the right hand side of the assignment operator can be written using the IF function. This can be useful if the expression is expensive or if it performs some kind of side-effect.
Rewriting the single row UPDATE

The final action that is performed inside the cursor loop is an UPDATE of the line_reserve table. In order to store the calculated vat and discount data, only the record that corresponds to the record fetched by the cursor is updated:

UPDATE line_reserve
SET unit_value = var_unit_val
, vat_value = var_val_vat
, discount = var_discount
WHERE id_line = var_id_line;

(There are rdbms-products that support a FOR UPDATE clause in the cursor declaration. This lets you modify the values in the cursor record immediately. MySQL does not support this feature.)

We should realize that UPDATE is perfectly capable of updating multiple records. In fact, any UPDATE statement that does not include a WHERE-clause, will update all the records in the table. To limit the update to only those line_reserve records that corrsepond to a reservation, we will need to add a proper WHERE-clause.
In fact, we know exactly which WHERE-clause we need, because we used it to write the cursor declaration:

UPDATE line_reserve
SET unit_value = ?
, vat_value = ?
, discount = ?
WHERE id_res = var_id_res;

Of course, we cannot just use the original variables for the column assignments, because the value of these variables potentially varies for each record touched by the UPDATE-statement. Thinking of an expression that calculates them directly as part of the UPDATE-statement poses a slight problem, because we have seen that the calculation of vat and discount depend on values from both the line_reserve record, as well as the corresponding product record.

So, somehow, we need to find a way for the corresponding product records to contribute their values to the UPDATE statement. One way of solving this problem is using subqueries:

UPDATE line_reserve r
SET unit_value = (
select p.value_prod
from product p
where p.id_prod = r.id_prod
)
, vat_value = (
select p.value_prod
* ROUND((p.vat_prod/100),2)
from product p
where p.id_prod = r.id_prod
)
, discount = (
select if( r.disc_bool
, ( p.value_prod
* round(p.vat_prod/100,2)
+ p.value_prod
)
* r.no_days
* round(p.discount_prod/100,2)
, p.discount
)
from product p
where p.id_prod = r.id_prod
)
WHERE id_res = var_id_res;

Before explaining these subqueries, please take a moment to realize that this single statement does all the work the cursor loop used to do. So, this proves that the cursor loop really was unnecessary in the sense that we can perform the same task using a single SQL statement.

The subqueries are the parenthesis enclosed SELECT expressions appearing on the right-hand side of the assignment operator in the SET-clause. In order for a subquery to be used like this, the subquery needs to be scalar: they must select a single row, with a single column. Because of this special form, the 'resultset' can be interpreted as a scalar: a simple, singular value-expression.

Note that these three subqueries differ only as far as the SELECT-list is concerned. They all select one single row from the product table that corresponds to the current line_reserve record. To select only the corresponding product record, the subquery is bound to the main query using an appropriate WHERE-clause.

(It is no coincidence that the WHERE-clause used here is identical to the one used in the SELECT...INTO statement in the original code.)

Subqueries that use expressions from the surrounding query to limit the resultset are said to correlated.

Some people might feel uncomfortable about the repetition of code. Three times, essentially similar queries are performed. Even if there would be some kind of optimization that accounts for any performance issues that might rise from that, it is still objectionable that we must repeatedly write part of the code.

Some rdbms-products allow a a variation of this kind of syntax to address this issue:

UPDATE line_reserve r
SET (
unit_value
, vat_value
, discount
) = (
select value_prod
, p.value_prod
* ROUND((p.vat_prod/100),2)
, if( r.disc_bool
, ( p.value_prod
* round(p.vat_prod/100,2)
+ p.value_prod
)
* r.no_days
* round(p.discount_prod/100,2)
, p.discount
)
from product p
where p.id_prod = r.id_prod
)
WHERE id_res = var_id_res;

So, the update is performed by assigning a complete record. This kind of syntax is not supported by MySQL. But as we shall see, there is a much better way to handle this. MySQL and a few other rdbms products support the JOIN syntax as part of an UPDATE statement.

We already discussed how a JOIN could be used to fetch product data corresponding to the line_reserve records in the cursor declaration. Now all we have to do is to modify that statement to an update statement:

UPDATE line_reserve r
INNER JOIN product p
ON r.id_prod = p.id_prod
SET r.unit_value = p.value_prod
, r.vat_value = p.value_prod * round(p.vat_prod/100,2)
, r.discount = if( r.disc_bool
, ( p.value_prod
* round(p.vat_prod/100,2)
+ p.value_prod
)
* r.no_days
* round(p.discount_prod/100,2)
, r.discount
)
WHERE r.id_res = var_id_res;

So in this case, the JOIN is performed, and any assignments that are applied in the SET-clause are somehow pushed through to the underlying table. At first, this might seem a bit awkward, but it is actually quite elegant once you get used to it. It is generally faster too than a solution with subqueries, and you can actually use this to modify multiple tables at once, something that can occasionally be quite useful.
Returning the rowcount

There is only one thing in the original that we did not account for yet. The original procedure returned the number of records that were actually touched by the procedure. For the single statement solution, we can use the built-in ROW_COUNT() function in MySQL. This function returns the number of records that were affected by the last DELETE, INSERT or UPDATE statement.

Finally


We managed to rewrite the orignal 68 line stored procedure to this mere 23 lines of code:

CREATE PROCEDURE
`sp_UpdateLineReserveVatAndDiscount`(
IN p_id_res INTEGER
)
BEGIN
UPDATE line_reserve l
INNER JOIN product p
ON l.id_prod = p.id_prod
SET l.unit_value = p.value_prod
, l.vat_value = p.value_prod * round(p.vat_prod/100,2)
, l.discount = if( l.disc_bool
, ( p.value_prod
* round(p.vat_prod/100,2)
+ p.value_prod
)
* l.no_days
* round(p.discount_prod/100,2)
, l.discount
)
WHERE r.id_res = p_id_res;

SELECT ROW_COUNT();
END;

This clearly illustrates that the cursor was not necessary in this case. To me, it also illustrates how extraordinarily powerful pure SQL is, because there must be some kind of mechanism that actually does iterate through all the records to perform the join and update actions on a low-level. The SQL language allows us to express the actual data manipulation
without requiring that we specify anything about the workflow that is needed to retrieve and wade through the individual records. This is certainly one of the reasons why I like this refactored solution better.

What about speed?


Is there anything we can say about how fast these different procedures run? Sure!

I created a simple benchmarking procedures to repeatedly execute a modified version of the two procedures. I modified the procedures by discarding the final SELECT statement that returns the number of affected records. For performing the benchmarks, I found it impractical to have the resultset returned and because the purpose was primarily to show the difference between the cursor and the single statement, I gathered that it was safe to leave it out.

I also generated series of data: a series of reservations with 1,10,100,250,500,750 and 1000 of corresponding line_reserve records, all of them referencing just one product record. There was only a very small (5) amount of product records in the table as a whole at this point.

As far as the rest of environment is concerned: the benchmark was performed on a lenovo 3000 n100 1.8Gb/1024 Mb (centrino Duo) notebook with Ubuntu linux and MySQL 5.1.11-beta. Executing the procedures was done using the mysql commandline utility. The tables are InnoDB tables, and there were proper indexes on the relevant columns. MySQL configuration is completely standard, but autocommit was turned off during a single procedure execution.

I then had my benchmarking procedures execute the procedure 10000 times for each of the series. Running the benchmarking procedure was repeated 2 to 5 times. For each run of the benchmarking procedure, the time needed to complete the benchmarking procedure was recorded, and averaged.

Then, more data was generated for the product table as well so that it contained 1000 records. The line_reserve table was updated, randomizing the product references.

So, to summarize: there are four different sets of measurements:

  • the original procedure using the cursor

    • few product records, and only one used product (cursor1)

    • randomized usage acros 1000 product records (cursor1000)



  • the refactored procedure using a single statement

    • few product records, and only one used product (statement1)

    • randomized usage acros 1000 product records (statement1000)




For each of these, a series of reservations with 1,10,100,250,500,750 and 1000 records was tested by executing the procedure 10000 times. The time required to do that was recorded. Elapsed time was then overaged over 2 to 5 runs of 10000 cycles.

The results are shown in this chart:

An important fact is that the single statement solution was faster than the cursor solution under all tested circumstances. Depending upon the exact situation, the single statement solution was at least 1.44 times as fast as the cursor solution, and at most 8.25 times as fast. The message is very clear if speed is all you care about: avoid cursors as much as you can in favour of single SQL statements.

In nearly all cases, the increase in the number of processed line_reserve records appears to have a linear effect on the increase in execution time. The one exception is the measurement of the cursor method for the 500 line_reserve records with randomized references to the available 1000 products. In that particular case, the execution time takes a drop and is only a little higher than the corresponding figure for 100 line_reserve records, but considerably lower than the corresponding figure for 250 line_reserve records. At present, it is unclear what could be causing this peculiar effect.

The differences between the two methods became bigger when more line_reserve records were involved. This seems to indicate that the overhead of setting up and running a cursor loop is just a little bigger than executing a single statement, whereas the direct processing of the rows inside a single SQL statement is considerable faster than performing row-by-row operations explicitly with a cursor. This is not very surprising when we realize that opening the cursor loop requires the execution of an SQL statement in addition to processing the resultset row by row.

Surprisingly, the addition of extra product records sped up execution time for both the cursor as well as the single-statement solution. This was unexpected. It was expected that the execution time of the single statement solution would remain nearly the same, but that the execution time of the cursor solution would increase, possibly quite a great deal. This expectation was based on the idea that only a little increase in the time needed to find a single product record would have a quite large effect on the cursor loop execution time, because it fetches a single product record in each iteration of the loop. However, that assumption is clearly not supported by these results.

Possibly this is caused by the total number of records in the product table:
One can imagine that in the initial situation (5 product records) a full table scan was performed to find the correct record,whereas 1000 records in the product table would result in using an index for the data access, speeding up the query. However, this would mean that a full table scan, even for 5 records, is an extremely expensive operation - something I cannot believe to be true,.

So, why use cursors at all?


The question remains why you would use cursors at all. In what cases do we really need one, or what benefit could a cursor have.

Well, I already pointed out a few reasons in one of my earlier articles. There are some things you just cannot do in a single SQL statement. For example, suppose you want execute SQL statements dynamically, and the statements are generated by another SQL statement. There is no way to do that without a cursor, so in this case you really need one. (If this example sounds artificial, check out this snippet on MySQLForge).

How about your story


If you have an explanation for the exact figures seen in the benchmark, or if you have any ideas on how to improve the benchmark, please post a comment. Also, if you have a good, convincing example that illustrates why you really do need cursors sometimes, please let me know. Just leave a comment on the blog.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...