Thursday, June 18, 2009

MySQL Stored Functions: Impact of DECLARE HANDLER on Performance

Hi again!

Just a moment ago, I read this post by Peter Gulutzan. In this post, Peter explains a little trick that allows you to capture the SQL state in a variable whenever an error occurs in your MySQL stored routine code.

MySQL CONDITIONs and HANDLERs

For the uninitiated: in MySQL stored routines, you can declare HANDLERs which are pieces of code that are executed only in case a particular CONDITION occurs. This device serves the same purpose as a try ... catch block which is supported in many popular programming languages like Java, JavaScript, C++, PHP5 and C#.

Now, one of the long-standing problems with MySQL HANDLERs - the inability to explicitly raise a CONDITION - has recently been solved by implementing the SIGNAL syntax. The most important issue that remains is the inability to discover exactly which error occurred. You need this especially when you are writing a generic 'catch all' HANDLER (for example, on that is triggered in response to SQLEXCEPTION) and you want to write information regarding the error to a log.

Peter's Trick to Capture SQLSTATE

To cut a long story short, Peter's solution is based on writing a HANDLER for all known CONDITIONs in advance. Here's a fragment of his code to explain:
BEGIN
...
DECLARE EXIT HANDLER FOR SQLSTATE '01000' BEGIN SET @e='01000'; RESIGNAL; END;
...
DECLARE EXIT HANDLER FOR SQLSTATE 'XAE09' BEGIN SET @e='XAE09'; RESIGNAL; END;

...remainder of code...

END;
As Peter points out, it's tedious, but it works. (That is to say, it works better than not having anything)

Performance?

Now, one particular paragraph in Peter's post caught my eye:
I added 38 DECLARE EXIT HANDLER statements at the start of my procedure, just after the variable declarations. These lines are always the same for any procedure. They’re not executed unless an error happens so I don’t worry about speed.
I respect Peter a great deal - if he's got something to say you do well to listen and take his word for it. However, this time I was curious to find out if I could measure the effect of the HANDLER code at all.

Method

The code I used is a simplification of Peter's code. I tested it on MySQL 5.1 because I was just interested in the impact of a DECLARE HANDLER statement. In my case, the BEGIN...END block of the handler does not contain the RESIGNAL statement, and my function does not drop a table but simply returns 1. This is important, as none of the HANDLERs is ever triggered by my code.

Last week I wrote how seemingly small changes in MySQL stored routine code can have a surprisingly large impact on performance. In that particular case, I already had a hunch about which things could be improved. In this case, I just didn't know so I created a series of functions with 2, 4, 8, 16, 32 and 38 DECLARE HANDLER statements, and again I used the BENCHMARK() function to measure the time it takes to execute it 100,000 times. I did warm-up runs, and repeated the measurement 5 times for each function variant.

Results

The graph below summarizes my observations:
handler
The squares in the graph are the measurements - each one represents a version of the function. Along the horizontal axis you see the number of DECLARE HANDLER statements in that particular version of the function. The number of seconds it took to execute the function 100,000 times on my laptop using BENCHMARK() is on the vertical axis.
As you can see, there seems to be a linear relationship between the number of DECLARE HANDLER statements and the time it takes to execute the function. This is in itself a surprise. I mean, I would expect a little overhead per DECLARE statement when the function is compiled initially. After that, it is cached at the session level, and beyond that point I would not expect statements that do not execute to have any impact.

So how badly do the HANDLER declarations slow our function down? Well, I measured an average of 0.38 seconds for 2, and an average of 0.55 seconds for 38 DECLARE HANDLER statements respectively. The difference, 0.18 seconds is a little less than 50% of the function variant with 2 DECLARE HANDLER statements, and a little more than 30% of the function having 38 DECLARE HANDLER statements.

Conclusion

To be fair, the function I tested doesn't actually do anything, and if your function or stored procedure does some real processing, the overhead may be neglectable. However, you can clearly see that even just declaring a handler has a measurable negative impact on performance. The essence of Peter's trick is to actually always write a DECLARE HANDLER for each possible condition and to do this for each stored routine. You will certainly suffer a peformance hit for small functions, esp. if they get called a lot.

Wednesday, June 10, 2009

MySQL: Refactoring a Stored Function

Hi All!

I was just reading PlanetMySQL and noticed Mike Hillyer's recent post on a user-friendly age function for MySQL. Basically, this function accepts two DATETIME values and returns an indication of the time between the two dates in the form of a human-readable string. For example:
mysql> select  TimeDiffUnits('2001-05-01', '2002-01-01')
+-------------------------------------------+
| TimeDiffUnits('2001-05-01', '2002-01-01') |
+-------------------------------------------+
| 8 Months |
+-------------------------------------------+
1 row in set (0.00 sec)


Just for fun, I decided to try and refactor it, and I'm writing this to share the results. Now for a little disclaimer. The purpose of this post is not to bash Mike or belittle his code: I consider him a friend, and I respect him and his skills a lot. The point is to show how seemingly small differences in MySQL stored function code can have quite an impact on performance. The good news is that there is a rationale behind all this - I did not refactor based on trial and error. I hope I can shed some light on that in the remainder of this post when discussing the individual improvements.

Summary


I changed three pieces of code in Mike's original function. Each of these changes help to increase performance just a little bit, and because none of the changes alter the overall structure of the original code, I'm inclined to consider them as improvements.I'm including a graph here:
refactor
The Y-axis of the graph represents the average (N=5) number of seconds it took to run
SELECT BENCHMARK(100000, TimeDiffUnits('2001-01-01', '2002-01-01'))
The original function is at the left, and each bar at the right represents one step of refactoring. All in all, at the far right, you can see that the final result is a function with exactly the same behavior which runs about 70% faster than the original :-).

About the Measurements


I should clarify a thing or two to explain the measurments I did.

First of all, BENCHMARK() is usually frowned upon as a general benchmarking tool. However, in this case the measured function is completely computational in nature, and I think that it isn't too bad to get a rough idea of relative performance.

Second, the actual code that is being measured, TimeDiffUnits('2001-01-01', '2002-01-01') is in fact a worst-case* scenario for this particular function, and it's quite likely that testing it with real data does not automatically result in 70% performance increase.

(*: It is quite literally a worst-case scenario because the input values are such that the CASE-statement used within the function has to mull through all branches before finding a matching branch)

Finally I should point out that the vast majority of database performance problems have to do with IO or resource contention, and are not computational in nature. So chances are that you can use none of the information in the remainder of this post to improve your performance problem. (You can always stop reading now of course ;-)

That said, if you use MySQL stored functions, it can't hurt to be aware of their performance issues, and it is not too hard to make a habit of writing the fastest possible code. In many cases, you will find that the fastest solution is also the cleanest, shortest, and most maintainable one.

Now, without further ado - how I refactored that function...

Original Function


First, meet the original function as it appeared on Mike's blog.
CREATE FUNCTION TimeDiffUnits (old DATETIME, new DATETIME)RETURNS CHAR(50) 
DETERMINISTIC NO SQL
BEGIN
DECLARE diff INTEGER;
SET diff = UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);
CASE
WHEN (diff < 3600) THEN RETURN CONCAT(FLOOR(diff / 60) , ' Minutes');
WHEN (diff < 86400) THEN RETURN CONCAT(FLOOR(diff / 3600), ' Hours');
WHEN (diff < 604800) THEN RETURN CONCAT(FLOOR(diff / 86400), ' Days');
WHEN (diff < 2592000) THEN RETURN CONCAT(FLOOR(diff / 604800), ' Weeks');
WHEN (diff < 31536000) THEN RETURN CONCAT(FLOOR(diff / 2592000), ' Months');
ELSE RETURN CONCAT(FLOOR(diff / 31536000), ' Years');
END CASE;
END;
This is pretty straight-forward:

  • First, the time between the input datetime values is computed as a number of seconds. This is done by converting the input values to a number of seconds using the UNIX_TIMESTAMP function. Then, the value for the old datetime is subtracted from the new datetime. The result is assigned to the diff variable where it is stored for later use.

  • Second, a searched CASE-statement is used to determine the order of magnitude: Minutes, Hours, and so on up to Years. For example, if the number of seconds is less than 3600 (which is an hour, 60 seconds times 60 minutes) the WHEN branch is entered to calculate a number of minutes

  • The WHEN branches of the CASE-statement calculate how many of the selected units (minutes, hours, etc.) fit into the elapsed time calculated in step 1. Using CONCAT(), this is used to create a nice human-readable string, which is immediately returned from the function.

On my laptop, BENCHMARK(1000000, TimeDiffUnits('2001-01-01', '2002-01-01')) takes between 2.91 and 3.05 seconds.

Step1: Using DEFAULT instead of SET

The first thing I did was getting rid of the assignment to diff:
SET diff = UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);

Instead, I used the DECLARE statement of diff to assign the elapsed time right away using the DEFAULT clause:
DECLARE diff INTEGER DEFAULT UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);
For completeness here's the modified function:
CREATE FUNCTION TimeDiffUnits1 (old DATETIME, new DATETIME)RETURNS CHAR(50)
DETERMINISTIC NO SQL
BEGIN
DECLARE diff INTEGER DEFAULT UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);
CASE
WHEN (diff < 3600) THEN RETURN CONCAT(FLOOR(diff / 60) , ' Minutes');
WHEN (diff < 86400) THEN RETURN CONCAT(FLOOR(diff / 3600), ' Hours');
WHEN (diff < 604800) THEN RETURN CONCAT(FLOOR(diff / 86400), ' Days');
WHEN (diff < 2592000) THEN RETURN CONCAT(FLOOR(diff / 604800), ' Weeks');
WHEN (diff < 31536000) THEN RETURN CONCAT(FLOOR(diff / 2592000), ' Months');
ELSE RETURN CONCAT(FLOOR(diff / 31536000), ' Years');
END CASE;
END;
This is not a big deal right? Well it isn't. I mean, this is certainly not a big change and I think the code is still just as clear. Did this gain us some performance? Well, just a bit. For me, SELECT BENCHMARK(100000, TimeDiffUnits1('2001-01-01', '2002-01-01')); takes between 2.89 and 2.98 seconds to run. This is about 2% better than the original. Admittedly, next to nothing, but considering that we casually eliminated only one assignment, I think it is rather good!

Take-away: Don't assign when you don't have to. Each local variable declaration is an implicit assignment - use it if you can.

Step 2: Using DIV instead of float division and FLOOR()

The second change I introduced is a bit larger than the previous one. To compute the number of elapsed units, the original code uses the division operator (/). This uses floating point arithmetic, and to get a nice integer result, the division is wrapped inside the FLOOR() function. In fact, this is a pattern that I have observed earlier in other code (yes, I'm guilty too :(), and I wrote about it in the past.

As it turns out, we don't need the division operator to perform division. At least, not in this case. MySQL provides the DIV operator, which is designed to perform integer division. This is great news for two reasons:

  • No time is wasted converting the numbers to floating point values to perform the calculation

  • Because the result of DIV is also an integer, we don't need FLOOR to convert back to integer again.

So, for example, this:
WHEN (diff < 31536000) THEN RETURN CONCAT(FLOOR(diff / 2592000), ' Months');
Can be rewritten to
WHEN (diff < 31536000) THEN RETURN CONCAT(diff DIV 2592000, ' Months');
. This should be faster, and its less code too. Here's the modified function for completeness:
CREATE FUNCTION TimeDiffUnits2 (old DATETIME, new DATETIME)RETURNS CHAR(50)
DETERMINISTIC NO SQL
BEGIN
DECLARE diff INTEGER DEFAULT UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);
CASE
WHEN (diff < 3600) THEN RETURN CONCAT(diff DIV 60 , ' Minutes');
WHEN (diff < 86400) THEN RETURN CONCAT(diff DIV 3600, ' Hours');
WHEN (diff < 604800) THEN RETURN CONCAT(diff DIV 86400, ' Days');
WHEN (diff < 2592000) THEN RETURN CONCAT(diff DIV 604800, ' Weeks');
WHEN (diff < 31536000) THEN RETURN CONCAT(diff DIV 2592000, ' Months');
ELSE RETURN CONCAT(diff DIV 31536000, ' Years');
END CASE;
END;
After the modification, BENCHMARK(100000, TimeDiffUnits2('2001-01-01', '2002-01-01')) takes between 2.61 and 2.72 seconds to run on my laptop. This is about 11% faster than the original, and about 9% faster than my first improvement.

Take-away: If you are doing division, think a minute about what data types you are using. Do you really need float arithmetic? If you don't, then don't use the division operator (/), simply use DIV instead. It may be less known, but it is more explicit and can give you some extra performance. If you are using FLOOR, ask yourself why you are throwing away fractional digits. There are a bunch of cases where you just need to format something that is intrinsically fractional, but if you can't care less about the fractional numbers, chances are you can chuck the FLOOR away and simply avodi the fractional numbers by using straight integer division.

Step 3: Using the CASE operator instead of the CASE statement

Finally, I changed the CASE-statement and replaced it with a CASE-operator instead.

Now, I have seen on numerous occasions that people are confused about CASE, so here's a quick overview, just to get it completely straight:

The CASE-statement

  • The CASE-statement is not the same as the CASE-operator.
  • The CASE-statement is a program-flow control statement, and is allowed in stored routines (procedures, functions, triggers and events), but not in regular SQL statements.
  • The CASE-statement is used to choose between a number of alternative code paths. Each WHEN branch must contain at least one statement, and may contain multiple statements. Note that statements that appear in the WHEN branch of a CASE-statement are always followed by a semi-colon statement terminator.
  • The CASE-statement is syntactically terminated using the keywords END CASE. Note that END CASE is typically followed by a semi-colon that acts as the statement terminator for the CASE-statement. (The only exception is when the CASE statement appears as top-level statement in a stored routine, in which case the semi-colon is allowed but not required.

The CASE-operator

  • The CASE-operator is not the same as the CASE-statement.
  • The CASE-operator (aka CASE-expression) is a value-expression. It is allowed in almost all places where you can use a value. So, you can use the CASE-operator in regular SQL statements as well as in stored routines.
  • The WHEN-branches of the CASE-operator must contain a single value-expression (which may itself be composite). The WHEN-branches of the CASE-operator can thus not contain statements, and cannot contain multiple expressions - it just wouldn't make sense because the CASE-operator evaluates to a single value, just like any other expression. Because statements are not allowed in the WHEN branches of CASE-operators, there are never terminating semi-colons inside these WHEN branches.
  • The CASE-expression is terminated with the END keyword. Note that this is different fromt the terminator of the CASE-statement which is END CASE
  • Also, note that a CASE-expression does not itself have a semi-colon as statement terminator for the simple reason that it is not a statement. Of course, it is possible for a CASE-expression to appear at the end of a statement. In this case, there will be a semi-colon statement terminator immediately after the END of the CASE-expression, but it is important to realize that that semi-colon terminates the statement that contains the CASE-expression - it does not terminate the CASE-expression itself.

Simple CASE and searched CASE

So, I just discussed two different occurrences of CASE: as statement, and as operator/expression. Now to complicate things even more, each of these can appear in two different forms, namely the simple and the searched CASE, making up a total of 4 different CASE variants. Fortunately, I wrote about the simple and searched CASE earlier in another article.

CASE-statement versus CASE-expression


As you can see in the original code, Mike uses a CASE-statement. Each of the WHEN-branches contains a single RETURN statement that passes the return value to the caller. With a small modification, we can rewrite this to a single RETURN statement that uses a CASE-expression to pick the right value. The result is shown below:

CREATE FUNCTION TimeDiffUnits3(old DATETIME, new DATETIME) RETURNS char(50)
NO SQL DETERMINISTIC
BEGIN
DECLARE diff INTEGER DEFAULT UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);
RETURN CASE
WHEN (diff < 3600) THEN CONCAT(diff div 60 , ' Minutes')
WHEN (diff < 86400) THEN CONCAT(diff div 3600, ' Hours')
WHEN (diff < 604800) THEN CONCAT(diff div 86400, ' Days')
WHEN (diff < 2592000) THEN CONCAT(diff div 604800, ' Weeks')
WHEN (diff < 31536000) THEN CONCAT(diff div 2592000, ' Months')
ELSE CONCAT(diff div 31536000, ' Years')
END;
END
While testing, I found that BENCHMARK(100000, TimeDiffUnits3('2001-01-01', '2002-01-01')) takes between 1.69 and 1.78 seconds to run. That is a 70% improvement over the original, and a 66% and 53% improvement with regard to the prior versions respectively.

Personally, I am curious why this is such a big improvement, and my guess is that this can be explained by assuming that a CASE statement is chopped up into many byte-code instructions which are each executed individually and sequentially, whereas the CASE-operator is written as a single C-function. At any rate, I think this was rather worth it, and personally I feel the final solution is a bit cleaner than the original.

Take-away: If you need some choice structure, think about what you are choosing. Are you deciding between different code paths, or are you picking between values? If you are picking values, simply write a CASE-expression. It's more explicit, and it is a lot faster than a CASE-statement. Another thing to consider:
do you need multiple statements in the branches, or can you get by with a single statement? If you can get by with a single statement, and it is always a RETURN or a SET-statement, then you can rewrite it to a CASE expresion.

Conclusion


Because the changes I made are really quite small, I think this mostly shows that MySQL stored function compilation is poorly optimized (if at all). I am not a compiler expert but my gut feeling is that most of the optimizations could have been done automatically.

Finally


If you are interested in refactoring MySQL stored functions and procedures, you might also like a few other articles I wrote on the subject:

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...