Showing posts with label Stored function. Show all posts
Showing posts with label Stored function. Show all posts

Wednesday, June 10, 2009

MySQL: Refactoring a Stored Function

Hi All!

I was just reading PlanetMySQL and noticed Mike Hillyer's recent post on a user-friendly age function for MySQL. Basically, this function accepts two DATETIME values and returns an indication of the time between the two dates in the form of a human-readable string. For example:
mysql> select  TimeDiffUnits('2001-05-01', '2002-01-01')
+-------------------------------------------+
| TimeDiffUnits('2001-05-01', '2002-01-01') |
+-------------------------------------------+
| 8 Months |
+-------------------------------------------+
1 row in set (0.00 sec)


Just for fun, I decided to try and refactor it, and I'm writing this to share the results. Now for a little disclaimer. The purpose of this post is not to bash Mike or belittle his code: I consider him a friend, and I respect him and his skills a lot. The point is to show how seemingly small differences in MySQL stored function code can have quite an impact on performance. The good news is that there is a rationale behind all this - I did not refactor based on trial and error. I hope I can shed some light on that in the remainder of this post when discussing the individual improvements.

Summary


I changed three pieces of code in Mike's original function. Each of these changes help to increase performance just a little bit, and because none of the changes alter the overall structure of the original code, I'm inclined to consider them as improvements.I'm including a graph here:
refactor
The Y-axis of the graph represents the average (N=5) number of seconds it took to run
SELECT BENCHMARK(100000, TimeDiffUnits('2001-01-01', '2002-01-01'))
The original function is at the left, and each bar at the right represents one step of refactoring. All in all, at the far right, you can see that the final result is a function with exactly the same behavior which runs about 70% faster than the original :-).

About the Measurements


I should clarify a thing or two to explain the measurments I did.

First of all, BENCHMARK() is usually frowned upon as a general benchmarking tool. However, in this case the measured function is completely computational in nature, and I think that it isn't too bad to get a rough idea of relative performance.

Second, the actual code that is being measured, TimeDiffUnits('2001-01-01', '2002-01-01') is in fact a worst-case* scenario for this particular function, and it's quite likely that testing it with real data does not automatically result in 70% performance increase.

(*: It is quite literally a worst-case scenario because the input values are such that the CASE-statement used within the function has to mull through all branches before finding a matching branch)

Finally I should point out that the vast majority of database performance problems have to do with IO or resource contention, and are not computational in nature. So chances are that you can use none of the information in the remainder of this post to improve your performance problem. (You can always stop reading now of course ;-)

That said, if you use MySQL stored functions, it can't hurt to be aware of their performance issues, and it is not too hard to make a habit of writing the fastest possible code. In many cases, you will find that the fastest solution is also the cleanest, shortest, and most maintainable one.

Now, without further ado - how I refactored that function...

Original Function


First, meet the original function as it appeared on Mike's blog.
CREATE FUNCTION TimeDiffUnits (old DATETIME, new DATETIME)RETURNS CHAR(50) 
DETERMINISTIC NO SQL
BEGIN
DECLARE diff INTEGER;
SET diff = UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);
CASE
WHEN (diff < 3600) THEN RETURN CONCAT(FLOOR(diff / 60) , ' Minutes');
WHEN (diff < 86400) THEN RETURN CONCAT(FLOOR(diff / 3600), ' Hours');
WHEN (diff < 604800) THEN RETURN CONCAT(FLOOR(diff / 86400), ' Days');
WHEN (diff < 2592000) THEN RETURN CONCAT(FLOOR(diff / 604800), ' Weeks');
WHEN (diff < 31536000) THEN RETURN CONCAT(FLOOR(diff / 2592000), ' Months');
ELSE RETURN CONCAT(FLOOR(diff / 31536000), ' Years');
END CASE;
END;
This is pretty straight-forward:

  • First, the time between the input datetime values is computed as a number of seconds. This is done by converting the input values to a number of seconds using the UNIX_TIMESTAMP function. Then, the value for the old datetime is subtracted from the new datetime. The result is assigned to the diff variable where it is stored for later use.

  • Second, a searched CASE-statement is used to determine the order of magnitude: Minutes, Hours, and so on up to Years. For example, if the number of seconds is less than 3600 (which is an hour, 60 seconds times 60 minutes) the WHEN branch is entered to calculate a number of minutes

  • The WHEN branches of the CASE-statement calculate how many of the selected units (minutes, hours, etc.) fit into the elapsed time calculated in step 1. Using CONCAT(), this is used to create a nice human-readable string, which is immediately returned from the function.

On my laptop, BENCHMARK(1000000, TimeDiffUnits('2001-01-01', '2002-01-01')) takes between 2.91 and 3.05 seconds.

Step1: Using DEFAULT instead of SET

The first thing I did was getting rid of the assignment to diff:
SET diff = UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);

Instead, I used the DECLARE statement of diff to assign the elapsed time right away using the DEFAULT clause:
DECLARE diff INTEGER DEFAULT UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);
For completeness here's the modified function:
CREATE FUNCTION TimeDiffUnits1 (old DATETIME, new DATETIME)RETURNS CHAR(50)
DETERMINISTIC NO SQL
BEGIN
DECLARE diff INTEGER DEFAULT UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);
CASE
WHEN (diff < 3600) THEN RETURN CONCAT(FLOOR(diff / 60) , ' Minutes');
WHEN (diff < 86400) THEN RETURN CONCAT(FLOOR(diff / 3600), ' Hours');
WHEN (diff < 604800) THEN RETURN CONCAT(FLOOR(diff / 86400), ' Days');
WHEN (diff < 2592000) THEN RETURN CONCAT(FLOOR(diff / 604800), ' Weeks');
WHEN (diff < 31536000) THEN RETURN CONCAT(FLOOR(diff / 2592000), ' Months');
ELSE RETURN CONCAT(FLOOR(diff / 31536000), ' Years');
END CASE;
END;
This is not a big deal right? Well it isn't. I mean, this is certainly not a big change and I think the code is still just as clear. Did this gain us some performance? Well, just a bit. For me, SELECT BENCHMARK(100000, TimeDiffUnits1('2001-01-01', '2002-01-01')); takes between 2.89 and 2.98 seconds to run. This is about 2% better than the original. Admittedly, next to nothing, but considering that we casually eliminated only one assignment, I think it is rather good!

Take-away: Don't assign when you don't have to. Each local variable declaration is an implicit assignment - use it if you can.

Step 2: Using DIV instead of float division and FLOOR()

The second change I introduced is a bit larger than the previous one. To compute the number of elapsed units, the original code uses the division operator (/). This uses floating point arithmetic, and to get a nice integer result, the division is wrapped inside the FLOOR() function. In fact, this is a pattern that I have observed earlier in other code (yes, I'm guilty too :(), and I wrote about it in the past.

As it turns out, we don't need the division operator to perform division. At least, not in this case. MySQL provides the DIV operator, which is designed to perform integer division. This is great news for two reasons:

  • No time is wasted converting the numbers to floating point values to perform the calculation

  • Because the result of DIV is also an integer, we don't need FLOOR to convert back to integer again.

So, for example, this:
WHEN (diff < 31536000) THEN RETURN CONCAT(FLOOR(diff / 2592000), ' Months');
Can be rewritten to
WHEN (diff < 31536000) THEN RETURN CONCAT(diff DIV 2592000, ' Months');
. This should be faster, and its less code too. Here's the modified function for completeness:
CREATE FUNCTION TimeDiffUnits2 (old DATETIME, new DATETIME)RETURNS CHAR(50)
DETERMINISTIC NO SQL
BEGIN
DECLARE diff INTEGER DEFAULT UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);
CASE
WHEN (diff < 3600) THEN RETURN CONCAT(diff DIV 60 , ' Minutes');
WHEN (diff < 86400) THEN RETURN CONCAT(diff DIV 3600, ' Hours');
WHEN (diff < 604800) THEN RETURN CONCAT(diff DIV 86400, ' Days');
WHEN (diff < 2592000) THEN RETURN CONCAT(diff DIV 604800, ' Weeks');
WHEN (diff < 31536000) THEN RETURN CONCAT(diff DIV 2592000, ' Months');
ELSE RETURN CONCAT(diff DIV 31536000, ' Years');
END CASE;
END;
After the modification, BENCHMARK(100000, TimeDiffUnits2('2001-01-01', '2002-01-01')) takes between 2.61 and 2.72 seconds to run on my laptop. This is about 11% faster than the original, and about 9% faster than my first improvement.

Take-away: If you are doing division, think a minute about what data types you are using. Do you really need float arithmetic? If you don't, then don't use the division operator (/), simply use DIV instead. It may be less known, but it is more explicit and can give you some extra performance. If you are using FLOOR, ask yourself why you are throwing away fractional digits. There are a bunch of cases where you just need to format something that is intrinsically fractional, but if you can't care less about the fractional numbers, chances are you can chuck the FLOOR away and simply avodi the fractional numbers by using straight integer division.

Step 3: Using the CASE operator instead of the CASE statement

Finally, I changed the CASE-statement and replaced it with a CASE-operator instead.

Now, I have seen on numerous occasions that people are confused about CASE, so here's a quick overview, just to get it completely straight:

The CASE-statement

  • The CASE-statement is not the same as the CASE-operator.
  • The CASE-statement is a program-flow control statement, and is allowed in stored routines (procedures, functions, triggers and events), but not in regular SQL statements.
  • The CASE-statement is used to choose between a number of alternative code paths. Each WHEN branch must contain at least one statement, and may contain multiple statements. Note that statements that appear in the WHEN branch of a CASE-statement are always followed by a semi-colon statement terminator.
  • The CASE-statement is syntactically terminated using the keywords END CASE. Note that END CASE is typically followed by a semi-colon that acts as the statement terminator for the CASE-statement. (The only exception is when the CASE statement appears as top-level statement in a stored routine, in which case the semi-colon is allowed but not required.

The CASE-operator

  • The CASE-operator is not the same as the CASE-statement.
  • The CASE-operator (aka CASE-expression) is a value-expression. It is allowed in almost all places where you can use a value. So, you can use the CASE-operator in regular SQL statements as well as in stored routines.
  • The WHEN-branches of the CASE-operator must contain a single value-expression (which may itself be composite). The WHEN-branches of the CASE-operator can thus not contain statements, and cannot contain multiple expressions - it just wouldn't make sense because the CASE-operator evaluates to a single value, just like any other expression. Because statements are not allowed in the WHEN branches of CASE-operators, there are never terminating semi-colons inside these WHEN branches.
  • The CASE-expression is terminated with the END keyword. Note that this is different fromt the terminator of the CASE-statement which is END CASE
  • Also, note that a CASE-expression does not itself have a semi-colon as statement terminator for the simple reason that it is not a statement. Of course, it is possible for a CASE-expression to appear at the end of a statement. In this case, there will be a semi-colon statement terminator immediately after the END of the CASE-expression, but it is important to realize that that semi-colon terminates the statement that contains the CASE-expression - it does not terminate the CASE-expression itself.

Simple CASE and searched CASE

So, I just discussed two different occurrences of CASE: as statement, and as operator/expression. Now to complicate things even more, each of these can appear in two different forms, namely the simple and the searched CASE, making up a total of 4 different CASE variants. Fortunately, I wrote about the simple and searched CASE earlier in another article.

CASE-statement versus CASE-expression


As you can see in the original code, Mike uses a CASE-statement. Each of the WHEN-branches contains a single RETURN statement that passes the return value to the caller. With a small modification, we can rewrite this to a single RETURN statement that uses a CASE-expression to pick the right value. The result is shown below:

CREATE FUNCTION TimeDiffUnits3(old DATETIME, new DATETIME) RETURNS char(50)
NO SQL DETERMINISTIC
BEGIN
DECLARE diff INTEGER DEFAULT UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);
RETURN CASE
WHEN (diff < 3600) THEN CONCAT(diff div 60 , ' Minutes')
WHEN (diff < 86400) THEN CONCAT(diff div 3600, ' Hours')
WHEN (diff < 604800) THEN CONCAT(diff div 86400, ' Days')
WHEN (diff < 2592000) THEN CONCAT(diff div 604800, ' Weeks')
WHEN (diff < 31536000) THEN CONCAT(diff div 2592000, ' Months')
ELSE CONCAT(diff div 31536000, ' Years')
END;
END
While testing, I found that BENCHMARK(100000, TimeDiffUnits3('2001-01-01', '2002-01-01')) takes between 1.69 and 1.78 seconds to run. That is a 70% improvement over the original, and a 66% and 53% improvement with regard to the prior versions respectively.

Personally, I am curious why this is such a big improvement, and my guess is that this can be explained by assuming that a CASE statement is chopped up into many byte-code instructions which are each executed individually and sequentially, whereas the CASE-operator is written as a single C-function. At any rate, I think this was rather worth it, and personally I feel the final solution is a bit cleaner than the original.

Take-away: If you need some choice structure, think about what you are choosing. Are you deciding between different code paths, or are you picking between values? If you are picking values, simply write a CASE-expression. It's more explicit, and it is a lot faster than a CASE-statement. Another thing to consider:
do you need multiple statements in the branches, or can you get by with a single statement? If you can get by with a single statement, and it is always a RETURN or a SET-statement, then you can rewrite it to a CASE expresion.

Conclusion


Because the changes I made are really quite small, I think this mostly shows that MySQL stored function compilation is poorly optimized (if at all). I am not a compiler expert but my gut feeling is that most of the optimizations could have been done automatically.

Finally


If you are interested in refactoring MySQL stored functions and procedures, you might also like a few other articles I wrote on the subject:

Wednesday, March 19, 2008

UDFs at the MySQL User's conference

The MySQL User's conference will be held in less than a month from now!!!

This year there is quite a good number of sessions on adding your own functions and procedures, such as:

I will be doing the 3 hour tutorial on writing user-defined functions, and I am currently adding the last few final touches to my slides.

My tutorial will be very much a hands-on experience. The ambition is to allow people with some programming skills in either C/C++, PHP or Java to leave the room with a bunch of UDFs they created themselves during one of labs. With that and the supporting materials (slides and a handout with detailed instructions) the attendees will be able to write UDFs themselves, and have the knowledge that allows them to make a sensible decision when they have to choose between stored SQL functions, UDFs or raw expressions built of built-in functions.

Allow me to tell you a bit about UDFs - you might decide you want to take my tutorial ;)

User-defined functions


User-defined functions or UDFs are often confused with Stored SQL functions but unrightly so. Other than the fact that stored functions and user-defined functions can be created by the user (as opposed to hard-wired, built-in functions), they have little in common. If you want to know exactly what the difference is and what the strengths and weaknesses are of either feature, my tutorial is for you.

Features


For now I don't want unveil too much but I think that it might be good to point out a number of key advantages of using UDF's over stored SQL functions:

  • Flexible argumentlists - UDFs allow a flexible number of dynamically typed arguments, and it is possible to identify arguments by name (rather than position).

  • The ability to leverage the functionality from external libraries - UDFs can be linked to existing libraries in order to expose their functionality to SQL statements

  • Aggregate functions - UDFs can compute a value for a collection of rows just like the built-in functions COUNT and GROUP_CONCAT


Neither of these features is available to stored SQL functions, and a generic workaround is simply impossible or far from trivial. So - UDFs might be for you if you need one of these things.

Performance


Another reason to use UDFs might be performance. UDFs are much, much better than stored SQL functions when it comes to computation. To prove this fact, I'd like to share a number of benchmarks I did.

For the benchmarks I use a stored procedure like this:

create procedure sp_<SOME-NAME>_benchmark(p_num int unsigned)
begin
declare v_num int unsigned default 0;
declare v_return <SOME-DATA-TYPE>;
declare v_begin int unsigned default unix_timestamp();
while v_num < p_num do
set v_return := <SOME-EXPRESSION>;
set v_num := v_num + 1;
end while;
call sp_store_function_benchmark(
'<SOME-NAME>'
, p_num
, unix_timestamp() - v_begin
);
end;

As you can see, the procedure repeats the evaluation of <SOME-EXPRESSION>, depending on the specified number of iterations passed to the parameter p_num. Depending on the benchmark, an expression is plugged in place of <SOME-EXPRESSION>. The variable v_return is used to capture the evaluation result, and a suitable data type is used in place of <SOME-DATA-TYPE>. Before running the loop, the time is recorded. After the loop, the elapsed time, the number of repetitions and the benchmark name are stored in a table for later analysis. This allows me to test a series of increasing repetitions of the same expression.

I also use a sp_noop_benchmark() procedure that is identical to the one described above, except that it omits the assignment of the expresion. So, it lacks this line:

set v_return := <SOME-EXPRESSION>;

This allows us to isolate the time spent on evaluating the expression and doing the assignment. I also use another control measurement that uses an assignment like this:

set v_return := <SOME-LITERAL-VALUE>;

This allows us to estimate the time spent on the value assignment, which can be used to get a bit closer to the actual time spent only on expression evaluation alone. (Of course, this assumes that the time spent on the assignment alone is comparable for expressions, function calls and literals).

All these measurements were made by comparing single, complete calls many times. All tests were done on Ubuntu Gutsy Gibbon using MySQL 5.1.23. Compilation of UDFs was performed using gcc version 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2). No optimizations were employed.

Stored functions and UDFs


To compare UDFs and Stored functions, I first used simple "Hello World" expressions that all evaluate to a VARCHAR(14). This was measured:

  • String literal sp_string_benchmark. The expression was:
    v_return := 'Hello, String!';

  • Stored SQL function sp_ssf_benchmark. The expresion was:
    v_return := ssf_hello_world();
    The ssf_hello_world() itself was:

    create function ssf_hello_world()
    returns varchar(14)
    return 'Hello, SSFs!!!';

  • User defined function sp_udf_benchmark. The expression was:
    v_return := udf_hello_world();
    The UDF itself was:

    my_bool udf_hello_world_init(
    UDF_INIT *initid, UDF_ARGS *args,
    char *message
    ){
    return 0;
    }
    char *udf_hello_world(
    UDF_INIT *initid, UDF_ARGS *args
    , char* result, unsigned long* length
    , char *is_null, char *error
    ){
    *length = 14;
    return "Hello, UDFs!!!";
    }

Here are the raw results:

+----------+------+--------+-----+-----+
| repeated | noop | string | UDF | SSF |
+----------+------+--------+-----+-----+
| 0 | 0 | 0 | 0 | 0 |
| 500000 | 3 | 4 | 6 | 8 |
| 1000000 | 7 | 9 | 10 | 16 |
| 1500000 | 10 | 13 | 16 | 25 |
| 2000000 | 13 | 17 | 21 | 33 |
| 2500000 | 16 | 22 | 25 | 42 |
| 3000000 | 19 | 27 | 31 | 50 |
| 3500000 | 23 | 31 | 36 | 58 |
| 4000000 | 26 | 36 | 41 | 67 |
| 4500000 | 30 | 40 | 45 | 75 |
| 5000000 | 33 | 44 | 51 | 83 |
| 5500000 | 36 | 49 | 55 | 90 |
| 6000000 | 39 | 53 | 61 | 99 |
| 6500000 | 42 | 57 | 66 | 107 |
| 7000000 | 45 | 62 | 71 | 115 |
| 7500000 | 49 | 66 | 76 | 123 |
| 8000000 | 52 | 71 | 81 | 131 |
| 8500000 | 56 | 75 | 86 | 140 |
| 9000000 | 58 | 80 | 90 | 147 |
| 9500000 | 62 | 83 | 96 | 156 |
| 10000000 | 64 | 88 | 101 | 164 |
+----------+------+--------+-----+-----+

And here is a graph:
udf-ssf-benchmark
At a glance we see that doing nothing is the fastest, followed by the assignment of a literal string. After that, comes the UDF and the stored SQL function is by far the slowest. Another thing that is immediately apparent is that the execution time increases proportionally as the number of repetitions is increased. This means that we can essentially forget the entire series and focus on the measurements with the highest number of repetitions (which is presumably more accurate than the earlier measurements).

But exactly how much faster is the UDF as compared to the SQL function? Well, it depends on how you look at it. If we just divide the raw execution time of the stored function by that of the UDF, we see that the ratio is:

164/101 = 1.60

This would tempt us into thinking that UDF's are about 60% faster.

However, when we calculate it like this, we are also counting the overhead of the benchmark procedures themselves. We get a better result if we focus on only the expression assignments. We can express the performance in terms of the time required to execute the no-operation benchmark. So:

noop / udf = 64 / 101 = .63
and

noop / ssf = 64 / 164 = .39

This correction indicates that the UDF performs at 63% of the no-operation benchmark, whereas the stored SQL function performs at 39% of the no-operation benchmark. We could say that relative to the no-operation benchmark, UDFs are 24% faster as compared to SQL functions

Now we could stretch it a bit more and try to isolate only the time spent on evaluating the functions. To do that, we'd have to assume that "assignment" costs the same, no matter if we are assigning from a stored SQL function, user-defined function or string literal. If we do that, we get this figure:

literal / udf = 88 / 101 = .87
and

literal / ssf = 88 / 164 = .54

This correction indicates that the UDF performs at 87% of the string-literal benchmark, whereas the stored SQL function performs at 54% of the string-literal benchmark. This would indicate that relative to the string literal assignment, UDFs are 33% faster than SQL functions.

Now, I can imagine that a lot of people will have some reservations against this method of correcting the measurements. Personally I think the method is valid, although the result itself is not that useful. I mean, in real life, the fact is that we will be assigning the value of the function, and from that perspective it doesn't help us much to know how fast it would have been if we didn't perform an assignment. Another reservation that we may have is that this benchmark still doesn't tell us much more than the relative differences in overhead. Basically, both the UDF and the stored function are empty - so this benchmark can tell us nothing about performance in a more realistic case where the function is actually doing some work.

Addition


In an attempt to measure at least some basic processing I decided to benchmark addition too. I took a INTEGER as data type and did the following benchmarks:

  • Integer literal: sp_num_benchmark with the expression:
    v_return := 3;

  • Addition operator: sp_opr_benchmark with the expression:
    v_return := 1+2;

  • UDF: sp_udf_benchmark with the expression:
    v_return := UDF_ADD(1,2);
    The code for the UDF is
    my_bool udf_add_init(
    UDF_INIT *initid
    , UDF_ARGS *args
    , char *message
    ){
    if(args->arg_count!=2){
    strcpy(message, "Require two arguments");
    return 1;
    } else {
    args->arg_type[0]= INT_RESULT;
    args->arg_type[1]= INT_RESULT;
    }
    initid->maybe_null= 1;
    return 0;
    }

    long long udf_add(
    UDF_INIT *initid
    , UDF_ARGS *args
    , char *is_null
    , char *error
    ){
    if((args->args[0]==NULL)||(args->args[1]==NULL)){
    *is_null= 1;
    return 0;
    } else {
    return (*((long long*)args->args[0])) + (*((long long*)args->args[1]));
    }
    }

  • Stored SQL function: sp_ssf_benchmark with the expression:
    v_return := SSF_ADD(1,2);
    The code for the function is

    create function ssf_add(l int, r int)
    returns int
    return l+r;
Here is the raw data:

+--------------+------+-----+-----+-----+-----+
| repeat_count | noop | num | opr | udf | ssf |
+--------------+----- +-----+-----+-----+-----+
| 0 | 0 | 0 | 0 | 0 | 0 |
| 500000 | 3 | 5 | 5 | 6 | 9 |
| 1000000 | 7 | 9 | 9 | 11 | 18 |
| 1500000 | 10 | 14 | 13 | 16 | 27 |
| 2000000 | 13 | 19 | 18 | 21 | 35 |
| 2500000 | 17 | 23 | 22 | 26 | 47 |
| 3000000 | 20 | 27 | 27 | 33 | 56 |
| 3500000 | 23 | 33 | 33 | 38 | 62 |
| 4000000 | 27 | 38 | 36 | 43 | 74 |
| 4500000 | 30 | 42 | 42 | 50 | 83 |
| 5000000 | 34 | 46 | 47 | 54 | 91 |
| 5500000 | 36 | 51 | 60 | 68 | 108 |
| 6000000 | 42 | 56 | 59 | 67 | 115 |
| 6500000 | 45 | 60 | 62 | 71 | 123 |
| 7000000 | 47 | 65 | 69 | 78 | 132 |
| 7500000 | 50 | 67 | 71 | 84 | 143 |
| 8000000 | 54 | 72 | 78 | 89 | 149 |
| 8000000 | 54 | 73 | 78 | 89 | 149 |
| 8500000 | 59 | 78 | 79 | 92 | 157 |
| 9000000 | 60 | 85 | 85 | 99 | 168 |
| 9500000 | 65 | 89 | 92 | 104 | 178 |
| 10000000 | 67 | 91 | 95 | 111 | 189 |
+--------------+------+-----+-----+-----+-----+
And here is the graph of the results:
addition-benchmark
We can immediately see that the result is quite similar to what we saw in the previous measurements. Let's look at the performance relative to assigning a literal integer:

int / opr = 91 / 95 = 0.96
and

int / udf = 91 / 111 = 0.82
and

int / ssf = 91 / 189 = 0.48
So, A direct addition operator is less than 5% slower than a literal assignment. The UDF 18% slower than the literal assignment and 14% slower than the addition operator. The stored SQL function is quite a good deal slower: 48% slower than the direct addition operator and 34% slower than the UDF. If we want to compare only UDFs vs stored SQL functions, we should compare relative to the addition operator, which gives a minutely different result:

opr / udf = 95 / 111 = 0.83
and

opr / ssf = 95 / 189 = 0.5

Built-in functions and UDFs


While I was busy doing benchmarks, I also figured that it'd be nice to figure out how UDFs and built-in functions compare performance-wise. To make things slightly more realistic than simply returning a value, I chose to implement my own version of CONCAT().

Because CONCAT can take on a number of arguments I decided to measure the effect of a varying number of arguments too. I ended up choosing VARCHAR(10) for the data type, and tested these different benchmarks:

  • Built-in function sp_bif<1..10>_benchmark - where <1..10> is a number from 1 to 10, and where the expression was:
    v_return := CONCAT(0[,1[,...,9]);
    That is, 10 different CONCAT measurements, from CONCAT(0) to CONCAT(0,1,2,3,4,5,6,7,8,9)

  • UDF sp_udf<1..10>_benchmark - where <1..10> is a number from 1 to 10, and where the expression was:
    v_return := UDF_CONCAT(0[,1[,...,9]);
    That is, 10 different measurements of my UDF implementation of concat, from UDF_CONCAT(0) to UDF_CONCAT(0,1,2,3,4,5,6,7,8,9).The code for my UDF_CONCAT is:

    my_bool udf_concat_init(
    UDF_INIT *initid, UDF_ARGS *args
    , char *message
    ){
    int i;
    size_t bytes = 0;
    for(i=0;iarg_count; i++){
    args->arg_type[i] = STRING_RESULT;
    bytes += args->lengths[i];
    }
    if(!(initid->ptr= malloc(bytes))){
    strcpy(message, "Error allocating memory.");
    return 1;
    } else {
    return 0;
    }
    }

    char *udf_concat(
    UDF_INIT *initid, UDF_ARGS *args
    , char* result, unsigned long* length
    , char *is_null, char *error
    ){
    int i;
    char *buff = initid->ptr;
    *length = 0;
    for(i=0;iarg_count; i++){
    if(args->args[i]==NULL){
    *is_null = 1;
    break;
    } else {
    memcpy(buff + *length, args->args[i], args->lengths[i]);
    *length += args->lengths[i];
    }
    }
    return buff;
    }

    void udf_concat_deinit(
    UDF_INIT *initid
    ){
    if(initid->ptr){
    free(initid->ptr);
    }
    }

I also measured the same string literal assignment and the noop benchmarks as control measurements. I won't give the raw results here - send me an email or post a comment if you'd like to have them. I do have a graph of the result here:
udf-builtin-benchmarkNote that for this graph I already subtracted the no-operation benchmark. This was done to make it easier to examine the data sets of the UDFs and Built-in functions. The graph does include the literal string benchmark, which is the orange line nearest to the X axis.

Now, what do we see here apart from the control benchmark? Basically, we see two bundles of series. The bundle nearest to the X axis corresponds with the built-in CONCAT() benchmarks. The bundle above that corresponds to the UDF benchmarks. So, this tells us that on every occasion, the built-in CONCAT() was faster than the UDF.

We can also see that there is a relatively small but measurable effect for passing more arguments. The calls with fewer arguments are consistently faster than the calls with more arguments within both the group of built-in and UDF calls.

Frankly, I didn't expect to see this. I had imagined that the effect of passing more arguments would be larger, so I expected to see pairs of UDF/Built-in function calls having the same number of parameters. Clearly, I was wrong.

So how much faster are UDFs? Well, if we look at the raw ratios of execution times for the largest number of repetitions, we see:

UDF_CONCAT(0) / CONCAT(0) = 91 / 107 = .85

Interestingly, if we compare that for the calls with 10 arguments we see the same ratio:

UDF_CONCAT(0,1,2,3,4,5,6,7,8,9) / CONCAT(0,1,2,3,4,5,6,7,8,9) = 91 / 107 = .85

So, UDF_CONCAT() would seem to be 15% slower than the built-in CONCAT(). Interestingly, there seems to be no difference between the call with the larger number of arguments, indicating that the performance impact of passing another argument is about the same for UDFs as it is for built-in functions.

What happens if we express the execution time relative to the no-operation benchmark? Well, we get:

noop / UDF_CONCAT(0) = 64 / 107 = .60
noop / CONCAT(0) = 64 / 91 = .70

and

noop / UDF_CONCAT(0,1,2,3,4,5,6,7,8,9) = 64 / 122 = .52
noop / CONCAT(0,1,2,3,4,5,6,7,8,9) = 64 / 104 = .62

Here we see that relative to the no-operation benchmark, the UDF is 10% slower than the built-in function. We could say that 10% is quite a lot, but it is more than half as large as the difference between UDFs and stored SQL functions.

We see something interesting if we look at the execution time relative to the execution time spent on the string literal assignment:

string / UDF_CONCAT(0) = 88 / 107 = .82
string / CONCAT(0) = 88 / 91 = .87

and

string / UDF_CONCAT(0,1,2,3,4,5,6,7,8,9) = 88 / 122 = .72
string / CONCAT(0,1,2,3,4,5,6,7,8,9) = 88 / 104 = .85

Here, the difference between the single argument calls is only 5%, wereas the difference becomes more 12% for the calls with 10 arguments. This is interesting because in the comparison with the no-operation benchmark we did not detect a difference caused by the the amount of arguments.

My current suspicion is that we are actually witnessing the effect of the length of the left hand expression in the assignment operation. Our string literal is a VARCHAR(14), and it should probably be a CHAR(1) for the comparison to the single argument calls and a CHAR(10) for the comparison with the 10 argument calls.

Conclusion


The UDFs I tested are

  • somewhere around and between 25% to 33% faster than stored SQL functions

  • somewhere around and between 5% to 20% slower than built-in functions


Another interesting finding is that argument passing seems to have about the same impact on UDFs as it has on built-in functions. A single argument seems to cost about 1% of the function performance in both cases. For stored SQL functions no such data is currently available.

Discussion


Currently, all functions call are single complete function calls. That means that for UDFs, the initialization function and the row-level function are both called exactly once per UDF instance. When using UDFs in multi-row SQL statements, the initialization function will be called only once per occurence of the UDF, and the row-level function will be called for each row. For functions where the initialization function is relatively expensive, this means that the performance observed in these benchmarks is probably poorer that it can be. A new set of benchmarks should be done to measure the performance of functions in multi-row SQL statements.

Finally


So, are you interested? If you are, go and register for the conference. You can contact me and get a 20% discount! Of course I'd love to see you attend my tutorial too!

See you at the conference!

DataZen winter meetup 2025

The DataZen winter meetup 2025 is nigh! Join us 18 - 20 February 2025 for 3 days of expert-led sessions on AI, LLM, ChatGPT, Big Data, M...