Tuesday, December 15, 2009

Validating MySQL data entry with triggers: A quick look at the SIGNAL syntax

The latest MySQL 5.5 milestone release offers support for an ANSI/ISO standard feature called the SIGNAL syntax. You can use this syntax inside stored routines (including triggers) to raise an error condition which can be used to invoke specific error handling, or otherwise abort the stored routine. In addition, you can use the SIGNAL syntax to convey information about what went wrong, which may be used by the caller to handle the error.

I have written about MySQL data entry validation procedures in the past. At the time, MySQL did not support any proper means to raise an error condition inside a stored routine, and one had to work around that by deliberatly causing a runtime error, for example by referring to a non-existent table, setting a non-nullable column to null, or ]Yting a specially crafted UDF. In this artcle, I'm taking a closer look at how to implement more robust data entry validation in MySQL using the SIGNAL syntax.

Triggers

For those of you that are unfamiliar with the subject, MySQL offers support for triggers as of version 5.0. Triggers are stored routines that are executed automatically right before or after data change events like a row being inserted, updated or deleted. Because triggers are executed as part of the SQL statement (and its containing transaction) causing the row change event, and because the trigger code has direct access to the changed row, you could in theory use them to correct or reject invalid data.

Example data validation problem

Let's take a quick look at the following example. Suppose you have a table called person to store data about persons:


CREATE TABLE person (
    id              INT         NOT NULL AUTO_INCREMENT PRIMARY KEY
,   first_name      VARCHAR(64) NOT NULL    
,   last_name       VARCHAR(64) NOT NULL
,   initials        VARCHAR(8)
)

Let's consider a simple validation method for the initials column:

We require the first letter of the value for the initials column to match the first letter of the value of the first_name column.

In addition, we require that the values for the initials column consists of uppercase letters separated by periods.

In order to implement this, we design the following algorithm:

If the value for first_name is NULL, we do nothing. The NOT NULL table constraint will prevent the data from being entered anyway, so further attempts at validation or correction are pointless.

If the value for initials is NULL, we correct it by automatically filling in the first character of the value for first_name.

If the values for first_name as well as initials are both not NULL, we require that the first character of the value for first_name equals the first character of the value for initials.

Finally, we use a regular expression to check if the value for initials matches the desired pattern of uppercase letters separated by periods.

A data validation stored procedure

Let's start by creating a stored procedure to perform this algorithm. Here's the code for the p_validate_initials procedure which validates and possibly corrects the initials value based on the value for first_name:


DELIMITER go

CREATE PROCEDURE p_validate_initials(
    IN      p_first_name    VARCHAR(64)
,   INOUT   p_initials      VARCHAR(64) 
)
DETERMINISTIC                                                                   -- same arguments yield same result, always
NO SQL                                                                          -- does not execute SQL statements, only procedural logic
_main: BEGIN

    DECLARE WARN_CORRECTED_INITIALS CONDITION FOR SQLSTATE '01000';
    DECLARE ERR_INITIALS_DONT_MATCH_FIRSTNAME CONDITION FOR SQLSTATE '45000';
    DECLARE ERR_INITIALS_ILLFORMATTED CONDITION FOR SQLSTATE '45000';

    IF      p_first_name IS NULL THEN
        LEAVE _main;                                                            -- nothing to validate
    ELSEIF   p_initials IS NULL THEN                                            -- initials are NULL, correct:
        SET p_initials := CONCAT(LEFT(p_first_name, 1), '.');                   -- take the first letter of first_name
        SIGNAL WARN_CORRECTED_INITIALS                                          -- warn about the corrective measure
        SET MESSAGE_TEXT = 'Corrected NULL value for initials to match value for first_name.';
    ELSEIF   BINARY LEFT(p_first_name, 1) != LEFT(p_initials, 1) THEN           -- initials don't match first_name
        SIGNAL ERR_INITIALS_DONT_MATCH_FIRSTNAME                                -- raise an error
        SET MESSAGE_TEXT = 'The first letter of the value for initials does not match the first letter of the value for first_name';
    END IF;
    IF NOT p_initials REGEXP '^([A-Z][.])+$' THEN                               -- if initials don't match the correct pattern
        SIGNAL ERR_INITIALS_ILLFORMATTED                                        -- raise an error
        SET MESSAGE_TEXT = 'The value for initials must consist of upper case letters separated by periods.';
    END IF;
END;
go

DELIMITER ;

Let's take a look at how this procedure works.

How to issue warnings

First, let's pass NULL for the initials to see if they are properly corrected:


mysql> set @initials := null;
Query OK, 0 rows affected (0.00 sec)

mysql> call p_validate_initials('Roland', @initials);
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> select @initials;
+-----------+
| @initials |
+-----------+
| R.        |
+-----------+
1 row in set (0.00 sec)

Note that executing the procedure placed the correct value into the @initials user-defined variable. Also note that a warning was issued.

In this case, the procedure ran through the following branch of the first IF statement:


    ...
    ELSEIF   p_initials IS NULL THEN
        SET p_initials := CONCAT(LEFT(p_first_name, 1), '.');
        SIGNAL WARN_CORRECTED_INITIALS
        SET MESSAGE_TEXT = 'Corrected NULL value for initials to match value for first_name.';
    ELSEIF ...

The warning is caused by the SIGNAL statement (which appears in bold text in the snippet above). The general syntax for the SIGNAL statement is:


SIGNAL condition_value
    [SET signal_information [, signal_information] ...]
    
condition_value:
    SQLSTATE [VALUE] sqlstate_value
  | condition_name

signal_information:
    condition_information_item = simple_value_specification

So, in this case, we used the condition name WARN_CORRECTED_INITIALS as condition_value. This condition is declared in the top of the procedure:


    DECLARE WARN_CORRECTED_INITIALS CONDITION FOR SQLSTATE '01000';

Basically, condition declarations like these serve simply to tie a human readable name to otherwise obscure SQLSTATE values. As per the SQL standard, SQLSTATE values are strings of 5 digits. The prefix 01 indicates a warning.

(Condition declarations like these are not only useful to clarify the meaning of your SIGNAL statements, you can also use them to declare error HANDLERs)

An alternative syntax for SIGNAL allows you to directly refer to the SQLSTATE without explicitly declaring a CONDITION. So, if you feel that declaring explicit conditions is too much trouble, you can also omit that and write:


        SIGNAL SQLSTATE '01000' 
            ...

(However, I like using explicit condition names better because it does a better job of explaining the intention of the code.)

Conveying `SIGNAL` context information

The SIGNAL statement also features a SET-clause which is used to convey signal information. In our example the set clause was:


        SET MESSAGE_TEXT = 'Corrected NULL value for initials to match value for first_name.';

As you can see, the SET-clause in the example contains an assignment. In the context of the SIGNAL statemntt, such an assignment is referred to as signal_information. The left hand side of the assignments must be one of the predefined condition_information_items. The SET-clause can have multiple of these signal_information items which can be used to capture and communicate program state to the client program.

In the case of the example we can demonstrate how this works using the MySQL command-line client. By issuing a SHOW WARNINGS statement, we can see the message text was conveyed with with the signal_information item:


mysql> show warnings;
+-------+------+---------------------------------------------------------------------------------+
| Level | Code | Message                                                                         |
+-------+------+---------------------------------------------------------------------------------+
| Error | 1644 | The value for initials must consist of upper case letters separated by periods. |
+-------+------+---------------------------------------------------------------------------------+
1 row in set (0.01 sec)

Unfortunately, MySQL does not currently support a way for any calling stored routines to capture the signal_information items. They are currently only available in the APIs you use to communicate with MySQL, so you can capture them in your application code.

(A bug has been filed to ask for the ability to refer to signal information items in stored routines. This should become available whenever MySQL implements a DIAGNOSTICS feature)

Predefined condition information items

I just mentioned that the left-hand side of the signal_information item assignment must be one of the predefined condition_information_items. These are dictated by the standard, and although MySQL allows all of the standard condition_information_items, only two of them are currently relevant: MESSAGE_TEXT and MYSQL_ERRNO. We already illustrated using MESSAGE_TEXT. The MYSQL_ERRNO is a non-standard condition information item that can be used to convey custom error codes.

This leaves currently three variables to convey information about the context of the SIGNAL statement:

SQLSTATE: available in the C API as mysql_sqlstate()

MYSQL_ERRNO: available in the C API as mysql_errno()

MESSAGE_TEXT: available in the C API as mysql_error()

(You should be able to obtain the information also in PHP through the corresponding mysqli_, pdo_ and mysql_ functions.)

How to issue errors

We just discussed how to cause your stored routine to issue warnings. Issuing errors is exactly the same process, it just relies on a different class of SQLSTATE values (as determined by the code prefix). Let's see the errors in action:


mysql> set @initials := 'r';
Query OK, 0 rows affected (0.00 sec)

mysql> call p_validate_initials('Roland', @initials);
ERROR 1644 (45000): The first letter of the value for initials does not match the first letter of the value for first_name

In this case, the stored routine ran through the last branch of the first IF statement:


    ELSEIF   BINARY LEFT(p_first_name, 1) != LEFT(p_initials, 1) THEN
        SIGNAL ERR_INITIALS_DONT_MATCH_FIRSTNAME
        SET MESSAGE_TEXT = 'The first letter of the value for initials does not match the first letter of the value for first_name';
    END IF;

As you can see, the syntax for the actual SIGNAL statement is exactly similar to what we saw in the example illustrating warnings. The most important difference is that in this case, the condition that is being signalled is declared with a SQLSTATE value of 45000:


    DECLARE ERR_INITIALS_DONT_MATCH_FIRSTNAME CONDITION FOR SQLSTATE '45000';

The SQLSTATE code 45000 is a special and indicates a general user-defined exception.

Using the procedure in a trigger

All we need to do now is create our triggers on the person table that call the procedure to perform the actual validation. We need to apply the validation when data is inserted into the table, but also when data is updated. If it turns out the data is invalid, we need to reject the change. For this reason, we want to create triggers that fire before the data change is applied to the table.

So, to enforce validation, we need two triggers: one that fires BEFORE INSERT events, and one that fires BEFORE UPDATE events. Because the validation process itself is the same regardless of the type of change event, both triggers can call the p_validate_initials procedure to perform the actual validation. This allows us to write (and maintain!) the validation logic only once, and reuse it whenever we need it.


DELIMITER go

CREATE TRIGGER bir_person 
BEFORE INSERT ON person
FOR EACH ROW
BEGIN
    CALL p_validate_initials(
        NEW.first_name
    ,   NEW.initials 
    );
END;
go

CREATE TRIGGER bur_person 
BEFORE UPDATE ON person
FOR EACH ROW
BEGIN
    CALL p_validate_initials(
        NEW.first_name
    ,   NEW.initials 
    );
END;
go

DELIMITER ;

A quick check indicates that data validation is now enforced as intended:


mysql> INSERT INTO person (id, first_name, last_name, initials)
    -> VALUES (2, 'Roland', 'Bouman', 'r');
ERROR 1644 (45000): The first letter of the value for initials does not match the first letter of the value for first_name

mysql> INSERT INTO person (id, first_name, last_name, initials)
    -> VALUES (2, 'Roland', 'Bouman', 'R');
ERROR 1644 (45000): The value for initials must consist of upper case letters separated by periods.

mysql> INSERT INTO person (id, first_name, last_name, initials)
    -> VALUES (2, 'Roland', 'Bouman', NULL);
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> SHOW WARNINGS;
+---------+------+------------------------------------------------------------------+
| Level   | Code | Message                                                          |
+---------+------+------------------------------------------------------------------+
| Warning | 1642 | Corrected NULL value for initials to match value for first_name. |
+---------+------+------------------------------------------------------------------+
1 row in set (0.00 sec)

Final words

For more information on MySQL 5.5, check out Giuseppe Maxia's article on Getting Started with MySQL 5.5. Detailed information on the SIGNAL syntax is available in the reference manual here: http://dev.mysql.com/doc/refman/5.5/en/signal-resignal.html.

Monday, November 16, 2009

Pentaho Data Integration: Javascript Step Performance

I just read a post from Vincent Teyssier on cleaning strings using the javascript capabilities of Pentaho Data Integration (also known as Kettle) and Talend.

In this post, I am looking at a few details of Vincent's approach to using Javascript in his transformation. I will present a few modifications that considerably improve performance of the Javascript execution. Some of these improvements are generic: because they apply to the use of the javascript language, they are likely to improve performance in both Talend as well as Kettle. Other improvements have to do with the way the incoming and outgoing record streams are bound to the javascript step in Kettle.

Original Problem

The problem described by Vincent is simple enough: for each input string, return the string in lower case, except for the initial character, which should be in upper case. For example: vIncEnt should become Vincent.

Vincent illustrates his solution using Pentaho Data Integration's "Modified Javascript Value" step. He uses it to execute the following piece of code:


//First letter in uppercase, others in lowercase
var c = Input.getString().substr(0,1);
if (parseInt(Input.getString().length)==1)
{
    var cc = upper(c);
}
else
{
    var cc = upper(c) + lower(Input.getString().slice(1));
}

(The original post explains that one should be able to execute the code with minimal modification in Talend. While I don't have much experience with that tool, I think the proper step to use in that case is the tRhino step. Both tools use an embedded Rhino engine as javascript runtime, but I can imagine that there are slight differences with regard to binding the input and output fields and the support for built-in functions. Please feel free and leave a comment if you can provide more detailed information with regard to this matter.)

In the script, Input is the string field in the incoming stream that is to be modified, and cc is added to the output stream, containing the modified value. For some reason, the original example uses the javascript step in compatibility mode, necessitating expressions such as Input.getString() to obtain the value from the field.

I used the following transformation to test this script:

The transformation uses a Generate Rows step to generate 1 million rows having a single String type field with the default value vIncEnt. The rows are processed by the Modified Javascript Value step, using the original code and compatibility mode like described in Vincent's original post. Finally, I used a Dummy step. I am not entirely sure the dummy ste has any effect on the performance of the javascript step, but I figured it would be a good idea to ensure the output of the script is actually copied to an outgoing stream.

On my laptop, using Pentaho Data Integration 3.2, this transformation takes 21.6 seconds to complete, and the Javascript step processes the rows at a rate of 46210.7 rows/second.

Caching calls to getString()

Like I mentioned, the original transformation uses the Javascript step in compatibility mode. Compatibility mode affects the way the fields of the stream are bound to the javascript step. With compatibility mode enabled, the step behaves like it did in Kettle 2.5 (and earlier versions): fields from the input stream are considered to be objects, and a special getter method is required to obtain their value. This is why we need an expression like Input.getString() to obtain the actual value.

The first improvement I'd like to present is based on simply caching the return value from the getter method. So instead of writing Input.getString() all the time, we simply write a line like this:


var input = Input.getString();

Afterwards, we simply refer only to input instead of Input.toString(). With this modifcation, the script becomes:


//First letter in uppercase, others in lowercase
var input = Input.getString();
var c = input.substr(0,1);
if (parseInt(input.length)==1)
{
var cc = upper(c);
}
else
{
var cc = upper(c) + lower(input.slice(1));
}

(Note that input and Input are two different things here: Input refers to the field object from the incoming record stream, and input refers to a global javascript variable which we use to cache the return value from the getString() method of the Input field object.)

If you compare this code to the original, you will notice that although this modified example adds an assignment to cache the value, it saves at least one call to the getString() method in the generic case. However, because the input value used in the example is longer than one character, it also saves another call done in the else branch of the if statement. So all in all, we can avoid two calls to getString() in this example.

This may not seem like that big a deal, but still, this improvement allows the javascript step to process rows at a rate of 51200.6 rows per second, which is an improvement of about 11%. Scripts that would have more than two calls to the getter method would benefit even more from this simple improvement.

Disabling Compatibility mode

The compatibility mode is just that: a way to stay compatible with the old Kettle 2.5 behaviour. While this is useful to ensure your old transformations don't break, you really should consider not using it for new transformations.

When disabling compatibility mode, you will need to change the script. In compatibility mode, the names of the fields from the input stream behave like variables that point to the field objects. With compatibility mode disabled, fieldnames still behave like variables, but now they point to the actual value of the field, and not the field object. So we need to change the script like this:


var c = Input.substr(0,1);
var cc;
if (parseInt(Input.length)==1){
    cc = upper(c);
} 
else {
    cc = upper(c) + lower(Input.slice(1));
}

As you can see, we don't need to use the getSting() method anywhere anymore, and this also makes our first improvement obsolete. Personally, I feel this is an improvement code-wise. In addition, the transformation now performs considerably better: now it takes 14,8 seconds, and the javascript step is processing 67159,1 rows per second, which 30% better than the previous solution, and 45% better than the original.

Eliminating unncessary code

The fastest code is the code you don't execute. The original script contains a call to the javascript built-in parseInt() function which is applied to the length property of Input:


if (parseInt(Input.length)==1){
...snip...
}

The intended usage of parseInt() is to parse strings into integer values. Because the type of the length property of a string is already an integer, the call to parseInt() is simply redundant, and can be removed without any issue. This cuts down execution time to 12.8 seconds, and the Javascript step is now processing at a rate of 75204,9 rows per second: an improvement of 12% as compared to the previous improvement, and 63% as compared to the original.

Optimizing the flow

Although it may look like we optimized the original javascript as much as we could, there is still room for improvement. We can rewrite the if statements using the ternary operator, like so:


var cc = Input.length==0 
        ? ""
        : Input.length==1 
        ? Input.toUpperCase()
        : Input.substr(0,1).toUpperCase() 
        + Input.substr(1).toLowerCase() 
;

(Note that I am now using the toLowerCase() and toUpperCase() methods of the javascrpt String object in favor of the kettle built-in lower() and upper() functions.)
Not everybody may appreciate this code-wise, as it may appear a lot less explcit than the original if logic. In its defense, the approach of this solution has a more functional feel (as opposed to the procedural logic of the prior examples), which may feel more natural for the problem at hand. Regardless of any code-maintenance or aesthetic arguments, this code is actually slightly faster: It takes 12.3 seconds total, and the javascript step is processing 80301,9
rows per second, which is a 7% improvement as compared to the previous solution, and a 74% improvement as compared to the original.

Not using Javascript at all

The Javascript step can be very useful. But always keep in mind that it really is a general purpose scripting device. With the javascript step, you can do loops, open files, write to databases and whatnot. Do we really need all this power to solve the original problem? Especially if you are proficient in Javascript, it may be somewhat of a challenge to find better ways to solve the problem at hand, but really - it is often worth it.

First, let us realize that the original problem does not presume a particularly difficult transformation. We just need "something" that takes one input value, and returns one output value. We don't need any side effects, like writing to a file. We also don't need to change the grain: every input row is matched by exactly one output row, which is similar in layout to the output row, save for the addition of a field to hold the transformed value.

When discussing the previous solution, I already hinted that it was more "functional" as compared to the more "procedural" examples before that. We will now look at a few solutions that are also functional in nature:

The Formula step

So, basically, we need to write a function. The Formula step lets you combine several built-in functions in about the same manner as you can in spreadsheet programs like open office and Microsoft Excel. Using the formula step we can enter the following formula:


UPPER(LEFT([Input];1)) & LOWER(MID([Input];2;LEN([Input])))

If, like me, your eyes are bleading now, you might appreciate this formatted overview of this calculation:


    UPPER(
        LEFT(
            [Input]
        ;   1
        )
    ) 
&   LOWER(
        MID(
            [Input]
        ;   2
        ;   LEN([Input])
        )
    )

This solution takes 8.5 seconds to complete, and the formula step is processing rows at a rate of 117868.9 per second, which is 47% better than the previous solution, and 155% better than the original (!!!)

The Calculator step

While not as flexible as the Formula step, the Calculator step offers a reasonable range of often used functions, and has the advantage of often being faster than the formula step. In this case, we're lucky, and we can set up two calculations: one "LowerCase of a string A" to convert the input value entirely to lower case, and then a "First letter of each word in capital of a string A". By feeding the output of the former into the latter, we get the desired result:

(To be fair, because the calculation will actually add a capital to every word in the input, the result will actually be different as compared to any of the other transformations. However, in many cases, you might be able to guarantee that there is actually one word in the input, or otherwise, it may be considered desirable to capitalize all words.)

This transformation complets in just 6.5 seconds, and the calculator processes rows at a rate of 155327,7 per second. This is 32% better than the previous solution and 236% better than the original.

User-defined Java Expression

The final kicker is the user-defined java expression step. The user-defined java expression step allows you to write a java expression, which is compiled while the transformation is initialized. The expression I used is quite like the last javascript solution I discussed, except that we have to use methods of the Java String object (and not the JavaScript string object)


Input.length()==0?"":Input.length()==1?Input.toUpperCase():Input.substring(0,1).toUpperCase() + Input.substring(1).toLowerCase()

The result is truly amazing: The transformation completes in just 3.1 seconds, with the user-defined Java expression step processing at a rate of 324886,2 rows per second. This is 109% faster than the previous solution, and 603% faster than the original.

Conclusion

Javascript is a powerful device in data intergration transfomations, but it is quite slow. Consider replacing the javascript step with either the formula step, the calculator step or the user-defined Java expression step. Depending on your requirements, there may be other steps that deliver the fuunctionality you need.

If you really do need javascript, and you are using Pentaho Data Integration, consider disabling the compatibility mode. On the other hand, if you do need the compatibility mode, be sure to avoid repeated calls the getter methods of the field objects to obtain the value. Instead, call the getter methods just once, and use global script variables to cache the return value.

Summary

Here's a summary of the measurements:


    Transformation   |  Rows per second
    -----------------+-----------------
    Original         |   46201,7
    Cache getString()|   51200,6
    No Compatmode    |   67159,1
    no parseInt()    |   75204,9
    Optimize flow    |   80301,9
    Formula          |  117868,9
    Calculator       |  155327,7
    Java Expression  |  324886,2

...and here, a bar chart showing the results:

Final thoughts

One of the things I haven't looked at in detail is adding more parallelism. By simply modifying the number of copies of the transforming step, we can use more cores/processors, but this is an excellent subject for a separate blog post.

UPDATE
Daniel Einspanjer from Mozilla Coorp. created a 30 min. video demonstrating this hands-on! He adds a few very interesting approaches to squeeze out even more performance.

Tuesday, October 27, 2009

Calpont opens up: InfiniDB Open Source Analytical Database (based on MySQL)

Open source business intelligence and data warehousing are on the rise!

If you kept up with the MySQL Performance Blog, you might have noticed a number of posts comparing the open source analytical databases Infobright, LucidDB, and MonetDB. LucidDB got some more news last week when Nick Goodman announced that the Dynamo Business Intelligence Corporation will be offering services around LucidDB, branding it as DynamoDB.

Now, to top if off, Calpont has just released InfiniDB, a GPLv2 open source version of its analytical database offering, which is based on the MySQL server.

So, let's take a quick look at InfiniDB. I haven't yet played around with it, but the features sure look interesting:

Column-oriented architecture (like all other analytical database products mentioned)

Transparent compression

Vertical and horizontal partitioning: on top of being column-oriented, data is also partitioned, potentially allowing for less IO to access data.

MVCC and support for high concurrency. It would be interesting to see how much benefit this gives when loading data, because this is usually one of the bottle necks for column-oriented databases

Support for ACID/Transactions

High performance bulkloader

No specialized hardware - InfiniDB is a pure software solution that can run on commidity hardware

MySQL compatible

The website sums up a few more features and benefits, but I think this covers the most important ones.

Calpont also offers a closed source enterprise edition, which differs from the open source by offering support for multi-node scale-out support. By that, they do not mean regular MySQL replication scale-out. Instead, the enterprise edition features a true distributed database architecture which allows you to divide incoming requests across a layer of so-called "user modules" (MySQL front ends) and "performance modules" (the actual workhorses that partition, retrieve and cache data). In this scenario, the user modules break the queries they recieve from client applications into pieces, and send them to one or more performance modules in a parallel fashion. The performance modules then retrieve the actual data from either their cache, or from the disk, and sends those back to the user modules which re-assemble the partial and intermediate results to the final resultset which is sent back to the client. (see picture)

Given the MySQL compatibility and otherwise similar features, I think it is fair to compare the open source InfiniDB offering to the Infobright community edition. Interesting differences are that InfiniDB supports all usual DML statements (INSERT, DELETE, UPDATE), and that InfiniDB offers the same bulkloader in both the community edition as well as the enterprise edition: Infobright community edition does not support DML, and offers a bulk loader that is less performant than the one included in its enterprise edition. I have not heard of an InfoBright multi-node option, so when comparing the enterprise edition featuresets, that seems like an advantage too in Calpont's offering.

Please understand that I am not endorsing one of these products over the other: I'm just doing a checkbox feature list comparison here. What it mostly boils down to, is that users that need an affordable analytical database now have even more choice than before. In addition, it adds a bit more competition for the vendors, and I expect them all to improve as a result of that. These are interesting times for the BI and data warehousing market :)

Tuesday, September 15, 2009

MySQL: Another Ranking trick

I just read SQL: Ranking without self join, in which Shlomi Noach shares a nice MySQL-specific trick based on user-defined variables to compute rankings.

Shlomi's trick reminds me somewhat of the trick I came across little over a year ago to caclulate percentiles. At that time, several people pointed out to me too that using user-defined variables in this way can be unreliable.

The problem with user-defined variables

So what is the problem exaclty? Well, whenever a query assigns to a variable, and that same variable is read in another part of the query, you're on thin ice. That's because the result of the read is likely to differ depending on whether the assignment took place before or after the read. Not surprising when you think about it - the whole point of variable assignment is to change its value, which by definition causes a different result when subsequently reading the variable (unless you assigned the already assigned value of course, duh...).

Now watch that previous statement clearly - the word subsequently is all-important.

See, that's the problem. The semantics of a SQL SELECT statement is to obtain a (tabular) resultset - not specifying an algorithm to construct that resultset. It is the job of the RDBMS to figure out an algorithm and thus, you can't be sure in what order individual expressions (including variable evaluation and assignment) are executed.

The MySQL manual states it like this:

The order of evaluation for user variables is undefined and may change based on the elements contained within a given query. In SELECT @a, @a := @a+1 ..., you might think that MySQL will evaluate @a first and then do an assignment second, but changing the query (for example, by adding a GROUP BY, HAVING, or ORDER BY clause) may change the order of evaluation.

The general rule is never to assign a value to a user variable in one part of a statement and use the same variable in some other part of the same statement. You might get the results you expect, but this is not guaranteed.

So what good are these variables anyway?

On the one hand, this looks really lame: can't MySQL just figure out the correct order of doing the calulations? Well, that is one way of looking at it. But there is an equally valid reason not to do that. If the calculations would influence execution order, it would drastically lessen the number of ways that are available to optimize the statement.

This begs the question: Why is it possible at all to assign values to the user-defined variables? The answer is quite simple: you can use it to pass values between statetments. My hunch is the variables were created in the olden days to overcome some limitations resulting from the lack of support for subqueries. Having variables at least enables you to execute a query and assign the result temporarily for use in a subsequent statement. For example, to find the student with the highest score, you can do:


mysql> select @score:=max(score) from score;
+--------------------+
| @score:=max(score) |
+--------------------+
|                 97 |
+--------------------+
1 row in set (0.00 sec)

mysql> select * from score where score = @score;
+----------+--------------+-------+
| score_id | student_name | score |
+----------+--------------+-------+
|        2 | Gromit       |    97 |
+----------+--------------+-------+
1 row in set (0.03 sec)

There is nothing wrong with this approach - problems start arising only when reading and writing the same variable in one and the same statement.

Another way - serializing the set with `GROUP_CONCAT`

Anyway, the percentile post I just linked to contains another solution for that problem that relies on GROUP_CONCAT. It turns out we can use the same trick here.

(Some people may like to point out that using GROUP_CONCAT is not without issues either, because it may truncate the list in case the pre-assigned string buffer is not large enough. I wrote about dealing with that limitation in several places and I remain recommending to set the group_concat_max_len server variable to the value set for the max_packet_size server variable like so:

SET @@group_concat_max_len := @@max_allowed_packet;

)

The best way to understand how it works is to think of the problem in a few steps. First, we make an ordered list of all the values we want to rank. We can do this with GROUP_CONCAT like this:


mysql> SELECT  GROUP_CONCAT(
    ->             DISTINCT score
    ->             ORDER BY score  DESC
    ->         )                   AS scores
    -> FROM    score
    -> ;
+-------------+
| scores      |
+-------------+
| 97,95,92,85 |
+-------------+
1 row in set (0.00 sec)

Now that we have this list, we can use the FIND_IN_SET function to look up the position of any particlar value contained in the list. Because the list is ordered in descending order (due to the ORDER BY ... DESC), and contains only unique values (due to the DISTINCT), this position is in fact the rank number. For example, if we want to know the rank of all scores with the value 92, we can do:


mysql> SELECT FIND_IN_SET(92, '97,95,92,85')
+--------------------------------+
| FIND_IN_SET(92, '97,95,92,85') |
+--------------------------------+
|                              3 |
+--------------------------------+
1 row in set (0.00 sec)

So, the answer is 3 because 92 is the third entry in the list.

(If you're wondering how it's possible that we can pass the integer 92 as first argument for FIND_IN_SET: the function expects string arguments, and automatically converts whichever non-string typed value we pass to a string. In the case of the integer 92, it is silently converted to the string '92')

Of course, we are't really interested in looking up ranks for individual numbers one at a time; rather, we'd like to combine this with a query on the scores table that does it for us. Likewise, we don't really want to manually supply the list of values as a string constant, we want to substitute that with the query we wrote to generate that list.
So, we get:


mysql> SELECT score_id, student_name, score
    -> ,      FIND_IN_SET(
    ->            score
    ->        ,  (SELECT  GROUP_CONCAT(
    ->                        DISTINCT score
    ->                        ORDER BY score  DESC
    ->                    )
    ->            FROM    score)
    ->        ) as rank
    -> FROM   score;
+----------+--------------+-------+------+
| score_id | student_name | score | rank |
+----------+--------------+-------+------+
|        1 | Wallace      |    95 |    2 |
|        2 | Gromit       |    97 |    1 |
|        3 | Shaun        |    85 |    4 |
|        4 | McGraw       |    92 |    3 |
|        5 | Preston      |    92 |    3 |
+----------+--------------+-------+------+
5 rows in set (0.00 sec)

Alternatively, if you think that subqueries are for the devil, you can rewrite this to a CROSS JOIN like so:


SELECT      score_id, student_name, score
,           FIND_IN_SET(
                score
            ,   scores
            ) AS rank
FROM        score
CROSS JOIN (SELECT  GROUP_CONCAT(
                       DISTINCT score
                       ORDER BY score  DESC
                    ) AS scores
            FROM    score) scores

Now that we have a solutions, lets see how it compares to Shlomi's original method. To do this, I am using the payment table from the sakila sample database.

First, Shlomi's method:


mysql> SELECT   payment_id
    -> ,        amount
    -> ,        @prev := @curr
    -> ,        @curr := amount
    -> ,        @rank := IF(@prev = @curr, @rank, @rank+1) AS rank
    -> FROM     sakila.payment
    -> ,       (SELECT @curr := null, @prev := null, @rank := 0) sel1
    -> ORDER BY amount DESC;
+------------+--------+----------------+-----------------+------+
| payment_id | amount | @prev := @curr | @curr := amount | rank |
+------------+--------+----------------+-----------------+------+
|        342 |  11.99 |           NULL |           11.99 |    1 |
.        ... .  ..... .          ..... .           ..... .    . .
|      15456 |   0.00 |           0.00 |            0.00 |   19 |
+------------+--------+----------------+-----------------+------+
16049 rows in set (0.09 sec)

Wow! It sure is fast :) Now, the GROUP_CONCAT solution, using a subquery:


mysql> SELECT payment_id, amount
    -> ,      FIND_IN_SET(
    ->            amount
    ->        ,  (SELECT  GROUP_CONCAT(
    ->                        DISTINCT amount
    ->                        ORDER BY amount  DESC
    ->                    )
    ->            FROM    sakila.payment)
    ->        ) as rank
    -> FROM   sakila.payment
+------------+--------+------+
| payment_id | amount | rank |
+------------+--------+------+
|          1 |   2.99 |   15 |
.          . .   .... .   .. .
|      16049 |   2.99 |   15 |
+------------+--------+------+
16049 rows in set (0.14 sec)

(In case you're wondering why the results are different, this is because the result set for Shlomi's solution is necessarily ordered by ascending rank (or descending amount - same difference. To obtain the identical result, you need to add an ORDER BY clause to my query. But since the point was to calculate the ranks, I didn't bother. Of course, adding an ORDER BY could slow things down even more.)

Quite a bit slower, bummer. But at leastt we can't run into nasties with the user variables anymore. For this data set, I get about the same performance with the CROSS JOIN, but I should warn that I did not do a real benchmark.

Conclusion

Don't fall into the trap of reading and writing the same user-defined variable in the same statement. Although it seems like a great device and can give you very good performance, you cannot really control the order of reads and writes. Even if you can, you must check it again whenever you have reason to believe the query will be solved differently by the server. This is of course the case whenever you upgrade the server. But also seemingly harmless changes like adding an index to a table may change the order of execution.

Almost all cases where people want to read and write to the same user variables within the same query, they are dealing with a kind of serialization problem. They are trying to maintain state in a variable in order to use it across rows. In many cases, the right way to do that is to use a self-join. But this may not always be feasible, as pointed out in Shlomi's original post. For example, rewriting the payment rank query using a self join is not going to make you happy.

Often, there is a way out. You can use GROUP_CONCAT to serialize a set of rows. Granted, you need at least one pass for that, and another one to do something useful with the result, but this still a lot better than dealing with semi-cartesian self join issues.

Saturday, September 12, 2009

EU Should Protect MySQL-based Special Purpose Database Vendors

In my recent post on the EU antitrust regulators' probe into the Oracle Sun merger I did not mention an important class of stakeholders: the MySQL-based special purpose database startups. By these I mean:

Kickfire

Infobright

Calpont

ScaleDB

I think it's safe to say the first three are comparable in the sense that they are all analytical databases: they are designed for data warehousing and business intelligence applications. ScaleDB might be a good fit for those applications, but I think it's architecture is sufficiently different from the first three to not call it an analytical database.

For Kickfire and Infobright, the selling point is that they are offering a relatively cheap solution to build large data warehouses and responsive business intelligence applications. (I can't really find enough information on Calpoint pricing, although they do mention low total cost of ownership.) An extra selling point is that they are MySQL compatible, which may make some difference for some customers. But that compatibility is in my opinion not as important as the availability of a serious data warehousing solution at a really sharp price.

Now, in my previous post, I mentioned that the MySQL and Oracle RDBMS products are very different, and I do not perceive them as competing. Instead of trying to kill the plain MySQL database server product, Oracle should take advantage of a huge opportunity to help shape the web by being a good steward, leading ongoing MySQL development, and in addition, enable their current Oracle Enterprise customers to build cheap LAMP-based websites (with the possibility of adding value by offering Oracle to MySQL data integration).

For these analytical database solutions, things may be different though.

I think these MySQL based analytical databases really are competitive to Oracle's Exadata analytical appliance. Oracle could form a serious threat to these MySQL-based analytical database vendors. After the merger, Oracle would certainly be in a position to hamper these vendors by resticting the non-GPL licensed usage of MySQL.
In a recent ad, Oracle vouched to increase investments in developing Sun's hardware and operating system technology. And this would eventually put them in an even better position to create appliances like Exadata, allowing them to ditch an external hardware partner like HP (which is their Exadata hardware partner).

So, all in all, in my opinion the EU should definitely take a serious look at the dynamics of the analytical database market and decide how much impact the Oracle / Sun merger could have on this particular class of MySQL OEM customers. The rise of these relatvely cheap MySQL-based analytical databases is a very interesting development for the business intelligence and data warehousing space in general, and means a big win for customers that need affordable datawarhousing / business intelligence. It would be a shame if it would be curtailed by Oracle. After the merger, Oracle sure would have the means and the motive, so if someone needs protection, I think it would be these MySQL-based vendors of analytical databases.

As always, these are just my musing and opinions - speculation is free. Feel free to correct me, add applause or point out my ignorance :)

Thursday, September 03, 2009

MySQL a factor in EU's decision

I just read Björn Schotte's post on the activities of the European Union antitrust regulators concerning the intended takeover of Sun Microsystems by Oracle.

Björn mentions a news article that cites EU Competition Commissioner Neelie Kroes saying that the commission has the obligation to protect the customers from reduced choice, higher costs or both. But to me, this bit is not the most interesting. Later on the article reads:

The Commission said it was concerned that the open source nature of Sun's MySQL database might not eliminate fully the potential for anti-competitive effects.

With both Oracle's databases and MySQL competing directly in many sectors of the database market, MySQL is widely expected to represent a greater competitive constraint as it becomes increasingly functional, the EU executive said.

In other words, the commission is working to protect the MySQL users :)

Personally, I (and many other MySQL community members) don't fear for the future of MySQL as a product. But I do think it is justified to worry about customers that are now paying Sun for some licensed usage of MySQL, most notably OEM customers and a bunch of Enterprise users.

Ever since the news was disclosed concerning the intention of Oracle to acquire Sun, it has been speculated that Oracle my try to "upsell" the Oracle RDBMS to current MySQL enterprise users. However I don't think that that would be the brightest of moves. I did a bit of speculation myself back in April in response to questions put forward in the SSWUG newsletter.

I maintain the opinions I stated there:

MySQL / Oracle are completely different beasts and customers realize this, and most likely Oracle does so too. People running MySQL for web related applications won't move to Oracle. Period. Oracle may be able to grab some customers that use MySQL for data warehousing, but I think that even in these cases a choice for Infobright or Kickfire makes more sense.

Not all problems are database problems - if Oracle does a decent job of supporting and developing MySQL, they may become a respectable enough partner for current (larger) MySQL users to help them solve other problems such as systems integration.

Instead of looking at the benefits for MySQL customers of using Oracle, look at the benefits for Oracle customers using MySQL. Suddenly Oracle can offer support for the most popular webstack in the world - Now all these enterprise customers running expensive Oracle installations can finally build cheap websites based on MySQL and even get support from Oracle on connecting their backend Enterprise Oracle instances to the MySQL web front ends.

It's not all about the products. Open Source adds a whole new dynamic to the development process. I'm not just talking about outside developers that offer new features and code patches, as this does not happen too often. There's more to it than code though

In all successful open source projects I know there is a very lively culture of users engaging with developers and voicing their opinion on what is good and what is not so good. There is a very real chance for the user to influence the direction of the development process (although this does not mean everybody gets what they want in equal amounts). Conversely this provides a great opportunity for the development organization to learn about what the users really need and wish for.

In short, Oracle may want to use Sun/MySQL to learn how to do better business with more empowered users.

Of course, its all just my opinion - speculation is free. So you should feel free too to post your ideas on the matter. Go ahead and leave a comment ;)

Roland Bouman's blog goes i18n (Powered by Google Translate)

Now that Pentaho Solutions is in print, and the first few copies are finding its way towards the readers, I felt like doing something completely unrelated. So, I hacked up a little page translation widget, based on the Google Language API. You can see the result in the top of the left sidebar of my blog right now:

Using it is very simple: just pick the language of choice, and the page (text and some attributes like alt and title) will be translated. Pick the first entry in the list to see the original language again.

This all happens inline by dynamic DOM manipulation, without having to reload the page. I tested it on Chrome 2, Firefox 3.5, Opera 10, Safari 4 and Internet Explorer 6 and 8. So far, it seems to work for all these browsers.

Personally, I feel that the user experience you get with this widget is superior to what you would get with the google translation gadget. In addition, it is pretty easy to to configure the Translator class .

The code to add this to your page is in my opinion reasonably simple:


    <!-- add a placeholder for the user interface -->
    <div id="toolbar"><div>

    <!-- Include script that defines the Translator class -->
    <script type="text/javascript" src="Translator-min.js"></script>
    <!-- Instantiate a translator, have it create its gui and render to placeholder -->
    <script type="text/javascript">
        var translator = new Translator();
        var gui = translator.createGUI(null, "Language");
        document.getElementById("toolbar").appendChild(gui);
    </script>

This really is all the code you need - there are no dependencies on external Javascript frameworks. If you don't need or like the gui, you can of course skip the gui placeholder code as well as the second script and interract with the Translator object programmatically.

The minified javascript file is about 7k, which is not too bad in my opinion. I haven't worried too much about optimizations, and I think it should be possible to cut down on codesize.

Another thing I haven't focused on just now is integration with frameworks - on the contrary I made sure you can use it standalone. But in order to do that, I had to write a few methods to facilitate DOM manipulation and JSON parsing, and its almost certain you will find functions like that are already in your framework.

Anyway, readers, I'd like to hear from you...is this auseful feature on this blog? Would you like to use it on your own blog? If there's enough people that want it, I will make it available on google code or something like that.

Saturday, August 22, 2009

"Pentaho Solutions": copies hit the mail

Hi!

Just a few hours ago, I arrived home after a very quiet and peaceful two-week holiday with my family. It was great! I didn't bring a computer on purpose. I brought a mobile phone, but didn't answer that on purpose too :) Result: absolute relaxation, with lots of time to hike, cycle, and read, and occasional visits to musea and historic sites. Bliss :)

Anyway, now that the bags are unpacked, and the kids are asleep, it's time to face the dragon better known as my inbox. What I found brought a big smile to my face:

Pentaho corp. posing with a copy of "Pentaho Solutions"

Yes - it's true!! Copies of my and Jos' book Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL have hit the mail, and at least one copy has reached the Pentaho office.

Pentaho-ers, thanks for your kind email, and thanks for a great product!

I haven't received one myself (yet), and it will probably take some time still to ship the books to Europe. But it's certainly good to see a physical proof of our work.

Anyway - if you are expecting a copy of the book because you pre-ordered one, or if I or Jos promised you a copy, fear not, it should be heading your way. I hope you like it - Enjoy :)

Friday, July 24, 2009

OSCON 2009 Presentation "Taming your Data: Practical Data Integration Solutions with Kettle" now online

I just delivered my OSCON 2009 presentation "Taming your data: Practical Data Integration Solutions with Kettle". I think it went pretty well, and I got good responses from the audience. I did have much more material than time, and I probably should have proposed to do a tutorial instead. Maybe next year :) Anyway, you can find the presentation and the examples on the OSCON 2009 website. The slides are available in pdf format. There's also a zip file that contains the presentation as well as the kettle sample transformations and jobs.

Saturday, July 18, 2009

My OSCON 2009 Session: Taming your Data...

Yes!

Finally, it's there: In a few hours, I will be flying off to San Franscisco to attend OSCON 2009 in San Jose, California. This is the first time I'm attending, and I'm tremendously excited to be there! The sessions look very promising, and I'm looking forward to seeing some excellent speakers. I expect to learn a lot.

I'm also very proud and feel honoured to have the chance to deliver a session myself. It's called Taming Your Data: Practical Data Integration Solutions with Kettle.

Unsurprisingly, I will be talkig a lot about Kettle, a.k.a. Pentaho Data Integration. Recently, I talked about Kettle too at the MySQL user's conference, and more recently, at a MySQL university session. Those sessions were focused mainly on how Kettle can help you load a data warehouse.

But...there's much more to this tool than just data warehousing, and in this session, I will be exploring rougher grounds, like making sense of raw imdb text files, loading and generating XML, clustering and more. This session will also be much more hands on demonstration than the Sakila sessions. If you're interested and you are also attending, don't hesitate to drop by! I'm looking forward to meeting you :)

And...because the topic of the session kind of relates to my upcoming book, "Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL" (ISBN: 978-0-470-48432-6, 600+ pages, list price $50.00), my publisher Wiley decided to throw in a litte extra. Yup, that's right - I've got discount coupons for the book, so if you are interested in picking up a copy, or if you just want to give one away to a friend or colleague, come find me at my session (or somewhere else on OSCON) and I'll make sure you'll get one. Thanks Wiley!!

Anyway - I'm hoping to meet you there: see you soon!!!

Tuesday, July 14, 2009

open-msp-viewer: Free XSLT utilities to render MS Project files as HTML web pages

IMPORTANT NOTICE

The old xslt-based open-msp-viewer project is no longer updated. Please find a new and much improved pure HTML version here at github: https://github.com/rpbouman/open-msp-viewer. Thank you for your interest.

Original post below

For my day job, I've been working on a few things that allow you to render Microsoft Project 2003 projects on a web page.

The code I wrote for my work is proprietary, and probably not directly useful for most people. But I figured that at least some of the work might be useful for others, so I wrote an open source version from scratch and I published that as the open-msp-viewer project on google code. If you like, check out the code and give it a spin.

It works by first saving the project in the MS Project XML format using standard MS Project functionality (Menu \ Save As..., then pick .XML) and then applying an XSLT transformation to generate HTML.

Currently, the project includes an xslt stylesheet that renders MS Project XML files as a Gantt chart. To give you a quick idea, Take a look at these screenshots:

and

The web gantt chart is rendered in a HTML 4.01 variant, CSS 2.1 and uses javascript to allow the user to collapse and/or expand individual tasks in the work breakdown structure. Currently, the HTML does not validate due to a few custom attributes I introduced to support dynamic collapsing/expanding the chart with javascript. In addition, the xslt transform process introduces the msp namespace into the result document, which results in a validation error

You can either associate the xslt stylesheet directly with the MS Project XML file, or you can use an external tool like xsltproc.

In the trunk/xml subdirectory, you can find a couple of sample projects in xml format that already have the stylesheet association. I have tested these in IE8, Chrome 2, Safari 4 and Firefox 3.5, and it works well in all these browsers. In the trunk/html directory, you'll find HTML output as created by xsltproc.

In the future, more xslt stylesheets may be added to support alternative views. Things that I think I will add soon are a resources list and a calendar view.

Enjoy, and let me know if you find a bug or would like to contribute.

Sunday, July 12, 2009

WTF? Apple favicon on the Windows update site?

I just noticed the weirdest thing while visiting the windows update site:

As you can see, the shortcut icon is actually an Apple icon...WTF?!! I looked in the code, and this is the code that sets the favicon:

<link rel='shortcut icon' href='shared/images/banners/favicon.ico' type='image/x-icon'/>

When I navigate directly to http://update.microsoft.com/windowsupdate/v6/shared/images/banners/favicon.ico I get the normal icon, that is to say, this:

.When I look with IE6, I actually see that icon as favicon...so the question is, what's up with the Apple icon in IE8?! Do I have some virus or malware that has modified IE8? I googled for the problem, but can't find any references...am I the only one? Anybody else experiencing this?

Friday, July 03, 2009

Starring Sakila: MySQL university recording, slides and materials available onMySQLForge

Hi!

Yesterday I had the honour of presenting my mini-bi/datawarehousing tutorial "Starring Sakila" for MySQL University. I did a modified version of the presentation I did together with Matt Casters at the MySQL user's conference 2009 (video available here, thanks to Sheeri). The structure of the presentation is still largely the same, although I condensed various bits, and I added practical examples of setting up the ETL process and creating a Pentaho Analysis View (OLAP pivot table) on top of a Mondrian Cube.

The slides, session recording, and materials such as SQL script, pentaho data integration jobs and transformations, and Sakila Rentals Cube for Mondrian are all available here on MySQL Forge.

Copyright Notice

Presentation slides, and materials such as SQL script, pentaho data integration jobs and transformations, and Sakila Rentals Cube for Mondrian are all Copyright Roland Bouman. Feel free to download and learn from it. But please do not distribute the materials yourself - instead, point people to the wiki page to get their own copy of the materials. Personal use of the files is allowed. Use these materials for creating training materials of using these materials as training materials is explicitly not allowed without written prior consent. (Just mail me at roland dot bouman at gmail dot com if you would like to use the materials for such purposes, and we can work something out.)

Thursday, June 18, 2009

MySQL Stored Functions: Impact of DECLARE HANDLER on Performance

Hi again!

Just a moment ago, I read this post by Peter Gulutzan. In this post, Peter explains a little trick that allows you to capture the SQL state in a variable whenever an error occurs in your MySQL stored routine code.

MySQL `CONDITION`s and `HANDLER`s

For the uninitiated: in MySQL stored routines, you can declare HANDLERs which are pieces of code that are executed only in case a particular CONDITION occurs. This device serves the same purpose as a try ... catch block which is supported in many popular programming languages like Java, JavaScript, C++, PHP5 and C#.

Now, one of the long-standing problems with MySQL HANDLERs - the inability to explicitly raise a CONDITION - has recently been solved by implementing the SIGNAL syntax. The most important issue that remains is the inability to discover exactly which error occurred. You need this especially when you are writing a generic 'catch all' HANDLER (for example, on that is triggered in response to SQLEXCEPTION) and you want to write information regarding the error to a log.

Peter's Trick to Capture SQLSTATE

To cut a long story short, Peter's solution is based on writing a HANDLER for all known CONDITIONs in advance. Here's a fragment of his code to explain:

BEGIN
  ...
  DECLARE EXIT HANDLER FOR SQLSTATE '01000' BEGIN SET @e='01000'; RESIGNAL; END;
  ...
  DECLARE EXIT HANDLER FOR SQLSTATE 'XAE09' BEGIN SET @e='XAE09'; RESIGNAL; END;

  ...remainder of code...

END;

As Peter points out, it's tedious, but it works. (That is to say, it works better than not having anything)

Performance?

Now, one particular paragraph in Peter's post caught my eye:

I added 38 DECLARE EXIT HANDLER statements at the start of my procedure, just after the variable declarations. These lines are always the same for any procedure. They’re not executed unless an error happens so I don’t worry about speed.

I respect Peter a great deal - if he's got something to say you do well to listen and take his word for it. However, this time I was curious to find out if I could measure the effect of the HANDLER code at all.

Method

The code I used is a simplification of Peter's code. I tested it on MySQL 5.1 because I was just interested in the impact of a DECLARE HANDLER statement. In my case, the BEGIN...END block of the handler does not contain the RESIGNAL statement, and my function does not drop a table but simply returns 1. This is important, as none of the HANDLERs is ever triggered by my code.

Last week I wrote how seemingly small changes in MySQL stored routine code can have a surprisingly large impact on performance. In that particular case, I already had a hunch about which things could be improved. In this case, I just didn't know so I created a series of functions with 2, 4, 8, 16, 32 and 38 DECLARE HANDLER statements, and again I used the BENCHMARK() function to measure the time it takes to execute it 100,000 times. I did warm-up runs, and repeated the measurement 5 times for each function variant.

Results

The graph below summarizes my observations:

The squares in the graph are the measurements - each one represents a version of the function. Along the horizontal axis you see the number of DECLARE HANDLER statements in that particular version of the function. The number of seconds it took to execute the function 100,000 times on my laptop using BENCHMARK() is on the vertical axis.
As you can see, there seems to be a linear relationship between the number of DECLARE HANDLER statements and the time it takes to execute the function. This is in itself a surprise. I mean, I would expect a little overhead per DECLARE statement when the function is compiled initially. After that, it is cached at the session level, and beyond that point I would not expect statements that do not execute to have any impact.

So how badly do the HANDLER declarations slow our function down? Well, I measured an average of 0.38 seconds for 2, and an average of 0.55 seconds for 38 DECLARE HANDLER statements respectively. The difference, 0.18 seconds is a little less than 50% of the function variant with 2 DECLARE HANDLER statements, and a little more than 30% of the function having 38 DECLARE HANDLER statements.

Conclusion

To be fair, the function I tested doesn't actually do anything, and if your function or stored procedure does some real processing, the overhead may be neglectable. However, you can clearly see that even just declaring a handler has a measurable negative impact on performance. The essence of Peter's trick is to actually always write a DECLARE HANDLER for each possible condition and to do this for each stored routine. You will certainly suffer a peformance hit for small functions, esp. if they get called a lot.

Wednesday, June 10, 2009

MySQL: Refactoring a Stored Function

Hi All!

I was just reading PlanetMySQL and noticed Mike Hillyer's recent post on a user-friendly age function for MySQL. Basically, this function accepts two DATETIME values and returns an indication of the time between the two dates in the form of a human-readable string. For example:

mysql> select  TimeDiffUnits('2001-05-01', '2002-01-01')
+-------------------------------------------+
| TimeDiffUnits('2001-05-01', '2002-01-01') |
+-------------------------------------------+
| 8 Months                                  |
+-------------------------------------------+
1 row in set (0.00 sec)

Just for fun, I decided to try and refactor it, and I'm writing this to share the results. Now for a little disclaimer. The purpose of this post is not to bash Mike or belittle his code: I consider him a friend, and I respect him and his skills a lot. The point is to show how seemingly small differences in MySQL stored function code can have quite an impact on performance. The good news is that there is a rationale behind all this - I did not refactor based on trial and error. I hope I can shed some light on that in the remainder of this post when discussing the individual improvements.

Summary

I changed three pieces of code in Mike's original function. Each of these changes help to increase performance just a little bit, and because none of the changes alter the overall structure of the original code, I'm inclined to consider them as improvements.I'm including a graph here:

The Y-axis of the graph represents the average (N=5) number of seconds it took to run

SELECT BENCHMARK(100000, TimeDiffUnits('2001-01-01', '2002-01-01'))

The original function is at the left, and each bar at the right represents one step of refactoring. All in all, at the far right, you can see that the final result is a function with exactly the same behavior which runs about 70% faster than the original :-).

About the Measurements

I should clarify a thing or two to explain the measurments I did.

First of all, BENCHMARK() is usually frowned upon as a general benchmarking tool. However, in this case the measured function is completely computational in nature, and I think that it isn't too bad to get a rough idea of relative performance.

Second, the actual code that is being measured, TimeDiffUnits('2001-01-01', '2002-01-01') is in fact a worst-case^* scenario for this particular function, and it's quite likely that testing it with real data does not automatically result in 70% performance increase.

(^*: It is quite literally a worst-case scenario because the input values are such that the CASE-statement used within the function has to mull through all branches before finding a matching branch)

Finally I should point out that the vast majority of database performance problems have to do with IO or resource contention, and are not computational in nature. So chances are that you can use none of the information in the remainder of this post to improve your performance problem. (You can always stop reading now of course ;-)

That said, if you use MySQL stored functions, it can't hurt to be aware of their performance issues, and it is not too hard to make a habit of writing the fastest possible code. In many cases, you will find that the fastest solution is also the cleanest, shortest, and most maintainable one.

Now, without further ado - how I refactored that function...

Original Function

First, meet the original function as it appeared on Mike's blog.

CREATE FUNCTION TimeDiffUnits (old DATETIME, new DATETIME)RETURNS CHAR(50) 
DETERMINISTIC NO SQL
BEGIN
  DECLARE diff INTEGER;
  SET diff = UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);
  CASE
    WHEN (diff < 3600)     THEN RETURN CONCAT(FLOOR(diff / 60) , ' Minutes');
    WHEN (diff < 86400)    THEN RETURN CONCAT(FLOOR(diff / 3600), ' Hours');
    WHEN (diff < 604800)   THEN RETURN CONCAT(FLOOR(diff / 86400), ' Days');
    WHEN (diff < 2592000)  THEN RETURN CONCAT(FLOOR(diff / 604800), ' Weeks');
    WHEN (diff < 31536000) THEN RETURN CONCAT(FLOOR(diff / 2592000), ' Months');
    ELSE                        RETURN CONCAT(FLOOR(diff / 31536000), ' Years');
  END CASE;
END;

This is pretty straight-forward:

First, the time between the input datetime values is computed as a number of seconds. This is done by converting the input values to a number of seconds using the UNIX_TIMESTAMP function. Then, the value for the old datetime is subtracted from the new datetime. The result is assigned to the diff variable where it is stored for later use.

Second, a searched CASE-statement is used to determine the order of magnitude: Minutes, Hours, and so on up to Years. For example, if the number of seconds is less than 3600 (which is an hour, 60 seconds times 60 minutes) the WHEN branch is entered to calculate a number of minutes

The WHEN branches of the CASE-statement calculate how many of the selected units (minutes, hours, etc.) fit into the elapsed time calculated in step 1. Using CONCAT(), this is used to create a nice human-readable string, which is immediately returned from the function.

On my laptop, BENCHMARK(1000000, TimeDiffUnits('2001-01-01', '2002-01-01')) takes between 2.91 and 3.05 seconds.

Step1: Using `DEFAULT` instead of `SET`

The first thing I did was getting rid of the assignment to diff:

SET diff = UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);

Instead, I used the DECLARE statement of diff to assign the elapsed time right away using the DEFAULT clause:

DECLARE diff INTEGER DEFAULT UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);

For completeness here's the modified function:

CREATE FUNCTION TimeDiffUnits1 (old DATETIME, new DATETIME)RETURNS CHAR(50)
DETERMINISTIC NO SQL
BEGIN
  DECLARE diff INTEGER DEFAULT UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);
  CASE
    WHEN (diff < 3600)     THEN RETURN CONCAT(FLOOR(diff / 60) , ' Minutes');
    WHEN (diff < 86400)    THEN RETURN CONCAT(FLOOR(diff / 3600), ' Hours');
    WHEN (diff < 604800)   THEN RETURN CONCAT(FLOOR(diff / 86400), ' Days');
    WHEN (diff < 2592000)  THEN RETURN CONCAT(FLOOR(diff / 604800), ' Weeks');
    WHEN (diff < 31536000) THEN RETURN CONCAT(FLOOR(diff / 2592000), ' Months');
    ELSE                        RETURN CONCAT(FLOOR(diff / 31536000), ' Years');
  END CASE;
END;

This is not a big deal right? Well it isn't. I mean, this is certainly not a big change and I think the code is still just as clear. Did this gain us some performance? Well, just a bit. For me, SELECT BENCHMARK(100000, TimeDiffUnits1('2001-01-01', '2002-01-01')); takes between 2.89 and 2.98 seconds to run. This is about 2% better than the original. Admittedly, next to nothing, but considering that we casually eliminated only one assignment, I think it is rather good!

Take-away: Don't assign when you don't have to. Each local variable declaration is an implicit assignment - use it if you can.

Step 2: Using `DIV` instead of float division and `FLOOR()`

The second change I introduced is a bit larger than the previous one. To compute the number of elapsed units, the original code uses the division operator (/). This uses floating point arithmetic, and to get a nice integer result, the division is wrapped inside the FLOOR() function. In fact, this is a pattern that I have observed earlier in other code (yes, I'm guilty too :(), and I wrote about it in the past.

As it turns out, we don't need the division operator to perform division. At least, not in this case. MySQL provides the DIV operator, which is designed to perform integer division. This is great news for two reasons:

No time is wasted converting the numbers to floating point values to perform the calculation

Because the result of DIV is also an integer, we don't need FLOOR to convert back to integer again.

So, for example, this:

WHEN (diff < 31536000) THEN RETURN CONCAT(FLOOR(diff / 2592000), ' Months');

Can be rewritten to

WHEN (diff < 31536000) THEN RETURN CONCAT(diff DIV 2592000, ' Months');

. This should be faster, and its less code too. Here's the modified function for completeness:

CREATE FUNCTION TimeDiffUnits2 (old DATETIME, new DATETIME)RETURNS CHAR(50)
DETERMINISTIC NO SQL
BEGIN
  DECLARE diff INTEGER DEFAULT UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);
  CASE
    WHEN (diff < 3600)     THEN RETURN CONCAT(diff DIV 60 , ' Minutes');
    WHEN (diff < 86400)    THEN RETURN CONCAT(diff DIV 3600, ' Hours');
    WHEN (diff < 604800)   THEN RETURN CONCAT(diff DIV 86400, ' Days');
    WHEN (diff < 2592000)  THEN RETURN CONCAT(diff DIV 604800, ' Weeks');
    WHEN (diff < 31536000) THEN RETURN CONCAT(diff DIV 2592000, ' Months');
    ELSE                        RETURN CONCAT(diff DIV 31536000, ' Years');
  END CASE;
END;

After the modification, BENCHMARK(100000, TimeDiffUnits2('2001-01-01', '2002-01-01')) takes between 2.61 and 2.72 seconds to run on my laptop. This is about 11% faster than the original, and about 9% faster than my first improvement.

Take-away: If you are doing division, think a minute about what data types you are using. Do you really need float arithmetic? If you don't, then don't use the division operator (/), simply use DIV instead. It may be less known, but it is more explicit and can give you some extra performance. If you are using FLOOR, ask yourself why you are throwing away fractional digits. There are a bunch of cases where you just need to format something that is intrinsically fractional, but if you can't care less about the fractional numbers, chances are you can chuck the FLOOR away and simply avodi the fractional numbers by using straight integer division.

Step 3: Using the CASE operator instead of the CASE statement

Finally, I changed the CASE-statement and replaced it with a CASE-operator instead.

Now, I have seen on numerous occasions that people are confused about CASE, so here's a quick overview, just to get it completely straight:

The `CASE`-statement

The CASE-statement is not the same as the CASE-operator.
The CASE-statement is a program-flow control statement, and is allowed in stored routines (procedures, functions, triggers and events), but not in regular SQL statements.
The CASE-statement is used to choose between a number of alternative code paths. Each WHEN branch must contain at least one statement, and may contain multiple statements. Note that statements that appear in the WHEN branch of a CASE-statement are always followed by a semi-colon statement terminator.
The CASE-statement is syntactically terminated using the keywords END CASE. Note that END CASE is typically followed by a semi-colon that acts as the statement terminator for the CASE-statement. (The only exception is when the CASE statement appears as top-level statement in a stored routine, in which case the semi-colon is allowed but not required.

The `CASE`-operator

The CASE-operator is not the same as the CASE-statement.
The CASE-operator (aka CASE-expression) is a value-expression. It is allowed in almost all places where you can use a value. So, you can use the CASE-operator in regular SQL statements as well as in stored routines.
The WHEN-branches of the CASE-operator must contain a single value-expression (which may itself be composite). The WHEN-branches of the CASE-operator can thus not contain statements, and cannot contain multiple expressions - it just wouldn't make sense because the CASE-operator evaluates to a single value, just like any other expression. Because statements are not allowed in the WHEN branches of CASE-operators, there are never terminating semi-colons inside these WHEN branches.
The CASE-expression is terminated with the END keyword. Note that this is different fromt the terminator of the CASE-statement which is END CASE
Also, note that a CASE-expression does not itself have a semi-colon as statement terminator for the simple reason that it is not a statement. Of course, it is possible for a CASE-expression to appear at the end of a statement. In this case, there will be a semi-colon statement terminator immediately after the END of the CASE-expression, but it is important to realize that that semi-colon terminates the statement that contains the CASE-expression - it does not terminate the CASE-expression itself.

Simple `CASE` and searched `CASE`

So, I just discussed two different occurrences of CASE: as statement, and as operator/expression. Now to complicate things even more, each of these can appear in two different forms, namely the simple and the searched CASE, making up a total of 4 different CASE variants. Fortunately, I wrote about the simple and searched CASE earlier in another article.

`CASE`-statement versus `CASE`-expression

As you can see in the original code, Mike uses a CASE-statement. Each of the WHEN-branches contains a single RETURN statement that passes the return value to the caller. With a small modification, we can rewrite this to a single RETURN statement that uses a CASE-expression to pick the right value. The result is shown below:


CREATE FUNCTION TimeDiffUnits3(old DATETIME, new DATETIME) RETURNS char(50)
NO SQL DETERMINISTIC
BEGIN
  DECLARE diff INTEGER DEFAULT UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);
  RETURN CASE
    WHEN (diff < 3600)     THEN CONCAT(diff div 60 , ' Minutes')
    WHEN (diff < 86400)    THEN CONCAT(diff div 3600, ' Hours')
    WHEN (diff < 604800)   THEN CONCAT(diff div 86400, ' Days')
    WHEN (diff < 2592000)  THEN CONCAT(diff div 604800, ' Weeks')
    WHEN (diff < 31536000) THEN CONCAT(diff div 2592000, ' Months')
    ELSE                        CONCAT(diff div 31536000, ' Years')
  END;
END

While testing, I found that BENCHMARK(100000, TimeDiffUnits3('2001-01-01', '2002-01-01')) takes between 1.69 and 1.78 seconds to run. That is a 70% improvement over the original, and a 66% and 53% improvement with regard to the prior versions respectively.

Personally, I am curious why this is such a big improvement, and my guess is that this can be explained by assuming that a CASE statement is chopped up into many byte-code instructions which are each executed individually and sequentially, whereas the CASE-operator is written as a single C-function. At any rate, I think this was rather worth it, and personally I feel the final solution is a bit cleaner than the original.

Take-away: If you need some choice structure, think about what you are choosing. Are you deciding between different code paths, or are you picking between values? If you are picking values, simply write a CASE-expression. It's more explicit, and it is a lot faster than a CASE-statement. Another thing to consider:
do you need multiple statements in the branches, or can you get by with a single statement? If you can get by with a single statement, and it is always a RETURN or a SET-statement, then you can rewrite it to a CASE expresion.

Conclusion

Because the changes I made are really quite small, I think this mostly shows that MySQL stored function compilation is poorly optimized (if at all). I am not a compiler expert but my gut feeling is that most of the optimizations could have been done automatically.

Finally

If you are interested in refactoring MySQL stored functions and procedures, you might also like a few other articles I wrote on the subject:

Refactoring Cursors

Refactoring UNION

Refactoring information schema queries

Tuesday, December 15, 2009

Triggers

Example data validation problem

A data validation stored procedure

How to issue warnings

Conveying SIGNAL context information

Predefined condition information items

How to issue errors

Using the procedure in a trigger

Final words

Monday, November 16, 2009

Original Problem

Caching calls to getString()

Disabling Compatibility mode

Eliminating unncessary code

Optimizing the flow

Not using Javascript at all

The Formula step

The Calculator step

User-defined Java Expression

Conclusion

Summary

Final thoughts

Tuesday, October 27, 2009

Tuesday, September 15, 2009

The problem with user-defined variables

So what good are these variables anyway?

Another way - serializing the set with GROUP_CONCAT

Conclusion

Saturday, September 12, 2009

Thursday, September 03, 2009

Saturday, August 22, 2009

Friday, July 24, 2009

Saturday, July 18, 2009

Tuesday, July 14, 2009

IMPORTANT NOTICE

Original post below

Sunday, July 12, 2009

Friday, July 03, 2009

Copyright Notice

Thursday, June 18, 2009

MySQL CONDITIONs and HANDLERs

Peter's Trick to Capture SQLSTATE

Performance?

Method

Results

Conclusion

Wednesday, June 10, 2009

Summary

About the Measurements

Original Function

Step1: Using DEFAULT instead of SET

Step 2: Using DIV instead of float division and FLOOR()

Step 3: Using the CASE operator instead of the CASE statement

The CASE-statement

The CASE-operator

Simple CASE and searched CASE

CASE-statement versus CASE-expression

Conclusion

Finally

Conveying `SIGNAL` context information

Another way - serializing the set with `GROUP_CONCAT`

MySQL `CONDITION`s and `HANDLER`s

Step1: Using `DEFAULT` instead of `SET`

Step 2: Using `DIV` instead of float division and `FLOOR()`

The `CASE`-statement

The `CASE`-operator

Simple `CASE` and searched `CASE`

`CASE`-statement versus `CASE`-expression