Saturday, March 29, 2008

MySQL information_schema: Identifying rows from TABLE_CONSTRAINTS

Yesterday, I set out a little quiz about the TABLE_CONSTRAINTS table in the MySQL information_schema. The task was:
  • Specify a minimal set of columns of the information_schema.TABLE_CONSTRAINTS table that is sufficient to reliably identify a single row in the information_schema.TABLE_CONSTRAINTS table.
  • Argue why these columns are necessary and sufficient to identify a row, and why a smaller set of columns does not exist

Short Answer


For MySQL there are two such column sets:

  • CONSTRAINT_SCHEMA, TABLE_NAME, CONSTRAINT_NAME, and CONSTRAINT_TYPE
  • TABLE_NAME, TABLE_SCHEMA, CONSTRAINT_NAME, and CONSTRAINT_TYPE

Explanation


According to the ISO SQL specification (ISO/IEC 9075-11:2003 (E) 6.48 TABLE_CONSTRAINTS base table), we can derive that the "conceptual primary key" of a standard implementation of the TABLE_CONSTRAINTS system view should be (CONSTRAINT_CATALOG, CONSTRAINT_SCHEMA, and CONSTRAINT_NAME). In the absence of support for catalogs, we can ignore the CONSTRAINT_CATALOG column, and this also applies to MySQL (all %_CATALOG columns in the MySQL information_schema are always NULL). That would leave us with (CONSTRAINT_SCHEMA and CONSTRAINT_NAME) - that is, table constraint names would be unique within a schema. However, in MySQL, constraint names need not be unique within a schema.

MySQL supports three types of table constraints:

  • PRIMARY KEY
  • UNIQUE
  • FOREIGN KEY (that is, they are supported dependent upon the storage engine)
.

Identifying FOREIGN KEY constraints


FOREIGN KEYs are in their own per schema namespace, so at a glance it may seem that the combination of (CONSTRAINT_SCHEMA and CONSTRAINT_NAME) is sufficient to identify a FOREIGN KEY constraint. However, because PRIMARY KEY and UNIQUE constraints are not within the same namespace as FOREIGN KEY constraints, a single schema may contain a FOREIGN KEY constraint and a non-FOREIGN KEY constraint that have the same CONSTRAINT_NAME.

From this it follows that we need to take the CONSTRAINT_TYPE into account too.

So in order to reliably identify a FOREIGN KEY constraint, we need to have at least CONSTRAINT_SCHEMA, CONSTRAINT_NAME *and* require that CONSTRAINT_TYPE equals 'FOREIGN KEY'.

(Among the answers was a suggestion that the CONSTRAINT_TYPE is not necessary, because the CONSTRAINT_NAME would provide the necessary differentiation. Although this is true for the identifiers generated by MySQL in the absence of explicit identifiers, it is not true in the general case)

Identifying PRIMARY KEY constraints


So, would (CONSTRAINT_SCHEMA, CONSTRAINT_NAME, and CONSTRAINT_TYPE) be sufficient to identify any of the other types of table constraints?

Unfortunately not. Almost everybody that is somewhat familiar with MySQL knows that in MySQL a PRIMARY KEY constraint doesn't really have its own identifier. Rather, the CONSTRAINT_NAME for a PRIMARY KEY constraint is always 'PRIMARY'. In fact, the identifier 'PRIMARY' is reserved exclusively for PRIMARY KEY constraints, and may not be used for any other type of constraint.

From this, it follows that there can be multiple rows in information_schema.TABLE_CONSTRAINTS that have the same combination of values in the columns (CONSTRAINT_SCHEMA, CONSTRAINT_NAME, and CONSTRAINT_TYPE). In fact, for one particular value of CONSTRAINT_SCHEMA there will be just as many occurrences of the combination ('PRIMARY', 'PRIMARY KEY') in the columns (CONSTRAINT_NAME, CONSTRAINT_TYPE) as there are PRIMARY KEY constraints in that schema.

Of course, we know that there can be only one PRIMARY KEY per table. From that, it follows that we can identify a PRIMARY KEY constraint if we know to which table it belongs.

Tables (and views) reside in their own per-schema namespace, so the combination of (TABLE_SCHEMA and TABLE_NAME) is sufficient to identify a table. That means that the combination of (TABLE_SCHEMA, TABLE_NAME, and CONSTRAINT_TYPE) is sufficient to identify a PRIMARY KEY constraint. Once we identified a particular table using (TABLE_SCHEMA, TABLE_NAME) we know that the table constraint with the CONSTRAINT_TYPE equal to 'PRIMARY KEY' is the PRIMARY KEY constraint.

Of course, because the name 'PRIMARY' is reserved exclusively for PRIMARY KEY constraints we could've equally well settled for (TABLE_SCHEMA, TABLE_NAME, CONSTRAINT_NAME) although personally I prefer to use CONSTRAINT_TYPE.

Identifying UNIQUE constraints


This leaves us with the UNIQUE constraints. UNIQUE constraints must have a unique name per table, and the name must not be equal to 'PRIMARY' (which is reserved for PRIMARY KEY constraints). From this we must conclude that we need no less than the combination of (TABLE_SCHEMA, TABLE_NAME, CONSTRAINT_TYPE, and CONSTRAINT_NAME) to identify a UNIQUE constraint.

We must have (TABLE_SCHEMA, TABLE_NAME) to identify the table because the UNIQUE constraints have a unique name per table. Now we can take a shortcut: within one table we can distinguish between PRIMARY KEY and UNIQUE constraints by virtue of the fact that the CONSTRAINT_NAME for PRIMARY KEYs will always be 'PRIMARY'. However, if we need to distinguish between FOREIGN KEY and UNIQUE constraints, we still need to look at the CONSTRAINT_TYPE too.

Summary


To recapitulate:
  • we need CONSTRAINT_SCHEMA, CONSTRAINT_NAME, CONSTRAINT_TYPE to identify a FOREIGN KEY constraint
  • we need TABLE_SCHEMA, TABLE_NAME, and CONSTRAINT_TYPE to identify a PRIMARY KEY constraint.
  • Alternatively we could identify PRIMARY KEY constraints by looking at TABLE_SCHEMA, TABLE_NAME, CONSTRAINT_NAME by virtue of the fact that all PRIMARY KEYs always have a CONSTRAINT_NAME equal to 'PRIMARY'
  • we need TABLE_SCHEMA, TABLE_NAME, CONSTRAINT_TYPE and CONSTRAINT_NAME to identify a UNIQUE constraint

From this we could conclude that the minimal set of columns required to identify an arbitrary table constraint consists of:
  • CONSTRAINT_SCHEMA
  • CONSTRAINT_NAME
  • CONSTRAINT_TYPE
  • TABLE_SCHEMA
  • TABLE_NAME

However, in MySQL, a table constraint is 'owned' by the table on which it is defined and a table constraint defined on particular table cannot reside in a schema different from the schema its table. This means that for any row in information_schema.TABLE_CONSTRAINTS, the columns TABLE_SCHEMA and CONSTRAINT_SCHEMA will always have the same value.

This brings us to the final conclusion that there are two minimal set of columns that may be used to identify an arbitrary constraint in the information_schema.TABLE_CONSTRAINTS table:
  • CONSTRAINT_SCHEMA, TABLE_NAME, CONSTRAINT_NAME, CONSTRAINT_TYPE
  • TABLE_SCHEMA, TABLE_NAME, CONSTRAINT_NAME, CONSTRAINT_TYPE

As CONSTRAINT_SCHEMA and TABLE_SCHEMA are completely interchangeable as far as the result is concerned, there is no specific reason to generically prefer either combination.

To learn more about this and other traps in the MySQL information schema, come meet me at the MySQL User's conference. I will be your guide on the Grand Tour of the Information Schema and its Applications, and explain more of these gotcha's.

Thursday, March 27, 2008

"Me Too" MySQL Information Schema popquiz

A few days ago, I wrote how I will be your guide to the Grand Tour of the Information Schema and its Applications which is one of the two talks I will be doing at the upcoming MySQL User's Conference.

In view of the popularity of "Pop Quiz" format so successfully used by Carsten, I feel compelled to imitation, and as a primer to my talk, I'd like to offer you my "Me Too" MySQL Information Schema popquiz. So, here goes...


The MySQL information_schema contains a number of tables. Among them, one is called TABLE_CONSTRAINTS. Unsurprisingly, each row in TABLE_CONSTRAINTS table represents a single table constraint. The TABLE_CONSTRAINTS table has the following columns:

+--------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+-------+
| CONSTRAINT_CATALOG | varchar(512) | YES | | NULL | |
| CONSTRAINT_SCHEMA | varchar(64) | NO | | | |
| CONSTRAINT_NAME | varchar(64) | NO | | | |
| TABLE_SCHEMA | varchar(64) | NO | | | |
| TABLE_NAME | varchar(64) | NO | | | |
| CONSTRAINT_TYPE | varchar(64) | NO | | | |
+--------------------+--------------+------+-----+---------+-------+


  • Specify a minimal set of columns of the information_schema.TABLE_CONSTRAINTS table that is sufficient to reliably identify a single row in the information_schema.TABLE_CONSTRAINTS table.

  • Argue why these columns are necessary and sufficient to identify a row, and why a smaller set of columns does not exist



Feel free to post the answer as a comment. Every correct answer will earn a 20% discount to the admission fee. You can get bonus points for nice elegant argumentation, and also if you specify more than one minimal set of columns.

I will elect three to five of the best answers and make sure that whoever posts it gets a laminated printout of the diagram of the MySQL information schema.

(Please make sure you post me your email address if you think you are eligible to either the discount code, the laminated diagram or both)

Tuesday, March 25, 2008

MySQL Information Schema applications at the UC2008

Last week I blogged about the upcoming MySQL Users conference, in particular about the Writing MySQL UDFs tutorial that I will be delivering.

I will also be doing the Grand Tour of the Information Schema and its Applications.


I will discuss the elements in the MySQL information schema, and provide tips to write queries against it. Most of the talk will revolve around a number of scripts I have developed over the past few years:

  • Generating a history/audit database - generate all the code you need to keep track of all changes occurring in your on-line database. Use it to audit or implement 'flashback'

  • Checking foreign key violations - disabling foreign key checks may be useful, but is dangerous. This script helps you find problems with foreign key constraints

  • Creating federated tables - FEDERATED tables are useful, but tedious and error-prone to create. Let this script do the work instead

  • Checking for duplicated and redundant indexes - Redundant or duplicate indexes can slow down your database performance. Find them with this script (Note: the original script I blogged about earlier contains a critical flaw - this is a completely new version)

I also have a few new information schema tricks up my sleeve. Without giving away too much, take a look at this little conversation I just had with my MySQL command line client:

mysql> call qwz.qwz(null);
+-----+------------------------------------------------------------------+
| | Welcome to the command line Query Wizard for MySQL!!! |
+-----+------------------------------------------------------------------+
| 1 | Set the schema (current: world) |
| --- | ---------------------------------------------------------------- |
| 2 | Choose a table |
+-----+------------------------------------------------------------------+
3 rows in set (0.00 sec)

Yes! It's an interactive query wizard for the command line. I don't want to spoil too much, but I can unveil that I will be presenting a full-fledged interactive query wizard that operates completely from the MySQL command line. It's just a stored procedure - no proxy, no UDFs, no plugins or whatever.

To give you a taste of the possibilities, take a look at the continuation of my session with the query wizard. First let's choose the schema:

mysql> call qwz.qwz(1);
+----+------------------------+
| | Set the default schema |
+---+------------------------+
| 1 | information_schema |
| 2 | mysql |
| 3 | qwz |
| 4 | sakila |
| 5 | world |
+---+------------------------+

Let's settle for the sakila schema:

mysql> call qwz.qwz(4);

This will prompt us to choose a table from the sakila schema:

+-----+------------------------------------------------------------------+
| | The command line Query Wizard for MySQL |
+-----+------------------------------------------------------------------+
| 1 | Set the schema (current: sakila) |
| --- | ---------------------------------------------------------------- |
| 2 | actor |
| 3 | actor_info |
| 4 | address |
| 5 | category |
| 6 | city |
| 7 | country |
| 8 | customer |
| 9 | customer_list |
| 10 | film |
| 11 | film_actor |
| 12 | film_category |
| 13 | film_list |
| 14 | film_text |
| 15 | inventory |
| 16 | language |
| 17 | nicer_but_slower_film_list |
| 18 | payment |
| 19 | rental |
| 20 | sales_by_film_category |
| 21 | sales_by_store |
| 22 | staff |
| 23 | staff_list |
| 24 | store |
+-----+------------------------------------------------------------------+
26 rows in set (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Let's pick the rental table:

mysql> call qwz.qwz(19);

Once we selected the initial table, we can now build a join path, and we are prompted to pick any of the tables related to rental:

+-----+------------------------------------------------------------------+
| | Join path: + tables are added / - tables are removed |
+-----+------------------------------------------------------------------+
| | sakila.rental |
| 1 | + sakila.customer (fk_rental_customer) |
| 2 | + sakila.inventory (fk_rental_inventory) |
| 3 | + sakila.staff (fk_rental_staff) |
| --- | ---------------------------------------------------------------- |
| 4 | Choose columns |
| --- | ---------------------------------------------------------------- |
| 5 | New Query |
+-----+------------------------------------------------------------------+
8 rows in set (0.20 sec)

Query OK, 0 rows affected (0.20 sec)

Let's extend the join path with the customer table:

mysql> call qwz.qwz(1);
+-----+------------------------------------------------------------------+
| | Join path: + tables are added / - tables are removed |
+-----+------------------------------------------------------------------+
| | sakila.rental |
| 1 | - sakila.customer (fk_rental_customer) |
| 2 | + sakila.address (fk_customer_address) |
| 3 | + sakila.store (fk_customer_store) |
| 4 | + sakila.inventory (fk_rental_inventory) |
| 5 | + sakila.staff (fk_rental_staff) |
| --- | ---------------------------------------------------------------- |
| 6 | Choose columns |
| --- | ---------------------------------------------------------------- |
| 7 | New Query |
+-----+------------------------------------------------------------------+
10 rows in set (0.09 sec)

Query OK, 0 rows affected (0.09 sec)

Let's throw in the inventory table in the mix too:

mysql> call qwz.qwz(4);
+-----+------------------------------------------------------------------+
| | Join path: + tables are added / - tables are removed |
+-----+------------------------------------------------------------------+
| | sakila.rental |
| 1 | - sakila.customer (fk_rental_customer) |
| 2 | + sakila.address (fk_customer_address) |
| 3 | + sakila.store (fk_customer_store) |
| 4 | - sakila.inventory (fk_rental_inventory) |
| 5 | + sakila.film (fk_inventory_film) |
| 6 | + sakila.store (fk_inventory_store) |
| 7 | + sakila.staff (fk_rental_staff) |
| --- | ---------------------------------------------------------------- |
| 8 | Choose columns |
| --- | ---------------------------------------------------------------- |
| 9 | New Query |
+-----+------------------------------------------------------------------+
12 rows in set (0.10 sec)

Query OK, 0 rows affected (0.10 sec)

We can keep this up quite a long time, and we can remove tables from our join path in a similar way. We can then start specifying some columns:

mysql> call qwz.qwz(8);
+-----+------------------------------------------------------------------+
| | Choose columns for the SELECT list |
+-----+------------------------------------------------------------------+
| | sakila.rental.* |
| | --------------- |
| 1 | + rental_id |
| 2 | + rental_date |
| 3 | + inventory_id |
| 4 | + customer_id |
| 5 | + return_date |
| 6 | + staff_id |
| 7 | + last_update |
| | --------------- |
| | sakila.customer.* (fk_rental_customer) |
| | -------------------------------------- |
| 8 | + customer_id |
| 9 | + store_id |
| 10 | + first_name |
| 11 | + last_name |
| 12 | + email |
| 13 | + address_id |
| 14 | + active |
| 15 | + create_date |
| 16 | + last_update |
| | -------------------------------------- |
| | sakila.inventory.* (fk_rental_inventory) |
| | ---------------------------------------- |
| 17 | + inventory_id |
| 18 | + film_id |
| 19 | + store_id |
| 20 | + last_update |
| --- | ---------------------------------------------------------------- |
| 21 | New Query |
+-----+------------------------------------------------------------------+
30 rows in set (0.02 sec)

Query OK, 0 rows affected (0.02 sec)

Currently I'm still working on a multiple select interface, but for now I'll snip out all but the last individual column selection:

mysql> call qwz.qwz(18);
+-----+------------------------------------------------------------------+
| | SELECT list: + columns are added / - columns are removed |
+-----+------------------------------------------------------------------+
| | sakila.rental.* |
| | --------------- |
| 1 | + rental_id |
| 2 | - rental_date |
| 3 | + inventory_id |
| 4 | + customer_id |
| 5 | + return_date |
| 6 | + staff_id |
| 7 | + last_update |
| | --------------- |
| | sakila.customer.* (fk_rental_customer) |
| | -------------------------------------- |
| 8 | + customer_id |
| 9 | + store_id |
| 10 | - first_name |
| 11 | - last_name |
| 12 | + email |
| 13 | + address_id |
| 14 | + active |
| 15 | + create_date |
| 16 | + last_update |
| | -------------------------------------- |
| | sakila.inventory.* (fk_rental_inventory) |
| | ---------------------------------------- |
| 17 | + inventory_id |
| 18 | - film_id |
| 19 | + store_id |
| 20 | + last_update |
| --- | ---------------------------------------------------------------- |
| 21 | Show SQL |
| --- | ---------------------------------------------------------------- |
| 22 | Explain SQL |
| --- | ---------------------------------------------------------------- |
| 23 | Execute SQL |
| --- | ---------------------------------------------------------------- |
| 24 | New Query |
+-----+------------------------------------------------------------------+
36 rows in set (0.03 sec)

Query OK, 0 rows affected (0.03 sec)

Now is a good moment to inspect the SQL underlying our query:

mysql> call qwz.qwz(21);

+-----+----------------------------------------------------------------------------+
| | The command line Query Wizard for MySQL |
+-----+----------------------------------------------------------------------------+
| | SELECT t1.rental_date |
| | ,t2.first_name |
| | ,t2.last_name |
| | ,t8.film_id |
| | FROM sakila.rental AS t1 |
| | INNER JOIN sakila.customer AS t2 ON (t1.customer_id) = (t2.customer_id) |
| | INNER JOIN sakila.inventory AS t8 ON (t1.inventory_id) = (t8.inventory_id) |
| --- | ---------------------------------------------------------------- |
| 1 | Execute SQL |
| --- | ---------------------------------------------------------------- |
| 2 | Explain SQL |
| --- | ---------------------------------------------------------------- |
| 3 | Choose columns |
| --- | ---------------------------------------------------------------- |
| 4 | New Query |
+-----+----------------------------------------------------------------------------+
15 rows in set (1.21 sec)

Query OK, 0 rows affected (1.21 sec)

And we can execute it, or run EXPLAIN on it:

mysql> call qwz.qwz(2);
+----+-------------+-------+--------+----------------------------------------+---------------------+---------+------------------------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------------------------------+---------------------+---------+------------------------+------+-------+
| 1 | SIMPLE | t8 | index | PRIMARY | PRIMARY | 3 | NULL | 4673 | |
| 1 | SIMPLE | t1 | ref | idx_fk_inventory_id,idx_fk_customer_id | idx_fk_inventory_id | 3 | sakila.t8.inventory_id | 1 | |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY | PRIMARY | 2 | sakila.t1.customer_id | 1 | |
+----+-------------+-------+--------+----------------------------------------+---------------------+---------+------------------------+------+-------+
3 rows in set (0.00 sec)

As you can imagine, quite a good deal of information schema hacking going on here.

Code will be released in full immediately after my talk, and if there is sufficient interest I will discuss the internals of the qwz stored procedure in full.

Friday, March 21, 2008

Sun/MySQL

Today I spent the larger part of the afternoon at the Sun office in Amersfoort. And,
Yes, I signed ;-)
and joined the now largest open source company in the world.

Wednesday, March 19, 2008

UDFs at the MySQL User's conference

The MySQL User's conference will be held in less than a month from now!!!

This year there is quite a good number of sessions on adding your own functions and procedures, such as:

I will be doing the 3 hour tutorial on writing user-defined functions, and I am currently adding the last few final touches to my slides.

My tutorial will be very much a hands-on experience. The ambition is to allow people with some programming skills in either C/C++, PHP or Java to leave the room with a bunch of UDFs they created themselves during one of labs. With that and the supporting materials (slides and a handout with detailed instructions) the attendees will be able to write UDFs themselves, and have the knowledge that allows them to make a sensible decision when they have to choose between stored SQL functions, UDFs or raw expressions built of built-in functions.

Allow me to tell you a bit about UDFs - you might decide you want to take my tutorial ;)

User-defined functions


User-defined functions or UDFs are often confused with Stored SQL functions but unrightly so. Other than the fact that stored functions and user-defined functions can be created by the user (as opposed to hard-wired, built-in functions), they have little in common. If you want to know exactly what the difference is and what the strengths and weaknesses are of either feature, my tutorial is for you.

Features


For now I don't want unveil too much but I think that it might be good to point out a number of key advantages of using UDF's over stored SQL functions:

  • Flexible argumentlists - UDFs allow a flexible number of dynamically typed arguments, and it is possible to identify arguments by name (rather than position).

  • The ability to leverage the functionality from external libraries - UDFs can be linked to existing libraries in order to expose their functionality to SQL statements

  • Aggregate functions - UDFs can compute a value for a collection of rows just like the built-in functions COUNT and GROUP_CONCAT


Neither of these features is available to stored SQL functions, and a generic workaround is simply impossible or far from trivial. So - UDFs might be for you if you need one of these things.

Performance


Another reason to use UDFs might be performance. UDFs are much, much better than stored SQL functions when it comes to computation. To prove this fact, I'd like to share a number of benchmarks I did.

For the benchmarks I use a stored procedure like this:

create procedure sp_<SOME-NAME>_benchmark(p_num int unsigned)
begin
declare v_num int unsigned default 0;
declare v_return <SOME-DATA-TYPE>;
declare v_begin int unsigned default unix_timestamp();
while v_num < p_num do
set v_return := <SOME-EXPRESSION>;
set v_num := v_num + 1;
end while;
call sp_store_function_benchmark(
'<SOME-NAME>'
, p_num
, unix_timestamp() - v_begin
);
end;

As you can see, the procedure repeats the evaluation of <SOME-EXPRESSION>, depending on the specified number of iterations passed to the parameter p_num. Depending on the benchmark, an expression is plugged in place of <SOME-EXPRESSION>. The variable v_return is used to capture the evaluation result, and a suitable data type is used in place of <SOME-DATA-TYPE>. Before running the loop, the time is recorded. After the loop, the elapsed time, the number of repetitions and the benchmark name are stored in a table for later analysis. This allows me to test a series of increasing repetitions of the same expression.

I also use a sp_noop_benchmark() procedure that is identical to the one described above, except that it omits the assignment of the expresion. So, it lacks this line:

set v_return := <SOME-EXPRESSION>;

This allows us to isolate the time spent on evaluating the expression and doing the assignment. I also use another control measurement that uses an assignment like this:

set v_return := <SOME-LITERAL-VALUE>;

This allows us to estimate the time spent on the value assignment, which can be used to get a bit closer to the actual time spent only on expression evaluation alone. (Of course, this assumes that the time spent on the assignment alone is comparable for expressions, function calls and literals).

All these measurements were made by comparing single, complete calls many times. All tests were done on Ubuntu Gutsy Gibbon using MySQL 5.1.23. Compilation of UDFs was performed using gcc version 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2). No optimizations were employed.

Stored functions and UDFs


To compare UDFs and Stored functions, I first used simple "Hello World" expressions that all evaluate to a VARCHAR(14). This was measured:

  • String literal sp_string_benchmark. The expression was:
    v_return := 'Hello, String!';

  • Stored SQL function sp_ssf_benchmark. The expresion was:
    v_return := ssf_hello_world();
    The ssf_hello_world() itself was:

    create function ssf_hello_world()
    returns varchar(14)
    return 'Hello, SSFs!!!';

  • User defined function sp_udf_benchmark. The expression was:
    v_return := udf_hello_world();
    The UDF itself was:

    my_bool udf_hello_world_init(
    UDF_INIT *initid, UDF_ARGS *args,
    char *message
    ){
    return 0;
    }
    char *udf_hello_world(
    UDF_INIT *initid, UDF_ARGS *args
    , char* result, unsigned long* length
    , char *is_null, char *error
    ){
    *length = 14;
    return "Hello, UDFs!!!";
    }

Here are the raw results:

+----------+------+--------+-----+-----+
| repeated | noop | string | UDF | SSF |
+----------+------+--------+-----+-----+
| 0 | 0 | 0 | 0 | 0 |
| 500000 | 3 | 4 | 6 | 8 |
| 1000000 | 7 | 9 | 10 | 16 |
| 1500000 | 10 | 13 | 16 | 25 |
| 2000000 | 13 | 17 | 21 | 33 |
| 2500000 | 16 | 22 | 25 | 42 |
| 3000000 | 19 | 27 | 31 | 50 |
| 3500000 | 23 | 31 | 36 | 58 |
| 4000000 | 26 | 36 | 41 | 67 |
| 4500000 | 30 | 40 | 45 | 75 |
| 5000000 | 33 | 44 | 51 | 83 |
| 5500000 | 36 | 49 | 55 | 90 |
| 6000000 | 39 | 53 | 61 | 99 |
| 6500000 | 42 | 57 | 66 | 107 |
| 7000000 | 45 | 62 | 71 | 115 |
| 7500000 | 49 | 66 | 76 | 123 |
| 8000000 | 52 | 71 | 81 | 131 |
| 8500000 | 56 | 75 | 86 | 140 |
| 9000000 | 58 | 80 | 90 | 147 |
| 9500000 | 62 | 83 | 96 | 156 |
| 10000000 | 64 | 88 | 101 | 164 |
+----------+------+--------+-----+-----+

And here is a graph:
udf-ssf-benchmark
At a glance we see that doing nothing is the fastest, followed by the assignment of a literal string. After that, comes the UDF and the stored SQL function is by far the slowest. Another thing that is immediately apparent is that the execution time increases proportionally as the number of repetitions is increased. This means that we can essentially forget the entire series and focus on the measurements with the highest number of repetitions (which is presumably more accurate than the earlier measurements).

But exactly how much faster is the UDF as compared to the SQL function? Well, it depends on how you look at it. If we just divide the raw execution time of the stored function by that of the UDF, we see that the ratio is:

164/101 = 1.60

This would tempt us into thinking that UDF's are about 60% faster.

However, when we calculate it like this, we are also counting the overhead of the benchmark procedures themselves. We get a better result if we focus on only the expression assignments. We can express the performance in terms of the time required to execute the no-operation benchmark. So:

noop / udf = 64 / 101 = .63
and

noop / ssf = 64 / 164 = .39

This correction indicates that the UDF performs at 63% of the no-operation benchmark, whereas the stored SQL function performs at 39% of the no-operation benchmark. We could say that relative to the no-operation benchmark, UDFs are 24% faster as compared to SQL functions

Now we could stretch it a bit more and try to isolate only the time spent on evaluating the functions. To do that, we'd have to assume that "assignment" costs the same, no matter if we are assigning from a stored SQL function, user-defined function or string literal. If we do that, we get this figure:

literal / udf = 88 / 101 = .87
and

literal / ssf = 88 / 164 = .54

This correction indicates that the UDF performs at 87% of the string-literal benchmark, whereas the stored SQL function performs at 54% of the string-literal benchmark. This would indicate that relative to the string literal assignment, UDFs are 33% faster than SQL functions.

Now, I can imagine that a lot of people will have some reservations against this method of correcting the measurements. Personally I think the method is valid, although the result itself is not that useful. I mean, in real life, the fact is that we will be assigning the value of the function, and from that perspective it doesn't help us much to know how fast it would have been if we didn't perform an assignment. Another reservation that we may have is that this benchmark still doesn't tell us much more than the relative differences in overhead. Basically, both the UDF and the stored function are empty - so this benchmark can tell us nothing about performance in a more realistic case where the function is actually doing some work.

Addition


In an attempt to measure at least some basic processing I decided to benchmark addition too. I took a INTEGER as data type and did the following benchmarks:

  • Integer literal: sp_num_benchmark with the expression:
    v_return := 3;

  • Addition operator: sp_opr_benchmark with the expression:
    v_return := 1+2;

  • UDF: sp_udf_benchmark with the expression:
    v_return := UDF_ADD(1,2);
    The code for the UDF is
    my_bool udf_add_init(
    UDF_INIT *initid
    , UDF_ARGS *args
    , char *message
    ){
    if(args->arg_count!=2){
    strcpy(message, "Require two arguments");
    return 1;
    } else {
    args->arg_type[0]= INT_RESULT;
    args->arg_type[1]= INT_RESULT;
    }
    initid->maybe_null= 1;
    return 0;
    }

    long long udf_add(
    UDF_INIT *initid
    , UDF_ARGS *args
    , char *is_null
    , char *error
    ){
    if((args->args[0]==NULL)||(args->args[1]==NULL)){
    *is_null= 1;
    return 0;
    } else {
    return (*((long long*)args->args[0])) + (*((long long*)args->args[1]));
    }
    }

  • Stored SQL function: sp_ssf_benchmark with the expression:
    v_return := SSF_ADD(1,2);
    The code for the function is

    create function ssf_add(l int, r int)
    returns int
    return l+r;
Here is the raw data:

+--------------+------+-----+-----+-----+-----+
| repeat_count | noop | num | opr | udf | ssf |
+--------------+----- +-----+-----+-----+-----+
| 0 | 0 | 0 | 0 | 0 | 0 |
| 500000 | 3 | 5 | 5 | 6 | 9 |
| 1000000 | 7 | 9 | 9 | 11 | 18 |
| 1500000 | 10 | 14 | 13 | 16 | 27 |
| 2000000 | 13 | 19 | 18 | 21 | 35 |
| 2500000 | 17 | 23 | 22 | 26 | 47 |
| 3000000 | 20 | 27 | 27 | 33 | 56 |
| 3500000 | 23 | 33 | 33 | 38 | 62 |
| 4000000 | 27 | 38 | 36 | 43 | 74 |
| 4500000 | 30 | 42 | 42 | 50 | 83 |
| 5000000 | 34 | 46 | 47 | 54 | 91 |
| 5500000 | 36 | 51 | 60 | 68 | 108 |
| 6000000 | 42 | 56 | 59 | 67 | 115 |
| 6500000 | 45 | 60 | 62 | 71 | 123 |
| 7000000 | 47 | 65 | 69 | 78 | 132 |
| 7500000 | 50 | 67 | 71 | 84 | 143 |
| 8000000 | 54 | 72 | 78 | 89 | 149 |
| 8000000 | 54 | 73 | 78 | 89 | 149 |
| 8500000 | 59 | 78 | 79 | 92 | 157 |
| 9000000 | 60 | 85 | 85 | 99 | 168 |
| 9500000 | 65 | 89 | 92 | 104 | 178 |
| 10000000 | 67 | 91 | 95 | 111 | 189 |
+--------------+------+-----+-----+-----+-----+
And here is the graph of the results:
addition-benchmark
We can immediately see that the result is quite similar to what we saw in the previous measurements. Let's look at the performance relative to assigning a literal integer:

int / opr = 91 / 95 = 0.96
and

int / udf = 91 / 111 = 0.82
and

int / ssf = 91 / 189 = 0.48
So, A direct addition operator is less than 5% slower than a literal assignment. The UDF 18% slower than the literal assignment and 14% slower than the addition operator. The stored SQL function is quite a good deal slower: 48% slower than the direct addition operator and 34% slower than the UDF. If we want to compare only UDFs vs stored SQL functions, we should compare relative to the addition operator, which gives a minutely different result:

opr / udf = 95 / 111 = 0.83
and

opr / ssf = 95 / 189 = 0.5

Built-in functions and UDFs


While I was busy doing benchmarks, I also figured that it'd be nice to figure out how UDFs and built-in functions compare performance-wise. To make things slightly more realistic than simply returning a value, I chose to implement my own version of CONCAT().

Because CONCAT can take on a number of arguments I decided to measure the effect of a varying number of arguments too. I ended up choosing VARCHAR(10) for the data type, and tested these different benchmarks:

  • Built-in function sp_bif<1..10>_benchmark - where <1..10> is a number from 1 to 10, and where the expression was:
    v_return := CONCAT(0[,1[,...,9]);
    That is, 10 different CONCAT measurements, from CONCAT(0) to CONCAT(0,1,2,3,4,5,6,7,8,9)

  • UDF sp_udf<1..10>_benchmark - where <1..10> is a number from 1 to 10, and where the expression was:
    v_return := UDF_CONCAT(0[,1[,...,9]);
    That is, 10 different measurements of my UDF implementation of concat, from UDF_CONCAT(0) to UDF_CONCAT(0,1,2,3,4,5,6,7,8,9).The code for my UDF_CONCAT is:

    my_bool udf_concat_init(
    UDF_INIT *initid, UDF_ARGS *args
    , char *message
    ){
    int i;
    size_t bytes = 0;
    for(i=0;iarg_count; i++){
    args->arg_type[i] = STRING_RESULT;
    bytes += args->lengths[i];
    }
    if(!(initid->ptr= malloc(bytes))){
    strcpy(message, "Error allocating memory.");
    return 1;
    } else {
    return 0;
    }
    }

    char *udf_concat(
    UDF_INIT *initid, UDF_ARGS *args
    , char* result, unsigned long* length
    , char *is_null, char *error
    ){
    int i;
    char *buff = initid->ptr;
    *length = 0;
    for(i=0;iarg_count; i++){
    if(args->args[i]==NULL){
    *is_null = 1;
    break;
    } else {
    memcpy(buff + *length, args->args[i], args->lengths[i]);
    *length += args->lengths[i];
    }
    }
    return buff;
    }

    void udf_concat_deinit(
    UDF_INIT *initid
    ){
    if(initid->ptr){
    free(initid->ptr);
    }
    }

I also measured the same string literal assignment and the noop benchmarks as control measurements. I won't give the raw results here - send me an email or post a comment if you'd like to have them. I do have a graph of the result here:
udf-builtin-benchmarkNote that for this graph I already subtracted the no-operation benchmark. This was done to make it easier to examine the data sets of the UDFs and Built-in functions. The graph does include the literal string benchmark, which is the orange line nearest to the X axis.

Now, what do we see here apart from the control benchmark? Basically, we see two bundles of series. The bundle nearest to the X axis corresponds with the built-in CONCAT() benchmarks. The bundle above that corresponds to the UDF benchmarks. So, this tells us that on every occasion, the built-in CONCAT() was faster than the UDF.

We can also see that there is a relatively small but measurable effect for passing more arguments. The calls with fewer arguments are consistently faster than the calls with more arguments within both the group of built-in and UDF calls.

Frankly, I didn't expect to see this. I had imagined that the effect of passing more arguments would be larger, so I expected to see pairs of UDF/Built-in function calls having the same number of parameters. Clearly, I was wrong.

So how much faster are UDFs? Well, if we look at the raw ratios of execution times for the largest number of repetitions, we see:

UDF_CONCAT(0) / CONCAT(0) = 91 / 107 = .85

Interestingly, if we compare that for the calls with 10 arguments we see the same ratio:

UDF_CONCAT(0,1,2,3,4,5,6,7,8,9) / CONCAT(0,1,2,3,4,5,6,7,8,9) = 91 / 107 = .85

So, UDF_CONCAT() would seem to be 15% slower than the built-in CONCAT(). Interestingly, there seems to be no difference between the call with the larger number of arguments, indicating that the performance impact of passing another argument is about the same for UDFs as it is for built-in functions.

What happens if we express the execution time relative to the no-operation benchmark? Well, we get:

noop / UDF_CONCAT(0) = 64 / 107 = .60
noop / CONCAT(0) = 64 / 91 = .70

and

noop / UDF_CONCAT(0,1,2,3,4,5,6,7,8,9) = 64 / 122 = .52
noop / CONCAT(0,1,2,3,4,5,6,7,8,9) = 64 / 104 = .62

Here we see that relative to the no-operation benchmark, the UDF is 10% slower than the built-in function. We could say that 10% is quite a lot, but it is more than half as large as the difference between UDFs and stored SQL functions.

We see something interesting if we look at the execution time relative to the execution time spent on the string literal assignment:

string / UDF_CONCAT(0) = 88 / 107 = .82
string / CONCAT(0) = 88 / 91 = .87

and

string / UDF_CONCAT(0,1,2,3,4,5,6,7,8,9) = 88 / 122 = .72
string / CONCAT(0,1,2,3,4,5,6,7,8,9) = 88 / 104 = .85

Here, the difference between the single argument calls is only 5%, wereas the difference becomes more 12% for the calls with 10 arguments. This is interesting because in the comparison with the no-operation benchmark we did not detect a difference caused by the the amount of arguments.

My current suspicion is that we are actually witnessing the effect of the length of the left hand expression in the assignment operation. Our string literal is a VARCHAR(14), and it should probably be a CHAR(1) for the comparison to the single argument calls and a CHAR(10) for the comparison with the 10 argument calls.

Conclusion


The UDFs I tested are

  • somewhere around and between 25% to 33% faster than stored SQL functions

  • somewhere around and between 5% to 20% slower than built-in functions


Another interesting finding is that argument passing seems to have about the same impact on UDFs as it has on built-in functions. A single argument seems to cost about 1% of the function performance in both cases. For stored SQL functions no such data is currently available.

Discussion


Currently, all functions call are single complete function calls. That means that for UDFs, the initialization function and the row-level function are both called exactly once per UDF instance. When using UDFs in multi-row SQL statements, the initialization function will be called only once per occurence of the UDF, and the row-level function will be called for each row. For functions where the initialization function is relatively expensive, this means that the performance observed in these benchmarks is probably poorer that it can be. A new set of benchmarks should be done to measure the performance of functions in multi-row SQL statements.

Finally


So, are you interested? If you are, go and register for the conference. You can contact me and get a 20% discount! Of course I'd love to see you attend my tutorial too!

See you at the conference!

Thursday, March 13, 2008

MySQL Stored Procedures: CASE syntax

Thank you all for taking the time to respond to the little challenge I posted yesterday! I am pleasantly surprised to note that so many people took the time to post a solution. And most people provided the correct answer too: you are all entitled to a well deserved discount to register for the MySQL User's conference!!!

For those of you interested in the solution: there are two different forms of the CASE statement syntax: the so-called simple case and the searched case.

The simple case selects one WHEN...THEN branch by comparing the value of the expression that appears after the CASE keyword with the value of the expressions given in each of the WHEN...THEN branches. It enters the first branch found where the values are equal to one another:

CASE expr
WHEN expr1 THEN ...statements...
...
WHEN exprN THEN ...statements...
ELSE ...statements...
END CASE

The searched case syntax simply chooses the first WHEN...THEN branch for which the condition appearing after the WHEN keyword is TRUE:

CASE
WHEN cond1 THEN ...statements...
...
WHEN condN THEN ...statements...
ELSE ...statements...
END CASE

The p_find_slash procedure uses a simple case but accidentally used conditions the WHEN...THEN branches:

CASE v_char
WHEN v_char = '/' THEN ...
...
END CASE

What many people don't realize is that syntactically this is perfectly valid. That's because to MySQL, those conditions are simply particular types of expression. It's just that their value will be either 0 or 1 depending on whether the condition holds FALSE or TRUE respectively (consequently, the built-in constants FALSE and TRUE are in fact synonyms for 0 and 1 respectively).

I already emailed a few of you, so if you didn't yet receive an email from me, send me your email address and I'll make sure you get the discount code. I can be reached via email:

Roland dot Bouman at gmail dot com

See you at the UC!

Wednesday, March 12, 2008

MySQL stored procedurs: ...the CASE that they gave me...

Let's see if you can solve this little puzzle...

Consider this stored procedure:

-- finds the first slash and exits
create procedure p_find_slash(p_text text)
begin
declare v_index int default 1;
declare v_length int default character_length(p_text);
declare v_char char(1);

_main_loop: while v_index <= v_length do -- loop over all characters

set v_char := substring(p_text, v_index, 1); -- grab the current character
case v_char
when v_char = '/' then -- found a slash!
select concat('A slash at ', v_index) message; -- report it
leave _main_loop; -- and then stop
else
select concat(v_char, ' is not a slash.') message; -- not a slash...
end case;
set v_index := v_index + 1; -- next character pls
end while;
end;

Of course, it's a bogus stored procedure, but that's not the point right now. Most people can guess what the intention is of this procedure: it should examine each character in the argument text, check if it is a slash, and if so, report its position and then stop execution. If the character is not a slash, the procedure moves on to the next character.

So, what do you expect when we call it with only a literal slash as argument? Let's find out:

call p_find_slash('/');

Well? It may come as a surprise to some, but this is the result:

+-------------------+
| message |
+-------------------+
| / is not a slash. |
+-------------------+

So, can you explain this result? Can you fix it?

Just leave a comment to this post with your explanation and the solution. Results published later this week....and oh!! If you know how to fix this, maybe you're ready to move on to the next level and should attend the "Advanced Stored Procedures" tutorial by Mariella Di Giacomothe at the MySQL User's Conference. Like Giuseppe Maxia wrote earlier, you can earn a 20% discount code by asking a speaker!!

So that is the return: if you post your solution as a comment on this blog, I'll make sure you'll get that code for a 20% discount.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...