Roland Bouman's blog: 2006

Saturday, December 09, 2006

MySQL stored routines and the command line client: No comment...Or Maybe..?!

A popular myth holds that the MySQL Database Server strips the comments from stored procedures, functions, triggers, views and events. This is not true. Or at least, it is only somewhat true.

On an earlier occasion, I have written about MySQL views and how they lose all their comments and whitespace formatting. The article also describes a workaround for it. (Interestingly, a few of the bugs I ran into at the time are now resolved, which means you can do without the kludges I had to turn to in order to get it to work.)

Apart from views, there really is no limitation in the capabilities of the MySQL Server to keep the comments inside code associated with stored routines. (With stored routines, I mean: stored procedures, functions, triggers and events).

There's a really plain and simple explanation for the endurance of the "comment-stripping-myth". The myth remains alive because a lot (if not most) people exclusively use the mysql command line client command line tool for their MySQL work. The mysql interprets the user input, and strips the comments before it sends the command to the server. So any comments never even arrive at the server side.

MySQL Command Line Client Interpreter

The mysql command line client tool implements a simple state-machine to discover when the user input should be sent to the server. For example, the ; you normally type after a statement, and the delimiter command you use in order to create a stored procedure using the command line client: those are all elements that don't really belong to the SQL language itself. They exist solely for the benefit of the mysql command line client tool, because it has to have some way of knowing when the user is done typing whatever is to be sent to the server.

The mysql command line interpreter does not limit itself to interpreting the statement delimiter. It also recognizes that the user input initiates a comment. Let's take a closer look at that behaviour:


mysql>

This is what the mysql prompt normally looks like. When it looks like this, it is ready to accept input. Let's type some input, and then skip to the new line:


mysql> select version()
    ->

As you can see, the client tool notices that we started to type some input. The prompt changes from mysql> to ->, notifying the user that all the subsequent input will be seen as belonging to the current statement. When we now type the default statement delimiter ;, the command line tool will send the gathered input to the server, wait to receive the result and output that to the user:


mysql> select version() 
    -> ;
+-----------------+
| version()       |
+-----------------+
| 5.1.15-beta-log | 
+-----------------+
1 row in set (0.00 sec)

mysql>

It also displays a new fresh prompt, ready to receive new input.

So what happens when we initiate a comment? Well, the mysql command line client notices that too. Single line comments are initiated by a # character or the -- (dash-dash-space) sequence, which will effectively stop interpreting the subsequent input until a newline character is seen. This can be demonstrated by typing the statement delimiter right after the comment is initiated, and then skipping to the next line:


mysql> select version() # single line comment;
    -> -- dash dash space;
    -> ;

As you can see, the ; does not delimit the statement if it appears on the same line after the # or -- single line comment initiators. The statement delimiter is not the only character that ceases to be given a special meaning: all input appearing on the line after the comment was initiated will not be interpreted.

Comments can also span multiple lines using the C-style comment syntax:


mysql> select /*          
   /*> comments;
   /*> */ version();

So, the /* sequence initiates a comment, and all subsequent input, including new lines, will not be interpreted as having special meaning. That is, until the next */ sequence is encountered, delimiting the comment. You can see that the command client tool knows we're inside the multi-line comment, because the prompt changes accordingly: the second and third line have the /*> prompt rather than the usual -> prompt.

Comments are stripped before sending the command to the server

We've just seen that the mysql command line client interprets the user input until it sees a statement delimiter. When it sees one, it treats the gathered input as one command, and sends it to the server. So far, nothing strange is happening. To observe the effect of comments stripping, we need to create a little stored procedure.

Stored procedures, functions, triggers and events are also created with a single statement. However, these types of statements differ from the single SELECT statements we just observed, because these statements can themselves contain a statement sequence. So, if the mysql command line client is to be used to create them, there must be a mechanism to distinguish the delimiter that is used to separate the statements inside the procedure from the delimiter that is used to mark the end of the outer statement so it can be sent to the server.

This is achieved using a special mysql command line client command, the delimiter command. The delimiter command is used to define a word or a delimiting sequence as the statement delimiter. Let's use the delimiter command:


mysql> delimiter $$

This command tells the mysql command line client command to use the $$ sequence as a marker to gather the user input and send it to the server. It's not sent to the server - it only serves to modify the behaviour of the local command interpreter. The default statement delimiter ; may still appear at the end of a statement, but it will simply be ignored:


mysql> select version() 
    -> ;
    -> $$
+-----------------+
| version()       |
+-----------------+
| 5.1.15-beta-log | 
+-----------------+
1 row in set (0.00 sec)

mysql>

Now, we can create a stored routine that contains a statement sequence:


mysql> CREATE PROCEDURE p()                                                        
    -> BEGIN       
    -> /******************
   /*> * A simple procedure
   /*> *******************/
    ->     SELECT version();
    ->     SELECT schema();
    -> END;
    -> $$
Query OK, 0 rows affected (0.00 sec)

The actual statement sequence is contained in the BEGIN..END statement. The BEGIN..END statement is therefore also a compound statement.

The statement list itself is made up of only two SELECT statements. Each of these needs to be properly delimited using the ; statement delimiter. Here, the ; symbol is fixed, and cannot be changed. In this context, the ; is interpreted by the MySQL Server. Inside the BEGIN..END block is also a multi line comment. Technically, comments are not statements, however we did put it inside the block with the intention of keeping it there.

In order to be able to send the CREATE PROCEDURE statement as a whole to the server, the mysql command line client needs to stop interpreting the ; symbol as a statement delimiter. That's why we had to redefine the mysql command line client statement delimiter to something other than ;. In this example, the $$ was chosen. As soon as the mysql command line client sees the $$ sequence, it will gather the user input, and send it to the server as one complete statement, which tells the server to create the procedure.

To change the mysql command line client statement delimiter back to the original, simply use the DELIMITER command again:


mysql> delimiter ;

We can use the SHOW CREATE PROCEDURE statement or the ROUTINES view in the information_schema to see what the procedure looks like according to the server:


mysql> SELECT routine_definition
    -> FROM   information_schema.routines
    -> WHERE  routine_schema = schema()
    -> AND    routine_name = 'p'
    -> ;
+-------------------------------------------------------+
| routine_definition                                    |
+-------------------------------------------------------+
| BEGIN

    SELECT version();
    SELECT schema();
END | 
+-------------------------------------------------------+
1 row in set (0.01 sec)

mysql>

This result clearly shows that the comment is now gone, although we took no explicit action to remove the comment. So where did it go? How do we know for sure it is due to the mysql client tool that the comment is gone?

The simplest way to prove it, is to use the either the MySQL Query Browser or the MySQL Administrator GUI Tools. For example, using the MySQL Administrator, you can find the procedures by clicking the "Catalogs" item in the top sidebar, and then selecting the database that contains the procedure in the bottom sidebar. The procedures are all available in the "Stored Procedures" tab:

There, you can select the procedure. Hitting the "Edit" putting pops up a small editor:

By pasting the original code there, and then pushing the "Execute" button, we recreate the procedure. If we now use the mysql command line client again to see what the procedure looks like, we can see that the comment is exactly were we put it originally:


mysql> SELECT routine_definition
    -> FROM   information_schema.routines
    -> WHERE  routine_schema = schema()
    -> AND    routine_name = 'p'
    -> ;
+---------------------+
| routine_definition                                                                                                        |
+---------------------+
| BEGIN       
/******************
* A simple procedure
*******************/
    SELECT version();
    SELECT schema();
END | 
+---------------------+
1 row in set (0.00 sec)

You might argue that this solves the problem: if we simply switch and use these GUI tools, we will never have this problem again. Well, that's true, and maybe that works for you. However, in a lot of cases, I like the mysql command line tool better for a number of reasons (not discussed here).

Why I like code comments stored in the database

I like to write comments inside stored procedures. I use them to document the algorithm as well as it's usage. I picked up the habit when I worked as application developer writing tons of Oracle PL/SQL. Parts of the PL/SQL code of Oracle's own builtin packages (package headers, sometimes package bodies) can be queried using the data dictionary, and usually the comments inside the PL/SQL source code are enough to figure out how to use the builtin package.

Now I know a lot of people will say: "Nah, the database was not meant for all that, you should write proper documentation!" or "Why don't you use a version control system for that." Of course, I agree that proper documentation should be written, and I agree that version control systems should be used. However, the comments inside the stored procedure code serve another purpose, and I think it's a pity it's discarded by the mysql command line client tool.

Documentation is usually not written at the level of the implementation. Hacks, work-arounds and tricks used to implement a certain algorithm usually don't make it to the documentation. However, it is important to record their rationale, and in my opinion, there is no better place than the code itself.
A version control system should be in place in the development environment. However, when an application is distributed, the actual development code (including the comments) might not be readily available when it's most needed. Keeping the code and comments together inside the database server can be a life-saver if somebody suddenly runs into problems, because it enhances the possibilities to modify the code.
What about Licensing? Suppose I want to include, say a GPL license note with the code? I'd rather keep it as close to my stored procedure as possible.

Ok, have it your way, and try without comments

To illustrate the value of retaining the comments, consider the following function:


CREATE FUNCTION f_check_ccno(
    p_ccno BIGINT 
)
RETURNS TINYINT
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
SQL SECURITY INVOKER
BEGIN
    DECLARE v_ccno VARCHAR(20) 
        DEFAULT reverse(
            cast(p_ccno AS CHAR(20)) 
        );
    DECLARE v_len_ccno TINYINT
        DEFAULT length(v_ccno);
    DECLARE v_i TINYINT 
        DEFAULT 1;
    DECLARE v_num TINYINT 
        DEFAULT 1;
    DECLARE v_check BIGINT 
        DEFAULT 0;
    WHILE v_i <= v_len_ccno DO
        SET v_num := cast(
            substr(v_ccno,v_i,1)
        AS  SIGNED
        );
        IF v_i%2=0 THEN
            SET v_num := 2*v_num;
            IF v_num > 9 THEN
                SET v_num := v_num - 9;
            END IF;
        END IF;
        SET v_check := v_check + v_num;
        SET v_i := v_i + 1;
    END WHILE;
    RETURN (v_check%10)=0;
END;

So, what kind of function is this? Well, the function is called f_check_ccno, and it accepts a p_ccno parameter that is of the BIGINT type. Supposedly, ccno means something, assuming that f_ and p_ are prefixes for "function" and "parameter" respectively. The function returns a TINYINT which we assume encodes the validation result in some way.

It becomes more difficult if we want to find out what kind of validation we're dealing with. I invite you to try it: just peel it off, statement by statement and expression by expression Just guess what this thing is supposed to do.

Most people with some programming experience will have only little difficulties if at all to find out what the function does. That's hardly a suprise: the MySQL Stored Procedure language is a structured higher programming language, and it consists mostly of English keywords. However, the real problem with software program source code is to find out what it means or what was intended. Programming languages have no construct to code for purpose. All speculation on the meaning of this function so far has been derived from the identifiers. We just assume that the names have something to do with the role they play in the code.

(People: try and guess this. Post a comment here and describe what you think this function does. It is actually a useful function - I didn't just make it up. And, oh yeah...I put in a little snag - a bug. Can you find it? In one or two weeks, I'll post it as a snippet at MySQL Forge, with comments and a proper description.)

Now why do we even care? Suppose we inherit someones database and we happen to bump into this function. Of course, we assume the function is there for a reason, and it's might be used by another piece of code, say, a view, or a trigger, or maybe even a query issued by the application. But can't we just treat this function as a black box, and trust it to do whatever it's supposed to do?

Well, sure we can! And if everything is working fine, there's no reason to change anything. Of course we're just ignoring that some day, things might not work fine anymore. Bugs may suddenly occur, or we may find that the function does not scale well. Another likely scenario is a change of business, requiring that we re-evaluate or change the business rules and the software that enforces it. So one way or the other, we're going to have to deal with this function. In the worst case, we mightfind ourselves rebuilding part of the system, gathering and analyzing business requirements again. Once we're done, we might find that we've ended up with pretty much the same function we started with, the only difference being that we re-invented it ourselves.

Of course, you might get pretty fed up with the job by the time that's finished. You'll find a new opportunity and find a new occupation, leaving your successor with a new function called f_check_cc_number() that accepts a BIGINT UNSIGNED parameter called p_cc_number, and that returns a ENUM('PASS','FAIL') value. Needless to say, this is a vast improvement over the original, and has a really nifty and superior implementation. Of course, it still won't have comments because the mysql command line client has stripped them out again...

Yup. I guess you know where this is heading.

Great. Now what do we do?

Well, there are several options:

Switch to the GUI tools

Modify the mysql command line client

Use a little trick...

I already discussed the first option more or less. If it works for you, great. I like the GUI Tools too, but for a lot of work I just feel more comfortable with the mysql command line client.

The second option is something I'm seriously considering. It would be great to be able to turn the comment stripping on using a configuration option. The default behaviour would not be changed, and people that want to turn comment stripping off have a clean way of doing so. I explored the source code a little bit, and although I'm not a C/C++ programmer, I should be able to pull this off.

Of course, everybody can do this, and contribute the code. If you want to get started, get the source code, and hack away. To get you started, there is a very simple hack that will make mysql preserve at least the single line comments.

To do so, open the {mysql-bk}/client/mysql.cc source file and goto line 1108. There you will find something like this:


    if (!in_string && (line[0] == '#' ||
                       (line[0] == '-' && line[1] == '-') ||
                       line[0] == 0))
      continue;                                 // Skip comment lines

Removing these lines will make mysql preserve all single line comments that start a line. At line 1330 a similar snippet strips the single line comments that are start within a line:


    else if (!*ml_comment && (!*in_string && (inchar == '#' || 
                              inchar == '-' && pos[1] == '-' &&
                              my_isspace(charset_info,pos[2]))))
      break;                                    // comment to end of line

I tried it, and this seems to work just fine. Of course, making the patch really worth while will, put mildly, probably take a bit more overview of the source than I have now. For example, what keeps nagging me is the question "Why on earth are the comments stripped in the first place?" Anyway.

Finally: A little trick....

There is a dirty little trick that might just work fine for the time being. If you read the manual section that describes the comment syntax you might notice that there are two different forms of the multi-line comment. We already discussed the plain "C-style" comments. There is another syntax that is meant as a compatibility syntax rather than a comment syntax. It goes like this:


/*! MySQL-specific code */

The purpose of these "comments" is to be able to write portable queries. The "comment" is not really a comment; rather, it's a real command that might not be understood by all servers. The manual uses this example:


SELECT /*! STRAIGHT_JOIN */ col1 FROM table1,table2 WHERE ...

So, a MySQL server would understand the STRAIGHT JOIN modifier, and execute the statement accordingly, whereas other servers won't recognize this as anything else but a comment, and ignore it entirely. A special form of this syntax is frequently used in the output of mysqldump to create database dumps that are portable between MySQL versions. Here's an example from the script for the world database:


/*!40101 SET NAMES latin1 */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;

So, you see, a n.nn.nn scheme is used to target only a specific version of the MySQL Server. I tried a little bit, and I found out that this does the trick pretty well:


/*!999999
*
*/

So, note that I'm using six digits rather than five. Of course, it's ugly, but it works. And yes, I might have a problem as soon as MySQL 99.99.99 hits the shelves, but maybe that is just enough time for me to learn enough C/C++ to come up with a satisfactory patch for the mysql command line client. :)

Sunday, December 03, 2006

MySQL meets OpenOffice.org

A lot of MySQL software development articles focus on the server-side. This is true even end-user applications: most MySQL applications I read about, are web applications.

I decided it would be fun to write down a few simple tips on how to use MySQL in Office applications. Yup, you heard me: using MySQL as "Desktop Database", quite like MS Access is being used on many, many windows desktops.

Of course, I do not want to suggest that MySQL "...is a Desktop Database", because it's not. Rather, MySQL is a general-purpose database, so there's no reason at all to dismiss it for typical "Desktop Database" purposes.

Even though MySQL is a server product, MySQL resembles a typical Desktop Database in a number of ways. For example, MySQL is not extravagantly resource-hungry, and does not require much maintenance - (not for this purpose anyway). So, there's not much against using MySQL here.

In fact, it's probably a good idea to back your Desktop Database applications with MySQL. A lot of the typical problems (resource contention due to locking, file corruption due to crashing application programs, scale limits, and much much more) that you encounter as these applications grow larger, are either not-existent or easily solved using a server solution such as MySQL.

So - let's dive in, and see what it takes to use MySQL in the daily office. Oh yeah - I once did a write-up on connecting MS Access and MySQL. This time, we won't be using MS Office at all. We'll use OpenOffice.org instead.

Some Prerequisites

All products described here are available for most popular architectures and operating system. Both MS Windows and Linux users should be able to get this to work.

The client computer should be able to directly access a local or remote MySQL Server instance. You need to know the hostname (or IP address) of the MySQL Server host. If the MySQL Server is configured to use a non-default port, you must know the portnumber too. You also need the name of the database you want to access. Finally, you also need to know the user name and password of an existing user that has privileges to acces the database.

I'll be using the sakila sample database in the following examples. It's not absolutely necessary to install it if you already have another database to play with. If you want it, you can download it here.

Client Software

Before you start, a few things need to be in place on the client computer. It may already have these items, or you might have to download and install them:

A Java runtime environment (jre) or Java Development Kit (JDK, if you also want to develop Java programs). In most cases, your operating system ships with some pre-installed implementation. However, I found out that a pre-installed implementation might not be exactly what you need. Personally, I'm quite happy with Sun's own implementation.

The OpenOffice.org suite. You can download it here or look for a package tailored for your operating system yourself.

The MySQL JDBC driver, Connector/J. You can download it here.

I won't discuss the installation procedure for OpenOffice.org or the Java runtime environment. As much as I'd like to, there's just too many variety to elaborate on this in a useful way. However, I trust few if any problems will arise obtaintaining and installing these items.

As for the MySQL JDBC driver: there's not much to install. Just unpack the download in a location you're comfortable with. You will find a .jar (Java archive) in the unpacked directory, and you will need the full path and filenename of the archive.

I like to keep several versions around of JDBC drivers, and to make life easier, I created a symbolic link named mysql-connector-java.jar to the latest version of the .jar file. That way, I can easily upgrade or downgrade as needed by flipping the symbolic link to another location. You don't need to do it this way - the symbolic link is purely for my convenience.

(So what do you do with your JDBC drivers? Where do you store them? How do you upgrade?)

Configure OpenOffice.org to run Java

First, you need to enable OpenOffice.org to use Java. To do that, follow these instructions:

Open one of the OpenOffice.org applications, such as Calc or Writer.

Use Menu \ Tools \ Options to configure OpenOffice.org

In the Options dialog, find the "Java" Category. Check the "Use a Java runtime environment" checkbox. You might see a list there of the Java runtime environments available on you system. It seems to automatically pick up all the Java runtime environments I installed in /usr/lib/jvm but I don't dare say it will be the same on your system. However, if you don't see any Java runtime environments there, and you are sure you installed one, you can use the "Add" button and browse to the corresponding directory:
Choose your preferred Java Runtime environment by checking the radiobutton.

By completing these steps, OpenOffice.org is able to run Java applets and programs. This is a prerequisite to enable the OpenOffice.org Base application to connect to MySQL.

(Note that if you add or change a Java Runtime environment, you will be prompted to restart open office:

Don't restart just now - we still need to configure OpenOffice.org to be aware of the MySQL JDBC driver.

Configure OpenOffice.org to use MySQL Connector/J

Although OpenOffice.org is now able to run Java, it is still unaware of the MySQL JDBC driver. Follow these instructions to enable OpenOffice.org to use the MySQL JDBC driver:

Open the OpenOffice.org options dialog, and select the "Java" category as described in the previous section. If you followed the instructions in the previous section, the dialog should still be open.

Use the "Class Path..." button to open the Class Path dialog.

In the Class Path dialog, add a reference to the MySQL Connector/J .jar file using the "Add Archive..." button:

Close the OpenOffice.org options dialog, close the application if you are prompted to restart.

Ok, now we're all set. Everything discussed so far was just one-time preparation, and you won't have to go through it again (except perhaps after upgrading java or OpenOffice.org).

Registering your database with OpenOffice.org Base

OpenOffice.org 2 includes Base, a MS Access like "Desktop Database" application. OpenOffice.org Base acts as a front end for a variety of datasources commonly found in office environments. Typical examples include text documents, spreadsheets, and LDAP addressbooks.

Base can also be used to design and develop simple relational database applications using the HSQLDB database embedded in OpenOffice.org. Base also provides the means to connect to remote rdbms servers through an ODBC or JDBC connection. Right now, we will explore this last usage of Base.

Follow these instructions to register your MySQL database in OpenOffice.org Base:

Start OpenOffice.org Base to enter the "Database Wizard". Check the "Connect to an existing database" radiobutton, and choose "MySQL" from the dropdown listbox. Click the "Next" button.
In the "Set up a MySQL Connection", select the "Connect using JDBC" radiobutton: As you can see, OpenOffice.org Base also supports ODBC connectivity (not discussed in this article), which might be more natural to Windows users. Click the "Next" button.

A dialog appeart to specify the connection details for the MySQL server to which you want to connect: The name of the MySQL JDBC driver class and the port number are already provided by the wizard.

Before filling out the remaining details, we should test if we set up the MySQL JDBC driver properly. Hit the "Test class" button to do that. If the driver can be loaded, you should see the following messagebox: If we are convinced that the driver can be loaded, we can fill out the remaining details to establish a connection. Confirm the messagebox and return to the Database Wizard dialog.

In the "Name of the database" field, type the name of the schema (database) you want use. It is required to specify a database: you cannot access other databases that might be managed by the Server simultaneously from one "Base" database file.

In the "Server URL" field, type the hostname or IP address of the MySQL Server host. You can use the "port number" field to specify a non-default port if necessary, but if you did not configure the MySQL Server to use another port than the default one, it should remain 3306.Hit the "Next" button

Type the name of a user that can access the database. Check the "Password required" checkbox of the user is required to identify itself with a password.

Hit the "Test connection" button to verify that the connection can be established.Close the message box to return to the Database Wizard.

Hit the "next" button to finish off:Select the radiobutton in order to register the dabase file in OpenOffice.org. That enables you to access this database in all you OpenOffice.org documents. Check the checkbox to continue working with the database.

Finally, we can hit the "Finish" button. You are prompted for a name and a location to save the "Base" database file.

If you followed all the instructions, the "Base" main screen will open, and you'll see something like this:

On the left side, we have a sidebar that shows the four different kinds of things you kan work with in "Base":

Tables - here, you'll find the base tables and views defined in your database.

Queries - here, you can define database queries, and save them for repeated use.

Forms - you can use forms to create simple database applications.

Reports allow you to layout data for presentation purposes

The right, main area of the window shows a task pane (top) and a content pane (bottom). When you select an item in the sidebar, the task pane shows all the tasks you can perform with these types of objects, and the content pane shows all instances available for this type of object.
Great! Now what?

Working with Tables

One of the great strengths of "Desktop Database" applications is ease of use - even for non-technical users. For example, once your database is registered in Base, end-users can directly open a table and work with the data it contains. Let's try that right away.

In the sidebar, single click the "Tables" icon to select it. The "Tasks" pane shows a few taks to create new tables, and the content pane shows all tables in the "Sakila" database. In the content pane, double click the "actor" table. You should see something like this:

Ok, nice. This little grid gives us direct access to the table data. Navigation is quit straightforward: use the mouse to point and click an arbitrary cell in the grid. Use the "tab" key to move to the next cell. Use shift+tab to navigate to the previous cell. The "down" arrow key takes you to the next row, and the "up" arrow key to the previous row.

Changing data is easy too: just navigate to cell and type to edit the data. If you're not happy with the edits, use the "Undo" toolbar button to undo them. Navigating out of the row will permanently store the edit. If you are afraid to accidentally change the data and uncheck the "Edit Data" checkbox on the "Edit" menu.

How about adding a new row? Scrolldown to the last record in the grid, and find the empty row at the end. Navigate into it, and edit. A new empty row is automatically appended. Delete is also easy: click on the left before the first field of the row to select it, and then hit the "delete" key.

The grid also some functionality that might help the end user to work with the data in a more effective manner.

Filtering - only those records are retrieved from the underlying database that match a condition. Use the toolbar buttons with the funnel icon to filter data.

Searching - the local set of rows in the grid is searched for records that match a particular condition. Use the toolbar button with the looking glass to search for particular records.

Sorting - the local set of rows in the grid is reordered according to a sorting key (and order).

My impression of the "Base" data grid is that it is a very useful tool to quickly inspect or enter data. It is far from perfect, and it is also unsuitable for normal application development, but it sufficient for maintaining and entering address lists and the like. Most end-users that are used to maintaining such lists in a spreadsheet application will be able to effectively use this interface instead. Also, besides registering the database in OpenOffice.org, it requires no effort at all to obtain this interface: no coding required.

Yeah sure, nifty grid. Now what?

The "Base" application offers two more items that can be interesting for end-users in an office environment: Forms and Reports.

Forms can be developed to build small end-user applications. For a number of reasons, the data grid just is not suitable to build a genuine end-user application. A lot of the limitations you have with grids can be lifted by creating an suitable form. I will discuss this topic on another occasion more thoroughly, and demonstrate how this can be done.

Reports are useful if you want to present data to the end user in a particular layout or grouping. Reports are also useful if you want to export your data to a pdf format, or if you want to print it. Personally, I think that "Base" reports are less suitable for typical analytical BI applications.

Databases registered in "Base" can be used throughout OpenOffice.org. So, although "Base" offers some functionality which might be used in an office environment, it is really only the start. For example, if you want to perform some kind of analysis using the OpenOffice.org spreadsheet application "Calc", you can draw that from one of the registered databases.

Another typical example for the OpenOffice.org wordprocessor "Write" would be to combine data from the database with letters, or address labels to automate postal mail communication.

Right now, I won't discuss these applications in any detail, but I'll try and write up a tutorial on those subjects soon. Of course, if anybody else wants to share their story on integrating their database in open office, please do not hesitate and add a comment to this post.

Wednesday, November 22, 2006

MySQL: A Stored Procedure to Create FEDERATED Tables.

The FEDERATED storage engine is a MySQL feature that lets you access the tables (and views) on a remote MySQL server. In terms of purpose and functionality, it best compares to a "linked table" in MS Access, and to a lesser extent, to Oracle database links.

Creating a federated table is tedious, because in the local database, you need to define a table with exactly the same structure as the remote table. That includes the index definitions, because these are used by the local optimizer to create a query plan (although the statistics for the index will not be locally available)

I created a procedure to make life easier (well, at least as far as creating FEDERATED/ tables is concerned). You can grab if here:

http://forge.mysql.com/snippets/view.php?id=54 https://github.com/sordidfellow/MysqlSnippets/blob/master/p_create_federated_table.sql

Sample1:


call p_create_federated_table(
  '192.168.0.100'
, 3306
, 'my_remote_username'
, 'my_remote_password'
, 'my_remote_schema'
, 'my_remote_table'
, 'my_local_schema'
, 'my_local_table'
);

This sample will create a FEDERATED table named 'my_local_table' in the schema 'my_local_schema'. The created table will point to the table 'my_remote_table' in the schema 'my_remote_schema' on the remote MySQL Server on the host 192.168.0.100 that is listening on port 3306

Sample2:


call p_create_federated_table(
  'myhost.mydomain.org'
, NULL
, NULL
, NULL
, 'my_schema'
, 'my_table'
, NULL
, NULL
);

This sample will create a FEDERATED table with the same name as the specified remote table, and in the local schema with the same name as the specified remote schema. The remote mysql server is listening on the default mysql port (usually 3306) on the host myhost.mydomain.org'.

If you want to know how it works exactly: the procedure has got pretty good comments. Go get it at http://forge.mysql.com/snippets/view.php?id=54 github, and let me know what you think.

Code

Stephen kindly went into the wayback machine and retrieved the code that used to be on the now defunct MySQL Forge website. Thank you stephen! You can also view the code on github.

delimiter $$

drop  procedure if exists 
p_create_federated_table
$$
create procedure 
p_create_federated_table

(
    -- the ip address or name of the remote mysql server host (if NULL, 'localhost')
    p_remote_host     varchar(32)   
    -- the port where the remote  mysql server is listening (if NULL, 3306)
,   p_remote_port     int unsigned  
    -- the user on the remote server that accesses the table (if NULL, 'root') 
,   p_remote_user     varchar(16)   -- 
    -- the password for the remote user (if NULL, omitted)
,   p_remote_password varchar(32)
    -- the schema in which the remote table resides
,   p_remote_schema   varchar(64)
    -- the name of the remote table
,   p_remote_table    varchar(64)
    -- the local schema to create the local FEDERATED table (if NULL, p_remote_schema)
,   p_local_schema    varchar(64)
    -- the name of the local FEDERATED table (if NULL, p_remote_table)
,   p_local_table     varchar(64)
)
LANGUAGE SQL
NOT DETERMINISTIC
MODIFIES SQL DATA
SQL SECURITY INVOKER
COMMENT 'Creates a FEDERATED table.'
/*
    Changelog
    WHEN?      WHO? WHAT?
    -----------------------------------------------------------
    2007-02-20 RPB  added handler to ignore warning 1366. Not sure why this appears.
                    lowered the group_concat_max_len (bug #23856)
                    removed the ORDER BY clauses from the GROUP_CONCATs on COLUMN_TYPE (bug #23856)
                    added output so we can see what's taking so long
                    added changelog
    2006-11-20 RPB  Created Initial version  
*/ 
begin
    -- size used for the GROUP_CONCAT buffer if current is lower
    -- please see http://bugs.mysql.com/bug.php?id=23856
    declare v_group_concat_max_len smallint 
        default 16384;
    -- stores the original size of the GROUP_CONCAT buffer to restore it
    declare v_old_group_concat_max_len int unsigned
        default @@group_concat_max_len;
    -- stores the original sql_mode
    declare v_old_sql_mode varchar(255)
        default @@sql_mode;
    -- Used to drop temporary tables
    declare v_drop_table_name varchar(64);

    -- set the GROUP_CONCAT buffer sufficiently large
    set @@group_concat_max_len := greatest(
        v_group_concat_max_len
    ,   v_old_group_concat_max_len
    );
    -- set the sql_mode to default to prevent invalid column defaults
    set @@sql_mode := '';

    -- The following block contains all the meat
    -- We use a separate block to allow for proper error handling.
    -- All errors that might be expected are handled inside this block
    -- This should guarantee that the outer block is always completed.
    -- That is necessary, beause we need to do a little cleaning up 
    -- before exiting the procedure.
    begin
        -- Used for defaulting the specified host
        declare v_remote_host     varchar(32) 
            default coalesce(p_remote_host,'localhost');
        -- Used for defaulting the specified user
        declare v_remote_user     varchar(16) 
            default coalesce(p_remote_user,'root');
        -- Used for defaulting the local schema
        declare v_local_schema    varchar(64)
            default coalesce(p_local_schema,p_remote_schema);
        -- Used for defaulting the local schema
        declare v_local_table     varchar(64)
            default coalesce(p_local_table,p_remote_table);
        -- Holds the connectstring prefix for FEDERATED tables
        declare v_connectstring   varchar(255)
            default concat(
                'mysql://'
            ,   v_remote_user
            ,   if(p_remote_password is null
                ,   ''
                ,   concat(':',p_remote_password)
                )
            ,   '@'
            ,   v_remote_host
            ,   if(p_remote_port is null
                ,   ''
                ,   concat(':',p_remote_port)
                )
            ,   '/'
            );

        -- Various conditions we might encounter
        -- We rename them just for clarity
        declare TABLE_EXISTS_ERROR condition FOR 1050;
        declare UNKNOWN_COLUMN_ERROR condition FOR 1054;
        declare SYNTAX_ERROR condition FOR 1064;
        declare GROUP_CONCAT_TRUNCATION_ERROR condition FOR 1260;
        declare INCORRECT_VALUE_ERROR condition FOR 1366;
        declare TRUNCATION_ERROR condition FOR 1406;
        declare FEDERATION_ERROR condition FOR 1429;

        -- Various handlers. These will execute if one of the conditions occur.
        -- They all show a friendly error message and the exit the inner block.
        -- Execution is resumed at the clean up code, just before 
        -- the end of the proceudre
        declare exit handler for FEDERATION_ERROR 
            select  'Federation error' error_type
            ,       'Check the connectstring.' error_message
            ,       v_connectstring connectstring
            ;
        declare exit handler for GROUP_CONCAT_TRUNCATION_ERROR 
            select  'GROUP_CONCAT Truncation' error_type
            ,       'Increase GROUP_CONCAT_MAX_LEN.' error_message
            ,      @@group_concat_max_len group_concat_max_len
            ;
/*
        declare exit handler for SYNTAX_ERROR 
            select 'Syntax Error' error_type
            ,      'Check this statement.' error_message
            ,      @create_table_statment statement
            ;
*/
        declare exit handler for TABLE_EXISTS_ERROR 
            select 'Table Exists' error_type
            ,      'Drop the table first.' error_message
            ,      concat(v_local_schema,'.',v_local_table) table_identifier
            ;
        declare exit handler for UNKNOWN_COLUMN_ERROR 
            select 'Unknown Column' error_type
            ,      'Unexpected error, check the statement.' error_message
            ,      @create_table_statement table_identifier
            ;
        declare exit handler for NOT FOUND 
            select 'No such Table' error_type
            ,      'The requested table was not found on the remote host.' error_message
            ,      concat(p_remote_schema,'.',p_remote_table) table_identifier
            ;
 declare continue handler for INCORRECT_VALUE_ERROR 
            select 'We hit warning 1366. It''s probably nothing serious.'
            ;
        declare exit handler for SQLEXCEPTION
            select 'SQL Exception' error_type
            ,      'Unexpected generic error. Debug the procedure.' error_message
            ,      concat(
                           'CALL ',schema(),'.', p_create_federated_table,'(' 
                   ,       if( p_remote_host is null
                           ,   'NULL'
                           ,   concat('''',p_remote_host,'''')
                           )
                   ,',',   if( p_remote_port is null
                           ,   'NULL'
                           ,   p_remote_port
                           )
                   ,',',   if( p_remote_user is null
                           ,   'NULL'
                           ,   concat('''',p_remote_user,'''')
                           )
                   ,',',   if( p_remote_password is null
                           ,   'NULL'
                           ,   concat('''',p_remote_password,'''')
                           )
                   ,',',   if(
                               p_remote_schema is null
                           ,   'NULL'
                           ,   concat('''',p_remote_schema,'''')
                           )
                   ,',',   if( p_remote_table is null
                           ,   'NULL'
                           ,   concat('''',p_remote_table,'''')
                           )
                   ,',',   if( p_local_schema is null
                           ,   'NULL'
                           ,   concat('''',p_local_schema,'''')
                           )
                   ,',',   if( p_local_table is null
                           ,   'NULL'
                           ,   concat('''',p_local_table,'''')
                           )
                   ,   ')'
                   ) call_command
            ; 

        -- The following block creates temporary federated tables 
        -- on the remote information_schema. 
        -- These are needed to generate the structure 
        -- of the local federated table. 
 select 'Getting remote metadata...';
        begin
            -- Prefix used for the temporary tables
            declare v_temp_table_prefix char(25)
                default 'p_create_federated_table$';
            -- Used to fetch the generated DDL from the cursor
            declare v_create_table_statment text;
            -- Cursor loop control variable
            declare v_no_more_rows boolean
                default FALSE;
            -- Cursor generates DDL for createing temporary federated
            -- tables on the remote information_schema.
            -- We need this to generate the DDL to create the actual 
            -- federated table specified by the user.
            declare csr_metadata cursor for 
                select      table_name
                ,           concat(
                                'create temporary table'
                            ,'\n',schema(),'.'
                            ,v_temp_table_prefix,table_name,'('
                            ,'\n',group_concat(
                                    column_name
                                ,   ' '
                                ,   column_type
                                ,   if(
                                        character_set_name is null
                                    ,   ''
                                    ,   concat(
                                            ' character set '
                                        ,   character_set_name
                                        ,   ' collate '
                                        ,   collation_name
                                        )
                                    )
                                ,   if( is_nullable='NO'
                                    ,   ' NOT NULL'
                                    ,   ''
                                    )
                                    separator '\n,'
                                )
                            ,'\n',')'
                            ,'\n','engine = federated'
                            ,'\n','connection = '
                            ,'\n',''''
                            ,     v_connectstring
                            ,     table_schema
                            ,     '/'
                            ,     table_name
                            ,     '''' 
                            )
                from        information_schema.columns
                where       table_schema = 'information_schema'
                and         table_name IN (
                                'COLUMNS'
                            ,   'STATISTICS'
                            ,   'TABLE_CONSTRAINTS'
                            )
                group by    table_schema
                ,           table_name
                ;
            -- handler to control the cursor loop
            declare continue handler for NOT FOUND 
                set v_no_more_rows := TRUE;

            set @drop_temporary_tables_statement := null;

            -- loop through the cursor
            open csr_metadata;
            my_loop: loop
                -- get the DDL for the temporary federated 
                -- information_schema table
                fetch csr_metadata 
                into  v_drop_table_name
                ,     v_create_table_statment;
  
                -- basic cursor loop control exits if cursor is exhausted
                if v_no_more_rows then
                    close csr_metadata;
                    leave my_loop;
                end if;
                -- build a statement to drop all temporary tables
                set @drop_temporary_tables_statement := if (
                    @drop_temporary_tables_statement is null
                ,   concat(
                        'DROP TEMPORARY TABLE IF EXISTS '
                    ,   v_temp_table_prefix
                    ,   v_drop_table_name
                    )
                ,   concat(
                        @drop_temporary_tables_statement
                    ,   ','
                    ,   v_temp_table_prefix
                    ,   v_drop_table_name
                    )
                );

                -- kludge: need a user variable to execute the DDL string
                -- dynamically with the PREPARE syntax
                set @create_table_statment := v_create_table_statment;

                -- create the temporary federated information_schema table 
                prepare stmt from @create_table_statment;
                execute stmt;
                deallocate prepare stmt;

            end loop;
        end;
 -- Reset the variable. Mainly to simplify debugging
 select 'Generating CREATE TABLE statement for FEDERATED table...';
 set @create_table_statment:='...generating statement...';
        -- This creates the actual ddl for the requested local FEDERATED table.
        -- It selects the DDL directly into the user variable. 
        -- It does this by querying the remote information_schema.
        -- This DDL includes the index definitions of the remote table.
        select      concat(
                        'create table if not exists'
                    ,'\n','`',v_local_schema,'`'
                    ,'.' ,'`',v_local_table,'`'
                    ,     '('
                    ,'\n',column_definitions
                    ,     coalesce(index_definitions,'')
                    ,'\n',')'
                    ,'\n','engine = federated'
                    ,'\n','connection = '
                    ,'\n',''''
                    ,     v_connectstring
                    ,     column_definitions.table_schema
                    ,     '/'
                    ,     column_definitions.table_name
                    ,     '''' 
                    ) stmt
        into        @create_table_statement
        from        (
                    select      table_schema
                    ,           table_name 
                    ,           group_concat(
                                        '`',column_name,'` '
                                    ,   column_type
                                    ,   if(
                                            character_set_name is null
                                        ,   ''
                                        ,   concat(
                                                ' character set '
                                            ,   character_set_name
                                            ,   ' collate '
                                            ,   collation_name
                                            )
                                        )
                                    ,   if( is_nullable='NO'
                                        ,   ' not null'
                                        ,   ''
                                        )
                                    ,   if( column_default is null
                                        ,   ''
                                        ,   concat(
                                                ' default '
                                            ,   case
                                                    when data_type = 'TIMESTAMP' 
                                                    and  column_default = 'CURRENT_TIMESTAMP'
                                                        then column_default
                                                    when data_type like '%char' 
                                                    or   data_type like 'date%'
                                                    or   data_type like 'time%'
                                                    or   data_type in ('set','enum')
                                                        then concat('''',column_default,'''')
                                                    else column_default
                                                end
                                            )
                                        )
                                    ,   if(extra='','',concat(' ',extra))
                                    ,   ' comment '
                                    ,   '''',column_comment,''''                                        
                                        separator '\n,'
                                ) as column_definitions
                    from        p_create_federated_table$columns
                    where       table_schema = p_remote_schema
                    and         table_name   = p_remote_table
                    group by    table_schema
                    ,           table_name
                    ) column_definitions
        left join   (
                    select      index_definitions.table_schema
                    ,           index_definitions.table_name 
                    ,           concat(
                                    '\n,'
                                ,   group_concat(
                                        case c.constraint_type 
                                            when 'PRIMARY KEY' then 
                                                constraint_type
                                            when 'UNIQUE' then 
                                                concat(
                                                    'CONSTRAINT '
                                                ,   constraint_name
                                                ,   ' '
                                                ,   constraint_type
                                                )
                                            else
                                                concat(
                                                    if( index_type in (
                                                            'FULLTEXT'
                                                        ,   'SPATIAL'
                                                        )
                                                    ,   concat(
                                                            index_type
                                                        ,   ' '
                                                        )
                                                    ,   ''
                                                    ) 
                                                ,   if( non_unique
                                                    ,   ''
                                                    ,   'UNIQUE '
                                                    )
                                                ,   'INDEX '
                                                ,   '`',index_name,'`'
                                                )
                                        end
                                    ,   index_columns
                                    ,   if( index_type in (
                                                'BTREE'
                                            ,   'HASH'
                                            )
                                        ,   concat(
                                                ' USING '
                                            ,   index_type
                                            )
                                        ,   ''
                                        )
                                        order by c.constraint_type
                                        separator '\n,'
                                    )
                                )    as index_definitions
                    from        (
                                select      table_schema
                                ,           table_name
                                ,           index_name
                                ,           index_type
                                ,           non_unique
                                ,           concat(
                                                '('
                                            ,   group_concat(
                                                    '`',column_name,'`'
                                                ,   if( sub_part is null
                                                    ,   ''
                                                    ,   concat(
                                                            '(',sub_part,')'
                                                        )
                                                    )
                                                    order by seq_in_index
                                                )
                                            ,   ')'
                                            ) index_columns
                                from        p_create_federated_table$statistics
                                where       table_schema = p_remote_schema
                                and         table_name   = p_remote_table
                                and         index_type not in ('FULLTEXT')
                                group by    table_schema
                                ,           table_name
                                ,           index_name
                                ,           index_type
                                ,           non_unique
                                ) index_definitions
                    left join   (
                                select      table_schema
                                ,           table_name
                                ,           constraint_name
                                ,           constraint_type
                                from        p_create_federated_table$table_constraints c
                                where       table_schema    = p_remote_schema
                                and         table_name      = p_remote_table
                                and         constraint_type in (
                                                'PRIMARY KEY'
                                            ,   'UNIQUE'
                                            )
                                group by    table_schema
                                ,           table_name
                                ,           constraint_name
                                ,           constraint_type
                                ) c
                    on          index_definitions.table_schema = c.table_schema
                    and         index_definitions.table_name   = c.table_name
                    and         index_definitions.index_name   = c.constraint_name
                    group by    table_schema
                    ,           table_name
                    ) index_definitions
        on          column_definitions.table_schema = index_definitions.table_schema
        and         column_definitions.table_name = index_definitions.table_name
        ;

        -- Create the actual FEDERATED table by dynamically executing
        -- the generated DDL for the requested FFEDERATED table. 
 select 'Creating FEDERATED table...';

        prepare stmt from @create_table_statement;
        execute stmt;
        deallocate prepare stmt;

        -- Print a friendly message that we succeeded
        select   'Success' completion_type
        ,        concat(
                        'Created FEDERATED table '
                    ,   v_connectstring,'/',p_remote_schema,'/',p_remote_table
                    ) completion_message
        ,        concat(v_local_schema,'.',v_local_table) table_identifier
        ;
    end;

    -- Cleanup: restore the original sql mode 
    set @@sql_mode := v_old_sql_mode;
    -- Cleanup: restore the original GROUP_CONCAT buffer size
    set @@group_concat_max_len := v_old_group_concat_max_len;

    -- Cleanup: drop the temporary federated information_schema tables.
    prepare stmt from @drop_temporary_tables_statement;
    execute stmt;
    deallocate prepare stmt;

    -- Cleanup: reset the user defined variables.
    set @create_table_statment := null
    ,   @drop_temporary_tables_statement := null
    ;
end;
$$

delimiter ;

Wednesday, October 18, 2006

Refactoring: Derived table, UNION...WTF?

I greatly admire the works of the Dutch literary author Gerard Reve (1923 - 2006). On many occasions, he was asked whether his stories were real-life stories, and he always answered like this:

If you mean "did this sequence of events factually take place?" then I can be brief: No. And that's a good thing too, because if a writer would describe reality, the result would be very hard to believe, if not completely inconceivable. The course taken by real life is just too crazy. A true description is bound to be seem like a constructed mannerism and no reader can be expected to believe even one word of it.
However, as far as the single events are concerned, I exclusively use only those things bourne from reality.

If you ever read The Daily WTF you will probably agree that this point of view makes a lot of sense. Each time a really crazy or insensible piece of code pops up on the Daily's there are always people commenting, saying something like:

No...this just cannot be true. I refuse to believe that software developers from large professional organisations can deliver this kind of crap..and get away with it.

Well, the odds are that indeed the pieces of code that appear there should not be taken too literally. I mean, for one thing, the code appears with indentation, and sometimes even with comments: no way this is real code. Real Code is Worse than what you'll ever find on The Daily WTF.

My take on WTF

The reason I'm choosing this topic because I was recently asked to take a look at a database query which was giving "...trouble...". Usually this means: "...not performing as fast as we'd like...".

In this case, the people having trouble were actually consultants from a professional IT service provider. They were trying to migrate some enterprise application from a big, operational Oracle database to MySQL 5.0 Server. Their efforts were all experimental, and so ar, they were quite enthusiastic about MySQL. However, there were two troublesome queries performancewise. (I will discuss only one - the other one is also a story on it's own.)

They found that in this particular case, Oracle took 30ms whereas MySQL took a little over one minute. Of course this is a huge difference, and I was immediately challenged.

The query I got was somewhat like this:


select  *
from    (
        select  film_id
        ,       title
        ,       release_year
        ,       'PG'
        from    film
        where   rating in ('PG','PG-13')
        union
        select  film_id
        ,       title
        ,       release_year
        ,       'R'
        from    film
        where   rating in ('R','NC-17')
        union
        select  film_id
        ,       title
        ,       release_year
        ,       rating
        from    film
        where   rating not in ('PG','PG-13','R','NC-17')
        ) film
where   release_year > 2005
and     release_year <= 2006

Well, of course, their enterprise application was not built on the sakila sample database: I stylized the original query to make it tangible. But don't let that distract you - just look.

Now I'm not saying that I'm not guilty of writing bad code. I mean, we all started somewhere, that is: from nothing. We all had to learn, and alas, in a lot of cases, you can only learn by making mistakes. However, things like the Daily WTF should be valuable eye-openers for developers and software engineers to always remain focussed on monitoring the quality of code. Even, or especially if it's their own.

So, instead of taking the chuckly attitude of laughing at someone else's apparent WTF, I'll try and do a step by step refactoring of this query. This will probably not be interesting for advanced users, and they can still have their laugh. If you can't appreciate the WTF-edness, keep reading: this article is for you.

A Simple Query

What we see here is essentially a simple query. A subselect appears in the FROM-clause: this is sometimes referred to as a derived table or an inline-view. If we forget the complexity of the subquery for a while and pretend that it actually is a real view, we can see easily that it really is simple:


select  ...
from    (/*complexity goes here*/) film
where   ...

So, no JOIN, no GROUP BY, and not correlelated subselects. Normally, this query should be quite fast, right? So why is it so slow now?

Well, we could use the basic performance and tuning instincts, and start asking questions like:

What engine is used for the film table?

Is there an index on release_year?

What is the size of the supa_dupa_xyz buffer?

However, I don't think that's really necessary at this point. Right now, we need to know but one thing, and that is that the resultset of an inline view is computed before as a separate step.

Using `EXPLAIN`

We sure can witness this using EXPLAIN:


explain
select  *            
from    (                    -- 1               
        ...                  --    2
        UNION
        ...                  --    3
        UNION
        ...                  --    4
        ) film               --        union result 
where   ...
;
+----+--------------+--------------+------+---------------+------+---------+------+------+-------------+
| id | select_type  | table        | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+--------------+--------------+------+---------------+------+---------+------+------+-------------+
|  1 | PRIMARY      | <derived2>   | ALL  | NULL          | NULL | NULL    | NULL | 1000 | Using where |
|  2 | DERIVED      | film         | ALL  | NULL          | NULL | NULL    | NULL |  952 | Using where |
|  3 | UNION        | film         | ALL  | NULL          | NULL | NULL    | NULL |  952 | Using where |
|  4 | UNION        | film         | ALL  | NULL          | NULL | NULL    | NULL |  952 | Using where |
|    | UNION RESULT | <union2,3,4> | ALL  | NULL          | NULL | NULL    | NULL | NULL |             |
+----+--------------+--------------+------+---------------+------+---------+------+------+-------------+

The explain result is not too hard to interpret: The first row has the select_type equal to PRIMARY and this corresponds to the outer query. The table mentions that it queries <derived2>, which means it queries on the subquery (I don't know why it does not use the subquery alias - it just doesn't).

The second, third and fourth row in the explain result refer to the queries that are UNION-ed together inside the subselect. It's a bit puzzling that the select_type for the second row is marked DERIVED, but it just means it's the first thing that is evaluated as part of the subquery.

The third and fourth row correspond to the sets that are to be UNION-ed to the total result of the subquery, and that's why you see UNION there in the select_type column.

For those that do not know: UNION is a so-called set operator, it takes the resultsets produced by the queries on the left and right hand side, and constructs a new resultset out of it that contains the rows from the resultset operands. So, UNION 'adds up' rows in a vertical manner. (The much used and well known JOIN operator 'adds up' resultsets in the horizontal direction)

The final row represents the actual 'adding up' of the all the rows from the three queries inside the subselect, and that's why it has UNION RESULT in the select_type column. This step is not trivial, because UNION is required to filter out any duplicate rows in the final result after applying the UNION. More about that later.

Do we really need a derived table?

Now, we were about to think whether we really need the derived table. What would happen if we'd simply expand the contents of the subquery? It will require some work, because we would have to duplicate the WHERE clause of the outer query into the individual UNION-ed queries:


select  film_id
,       title
,       release_year
,       'PG'
from    film
where   rating in ('PG','PG-13')
and     release_year > 2005
and     release_year <= 2006
union
select  film_id
,       title
,       release_year
,       'R'
from    film
where   rating in ('R','NC-17')
and     release_year > 2005
and     release_year <= 2006
union
select  film_id
,       title
,       release_year
,       rating
from    film
where   rating not in ('PG','PG-13','R','NC-17')
and     release_year > 2005
and     release_year <= 2006

So, what would this do to the result from EXPLAIN? Well, we don't expect the subquery step again in oour result:


+----+--------------+--------------+------+---------------+------+---------+------+------+-------------+
| id | select_type  | table        | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+--------------+--------------+------+---------------+------+---------+------+------+-------------+
|  1 | PRIMARY      | film         | ALL  | NULL          | NULL | NULL    | NULL |  952 | Using where |
|  2 | UNION        | film         | ALL  | NULL          | NULL | NULL    | NULL |  952 | Using where |
|  3 | UNION        | film         | ALL  | NULL          | NULL | NULL    | NULL |  952 | Using where |
|    | UNION RESULT | <union1,2,3> | ALL  | NULL          | NULL | NULL    | NULL | NULL |             |
+----+--------------+--------------+------+---------------+------+---------+------+------+-------------+

The result does not show a huge difference from what we had: first our PRIMARY step selected from the derived table - now it does so directly from the union result. However, in reality, it can make all the difference.

In our original query, the whole union result was first built, and after that, the WHERE-clause on release_year was applied. This means that the whole of the union result must be scanned row by row to filter for the right release_year.

The modification 'pushed down' the WHERE-clause and applied it directly on the individual sets. Apart from enabling us to cut out the subselect (and thus, the extra scan), a nice side effect is that this could potentially make the UNION result faster too, because each WHERE-clause could filter out rows, leaving less total rows to 'sum up' by the UNION operation.

So, when do I use a derived table?

Why would anybody ever use a inline view like this? What's the point, can't we always expand it into the outer query? Well, No. There is one case where it is literally impossible to expand it. That's when you're aggregating aggregates.

An example: suppose you have a soccer team database, and you want to take the average of the maximum number of goals per season per player. That's two levels of aggregation, first max(no_of_goals) per player per season, and then, the average of that (also per player per season).

You can find another example here on nesting repeated groups.

Why a `UNION`? Can't we use `UNION ALL`?

I just mentioned that the UNION operator discards duplicate rows from the final resultset. Although that can be a nice functionality now and then it really is rarely needed.

In most cases, the sets in the union don't have any overlap. Can we say something about this particular query? Well, first of all, the three queries all select the same columns from the film table, so potentially there could be overlap between he resultsets. However, the first two queries each select a different constant in the SELECT-list:


select  ..
,       'PG'
from    ..
where   ..
union
select  ..
,       'R'
from    ..
where   ..

So, by definition, these two queries can never yield duplicates, meaning that we don't need the UNION to discard any duplicate rows: on the contrary, we'd rather have it skip that step because that is most likely faster. We can do that using UNION ALL:


select  ..
,       'PG'
from    ..
where   ..
union all
select  ..
,       'R'
from    ..
where   ..

It's actually a good idea to always write UNION ALL in case you know for sure that there is no everlap. Even if performance is not an issue, it makes the code more explicit. This makes it easier for the database to devise a query plan, and it will also be easier to fellow developers to see the exact intention.

Concerning the third query in the UNION, what is happening with that one? Well, it also selects from the code>filmtable, and again, it uses the same columns. However this one does not select a constant, but the rating column.

Now if we only could somehow prove that the values in the rating column here will never contain either of the values 'PG' or 'R' (the constants selected by the previous two queries) we must conclude that this resultset can never overlap with the resultset from either of the other queries. In other words, we would be able to safely use the UNION ALL trick here too.

The WTF-edness

We can find out if the third query will ever select a 'PG' or 'R' in the rating column by loooking at the WHERE-clause:


where   rating not in ('PG','PG-13','R','NC-17')

Mmm, the WHERE-clause explicitly excludes the 'PG' and 'R' values. At least, that means that we can use the UNION ALL trick here too.

But wait a minute...let's look at the other WHERE-s too, ok?


select  ..
from    ..
where   rating in ('PG','PG-13')
and     ..
union all
select  ..
from    ..
where   rating in ('R','NC-17')
and     ..
union all
select  ..
from    ..
where   rating not in ('PG','PG-13','R','NC-17')
and     ..

Now that's beatiful: the third set selects exactly all the data not selected by the first and second. This means we can skip all the mumbo-jumbo and just write:


select  film_id
,       title
,       release_year
,       case rating
            when 'PG-13' then 'PG'
            when 'NC-17' then 'R'
            else rating
        end
from    film
where   release_year > 2005
and     release_year <= 2006

So the CASE expression now makes up for the rating column.

Needless to say that the explain output has shrunk quite a lot:


+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | film  | ALL  | NULL          | NULL | NULL    | NULL |  952 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+

Finally...

Just a short summary:

Real-life code can be very bad

Avoid derived tables

Use UNION ALL instead of UNION

Sadly I have not been able to test it myself. I'm quite confident that the reasoning behind this refactoring is sound, but you never know...

If you have any questions or suggestions, post a comment to the article.

Friday, October 06, 2006

Nedap/Groenendaal ES3B Voting Machines: Why Manufacturing Processes Should not be Closed

Here's a quick summary of the security analysis (8M pdf, in English) of the ES3B voting computer, manufactured by Nedap/Groendendaal. This type of apparatus is used to collect about 90% of all the votes for local and federal elections held in the Netherlands. (A slightly modified version of this type of voting computer is also used in Germany and France.)

The analysis is performed by a Dutch Citizens' Movement, whose name can be roughly translated into the We Don't Trust Voting Computers Foundation. Despite the arguably provocative name, their research project (various materials, in english) and the culminating report concerning the Nedap/Groenendaal voting computer seem fairly sound and objective.

To put it mildy, the report does not shine a favourable light on the quality of the voting machine for its intended purpose. The findings in the report appeared on Dutch national television on 5 october 2006. The publication of the report was slashdotted.

The whole affair would be quite entertaining (the report learns you how to turn the voting computer into a fairly lousy but fully functional chess-player) if not for the disturbing denial by both the Dutch authorities and the manufacturer concerning the obvious risks of fraudulent abuse of these voting machines if used in democratic elections. The total lack of convincing procedures to test the correct functioning of the machines are equally uncanny. The report describes fairly simple fixes for all of the flaws found during a months time research of only a few of these voting machines.

The final conclusion is that although the design of the machines includes data redundancy and independence of remote resources to remedy technical failures, no efforts seem to have been made at all to build safety measures against malignant use for fraudulent purposes. Instead, the manufacter openly advocates reliance on obscuring details about both the hardware and the sofware, claiming that openness would only benefit a small elite, which probably consists mainly of characters that seek to abuse that knowledge for malignent purposes.

Some quotes from the report:

Dutch election law requires physical keys to be used as part of an electronic voting system.

...

The key system chosen [...] for [the] locks [...] always comes with the same key [...] the same key is used [...] throughout The Netherlands. Spare keys can be ordered [...] for roughly a Euro [....]. We ordered [...] 100 of these keys without any problem. [...] typical applications for this lock include “copy machines and office furniture” [...] this [...] type of lock [can be opened] with a bent paperclip.

...

The Nedap ES3B system as it is in use by a typical Dutch municipality consists of multiple S3B voting computers, at least as many ballot memory modules, a reader unit to be attached to a PC via the serial port and an installed copy of the ISS (Integraal Stem Systeem) software running on a PC under Microsoft Windows.

(ISS or Integraal Stem Systeem translates to Integral Voting System)
For those that are more visually orientated, a slide of a typical setup is available from the manufacturers' site:

More quotes from the report:

The ISS software has a ‘maintenance mode’ that is supposed to be only accessible to members of the “verkiezingswacht”, the Nedap election-day helpdesk. You need a password to get the software in this mode. A quick look in the binary revealed this password to be “GEHEIM”, the Dutch word for “SECRET”. The maintenance mode, among other things, allows the helpdesk to read the binary contents of a ballot module plugged into the programming slot of a reader unit. By sniffing the serial commands between the ISS software and the reader unit, we figured out how to issue these commands ourselves and subsequently wrote a program in Tcl that we could use to read the entire contents of a ballot memory module.

...

We claimed [that] the Nedap was just another computer, and [that it could be programmed] to play chess or to lie about the election results. ...[the manufacturer spokesperson said that] "[with] regard to the claim that our machine can play chess: I’d like to see that demonstrated”. [...] our first goals now [...] was to make [the voting machine] play chess.

For the benefit of the non-Dutch readers, here are a few relevant paraphrases translated from the manufacturer's spokesperson (which appears to be the owner of the business himself, Jan Groenendaal):

It may be true that the technical knowledge concerning the operation of the system is known to only a limited group, but that does not have to be a problem.

...

I would very much like to see a demonstration of the statement that it is also possible to play a game of chess with our voting machines.

...

We understand the concern raised against personal computer based voting machines...However, our voting machine is a Dedicated Special Purpose Machine, meant for counting votes and nothing else. It merely records the pressing of buttons on a keyboard, but in very, very secure manner. [...] In addition, we have always taken care for the voting machines to work "stand alone", that is: without any network connections to exclude any external influences. Therefore, Hackers absolutely don't stand a chance.

...

One of the objections we repeatedly face is that the source code of the embedded software is not publicly available. There is some merit in that argument. However this is caused by the government policy which is directed at providing municipalities with voting machines from commercial vendors that operate in a free market. Because of competition, vendors shield their intellectual properties. [...] This should not be blown out of proportion.

...

Opening the sourcecode would allow only a small circle of people to judge it, which when then form just a very small new elite. For sure, this is no guarantee that individuals can understand the machine's operation. [...] "Open Source Software" advocates claim an increase in software quality when there is freedom to propose or contribute enhancements. This may be true for certain wide-spread applications. Elections are however not such an application. Opening up the source increases the possibilities for an attack by malignant forces. The fact that only a small group of people have inside knowledge might also be regarded as a positive thing.

Well, after reading this I think most people can probe the satisfaction the hackers must've felt after all the efforts they must've endured before they could finally write this:

After having learned roughly how the hardware worked we used a gcc 68000 crosscompiler to create a Nedap IO-library containing functions to initialize the system, write data to the display, read the keyboard, and write debug messages to the UART. [...] we then managed to compile and run Tom Kerrigan's Simple Chess Program (TSCP). This was non-trivial only because we had to squeeze out quite a few tables to make it run using only the available 16 kBytes of RAM.

...

It knows all the rules and every now and then it can be surprisingly clever for what it is. But in all honesty we have to admit that it does not play chess all that well.

Then, playtime is definitevely over:

When we started to think about demonstration software that would lie about election results (called “Nedap PowerFraud”), we kept in mind that the system should not lie after an election that was obviously a test of the system. We decided we needed to store the votes and only decide whether or not to perform the fraud at the moment the election was closed, so our program would have as much information as possible to make that decision.

...

The ES3B’s EEPROM [...stores...] a few system configuration parameters [...] and some settings [...] most of the space is used for two circular buffers holding [logs]. In these logs, the device keeps the system time [...] these times are not as helpful as one might think [...] We updated the circular buffer routines that deal with the error log [...] making space for our stolen votes [...] we steal votes only from the number one on each list. Since the majority of voters pick the first candidate on a given party’s list, this is quite acceptable.

...

We then built “hooks” into the regular ES3B code. Every time a voter casts a ballot, our code generates a random number [...] If the number is below the programmed percentage of votes we want to steal, that vote is not written to the ballot module [but to the corrupted log]. At the end of the election, our software determines whether this was a real election or not. It then [writes] these votes into [...] the ballot module, just like the real software does.

Now it might seem far fetched to reprogam the EEPROM's. However:

To determine the recipient of the stolen votes, PowerFraud does a case-insensitive match of all party names with a programmed string. If it finds a match, that party becomes the recipient of he stolen votes. This allows for the fraudulent EPROMs to be inserted long before the candidate lists are known, and it allows a fraudulent ROM to perform the same fraud year after year, even though the relative position of the party on the keyboard changes. It is significant to note that the Dutch interior ministry assumes this to be impossible. A recent statement 13 says: “Fraud during the production of voting machines does not make sense because the lists of candidates are not known then.

The authors then show that in the present implementation, there is not much chance of detecting the fraudulent behaviour of the voting machine. Also, there are no physical means that allow anybody to check whether or not the machine has been tampered with. So basically, all it takes once you have the reporgrammed EEPROM, is fairly short amount of time to replace it. It can then sit there for years and years in the machine to steal votes.

Apart from this, to me, quite intimidating attack a number of other, simple hacks are mentioned:

It would appear that if a special character is displayed, the controller has to do extra work every time the display is updated. This causes the display refresh frequency to drop from 72Hz to 58 Hz. The difference between these two frequencies can be determined by ear. In The Netherlands, the name of the major political party CDA is written in full on the display when the voter chooses any CDA candidate: “Christen Democratisch Appèl”. So using only a simple scanner or short-wave receiver, we can tell whether or not a voter is currently voting for a party or candidate with an accent in the name.

...

In all cases we could receive the signal at a few meters. In one case we could receive the signal up to 25 meters away. [...] We also noticed energy present at 3845 Hz when the vote-button is pressed.

Shortly after the television broadcast, a little reponse appeared on the manufacturer's website:

It occurred to us that our machine works remarkably well. The voting machine does exactly what it is told to do. This was completely expected behaviour, as was confirmed by all involved. We can only conclude that the name of the citizens' movement that calls itself the "We Don't Trust Voting Computers Foundation" depicts our machine in an unjust manner. We feel that the "We Don't Trust People Foundation" would have been a more suitable name.

Personally, I would not vouch that opening up the manufacturing process of these voting machines are a guarantee to twart these kinds of attacks. However, what frightens me is the trust placed by the manufacturer by the 'security' gained from obscurity. Clearly, their statement that the voting machine does exaclty what people tell it to do is not true. Nobody intended the machine to emit the radiation that allows one to see when the "vote" button is pushed.

Also, given that fact that these undoubtedly devoted hackers still took only a month to know probably more about this machine than most of the employees of the manufacturer, does not really convince when the manufacturer claims that it needs to keep everything closed in order to guard their intellectual property.

Frankly, it seems that striving towards a free market for voting machines (which, according to the Nedap spokesman, is a policy of the Dutch government) is probably better served when the manufacturers would be obliged to open up every single bit of the manufacturing process. This kind of "quality through openness" sure is one of the main reasons why I joined MySQL. I strongly and firmly believe in openness as a drive that is especially suitable to drive commercial product development.

Anyway - would my vote count? Next month's elections could be particularly interesting.