Saturday, December 09, 2006

MySQL stored routines and the command line client: No comment...Or Maybe..?!

A popular myth holds that the MySQL Database Server strips the comments from stored procedures, functions, triggers, views and events. This is not true. Or at least, it is only somewhat true.

On an earlier occasion, I have written about MySQL views and how they lose all their comments and whitespace formatting. The article also describes a workaround for it. (Interestingly, a few of the bugs I ran into at the time are now resolved, which means you can do without the kludges I had to turn to in order to get it to work.)

Apart from views, there really is no limitation in the capabilities of the MySQL Server to keep the comments inside code associated with stored routines. (With stored routines, I mean: stored procedures, functions, triggers and events).

There's a really plain and simple explanation for the endurance of the "comment-stripping-myth". The myth remains alive because a lot (if not most) people exclusively use the mysql command line client command line tool for their MySQL work. The mysql interprets the user input, and strips the comments before it sends the command to the server. So any comments never even arrive at the server side.

MySQL Command Line Client Interpreter


The mysql command line client tool implements a simple state-machine to discover when the user input should be sent to the server. For example, the ; you normally type after a statement, and the delimiter command you use in order to create a stored procedure using the command line client: those are all elements that don't really belong to the SQL language itself. They exist solely for the benefit of the mysql command line client tool, because it has to have some way of knowing when the user is done typing whatever is to be sent to the server.

The mysql command line interpreter does not limit itself to interpreting the statement delimiter. It also recognizes that the user input initiates a comment. Let's take a closer look at that behaviour:

mysql>

This is what the mysql prompt normally looks like. When it looks like this, it is ready to accept input. Let's type some input, and then skip to the new line:

mysql> select version()
->

As you can see, the client tool notices that we started to type some input. The prompt changes from mysql> to ->, notifying the user that all the subsequent input will be seen as belonging to the current statement. When we now type the default statement delimiter ;, the command line tool will send the gathered input to the server, wait to receive the result and output that to the user:

mysql> select version()
-> ;
+-----------------+
| version() |
+-----------------+
| 5.1.15-beta-log |
+-----------------+
1 row in set (0.00 sec)

mysql>

It also displays a new fresh prompt, ready to receive new input.

So what happens when we initiate a comment? Well, the mysql command line client notices that too. Single line comments are initiated by a # character or the -- (dash-dash-space) sequence, which will effectively stop interpreting the subsequent input until a newline character is seen. This can be demonstrated by typing the statement delimiter right after the comment is initiated, and then skipping to the next line:

mysql> select version() # single line comment;
-> -- dash dash space;
-> ;

As you can see, the ; does not delimit the statement if it appears on the same line after the # or -- single line comment initiators. The statement delimiter is not the only character that ceases to be given a special meaning: all input appearing on the line after the comment was initiated will not be interpreted.

Comments can also span multiple lines using the C-style comment syntax:

mysql> select /*
/*> comments;
/*> */ version();

So, the /* sequence initiates a comment, and all subsequent input, including new lines, will not be interpreted as having special meaning. That is, until the next */ sequence is encountered, delimiting the comment. You can see that the command client tool knows we're inside the multi-line comment, because the prompt changes accordingly: the second and third line have the /*> prompt rather than the usual -> prompt.

Comments are stripped before sending the command to the server


We've just seen that the mysql command line client interprets the user input until it sees a statement delimiter. When it sees one, it treats the gathered input as one command, and sends it to the server. So far, nothing strange is happening. To observe the effect of comments stripping, we need to create a little stored procedure.

Stored procedures, functions, triggers and events are also created with a single statement. However, these types of statements differ from the single SELECT statements we just observed, because these statements can themselves contain a statement sequence. So, if the mysql command line client is to be used to create them, there must be a mechanism to distinguish the delimiter that is used to separate the statements inside the procedure from the delimiter that is used to mark the end of the outer statement so it can be sent to the server.

This is achieved using a special mysql command line client command, the delimiter command. The delimiter command is used to define a word or a delimiting sequence as the statement delimiter. Let's use the delimiter command:

mysql> delimiter $$

This command tells the mysql command line client command to use the $$ sequence as a marker to gather the user input and send it to the server. It's not sent to the server - it only serves to modify the behaviour of the local command interpreter. The default statement delimiter ; may still appear at the end of a statement, but it will simply be ignored:

mysql> select version()
-> ;
-> $$
+-----------------+
| version() |
+-----------------+
| 5.1.15-beta-log |
+-----------------+
1 row in set (0.00 sec)

mysql>

Now, we can create a stored routine that contains a statement sequence:

mysql> CREATE PROCEDURE p()
-> BEGIN
-> /******************
/*> * A simple procedure
/*> *******************/
-> SELECT version();
-> SELECT schema();
-> END;
-> $$
Query OK, 0 rows affected (0.00 sec)

The actual statement sequence is contained in the BEGIN..END statement. The BEGIN..END statement is therefore also a compound statement.

The statement list itself is made up of only two SELECT statements. Each of these needs to be properly delimited using the ; statement delimiter. Here, the ; symbol is fixed, and cannot be changed. In this context, the ; is interpreted by the MySQL Server. Inside the BEGIN..END block is also a multi line comment. Technically, comments are not statements, however we did put it inside the block with the intention of keeping it there.

In order to be able to send the CREATE PROCEDURE statement as a whole to the server, the mysql command line client needs to stop interpreting the ; symbol as a statement delimiter. That's why we had to redefine the mysql command line client statement delimiter to something other than ;. In this example, the $$ was chosen. As soon as the mysql command line client sees the $$ sequence, it will gather the user input, and send it to the server as one complete statement, which tells the server to create the procedure.

To change the mysql command line client statement delimiter back to the original, simply use the DELIMITER command again:

mysql> delimiter ;

We can use the SHOW CREATE PROCEDURE statement or the ROUTINES view in the information_schema to see what the procedure looks like according to the server:

mysql> SELECT routine_definition
-> FROM information_schema.routines
-> WHERE routine_schema = schema()
-> AND routine_name = 'p'
-> ;
+-------------------------------------------------------+
| routine_definition |
+-------------------------------------------------------+
| BEGIN

SELECT version();
SELECT schema();
END |
+-------------------------------------------------------+
1 row in set (0.01 sec)

mysql>

This result clearly shows that the comment is now gone, although we took no explicit action to remove the comment. So where did it go? How do we know for sure it is due to the mysql client tool that the comment is gone?

The simplest way to prove it, is to use the either the MySQL Query Browser or the MySQL Administrator GUI Tools. For example, using the MySQL Administrator, you can find the procedures by clicking the "Catalogs" item in the top sidebar, and then selecting the database that contains the procedure in the bottom sidebar. The procedures are all available in the "Stored Procedures" tab:
MySQLAdministratorStoredProcedures
There, you can select the procedure. Hitting the "Edit" putting pops up a small editor:
MySQLAdministratorEditStoredRoutine
By pasting the original code there, and then pushing the "Execute" button, we recreate the procedure. If we now use the mysql command line client again to see what the procedure looks like, we can see that the comment is exactly were we put it originally:

mysql> SELECT routine_definition
-> FROM information_schema.routines
-> WHERE routine_schema = schema()
-> AND routine_name = 'p'
-> ;
+---------------------+
| routine_definition |
+---------------------+
| BEGIN
/******************
* A simple procedure
*******************/
SELECT version();
SELECT schema();
END |
+---------------------+
1 row in set (0.00 sec)

You might argue that this solves the problem: if we simply switch and use these GUI tools, we will never have this problem again. Well, that's true, and maybe that works for you. However, in a lot of cases, I like the mysql command line tool better for a number of reasons (not discussed here).

Why I like code comments stored in the database


I like to write comments inside stored procedures. I use them to document the algorithm as well as it's usage. I picked up the habit when I worked as application developer writing tons of Oracle PL/SQL. Parts of the PL/SQL code of Oracle's own builtin packages (package headers, sometimes package bodies) can be queried using the data dictionary, and usually the comments inside the PL/SQL source code are enough to figure out how to use the builtin package.

Now I know a lot of people will say: "Nah, the database was not meant for all that, you should write proper documentation!" or "Why don't you use a version control system for that." Of course, I agree that proper documentation should be written, and I agree that version control systems should be used. However, the comments inside the stored procedure code serve another purpose, and I think it's a pity it's discarded by the mysql command line client tool.

  • Documentation is usually not written at the level of the implementation. Hacks, work-arounds and tricks used to implement a certain algorithm usually don't make it to the documentation. However, it is important to record their rationale, and in my opinion, there is no better place than the code itself.
  • A version control system should be in place in the development environment. However, when an application is distributed, the actual development code (including the comments) might not be readily available when it's most needed. Keeping the code and comments together inside the database server can be a life-saver if somebody suddenly runs into problems, because it enhances the possibilities to modify the code.
  • What about Licensing? Suppose I want to include, say a GPL license note with the code? I'd rather keep it as close to my stored procedure as possible.

Ok, have it your way, and try without comments


To illustrate the value of retaining the comments, consider the following function:

CREATE FUNCTION f_check_ccno(
p_ccno BIGINT
)
RETURNS TINYINT
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
SQL SECURITY INVOKER
BEGIN
DECLARE v_ccno VARCHAR(20)
DEFAULT reverse(
cast(p_ccno AS CHAR(20))
);
DECLARE v_len_ccno TINYINT
DEFAULT length(v_ccno);
DECLARE v_i TINYINT
DEFAULT 1;
DECLARE v_num TINYINT
DEFAULT 1;
DECLARE v_check BIGINT
DEFAULT 0;
WHILE v_i <= v_len_ccno DO
SET v_num := cast(
substr(v_ccno,v_i,1)
AS SIGNED
);
IF v_i%2=0 THEN
SET v_num := 2*v_num;
IF v_num > 9 THEN
SET v_num := v_num - 9;
END IF;
END IF;
SET v_check := v_check + v_num;
SET v_i := v_i + 1;
END WHILE;
RETURN (v_check%10)=0;
END;

So, what kind of function is this? Well, the function is called f_check_ccno, and it accepts a p_ccno parameter that is of the BIGINT type. Supposedly, ccno means something, assuming that f_ and p_ are prefixes for "function" and "parameter" respectively. The function returns a TINYINT which we assume encodes the validation result in some way.

It becomes more difficult if we want to find out what kind of validation we're dealing with. I invite you to try it: just peel it off, statement by statement and expression by expression Just guess what this thing is supposed to do.

Most people with some programming experience will have only little difficulties if at all to find out what the function does. That's hardly a suprise: the MySQL Stored Procedure language is a structured higher programming language, and it consists mostly of English keywords. However, the real problem with software program source code is to find out what it means or what was intended. Programming languages have no construct to code for purpose. All speculation on the meaning of this function so far has been derived from the identifiers. We just assume that the names have something to do with the role they play in the code.

(People: try and guess this. Post a comment here and describe what you think this function does. It is actually a useful function - I didn't just make it up. And, oh yeah...I put in a little snag - a bug. Can you find it? In one or two weeks, I'll post it as a snippet at MySQL Forge, with comments and a proper description.)

Now why do we even care? Suppose we inherit someones database and we happen to bump into this function. Of course, we assume the function is there for a reason, and it's might be used by another piece of code, say, a view, or a trigger, or maybe even a query issued by the application. But can't we just treat this function as a black box, and trust it to do whatever it's supposed to do?

Well, sure we can! And if everything is working fine, there's no reason to change anything. Of course we're just ignoring that some day, things might not work fine anymore. Bugs may suddenly occur, or we may find that the function does not scale well. Another likely scenario is a change of business, requiring that we re-evaluate or change the business rules and the software that enforces it. So one way or the other, we're going to have to deal with this function. In the worst case, we mightfind ourselves rebuilding part of the system, gathering and analyzing business requirements again. Once we're done, we might find that we've ended up with pretty much the same function we started with, the only difference being that we re-invented it ourselves.

Of course, you might get pretty fed up with the job by the time that's finished. You'll find a new opportunity and find a new occupation, leaving your successor with a new function called f_check_cc_number() that accepts a BIGINT UNSIGNED parameter called p_cc_number, and that returns a ENUM('PASS','FAIL') value. Needless to say, this is a vast improvement over the original, and has a really nifty and superior implementation. Of course, it still won't have comments because the mysql command line client has stripped them out again...

Yup. I guess you know where this is heading.

Great. Now what do we do?


Well, there are several options:

  • Switch to the GUI tools

  • Modify the mysql command line client

  • Use a little trick...


I already discussed the first option more or less. If it works for you, great. I like the GUI Tools too, but for a lot of work I just feel more comfortable with the mysql command line client.

The second option is something I'm seriously considering. It would be great to be able to turn the comment stripping on using a configuration option. The default behaviour would not be changed, and people that want to turn comment stripping off have a clean way of doing so. I explored the source code a little bit, and although I'm not a C/C++ programmer, I should be able to pull this off.

Of course, everybody can do this, and contribute the code. If you want to get started, get the source code, and hack away. To get you started, there is a very simple hack that will make mysql preserve at least the single line comments.

To do so, open the {mysql-bk}/client/mysql.cc source file and goto line 1108. There you will find something like this:

if (!in_string && (line[0] == '#' ||
(line[0] == '-' && line[1] == '-') ||
line[0] == 0))
continue; // Skip comment lines

Removing these lines will make mysql preserve all single line comments that start a line. At line 1330 a similar snippet strips the single line comments that are start within a line:

else if (!*ml_comment && (!*in_string && (inchar == '#' ||
inchar == '-' && pos[1] == '-' &&
my_isspace(charset_info,pos[2]))))
break; // comment to end of line

I tried it, and this seems to work just fine. Of course, making the patch really worth while will, put mildly, probably take a bit more overview of the source than I have now. For example, what keeps nagging me is the question "Why on earth are the comments stripped in the first place?" Anyway.

Finally: A little trick....


There is a dirty little trick that might just work fine for the time being. If you read the manual section that describes the comment syntax you might notice that there are two different forms of the multi-line comment. We already discussed the plain "C-style" comments. There is another syntax that is meant as a compatibility syntax rather than a comment syntax. It goes like this:

/*! MySQL-specific code */

The purpose of these "comments" is to be able to write portable queries. The "comment" is not really a comment; rather, it's a real command that might not be understood by all servers. The manual uses this example:

SELECT /*! STRAIGHT_JOIN */ col1 FROM table1,table2 WHERE ...

So, a MySQL server would understand the STRAIGHT JOIN modifier, and execute the statement accordingly, whereas other servers won't recognize this as anything else but a comment, and ignore it entirely. A special form of this syntax is frequently used in the output of mysqldump to create database dumps that are portable between MySQL versions. Here's an example from the script for the world database:

/*!40101 SET NAMES latin1 */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;

So, you see, a n.nn.nn scheme is used to target only a specific version of the MySQL Server. I tried a little bit, and I found out that this does the trick pretty well:

/*!999999
*
*/

So, note that I'm using six digits rather than five. Of course, it's ugly, but it works. And yes, I might have a problem as soon as MySQL 99.99.99 hits the shelves, but maybe that is just enough time for me to learn enough C/C++ to come up with a satisfactory patch for the mysql command line client. :)

15 comments:

Anonymous said...

Excelent article. I never used comments at stored routines but I'll use after this. Good thing that we don't have so many running yet, but soon we will have. :)

Anonymous said...

Hi Roland.

I like your solution.

Sorry I had not much time to look into. But at the first glance: cc number length could be up to 19 digits, so v_check TINYINT UNSIGNED will be more appropriate to use. If that is not your bug - I'll try once more :)

ps I think it is a good idea to post sometimes such puzzles.

rpbouman said...

Hi Andrew!

you know what - I learn something new every day! I did not know that the cc numbers could be 19 digits, thanks for pointing it out.

I think this means that the p_ccno needs to be a BIGINT UNSIGNED.

Indeed, the proposed data type for v_check would work, because in a worst case scenario, the loop will run 19 iterations, which means v_check can add up to as much as 19*9 = 171. However, the data type is currently a BIGINT which seems like overkill, but can't really hurt.

So No: the -intended- bug is somewhere else. Beer's on me if you find it ;)

Anonymous said...

Hi Roland.

I mean unsigned byte should be enough for v_check.
About the p_ccno - I am not sure but as far as I know there are no valid creditcard numbers staring with '9', so your function should work fine.

And about bugs... the only thing I can see is that your function supposes NULL is a valid cc number.

Btw mine too :) That is why a good idea to return by default FALSE until proven otherwise.

rpbouman said...

Hi Andrew!

Wow - I'm impressed, I really am. You found yet another thing I didn't think about. I guess that's what code reviews are for ;)

So, I think we can best solve this problem by using proper initialization of v_check. So,

DECLARE v_check TINYINT UNSIGNED DEFAULT 1;

should solve it.

Now, the bug I had in mind is this. The problem is concentrated in the first 2 declarations:

DECLARE v_ccno VARCHAR(20)
DEFAULT reverse(
cast(p_ccno AS CHAR(20))
);

DECLARE v_len_ccno TINYINT
DEFAULT length(v_ccno);

The problem is is not what is there - the problem is what is not there!

The v_ccno holds the character representation of p_ccno, and v_len_ccno uses the length of the character string to figure out how much iterations to do (because in the loop, we are peeling off the number digit by digit to do the magic). The problem is with LENGTH - length may not return the intended value, because length does *not* return the number of characters. Rater, LENGTH returns the number of *bytes*.

Here, the declaration of v_ccno is such that it omits an explicit character set attribute, so the database default character set will be used. If the database default character set uses a single byte per character such as latin1 or ascii, the function will behave as expected. If the default character set uses multiple bytes per character, such as utf8, then the result of LENGTH will be more than the number of characters.

The solution is simple: you should *never* use LENGTH unless you are absolutely sure you need it. Instead, use CHARACTER_LENGTH - that will always return the number of characters.

Of course, normally it can't hurt to explicitly include the characters set in the declaration of v_ccno. However, not all character sets you expect to be present might be compiled into the server, so that actually makes the code less portable (although it is very unlikely to have a server without latin1 and/or ascii support - I wonder if that is actually even possible)

Regardless, it is plain bad practice to use LENGTH when you mean CHARACTER_LENGTH - that is the intended message behind my deliberate bug. I chose this one, because I see this problem very often.

Anyway - thanks again Andrew ;)

Anonymous said...

You know, I had thoughts about LENGTH and was already trying to change charset to UTF when I found NULL.
Anyway, it was a good experience for me. And I'll be waitng for new brainbreaking things ;)

ps What's a pity, I wanted beer so much :))

Anonymous said...

There is a better way...export you mysql.proc table using mysqldump ...that will keep you comments in as it uses inserts to recreate the functions.

D.

rpbouman said...

Hey D.,

that sounds interesting...I guess what I do not understand is: how do you prevent the comments from getting lost in the first place? I mean, I would think they are inserted into the mysql.proc table after the comments are stripped by the command line client?

Anonymous said...

Unfortunately this method is not working on the version of MySQL that I have (5.0.38-Ubuntu_0ubuntu1-log ). I do not know if Ubuntu changed it or someone upstream did. Also putting comments in via the query browser or the other GUI tools simply is not working either.

So I've used this solution, not elegant I know:

SET @comment = "my comments";

SET @license = "GPL stuff....";

Again it is not elegant but it will work.

rpbouman said...

Hi anonymous,

What do you mean, 'does not work'? Can you be more specific? If it does not work it is a major bug, you will for example have trouble loading mysqldump dumps.

Please take a little time to explain exactly what you are doing so I can try and reproduce it.

thanks in advance,

Roland

Anonymous said...

Hi Mr Bouman

I need your help.
Im sorry if my question is out of topic.

Im using xampp for MySQL database server in MS Windows

MySQL have a problem if selecting query from joined table with record more than 10 thousand records. It will take more than 5 minute.

I just use with joining standard. But with many WHERE statement.
for example:

SELECT a.record1, b.record2, c.record3, d.record4
FROM table1 a JOIN table2 b
ON a.record_id = b.record_id
JOIN table3 c
ON a.record_id = c.record_id
JOIN table4 d
ON a.record_id = d.record_id;

Is there any settings in MySQL for increase performance?
Or there is any problem if i use XAMPP for database server in large data?

Thanks for the advice

Anonymous said...

I came into the same issue and found this as very useful
Thanks

Giuseppe Maxia said...

Too bad the trick doesn't work anymore in MySQL 5.1. Even the special /*! ... */ comment is stripped away

Raido Valgeväli said...

Think about a smaller trick, should be waterproof:

declare read_this_comment varchar(99) default 'Anything you want to say about the following code...';

rpbouman said...

Hi Raido,

well of course this would work, but it also has some obvious problems: since your comments are regular variables they will actally cost something. This may not seem like a big deal, but with the current imeplementation of MySQL stored routines this really is more expensive than most people realize. Especially inside functions, that need to be as fast as possible, a few variable declarations will easily cause a noticeable slowdown. Another issue that may be more of a practical problem is that code editors won't recognize this as a regular comment, so you can't use things like folding.

That said, this may be a useful addition to the repetoire, esp. since the version specific comment trick doesn't seem to work anymore like Giuseppe mentioned.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...