Roland Bouman's blog: October 2005

Monday, October 31, 2005

Creating Crosstabs in MySQL

In some cases, it would be very nice to be able to get a crosstab result from a query. You can't really do that in MySQL, at least not directly; as matter of fact, you can't do it in most rdbms-es (Interestingly, MS Access does have a language construct for crosstabs: TRANSFORM...PIVOT..)
If you really need a crosstab, here are some techniques that might be of use. I'll illustrate them using the updated version of the sakila database. I really think the following stuff will be a lot more fun if you download and install this database first. If you think that's too much of a hassle, then just open the diagram of this database in a new window to help you understand queries.

Step 1: decide what objects you want to know something about

Let's say we want to know something about films. We will need something to recognize a film, and we'll be choosing the title to do that:


select  f.title
from    films   f

This will just return a list of all films.

Step 2: decide what you want to know about those objects

Well, suppose we want to know how well our films are being rented, we could do:


select     f.title
,          count(r.inventory_id) rentals
from       film             f
inner join inventory        i
on         f.film_id      = i.film_id
left join  rental           r
on         i.inventory_id = r.inventory_id
group by   f.title

This query takes all films, and finds the stores that have the film in their inventory. Because we do this using an inner join, we exclude the films that aren't in any store's inventory. This is good: by definition, there can't be any rentals for those films, so it does not make sense to include them in the result.

The inventories are left joined to the rentals. This ensures we won't exclude films that have never been rented. In this case, we definitely want to include these films. If you were a DVD rental store manager, wouldn't you want to know about those films that are for rent but have never been rented before?

Step 3: decide how you want to categorize your data

Now, we do know how many times a particular film has been rented, wich is good. But there's more to a rental than a film ending up to be rented. For starters, a film isn't "just" rented: it is being rented by a particular customer. Second, the customer doesn't "just" rent the film: it's rented from a particular store. The different customers that can rent films, and the different stores where they rent film from offer a perspective that let's us 'slice' out a subset of all the rentals.

For this example, let's stick to the stores to categorize our data.

Given our previous query, it's pretty easy to get our hands on the actual information. We just need to add the store_id to the select and group by list from the previous query:


select     f.title
,          i.store_id
,          count(r.inventory_id) rentals
from       film             f
inner join inventory        i
on         f.film_id      = i.film_id
left join  rental           r
on         i.inventory_id = r.inventory_id
group by   f.title
,          i.store_id

This will give us the number of times a film was rented per store.

Hey, you promised us a crosstab, now pay up!

But wait! This is not a crosstab, this is just a tablular resultset! That's right of course. But we know all the information is there, so what's the matter then?

Actually, a crosstab is just a particular presentation of data. To present a crosstab of this resultset, we could have the films along it's vertical axis (in the direction of the rows), the stores along it's horizontal axis (in the direction of the columns). (We could just as well turn it the other way around, but because we have so much movies, it's probably more practical to have them placed along the vertical axis.) The number of rentals would then be at the row-column intersections or the cells.

Now, as for having the films along the vertical axis - that's pretty much what our queries have been doing all along. A SELECT expression are all about dynamically generating resultset rows according to data in database tables. We can be quite confident that this will not pose a problemn.

Having the stores along the horizontal axis is an entirely different story. Columns in the resultset of a SELECT expression correspond to the individual select list expressions. We sure do control generation of resultset columns ourselves: the expression that we type is evaluated to data appearing in the column, and if we give that expression an alias, that will be used as the column heading. However, we can only exert this control in a static manner, by typing them as part of the statement. What we would like to have is a kind of device that would generate columns dynamically, driven by the data (in our case: the rows from the store table).
Actually, we need to generate columns quite like the way a SELECT expression generates rows.

Let's accept this situation - for the moment.

Step 4: Coding the horizontal axis

To find a way out, let's just pretend our collection of stores is not dynamic, (it doesn't change that often anyway) and try to achieve only the layout of a crosstab. This would involve us typing a separate SELECT list expression for each store, thus statically creating a column definition for each store.
So let's inspect the store table and see for what stores we would have to type an extra column expression:


select  *
from    store

gives us:


+----------+---------+------------+
| store_id | manager | address_id |
+----------+---------+------------+
|        1 |       1 |          1 |
|        2 |       2 |          2 |
+----------+---------+------------+

That's not too bad. Only two stores, wich means we'll only have to type two expressions. Let's concentrate on these expressions.
In our previous queries, we used:


count(r.inventory_id)

to calculate the number of rentals. Now, we still want to calculate the number of rentals, but for a particular column, we only want to count those rentals that occurred in the store corresponding to the column. We need something like a conditional version of count. So, suppose we would be writing the expression for the store with store_id = 1, we could try something like this:


count(
    if(i.store_id=1
    ,   r.inventory_id
    ,   null
    )
)

But wait! Something's wrong here. What if a particular film would be in the Inventory of the store with store_id = 2, but not in the inventory of Store with store_id = 1? Well, the if expression would always return nnull, because no records will be found for the store with store_id = 1. Count will return the count of non-null values, but zero if there are only null values. So, this expression will return zero for the store with store_id = 1.

This is bad news! It means we cannot distinguish between a film that does not exist in the inventory of a particular store and a film that does exist in the inventory but isn't rented at all. Lucky for us, we can emulate count using sum, like this:


sum(
    if(i.store_id=1
    ,   1
    ,   null
    )
)

The if will return 1 for each record where the store_id = 1, and null in other cases.
SUM will faithfully sum these 1's while ignoring the null's, effectively yielding the count.
If sum is fed with only null arguments, sum will return null. This is good, because the result will now let us distinguish between having a film that happens not to exists in the inventory of a particular store as opposed to having the film in the inventroy without ever renting it.

So, the query looks like this:


select     f.title
,          sum(
               if(i.store_id=1
               ,   1
               ,   null
               )
           )                "store_id=1"
,          sum(
               if(i.store_id=2
               ,   1
               ,   null
               )
           )                "store_id=2"
from       film             f
inner join inventory        i
on         f.film_id      = i.film_id
left join  rental           r
on         i.inventory_id = r.inventory_id
group by   f.title

So, there's our crosstab. It wasn't hard at all, just a bit tedious.

Step 4 - continued: Generating the statement

Up untill now, we've put up with the nuisance of 'generating' columns by typing them. Now it's time to stop putting up with that. We've hand-coded a working Example, and it has taken us enough time. We do not want to waste more time on the same problem in the future. And really, we're bound to encounter this problem again. Sooner or later, the data in the store table will change, probably invalidating our crosstab query.

But hey, it's not so bad, is it? We know exactly wich parts of the code correspond to occurrences in the store table, right? And we know what parts of the code are independant of all that. So, let's try and devise something that will generate all that code for us.


    select   concat(
                 'select   f.title','\n'
             ,   group_concat(
                     concat(
                         ',          sum(','\n'
                     ,   '               if(i.store_id=',s.store_id,'\n'
                     ,   '               ,   1','\n'
                     ,   '               ,   null','\n'
                     ,   '               )\n'
                     ,   '           )'
                     ,   '                "store_id=',s.store_id,'"\n'
                     )
                     order by s.store_id
                     separator ''
                 )
             ,   'from       film             f','\n'
             ,   'inner join inventory        i','\n'
             ,   'on         f.film_id      = i.film_id','\n'
             ,   'left join  rental           r','\n'
             ,   'on         i.inventory_id = r.inventory_id','\n'
             ,   'group by   f.title','\n'
             ) statement
    from     store s

This query returns just one row. The one column will contain a string value that is exactly equal to the statement text of our previous hand-coded crosstab query. Compare the hand-coded crosstab-query SELECT expression with the SELECT list of the generator. As you can see, the generator does not much more than select the concatenated string literals that make up the crosstab query. The only variable element is the GROUP_CONCAT call wich aggregates all the rows from the store table in the FROM list of the generator, yielding the collection of SELECT list expressions that makes up the horizontal axis of the crosstab, and of course the data appearing in the cells.

(By the way, GROUP_CONCAT is a commonly recurring element in these generators. I just love this feature, wich really seems to be unique to MySQL. So much for open source products not being innovative)

Now that we can generate the crosstab query SELECT expression, we will never have to worry about data changing in the store table. We can simply run this query to obtain an updated query for the crosstab.

Step 4 - continued: wrapping generation and execution in a stored procedure

Generating the query is certainly an improvement. By generating the statement, we created a single point of definition that handles the duplication of code (the Column expression) in a structured, data driven way. That saves us time, and more important, it will always be correct. It will never contain syntax errors, and it will always cost the same amount of time to create an updated version of the Statement should the data in the store stable be modified.

It's still a bit of a nuisance that we can't just execute the crosstab query with a single statement though.
Right now, we always need at least two statements: one to generate the statement, and one to execute the generated statement.
We would really like to do that in one statement.

Lucky for us, we can actually do that.
We can create a stored procedure that performs these steps in the right order:

CREATE PROCEDURE `sakilar2`.`p_film_rentals_per_store`()
LANGUAGE SQL
NOT DETERMINISTIC
CONTAINS SQL
SQL SECURITY DEFINER
begin
    select   concat(
                 'select   f.title','\n'
             ,   group_concat(
                     concat(
                         ',          sum(','\n'
                     ,   '               if(i.store_id=',s.store_id,'\n'
                     ,   '               ,   1','\n'
                     ,   '               ,   null','\n'
                     ,   '               )\n'
                     ,   '           )'
                     ,   '                "store_id=',s.store_id,'"\n'
                     )
                     separator ''
                 )
             ,   'from       film             f','\n'
             ,   'inner join inventory        i','\n'
             ,   'on         f.film_id      = i.film_id','\n'
             ,   'left join  rental           r','\n'
             ,   'on         i.inventory_id = r.inventory_id','\n'
             ,   'group by   f.title','\n'
             ) statement
    into     @film_rentals_per_store_sql
    from     store s
    order by s.store_id
    ;
    prepare  film_rentals_per_store
    from     @film_rentals_per_store_sql
    ;
    execute  film_rentals_per_store
    ;
    deallocate 
    prepare  film_rentals_per_store;
end

Inside the procedure, we first generate the sql that defines the query, and then use the prepared statement syntax to execute the generated statement. It's not as if we really need statement preparation; we just need some dynamic SQL. The prepared statement syntax happens to be the only construct that supports this in MySQL right now.

By now, we've achieved pretty much all we wanted. We can get a crosstab query result by simply calling the procedure. I'ts a bit of a pity we cannot setup a more generalized approach, but I'll settle with this for a while.

Friday, October 28, 2005

Internationalization overdone

...I could've guessed it. A couple of posts ago, I wrote on offline webpages for some webapps. Yesterday, I encountered a really annoying bug in my solution

It's like this, once the pages are offline, you need a way to prevent broken links and references to other pages. Because I always use relative URI's for links in the web app, this is not too big of a problem. What is a problemn though is that my URI's contain queries, as in:


   {path}/{resource}?name1=value1&name2=value2

I use the URI to generate a filename, in order for the links to keep on working once the page is taken offline, and stored as a file. Unfortunately, the ? is not a valid character for a filename on Windows (wich is the target platform). So, I have substitute that for another character. I have to do that everywhere I link to a page, and of course when generating the filename.

Fortunately, the only cases of references to pages are through the javascript window.open() method, wich takes, among others, an URI parameter. And I already wrapped that call in my own function to provide the appropriate options to the opened window. So, there already was a single point receiveing all the URI's, making it easy enough to substitute the ? character for the tilde ~.

The only thing I needed now was a way of determining wheter the user was browsing offline. Because this is an Internet Explorers Solution, I used the document.protocol property for that. It the protocol is "Hypertext Transfer Protocol" we must use the ?; else, we must use the ~. Easy enough right?

As it turned out, the value returned by the document.protocol property depends upon the language of the IE installation. On my developer machine, that's English (US); on the users machines, it's Dutch. In wich case the value returned by document.protocol is "Hypertext Overdrachts protocol" or something like that.

It's of course all solvable, you just have to check for document.protocol.indexOf("HyperText")==0, but I really don't see why, and obviously would not expect, such a property to be Language dependent.

Anyone?

Tuesday, October 25, 2005

Just want to say thanks

Hey,

I just found out I've been awarded an iPOD for the beta challenge! Amazing, when I received that June (yep, June already, time just flies when you're having fun) issue of the MySQL newsletter, I decided to try and do something.

As I recall, people were being teased to participate by promising them stuff like coffee mugs, T-shirts....Wich immediatly deepened my breath, and made my heart go BOOM,BOOM,BOOM as you might reckon!

Seriously, what convinced me to download the 5.0.7 release, was Arjen Lentz's "call to arms" to evaluate MySQL 5: a be-real, no-nonsense and to-the-point kind of appeal to test the new release of that well known open source rdbms, and this time, it was harbouring all kinds of enterprise grade features: stored procedures, views, triggers. At least, those were the features that I recall being mentioned repeatedly, let's say, the stuff that lets developers and programmers write more code.

Now, those that read some of the entries in blog probably wouldn't 've guessed it, but I hate writing code. I like to think about code, I just don't like writing it. Being an IT-consultant, database developer and information analyst, this attitude poses a problem, as people sure as hell expect me to deliver code, at least some of the time. Some while ago, I found ways of partially solving the paradox, by having as much of my code being generated for me, giving me more time to just think of what kind of could should be generated in the first place.

Now, to generate code, you need a solid source of information that you can use to drive the generation process. From the practice of generating code, I was already knowledgable about the INFORMATION_SCHEMA standard. So, when I noticed that MySQL was going to go along that road as well, I just had to check it out. Apart from offering a platform for wich you could code, this was also going to be a platform for wich I could go and build code generators too! So that's how I came about drawing that diagram, mainly to get myself acquainted with MySQL's implementation of the INFORMATION_SCHEMA standard.

And you bet, I want to get acquainted, because the way I see it, with this release, MySQL has taken a major step towards the direction of leading RDBMS vendors, at least feature-wise (at least One Of these vendOrs seems tO think sO tOO). I really think it's a mistake to, like I see too many of my colleagues do, keep ignoring MySQL as a product that you can do serious business with. I mean business as in: adding value for my customers, making Real Money with it etc.

Once I had the diagram, I thought, c'mon, let's share that stuff. Since then, I've had some valuable feedback on the diagram from Carsten Segieth and Mike Lischke, wich I think is great. I have already commited myself to maintaining it and keeping it up to date, and I'm going to recreate it in MySQL Workbench real soon. And Andrew Gilfrin's given me an opportunity to publish a code generator based on the mysql information_schema on his site. And now, there's this award. It may not be the coffee mug I hoped for, but I guess an iPOD's nice too ;-D. Terrific, I love it!

So, thanks Carsten and Mike, thanks Andrew and thank you very much Arjen. And of course, thanks to MySQL AB for developing this great product, and making it accessible to so many people. MySQL Rocks!

(walks of the stand, flowers in one hand, iPod in the other, almost bursting in to tears...whoops, I better quit this post)

Formatting MySQL code for the Web...in My:SPL

Among the more time consuming things maintaining this blog is formatting the code examples. I decided it was time for me to take that to the next level so I wrote a HTML formatter for mysql code. In case you're interested, it's available for download (zip 7kb). After downloading, just unzip it to your favourite folder, read the readme.txt, and everything should be fine after that.
I will spend the rest of this blog entry on some explanation and examples.

The Function `format_mysql_as_html()`

Here's the function that's responsible for actually creating html from mysql code. It is passed a string parameter via p_sql, and cuts that to bits: the tokens.
Dependant upon the type of the token maintained by v_state, it's wrapped into a <span> tag or left just as it is. The <span> is given a class attribute containing part of the name of the type of token, so we can mark it up through css later on.
There's one special type of token, the 'ACCEPT_OTHER' type. A token of this type is checked against the mysql.help_keyword table to see if it is a reserved word, in wich case it gets an extra distinguishing css class. (Alas, this table does not seem to contain all the words that are considered keywords; for example, it does not contain DECLARE or ELSEIF. But, for now, I'll settle for this.)
Finally, a token is produced of the type 'ACCEPT_END_OF_INPUT', wich terminates the function, returning a string of html.

CREATE FUNCTION `utils`.`format_mysql_as_html`(
    p_sql          longtext
)
RETURNS longtext
LANGUAGE SQL
NOT DETERMINISTIC
CONTAINS SQL
SQL SECURITY DEFINER
BLOCK_MAIN: begin
    declare v_line        int unsigned default 1;
    declare v_position    int unsigned default 1;
    declare v_token       varchar(65535) ;
    declare v_html_token  varchar(65535); 
    declare v_html        longtext default '';

    declare v_from_index int unsigned default 1;
    declare v_to_index   int unsigned default 1;
    declare v_state      enum(
        'ACCEPT_BACKTICK'
    ,   'ACCEPT_APOS'
    ,   'ACCEPT_QUOT'
    ,   'ACCEPT_WHITESPACE'
    ,   'ACCEPT_SINGLE_LINE_COMMENT'
    ,   'ACCEPT_MULTI_LINE_COMMENT'
    ,   'ACCEPT_OTHER'
    ,   'ACCEPT_END_OF_INPUT'
    ,   'ACCEPT_ENDLINE'
    ,   'ACCEPT_DELIMITER'
    ,   'ACCEPT_AT'
    ,   'ACCEPT_ATAT'
    );

    set v_html := concat(v_html,'\n','<pre>');
    LOOP_TOKENS: loop
        call next_mysql_token(
            p_sql
        ,   v_to_index
        ,   v_state
        );
        set v_token := substring(
            p_sql
        ,   v_from_index
        ,   v_to_index - v_from_index
        );
        set v_html_token := html_entities(v_token);
        if v_state = 'ACCEPT_END_OF_INPUT' then
            leave LOOP_TOKENS;
        elseif v_state in (
            'ACCEPT_WHITESPACE'
        ,   'ACCEPT_ENDLINE'
        ,   'ACCEPT_DELIMITER'
        ) then
            set v_html := concat(v_html,v_html_token);
        else 
            set v_html := concat(v_html,'<span class="',substring(v_state,8));
            if v_state = 'ACCEPT_OTHER' then        
                if exists (
                    select  null
                    from    mysql.help_keyword
                    where   name = v_token
                ) then
                    set v_html := concat(v_html,' RESERVED');                
                end if;
            end if;            
            set v_html := concat(v_html,'">',v_html_token,'</span>');
        end if;
        set v_from_index := v_to_index;
    end loop LOOP_TOKENS;
    set v_html := concat(v_html,'\n','</pre>');

    return v_html;
end BLOCK_MAIN

Most of the work is of course done by the next_mysql_token() (Im just printing the signature of that one, not the body. Look it up in the dist. if you're curious):

CREATE PROCEDURE `utils`.`next_mysql_token`(
    IN    p_text          longtext
,   INOUT p_index         int unsigned 
,   OUT   p_current_state enum(
        'ACCEPT_BACKTICK'
    ,   'ACCEPT_APOS'
    ,   'ACCEPT_QUOT'
    ,   'ACCEPT_WHITESPACE'
    ,   'ACCEPT_SINGLE_LINE_COMMENT'
    ,   'ACCEPT_MULTI_LINE_COMMENT'
    ,   'ACCEPT_OTHER'
    ,   'ACCEPT_END_OF_INPUT'
    ,   'ACCEPT_ENDLINE'
    ,   'ACCEPT_DELIMITER'
    ,   'ACCEPT_AT'
    ,   'ACCEPT_ATAT'
    )
)
LANGUAGE SQL
NOT DETERMINISTIC
CONTAINS SQL
SQL SECURITY DEFINER

This function implements a deterministic state machine. It starts off reading p_text from position p_current_index. Then, it reads characters from p_text, one by one. For each character, the internal state is updated untill an accepting state is reached. Then, it returns.
When it returns, it will have updated the p_current_state as well as p_current_index, wich is advanced to the point right after the previous token. The token itself is easliy excised from the text by taking the substring between the positions indicated by p_current_index before and after the call.

Formatting entire routines

The information_schema database contains a ROUTINES table, providing us with lots of code to feed into format_mysql_as_html(). However, there's a small problem with that. It does not contain information about the parameterlist of the routine. Lucky for us, there's the mysql.proc table (Thanks Pro MySQL, didn't know that one before) that does contain all the information. So, I came up with the get_routine_ddl() function:

CREATE FUNCTION `utils`.`get_routine_ddl`(
    p_schema varchar(64)
,   p_name   varchar(64)
)
RETURNS longtext
LANGUAGE SQL
NOT DETERMINISTIC
CONTAINS SQL
SQL SECURITY DEFINER
begin
    declare v_routine_ddl longtext;
    select concat(
               'CREATE ',p.type
           ,   ' `',p.db,'`.`',p.name,'`'
           ,   '(',convert(p.param_list using utf8),')','\n'
           ,   if( p.type='FUNCTION'
               ,   concat('RETURNS ',p.returns,'\n')
               ,   ''
               )       
           ,   'LANGUAGE ',p.language,'\n'
           ,   if( p.is_deterministic='NO'
               ,   'NOT '
               ,   ''
               ),  'DETERMINISTIC','\n'
           ,   replace(p.sql_data_access,'_',' '),'\n'
           ,   'SQL SECURITY ',p.security_type,'\n'
           ,   convert(p.body using utf8)
           )
    into   v_routine_ddl
    from   mysql.proc p
    where  p.db   = p_schema
    and    p.name = p_name
    ;
    return v_routine_ddl
    ;
end

This function takes the name of a schema and the name of a function or stored procedure, and returns the complete DDL for it. Because it needs to read from the mysql.proc table, I decided it would be best to create it with SQL SECURITY DEFINER. Then it can be created by a user that has access to that table, but can be executed by anyone that is granted execution of the function.
To top it off, I decided it would be fun to be able to write the formatted ddl directly to file. This can be done with the SELECT...INTO OUTFILE syntax, but sadly that does not accept a variable to specify the filename.
No sweat, since MySQL 5.0.13, the prepared statement syntax is available inside stored procedures! This comes in handy in the next function:

CREATE PROCEDURE `utils`.`write_routine_html_file`(
    p_schema  varchar(64)
,   p_name    varchar(64)
,   p_outfile varchar(255)
)
LANGUAGE SQL
NOT DETERMINISTIC
CONTAINS SQL
SQL SECURITY DEFINER
begin
    declare v_quote char(1) default '''';

    set @statement_text := concat(
        'select '
    ,   'utils.format_mysql_as_html('
    ,   'utils.get_routine_ddl(?,?)'
    ,   ')'
    ,   ' into outfile '
    ,   v_quote
    ,   p_outfile
    ,   v_quote
    ,   ' fields escaped by '
    ,   v_quote,v_quote
    ,   ' lines terminated by '
    ,   v_quote,v_quote
    );

    set @p_schema := p_schema;
    set @p_name   := p_name;

    prepare stmt
    from    @statement_text
    ;
    execute stmt 
    using   @p_schema
    ,       @p_name
    ;
    deallocate prepare stmt
    ;
end

This function just wraps the get_routine_ddl() function inside a prepared SELECT...INTO OUTFILE allowing for a variable filename.

Now, using a statement like:

call utils.write_routine_html_file(
    'utils'
,   'format_mysql_as_html'
,   'D:/temp/format_mysql_as_html.txt'
);

generates the listing found on top of this blog entry...effortlessly.

Saturday, October 22, 2005

Nesting MySQL Cursor Loops

Handling cursor loops in MySQL keeps puzzling people.

Single Cursor Loops

The common case is to have a simple cursor loop. Each record is fetched from the cursor until there are now more records, and for each record some work is performed.
Let's take a look at this snippet:


01    begin
02        declare v_col1 int;                     -- define the data to grab from the cursor
03        declare no_more_rows boolean := FALSE;  -- define the loop control variable
04        declare cursor1 cursor for              -- define the set to iterate through
05            select col1
06            from   MyTable;
07        declare continue handler for not found  -- Capture cursor exhaustion event
08            set no_more_rows := TRUE;           -- modify the loop control variable, then return
09
10        open cursor1;                           -- execute query, and save the result so we can iterate it
11        LOOP1: loop                             -- start repeatable region of code
12            fetch cursor1                       -- grab data from current cursor record
13            into  v_col1                        -- this will raise NOT FOUND if the cursor is exhausted
14            ;
15            if no_more_rows then                -- check the loop control variable, the handler might've modified it
16                close cursor1;                  -- free resources
17                leave LOOP1;                    -- stop iterations 
18            end if;
19            -- 
20            -- Process data here
21            -- 
22        end loop LOOP1;
23    end;

This is a fragment of a MySPL Procedure.
To summarize, the CURSOR is processed by iteration through an ordinary LOOP that happens to FETCH rows from the cursor. When the cursor's exhausted, the NOT FOUND condition is raised, transferring control to a HANDLER that sets a flag. This flag is checked inside the loop, right after the fetch, and when it is set, iteration is terminated.

Handling cursor exhaustion

An important point to make right away is that, unlike the OPEN (line 10), FETCH (line 12) and CLOSE (line 16) statements, there is no formal relationship between the HANDLER declaration (lines 07..08) and the CURSOR declaration (lines 04..06). For instance, the HANDLER declaration never refers to cursor1, the name of the cursor. It just 'happens' to be the case that FETCH-ing eventually leads to a situation in wich the cursor will have reached it's last record. Such an exhausted cursor will raise the NOT FOUND condition on attempt to FETCH from it again. This condition can then be captured by the, scopewise, nearest appropriate HANDLER.

Nested Cursor Loops

Sometimes you need to nest two cursors. I don't want to discuss what problems are appropriately solved with this device right now. Let me just say that most of the time someone is using this device, they are doing that because they don't know how to write a join properly.
The problem with nesting curors is this: because we can't associate a HANDLER with a particular CURSOR we have to think of a way to know wich one of the cursors is exhausted once the NOT FOUND condition is raised. We really need to know that, because our loop controls depend on it. We definitely want to terminate the corrsponding loop when one of the cursors will be exhausted, don't we?

Using a separate block

Perhaps one of cleanest ways to do this is to rely on the scopewise propagation of the condition. By putting the inner LOOP into it's own BEGIN..END block, we have created the possibility to give that block it's own NOT FOUND handler. A NOT FOUND condition raised by a FETCH in the inner LOOP will thus be captured by the appropriate HANDLER declared in the inner BEGIN..END block. That's because that inner block's handler is nearer than the handler declared in the outer block. From there, the inner loop can easily be terminated:


01    BLOCK1: begin
02        declare v_col1 int;                     
03        declare no_more_rows boolean1 := FALSE;  
04        declare cursor1 cursor for              
05            select col1
06            from   MyTable;
07        declare continue handler for not found  
08            set no_more_rows1 := TRUE;           
09        open cursor1;
10        LOOP1: loop
11            fetch cursor1
12            into  v_col1;
13            if no_more_rows1 then
14                close cursor1;
15                leave LOOP1;
16            end if;
17            BLOCK2: begin
18                declare v_col2 int;
19                declare no_more_rows2 boolean := FALSE;
20                declare cursor2 cursor for
21                    select col2
22                    from   MyOtherTable
23                    where  ref_id = v_col1;
24                declare continue handler for not found
25                    set no_more_rows2 := TRUE;
26                open cursor2;
27                LOOP2: loop
28                    fetch cursor2
29                    into  v_col2;
30                    if no_more_rows then
31                        close cursor2;
32                        leave LOOP2;
33                    end if;
34                end loop LOOP2;
35            end BLOCK2;
36        end loop LOOP1;
37    end BLOCK1;

Note that the inner BEGIN..END block's handler can never capture the NOT FOUND condition raised by the FETCH in the outer block, because conditions are propagated from the inside to the outside.
Another advantage of this approach is that the inner loop is almost entirely self-contained. The only thing that binds the inner BEGIN..END block to the outer one is a reference to the variable v_col1 in the decalaration of cursor2. Everything that has to do with the inner loop is defined in just one place, BLOCK2.

Resetting the loop control variable

The disadvantage of the previous method is of course the duplication of code. Apart from the cursor itself, we have to declare the loop control variable and the handler twice too. Sometimes, this is way too much trouble. Lucky for us, there's a way to evade all that, by resetting the loop control variable just before the inner loop terminates. Check it out:


00    begin
01        declare no_more_rows boolean default false;
02        declare v_col1       int;
03        declare v_col2       int;
04        declare cursor1 cursor for
05            select col1
06            from   MyTable
07            ;
08        declare cursor2 cursor for
09            select col2
10            from   MyOtherTable
11            where  ref_id = v_col1;
12        declare continue handler for not found 
13            set no_more_rows := true;
14    
15        open cursor1;
16        LOOP1: loop
17            fetch cursor1 into v_col1;
18            if no_more_rows then
19                close cursor1;
20                leave LOOP1;
21            end if;
22            open cursor2;
23            LOOP2: loop
24                fetch cursor2 into v_col2;
25                if no_more_rows then
26                    set no_more_rows := false;
27                    close cursor2;
28                    leave LOOP2;
29                end if;
30            end loop LOOP2;
31        end loop LOOP1;
32    end;

Observe te difference between the handling of the loop termination. For the outer loop, we see the usual sequence of statements to close the cursor and to leave the loop (lines 18..21). For the inner loop, we use an additional assignment to reset the loop control variable (line 26). This ensures the outer loop gets a chance to roll off untill it's exhausted too.

Packing cursor loops into stored procedures

Another approach would be to pack all the code for the dependant, inner cursor loop into a separate stored procedure and have the outer loop CALL that. This approach resembles the first one where we put the inner loop in it's own BEGIN..END block:


01    create procedure p_inner_loop(
02        IN p_col1 int
03    )
04    begin
05        declare v_col int;
06        declare no_more_rows boolean := FALSE;
07        declare cursor cursor1 for
08            select col2
09            from   MyOtherTable
10            where  ref_id = p_col1;
11        declare continue handler for not found
12            set no_more_rows := TRUE;
13        open cursor1;
14        LOOP1: loop
15            fetch cursor1
16            into  v_col;
17            if no_more_rows then
18                close cursor1;
19                leave LOOP1;
20            end if;
21        end loop LOOP1;
22    end;

And


01    BLOCK1: begin
02        declare v_col1 int;                     
03        declare no_more_rows boolean1 := FALSE;  
04        declare cursor1 cursor for              
05            select col1
06            from   MyTable;
07        declare continue handler for not found  
08            set no_more_rows1 := TRUE;           
09        open cursor1;
10        LOOP1: loop
11            fetch cursor1
12            into  v_col1;
13            if no_more_rows1 then
14                close cursor1;
15                leave LOOP1;
16            end if;
17            call p_inner_loop(v_col1);
18        end loop LOOP1;
19    end BLOCK1;

This approach is especially useful when you want to be able to use that inner loop from other contexts as well. When cursor processing is becoming complex, it's probably a good idea to use this technique as a "divide and conquer" strategy to keep development managable. For example, this technique allows you test the different cursor loops separate from each other, wich can be a great aid in debugging.

Monday, October 17, 2005

Know Thy Schema...without typing "information_schema.%" so much

The MySQL information_schema (wich is available as of version 5.0.2) is a very powerful feature that can help you maintain or administrate your databases. You can use it to investigate the existence and status of database objects (such as tables, constraints and indexes) and their components (such as columns).

I like to use the MySQL command line tool for lots of common tasks. From the command line tool, I frequently feel the need to query the information_schema. In most cases, I just need to have a quick overview of a database's assets. In fact, I took up the habit in my job as an Oracle developer. In the Oracle command line tool, SQL*Plus, I'd do something like:


select   object_name
,        object_type
,        created
from     all_objects o
where    owner = 'SCOTT'
order by object_type
,        object_name

and this would give me a list of all the objects accessible to the current user that reside in the schema owned by the user SCOTT. For each object, the schema in wich the object resides would be listed, along the object's name and the date (actually the datetime, but the formatting leaves out the time component) the object was created:


OBJECT_NAME                    OBJECT_TYPE        CREATED
------------------------------ ------------------ --------
PK_DEPT                        INDEX              12-05-02
PK_EMP                         INDEX              12-05-02
PK_NODE                        INDEX              18-08-05
SYS_C003897                    INDEX              09-08-05
UK_NODE1                       INDEX              18-08-05
UK_NODE2                       INDEX              18-08-05
EMPREPORT                      PACKAGE            20-03-05
EMPREPORT                      PACKAGE BODY       20-03-05
BONUS                          TABLE              12-05-02
DEPT                           TABLE              12-05-02
EMP                            TABLE              12-05-02
IMAGES                         TABLE              20-03-05
NODE                           TABLE              18-08-05
PK                             TABLE              09-08-05
PRODUCT                        TABLE              18-08-05
SALGRADE                       TABLE              12-05-02

With the mysql information_schema database, we can achieve the same sort of list. However, there's no table there that contains all these different types of objects.

Another thing that can be a bit frustrating is that such a query can be quite lengthy. The identifiers used throughout the information_schema database are quit long and verbose.


select  avg(
            character_length(
                identifiers.identifier
            )
        )       avg_character_length
,       min(
            character_length(
                identifiers.identifier
            )
        )       min_character_length
,       max(
            character_length(
                identifiers.identifier
            )
        )       max_character_length
from    (
        select  table_name identifier
        from    information_schema.tables
        where   table_schema = 'information_schema'
        union all
        select  column_name
        from    information_schema.columns
        where   table_schema = 'information_schema'
        )       identifiers
;

+----------------------+----------------------+----------------------+
| avg_character_length | min_character_length | max_character_length |
+----------------------+----------------------+----------------------+
|              13.1279 |                    2 |                   37 |
+----------------------+----------------------+----------------------+

On the one hand, these elaborate identifiers make it easy to understand the internal structure of the information_schema and the queries that are built upon it, wich is good. On the other hand, it easily becomes a nuisance when you have to type them over and over again.

On top of that, I usually query the information_schema database in a situation where it is not the current database. Lazy me just wants to perform an ad hoc query to get some information on the current database, and not switch databases first with a use command.

All in all, in it's raw form, the information_schema is not that suitable for quickly inspecting a database. I decided to make life easier for myself, and some time ago, I created a database to store all kinds of general purpose utilities.


create database utils
;

Among my utilities are a couple of views built on the information_schema. In these views, I tried to find a balance between ease of use and comprehensibility. I'm printing them all here, hoping someone else will benefit.

The `objs` View

First of all, there's the objs view:


create or replace
view 
utils.objs (
    db
,   namespace
,   name
,   type
,   extra
,   created
,   comments
)
as
select   trigger_schema   
,        _utf8'TRIGGERS'
,        trigger_name
,        _utf8'TRIGGER'
,        concat(
             action_timing
         ,   _utf8' '
         ,   event_manipulation
         ,   _utf8' ON '
         ,   table_name
         ,   _utf8' FOR EACH '
         ,   action_orientation
         ) as extra 
,        created       
,        _utf8''
from     information_schema.triggers
union all
select   routine_schema
,        _utf8'ROUTINES'
,        routine_name
,        routine_type
,        concat(
             if(is_deterministic='NO','NOT','')
         ,   _utf8' DETERMINISTIC '
         ,   ' SQL SECURITY '
         ,   security_type
         ,   sql_data_access
         ) as extra
,        created
,        routine_comment
from     information_schema.routines
union all
select   table_schema     
,        _utf8'TABLES'
,        table_name       
,        table_type
,        concat(
             'ENGINE='
         ,   engine
         ,   ' ROW_FORMAT='
         ,   row_format
         ) as extra
,        create_time      
,        table_comment
from     information_schema.tables

The objs view yields one row for each database object. I decided to include the following objects: base tables, (system) views, procedures, functions and triggers. I decided to leave out indexes, because you can't identify an index by name in a given database: you'll also need the table name to be absolutely sure.

The columns are:

db: The database (schema) in wich the object resides.
namespace: The namespace. Currently, there are three namespaces: 'TABLES' (base tables, views, system views); 'ROUTINES' (functions, stored procedures) and 'TRIGGERS'
name: The name of the object, wich is unique within the namespace given a particular database
type: The actual object type. Values include: 'BASE TABLE', 'VIEW' and 'SYSTEM VIEW' for objects in the 'TABLES' namespace; 'FUNCTION' or 'BASE TABLE', 'VIEW' and 'SYSTEM VIEW'for the 'ROUTINES'namespace; and 'TRIGGER' for the 'TRIGGERS'namespace
extra: This column shows some extra information, dependant upon the object type.
created: The datetime corresponding to the that the object was created
comments: The comments describing the object

Now, to display a list of all the objects in the current database, just do:


select *
from   utils.objs
where  db = schema()
;

The `refs` View

The refs view lists information about foreign keys. Here's the code:


create 
or replace 
view 
utils.refs (
    db
,   cons
,   tab
,   ref_db
,   ref_cons
,   ref_type
,   ref_tab
,   nulls
,   idpart
)
as
select     c.constraint_schema                
,          c.constraint_name                  
,          c.table_name                       
,          p.constraint_schema                
,          p.constraint_name                  
,          p.constraint_type                  
,          p.table_name                       
,          max(ctcol.is_nullable)             
,          case count(cidcon.constraint_name)       
               when 0 then 'NO'
               else        'YES'
           end                                
from       information_schema.table_constraints c
inner join information_schema.key_column_usage  kc
on         c.table_schema                   = kc.table_schema
and        c.table_name                     = kc.table_name
and        c.constraint_name                = kc.constraint_name
inner join information_schema.columns           ctcol
on         kc.table_schema                  = ctcol.table_schema
and        kc.table_name                    = ctcol.table_name
and        kc.column_name                   = ctcol.column_name
left join  information_schema.key_column_usage  cidcol
on         kc.table_schema                  = cidcol.table_schema
and        kc.table_name                    = cidcol.table_name
and        kc.column_name                   = cidcol.column_name
left join  information_schema.table_constraints cidcon
on         cidcol.constraint_schema         = cidcon.constraint_schema
and        cidcol.constraint_name           = cidcon.constraint_name
and        cidcol.table_name                = cidcon.table_name
and        cidcon.constraint_type          != 'FOREIGN KEY'
inner join information_schema.key_column_usage  kp
on         kc.referenced_table_schema       = kp.table_schema
and        kc.referenced_table_name         = kp.table_name
and        kc.referenced_column_name        = kp.column_name
and        kc.position_in_unique_constraint = kp.ordinal_position
inner join information_schema.table_constraints p
on         kp.table_schema                  = p.table_schema
and        kp.table_name                    = p.table_name
and        kp.constraint_name               = p.constraint_name
where      p.constraint_type               != 'FOREIGN KEY'
and        c.constraint_type                = 'FOREIGN KEY'
group by   c.constraint_schema
,          c.constraint_name
,          c.table_name
,          p.constraint_schema
,          p.constraint_name
,          p.constraint_type
,          p.table_name
;

Apart from the foreign key name and table, the table that is referred to is listed too, along with the primary key or unique constraint of wich the columns are referenced by the foreign key columns. As a bonus, some extra information is shown about the type of relationship implemented by the foreign key.

db: The database in wich the foreign key table (and thus, the foreign key constraint) resides.
cons: The name of the foreign key constraint
tab: The name of the table with the foreign key constraint a.k.a. the child or detail table
ref_db: The database that contains the referenced table
ref_cons: The unique or primary key constraint of the referenced table that contains the columns that are referred by the foreign key columns
ref_type: The type of the referenced constraint: either 'PRIMARY KEY' or 'UNIQUE'
ref_tab: The referenced (a.k.a master or lookup) table
nulls: Whether the foreign key columns accept NULL values: 'YES' (foreign key columns accept nulls, the relationship is not mandatory) or 'NO' (foreign key columns cannot have nulls, thus the relationship is mandatory)
idpart: Wheter the foreign key columns are also part of a 'PRIMARY KEY' or 'UNIQUE' constraint in the referencing table: 'YES' (foreign key columns are partially identifying) or 'NO' (foreign key columns are not part of an identifier)

The refs view does not take into account that MySQL/innodb supports foreign keys that do not refer to either a 'PRIMARY KEY' or a 'UNIQUE' constraint.

(Q: Wow, can MySQL do that? A: Yes, it can! Q: Why would you want to do that anyway? A: uhmm, don't really know, but you can post on this subject if you have a suggestion).

The `idxs` View

The idxs view lists information about indexes. Here's the code:


create
or replace 
view 
utils.idxs (
    db
,   tab
,   name
,   type
,   id   
,   nulls
,   cols
,   maxs
,   mins
)
as
select     t.table_schema
,          t.table_name
,          s.index_name
,          s.index_type
,          if(s.non_unique=0,_utf8'YES',_utf8'NO')
,          if(s.nullable='YES',_utf8'YES',_utf8'NO')
,          count(seq_in_index)  
,          round(max(s.cardinality)/t.table_rows,2)
,          round(min(s.cardinality)/t.table_rows,2) 
from       information_schema.tables       t
inner join information_schema.statistics   s
on         t.table_schema = s.table_schema
and        t.table_name   = s.table_name 
group by   t.table_schema
,          t.table_name
,          s.index_name
,          s.index_type
,          s.non_unique
,          t.table_rows
;

This view yields one row for each index.

db: The database in wich the index table (and thus, the index) resides.
name: The name of the index
tab: The name of the table that defines the index
id: Whether the index accepts duplicate values or value combinations: 'YES' when this is an unique index, else 'NO' if the index accepts duplicates
nulls: Whether the index columns accept NULL values: 'YES' (index columns accept nulls) or 'NO' (index columns are mandatory)
cols: The number of columns in de index
maxs: The maximum selectivity for all columns in the index
mins: The minimum selectivity for all columns in the index

So, good luck with this everyone. I hope this will be as useful to you as it was to me.

Sunday, October 09, 2005

Support for scheduled tasks in MySQL 5.1

I received a very interesting comment on my previous post, "Scheduling procedure execution in MySQL using SLEEP() and EXECUTE"

In 5.1 we have the following support:
SCHEDULE 
[ schedule-name ] 
{ START TIME start-time | BETWEEN start-time AND end-time } 
[ EVERY period { HOURS | MINUTES | SECONDS } ] 
[ ON { ( day-of-week, ... ) | ( day-of-month, ... ) } ] 
[ START DATE start-date ] 
So we are looking at scheduled event support directly in the database.

Cheers,

-Brian

(I assume Brian is actually Brian Aker from MySQL AB. If not, could the real Brian Aker correct me on that?)

I think this is very, vey interesting news. And I like the idea of defining scheduled tasks through a kind DDL statement too. Oracle jobs are different: you've got a package (basically, a container of types, variables, cursors, procedures, functions) called DBMS_JOB. You call procedures in this package to define the job, and this results in some data being written to a table, wich is used later on to run the jobs. Of course, the solution I presented was modelled after that.

I do hope that this syntax will allow us to have a job run, say, every 1st of the month. Oracle's DBMS_JOB can do this, because it allows you to specify the next date for job running in the form of a string that is evaluated as a date expression. So in Oracle,


    DBMS_JOB.SUBMIT(
        job       => v_job_id
    ,   what      => 'p_calculate_salaries'
    ,   next_date => to_date('2005-11-01','YYYY-MM-DD')
    ,   interval  => 'add_months(sysdate,1)'
    );

would run the job on the first of november for the first time, and each time the job runs, the next date to run the job would be calculated by evaluating the expression passed to the interval parameter. In this case, the current date plus one month.

Maybe the [ ON { ( day-of-week, ... ) | ( day-of-month, ... ) } ] could deal with a month?

But even the MySQL SCHEDULE syntax won't allow this, it still think it's a cool feature.

I also hope the scheduled tasks will be accessible through the data dictionary, or at least in the mysql database. I think that this willl probably be te case, I mean, MySQL will have to store the data associated with the schedule somewhere, so why not a table.

Knowing this functionality will become available in the near future, I probably won't be doing too much about my schedular. I'll probably build a small example, and make some additions available I made yesterday.

Anyway, I had fun building it. After all it is the first-not-totally-useless thing I built in MySPL, and it gave me to opportunity to work with SLEEP(), PREPARE and EXECUTE. And I learned to use GET_LOCK() and RELEASE_LOCK(). Maybe not too hot if you're a regular MySQL developer, but very useful when you aren't (yet) like me.

Friday, October 07, 2005

Scheduling procedure execution in MySQL using SLEEP() and EXECUTE

Both Oracle and MSSQL provide tools and utilities to run scheduled database jobs. A job is just a piece of work - a program, a script or report - that runs periodically without human interaction. Jobs are especially useful for doing work that can be automated and work that can or should be performed when there are no humans to do it. Typical examples include loading a datawarehouse or exporting large amounts of data in text formats.

MySQL does not yet provide builtin functionality to work with database jobs. But in most cases, this is not really a problem. At least, not one that can't be solved. In most cases, some generic external scheduling tool like Cron (Linux) or Scheduled Tasks (Windows) can be used to do this.

Although these tools are probably sufficient, I decided that it would be fun to build a scheduler in the MySQL stored procedure language (Yep, I'll be calling that MySPL again). For one thing, I was just curious whether it would be possible.
I mean, all the ingredients are there really:

we can build tables to hold the information that comprises a schedule and some definition of the actual work that needs to be done by the job

we can build procedures that access data from these tables, and have the flow of control be influenced by this data

as of the 5.0.12 version, we've got the SLEEP() function, wich is really important to implement recurrence. It's a bit tucked away inside the docs, but it's there allright!

as of the 5.0.13 version, we can use the prepared statement syntax from within MySPL

For those that are interested in all the details, there's an installation script available for download right here. It contains the bare schedular, and a little API as well. This little API makes it quite easy to define a job or alter it's characteristics. There's no documentation yet, but I hope this blog post can get you started. I am planning on writing some examples and documentation in the near future. I did include COMMENTs, so that should be of some help.

How do these pieces work together?

We can set up a table to hold the data that defines the actual work as well as the data associated with actually scheduling job execution. This information can be used by a particular procedure, the job runner. The job runner can be called, passing it a reference to a particular job. Then, the job runner takes care of scheduling and executing the actual work associated with the job.

Basically, the job runner implements a loop that uses the prepared statement syntax to perform the actual work. For each such individual run, the CURRENT_DATETIME() for start and finish are recorded, and this is used to have the SLEEP() function suspend the loop until the next execution is due. The scheduling information is retrieved anew for each cycle, so job execution can be influenced by changing the schedule data while the job runner is running the job.

Now there's something about this job runner mechanism that needs to be clarified right away. The job runner is just an ordinary procedure that uses the information in the job table. So, unlike the case with Oracle database jobs, there's no background process that automatically takes care of running jobs once we define them in the job table. We need to first define them and then call a job runner to actually run the job.

Because there's no way to call the job runner procedure asynchronously, the client that's executing a job runner will block until the entire job schedule has finished.
However, once a job runner is running, it will notice particular changes made to the job schedule. So, the timewindow wherein the job is supposed to run can be manipulated, as well as the interval between job cycles. We can also have the job runner inspect a status column. With these devices, we can instruct a running job gracefully (that is, without killing the thread running the job) from inside another thread, simply by updating data in a table.

The `job` base table

We'll start by The Most Important Thing First, and that's, yep you guessed it right: the data. We are going to need data to define the characteristics of a job, that is, when it should become active, what it's name is, what work it needs to do...that stuff. Once we're running the job, we will need status information: data to see if a particular job is done etc.
Here's what I came up with:


delimiter go

drop table if exists job
go

create table job (
    job_id               
        int unsigned  
        NOT NULL
        auto_increment 
        COMMENT 'Job identification for reference purposes'
,   job_owner            
        varchar(77)
        NOT NULL
        COMMENT 'The name of the user that owns this job'
,   job_name             
        varchar(129)
        NOT NULL
        COMMENT 'The per-owner unique name of this job.'
,   job_description
        varchar(1024) 
        COMMENT 'A description of this job.'
,   job_procedure_schema 
        varchar(64) 
        NOT NULL
        COMMENT 'The schema in which the job procedure resides.'
,   job_procedure_name
        varchar(64) 
        NOT NULL
        COMMENT 'The name of the job procedure.'
,   job_schedule_start 
        datetime         
        COMMENT 'If applicable, the start of the time window for which the job is scheduled.'
,   job_schedule_finish 
        datetime 
        COMMENT 'If applicable, the end of the time window for which the job is scheduled.'
,   job_schedule_interval 
        int unsigned 
        NOT NULL
        COMMENT 'The number of seconds between starts of job execution instances'
,   job_status 
        enum(
            'ABORTED'
        ,   'DISABLED'
        ,   'ENABLED'
        ,   'ERROR'
        ,   'FINISHED'
        ,   'PAUSED'
        ,   'RESUMED'
        ,   'RUNNING'
        ,   'SLEEPING'
        )
        NOT NULL
        default 'ENABLED'
        COMMENT 'The current status of the job'
,   job_runner_connection_id
        int unsigned
        COMMENT 'The CONNECTION_ID of the connection executing the runjob procedure for this job.'
,   job_runner_start
        datetime
        COMMENT 'The last time a jobrunner was started for this job'
,   job_runner_finish
        datetime
        COMMENT 'The last time a jobrunner finished this job'
,   job_runner_cycles
        int unsigned
        COMMENT 'The number of cycles the jobrunner successfully performed since jobrunner_start'
,   job_runner_last_cycle_start
        datetime 
        COMMENT 'The last time a job cycle started'
,   job_runner_last_cycle_finish
        datetime
        COMMENT 'The last time a job cycle finished'
,   constraint pk_job primary key (
        job_id
    )
,   constraint uk_job1 unique (
        job_owner
    ,   job_name
    )
) 
ENGINE=MyIsam 
DEFAULT CHARSET=latin1 
COMMENT='A table that schedules stored procedure execution'
go

Although you could put a lot more effort in defining table structures to make up a nice complete scheduler, I decided that a simple structure would do for most of the cases. The following brief explanation of the columns is meant to draw a general outline. The table contents are to be manipulated using a little API of MySPL procedures.

job definition

A job belongs to a particular database user, the job_owner. A job also has a job_name, wich uniquely identifies the job per owner. Optionally, a job_description can be entered so that an administrator or developer can read what the job is supposed to do. Also, an job_id was added to make it easy to reference a particular job.

The actual work that's performed by the job is represented as stored procedure. So, in order to define a job, you will first have to write yourself a (parameterless) procedure that does all the actual work. Then, you can enter the schema and the name of this procedure in the job_procedure_schema and job_procedure_name columns of the job table.

Note that there's really no technical reason to require a stored procedure for this.
In fact, any statement that can be executed from within the job runner procedure using the prepared statement syntax would do equally well. However, I chose to require a stored procedure because I think it's nicer that way. It ensures that there's a clear seperation between the job schedule definition and the actual work performed within it.

job scheduling

We still need some information to actually schedule job execution. Therefore, we have the job_schedule_start and job_schedule_end columns.
Together, these define a timeframe wherein the job is supposed to be automatically executed. The job_schedule_interval determines the frequency of execution.

Right now, scheduling is very primitive. Because you can only schedule using a fixed interval, it is impossible to schedule things at, say, every first of the month. That's because a month is not a fixed period of time. Oracle jobs use an expression that's evaluated at runtime to determine when the next execution will start. This is something I'm thinking of implementing, but for now, we're stuck with fixed intervals.

running the schedule

The other columns provide information about the current job runner executing the job.
Some are there for monitoring purposes, but some others are used to actully maintain the job runner's state. Most of these columns are updated each time a loop cycle done by the job runner starts or finishes the procedure associated with the job.

The job_status provides information about the state of the job: wheter it needs to be started, wheter is is currently executing the job procedure, and more.
Most of the available statuses are automatically provided by the job runner once it's running a job. The job_runner_connection_id identifies the thread currently running the job. This is used internally to prevent different job runner instances from running the same job. This can also be useful to kill the job.
The job_runner_start and job_runner_finish record the start and finish of the execution of the job runner procedure itself. The job_runner_cycles contains the number of cycles that the job runner has gone through since it was started. The job_runner_cycle_last_start and job_runner_cycle_last_finsished columns are used to record when the last cycle started or finished.

The `job$run` procedure

Once you scheduled jobs by entering data into the job table, you can invoke the job runner to actually execute the job according to schedule. I've implemented this in the procedure job$run. What happens in this procedure is this:

We try to get a lock for this job runner. This prevents having more than one job runner executing the same job simultaneously

If we succeed in acquiring the lock, we retrieve some data describing the job. We use this data decide wheter the job is eligable for running. For example, it could be the case that the job_schedule_finish is earlier than the current date and time, wich makes the job not eligable for running.

If we decide the job is eligable for running, we could have to wait in case the current date and time are earlier than the job_schedule_start. THis is achieved with the SLEEP() function

Once we reach the job_schedule_start date and time, we can start cycles executing the job procedure specified by the job_procedure_schema and job_procedure_name columns. Actual execution is performed by the PREPARE and EXECUTE syntax.

Cycling will occur at the frequency specified by job_schedule_interval. Again, the SLEEP() is used. This time, we need to SLEEP() through the period between the finish of the EXECUTE and start of EXECUTE plus job_schedule_interval seconds

For each cycle, some condition could arise that should terminate cycling. For example, job_schedule_finish could be reached, or job_status may be updated by another thread to actively initiate termination.


delimiter go

drop procedure if exists job$run
go

create procedure job$run(
    -- the name of the job to run
    -- this must correspond to an existing row in the job table 
    -- for which current_user() = job.job_owner
    -- and for wich the p_job_name = job.job_name
    in p_job_name             varchar(64)
)
comment 'Activates a previously defined stored procedure execution schedule.'
BLOCK_MAIN: begin

    declare JOBSTATUS_ABORTED  varchar(7) 
        default 'ABORTED';
    declare JOBSTATUS_ENABLED  varchar(7) 
        default 'ENABLED';
    declare JOBSTATUS_ERROR    varchar(5) 
        default 'ERROR';
    declare JOBSTATUS_FINISHED varchar(8) 
        default 'FINISHED';
    declare JOBSTATUS_PAUSED   varchar(6) 
        default 'PAUSED';
    declare JOBSTATUS_RUNNING  varchar(7) 
        default 'RUNNING';
    declare JOBSTATUS_SLEEPING varchar(8) 
        default 'SLEEPING';

    declare v_message                     varchar(1024);

    declare MESSAGE_JOB_LOCKED            varchar(1024) 
        default 'Job already locked by connection with id: ';
    declare MESSAGE_ERROR_LOCKING_JOB     varchar(1024) 
        default 'An error occurred while locking the job';
    declare MESSAGE_JOBSTATUS_NOT_ENABLED varchar(1024) 
        default 'Current job status not ''ENABLED''';
    declare MESSAGE_SCHEDULE_HAS_EXPIRED  varchar(1024) 
        default 'The schedule has expired';
    declare MESSAGE_INTERRUPTED           varchar(1024) 
        default 'Interrupted while sleeping';
    declare MESSAGE_NO_SUCH_JOB           varchar(1024) 
        default 'No such job';
    declare MESSAGE_PROCEDURE_FAILED      varchar(1024) 
        default 'The job procedure raised an exception';
    declare MESSAGE_JOB_ABORTED           varchar(1024) 
        default 'Job aborted at external request';
    declare MESSAGE_OK                    varchar(1024) 
        default 'Ok';

    declare v_lock_status                 tinyint unsigned;
    declare v_locked_by                   int unsigned;

    -- get a lock to prevent simultaneous execution of the same job
    call job$get_lock(
        p_job_name
    ,   v_lock_status
    ,   v_locked_by
    );
    if v_lock_status = 1 then
        -- Ok, we're now the one and only job runner for this job
        BLOCK_LOCK_ACQUIRED: begin                
            declare v_current_timestamp       timestamp
                default current_timestamp();
            declare v_job_id               int unsigned;
            declare v_job_schedule_start   datetime;
            declare v_job_status enum(
                'ABORTED'
            ,   'DISABLED'
            ,   'ENABLED'
            ,   'ERROR'
            ,   'FINISHED'
            ,   'PAUSED'
            ,   'RUNNING'
            ,   'SLEEPING'
            );
            declare exit handler for not found
                set v_message := MESSAGE_NO_SUCH_JOB;


            -- retrieve some general data and decide whether this job is eligble to run
            select   j.job_id
            ,        coalesce(
                         j.job_schedule_start
                     ,   v_current_timestamp
                     )
            ,        case
                         when j.job_status != JOBSTATUS_ENABLED 
                             then MESSAGE_JOBSTATUS_NOT_ENABLED
                         when j.job_schedule_finish < v_current_timestamp
                             then MESSAGE_SCHEDULE_HAS_EXPIRED
                         else MESSAGE_OK
                     end as message
            ,        concat(
                         'CALL'
                     ,   ' '
                     ,   '`',j.job_procedure_schema,'`'
                     ,   '.'
                     ,   '`',j.job_procedure_name,'`'
                     ,   '()'
                     )
            into     v_job_id
            ,        v_job_schedule_start             
            ,        v_message
            ,        @job$run
            from     job as j
            where    j.job_owner = current_user()
            and      j.job_name  = p_job_name
            ;
            if v_message = MESSAGE_OK then
                -- Yes, this job is eligble to run
                BLOCK_ENTER_RUNJOBS: begin

                    declare v_sleep                   int unsigned;
                    declare v_job_schedule_interval   int unsigned;
                    declare v_job_schedule_finish     datetime;
                    declare v_job_runner_cycles       int unsigned
                        default 0;

                    declare exit handler for not found
                        begin
                            set v_message := MESSAGE_NO_SUCH_JOB;
                            set v_job_status := JOBSTATUS_ERROR;
                        end;

                    -- preparing the job procedure 
                    prepare stmt_jobproc 
                    from    @job$run
                    ;
                    -- notify the outside we're running the job
                    -- it's just for monitoring purposes anyway
                    -- the lock is supposed the provide the actual safety
                    update job
                    set    job_status                   = JOBSTATUS_RUNNING
                    ,      job_runner_connection_id     = connection_id()
                    ,      job_runner_start             = v_current_timestamp
                    ,      job_runner_finish            = NULL
                    ,      job_runner_cycles            = NULL
                    ,      job_runner_last_cycle_start  = NULL
                    ,      job_runner_last_cycle_finish = NULL
                    where  job_id                       = v_job_id
                    ;
                    -- check if job is scheduled in the future in order to wait for it
                    if  v_job_schedule_start > v_current_timestamp then
                        update job
                        set    job_status  = JOBSTATUS_SLEEPING
                        where  id          = v_job_id
                        ;                               
                        -- sleep until job cycling is due
                        set v_sleep := timestampdiff(
                            SECOND
                        ,   v_current_timestamp
                        ,   v_job_schedule_start
                        );
                        if  sleep(v_sleep) = 1 then
                            set v_message := MESSAGE_INTERRUPTED;
                            set v_job_status := JOBSTATUS_ERROR;
                            leave BLOCK_ENTER_RUNJOBS;
                        end if;
                    end if
                    ;
                    -- this is the main loop
                    LOOP_RUNJOBS: loop
                        -- renew schedule data, it might've changed while we where asleep
                        select job_schedule_finish
                        ,      job_schedule_interval
                        ,      job_status
                        ,      current_timestamp()
                        into   v_job_schedule_finish
                        ,      v_job_schedule_interval
                        ,      v_job_status
                        ,      v_current_timestamp
                        from   job j
                        where  j.job_id = v_job_id
                        ;
                        -- decide wheter to finish or to continue cycling
                        if v_current_timestamp > v_job_schedule_finish then
                            set v_job_status := JOBSTATUS_FINISHED;
                            leave LOOP_RUNJOBS;
                        elseif v_job_status = JOBSTATUS_ABORTED then
                            leave LOOP_RUNJOBS;
                        else
                            -- skip the actual work when we're pausing
                            if v_job_status != JOBSTATUS_PAUSED then
                                update job
                                set    job_status                  = JOBSTATUS_RUNNING
                                ,      job_runner_last_cycle_start = v_current_timestamp
                                ,      job_runner_cycles           = v_job_runner_cycles
                                where  job_id                      = v_job_id
                                ;
                                -- calling the job procedure 
                                BLOCK_EXECUTE_JOBPROC: begin
                                    declare exit handler for sqlexception
                                        begin
                                            set v_job_status := JOBSTATUS_ERROR;
                                            set v_message := MESSAGE_PROCEDURE_FAILED;
                                            leave LOOP_RUNJOBS;
                                        end;
                                    execute stmt_jobproc;
                                end BLOCK_EXECUTE_JOBPROC
                                ;
                                set v_job_runner_cycles := v_job_runner_cycles + 1
                                ;
                                update job
                                set    job_status                   = JOBSTATUS_SLEEPING
                                ,      job_runner_last_cycle_finish = current_timestamp()
                                ,      job_runner_cycles            = v_job_runner_cycles
                                where  job_id                       = v_job_id
                                ;
                            end if;
                            -- calculate the amount of sleep until the next cycle's due
                            set v_sleep := timestampdiff(
                                SECOND
                            ,   current_timestamp()
                            ,   date_add(
                                    v_current_timestamp
                                ,   interval v_job_schedule_interval SECOND
                                )
                            );
                            if v_sleep > 0 then
                                if sleep(v_sleep) = 1 then
                                    set v_message := MESSAGE_INTERRUPTED;
                                    set v_job_status := JOBSTATUS_ERROR;
                                    leave BLOCK_ENTER_RUNJOBS;
                                end if;
                            end if;
                        end if;
                    end loop LOOP_RUNJOBS;
                end BLOCK_ENTER_RUNJOBS
                ;
                deallocate prepare stmt_jobproc
                ;
                update job
                set    job_status          = v_job_status
                ,      job_runner_finish   = current_timestamp()
                where  job_id              = v_job_id
                ;
            end if; 
        end BLOCK_LOCK_ACQUIRED
        ;
        -- release the lock so another job runner can run the job
        call job$release_lock(
            p_job_name
        ,   v_lock_status
        ,   v_locked_by
        );
    elseif v_lock_status = 0  then
        -- Rats! some other job runner was here first
        set v_message := concat(MESSAGE_JOB_LOCKED,v_locked_by);
    else
        -- Whoops! an error occurred aqcuiring the lock
        set v_message := MESSAGE_ERROR_LOCKING_JOB;
    end if
    ;
    -- Let the caller now what happened here
    select  p_job_name as job_name
    ,       v_message  as message
    ;
end BLOCK_MAIN
;
go

I used the bold typeface to highlight the usage of SLEEP() and the elements that are related to the prepared statement syntax.

Locking is implemented using job$get_lock and job$release_lock, wich use the builtin GET_LOCK() and RELEASE_LOCK() functions. Just download install_job.mysql to see how it works exactly.

The rest is pretty straight-forward: just nesting and looping.