Monday, October 31, 2005

Creating Crosstabs in MySQL

In some cases, it would be very nice to be able to get a crosstab result from a query. You can't really do that in MySQL, at least not directly; as matter of fact, you can't do it in most rdbms-es (Interestingly, MS Access does have a language construct for crosstabs: TRANSFORM...PIVOT..)
If you really need a crosstab, here are some techniques that might be of use. I'll illustrate them using the updated version of the sakila database. I really think the following stuff will be a lot more fun if you download and install this database first. If you think that's too much of a hassle, then just open the diagram of this database in a new window to help you understand queries.

Step 1: decide what objects you want to know something about


Let's say we want to know something about films. We will need something to recognize a film, and we'll be choosing the title to do that:

select f.title
from films f

This will just return a list of all films.

Step 2: decide what you want to know about those objects


Well, suppose we want to know how well our films are being rented, we could do:

select f.title
, count(r.inventory_id) rentals
from film f
inner join inventory i
on f.film_id = i.film_id
left join rental r
on i.inventory_id = r.inventory_id
group by f.title

This query takes all films, and finds the stores that have the film in their inventory. Because we do this using an inner join, we exclude the films that aren't in any store's inventory. This is good: by definition, there can't be any rentals for those films, so it does not make sense to include them in the result.

The inventories are left joined to the rentals. This ensures we won't exclude films that have never been rented. In this case, we definitely want to include these films. If you were a DVD rental store manager, wouldn't you want to know about those films that are for rent but have never been rented before?

Step 3: decide how you want to categorize your data


Now, we do know how many times a particular film has been rented, wich is good. But there's more to a rental than a film ending up to be rented. For starters, a film isn't "just" rented: it is being rented by a particular customer. Second, the customer doesn't "just" rent the film: it's rented from a particular store. The different customers that can rent films, and the different stores where they rent film from offer a perspective that let's us 'slice' out a subset of all the rentals.

For this example, let's stick to the stores to categorize our data.

Given our previous query, it's pretty easy to get our hands on the actual information. We just need to add the store_id to the select and group by list from the previous query:

select f.title
, i.store_id
, count(r.inventory_id) rentals
from film f
inner join inventory i
on f.film_id = i.film_id
left join rental r
on i.inventory_id = r.inventory_id
group by f.title
, i.store_id

This will give us the number of times a film was rented per store.

Hey, you promised us a crosstab, now pay up!


But wait! This is not a crosstab, this is just a tablular resultset! That's right of course. But we know all the information is there, so what's the matter then?

Actually, a crosstab is just a particular presentation of data. To present a crosstab of this resultset, we could have the films along it's vertical axis (in the direction of the rows), the stores along it's horizontal axis (in the direction of the columns). (We could just as well turn it the other way around, but because we have so much movies, it's probably more practical to have them placed along the vertical axis.) The number of rentals would then be at the row-column intersections or the cells.

Now, as for having the films along the vertical axis - that's pretty much what our queries have been doing all along. A SELECT expression are all about dynamically generating resultset rows according to data in database tables. We can be quite confident that this will not pose a problemn.

Having the stores along the horizontal axis is an entirely different story. Columns in the resultset of a SELECT expression correspond to the individual select list expressions. We sure do control generation of resultset columns ourselves: the expression that we type is evaluated to data appearing in the column, and if we give that expression an alias, that will be used as the column heading. However, we can only exert this control in a static manner, by typing them as part of the statement. What we would like to have is a kind of device that would generate columns dynamically, driven by the data (in our case: the rows from the store table).
Actually, we need to generate columns quite like the way a SELECT expression generates rows.

Let's accept this situation - for the moment.

Step 4: Coding the horizontal axis


To find a way out, let's just pretend our collection of stores is not dynamic, (it doesn't change that often anyway) and try to achieve only the layout of a crosstab. This would involve us typing a separate SELECT list expression for each store, thus statically creating a column definition for each store.
So let's inspect the store table and see for what stores we would have to type an extra column expression:

select *
from store

gives us:

+----------+---------+------------+
| store_id | manager | address_id |
+----------+---------+------------+
| 1 | 1 | 1 |
| 2 | 2 | 2 |
+----------+---------+------------+

That's not too bad. Only two stores, wich means we'll only have to type two expressions. Let's concentrate on these expressions.
In our previous queries, we used:

count(r.inventory_id)

to calculate the number of rentals. Now, we still want to calculate the number of rentals, but for a particular column, we only want to count those rentals that occurred in the store corresponding to the column. We need something like a conditional version of count. So, suppose we would be writing the expression for the store with store_id = 1, we could try something like this:

count(
if(i.store_id=1
, r.inventory_id
, null
)
)

But wait! Something's wrong here. What if a particular film would be in the Inventory of the store with store_id = 2, but not in the inventory of Store with store_id = 1? Well, the if expression would always return nnull, because no records will be found for the store with store_id = 1. Count will return the count of non-null values, but zero if there are only null values. So, this expression will return zero for the store with store_id = 1.

This is bad news! It means we cannot distinguish between a film that does not exist in the inventory of a particular store and a film that does exist in the inventory but isn't rented at all. Lucky for us, we can emulate count using sum, like this:

sum(
if(i.store_id=1
, 1
, null
)
)


The if will return 1 for each record where the store_id = 1, and null in other cases.
SUM will faithfully sum these 1's while ignoring the null's, effectively yielding the count.
If sum is fed with only null arguments, sum will return null. This is good, because the result will now let us distinguish between having a film that happens not to exists in the inventory of a particular store as opposed to having the film in the inventroy without ever renting it.

So, the query looks like this:

select f.title
, sum(
if(i.store_id=1
, 1
, null
)
) "store_id=1"
, sum(
if(i.store_id=2
, 1
, null
)
) "store_id=2"
from film f
inner join inventory i
on f.film_id = i.film_id
left join rental r
on i.inventory_id = r.inventory_id
group by f.title

So, there's our crosstab. It wasn't hard at all, just a bit tedious.

Step 4 - continued: Generating the statement


Up untill now, we've put up with the nuisance of 'generating' columns by typing them. Now it's time to stop putting up with that. We've hand-coded a working Example, and it has taken us enough time. We do not want to waste more time on the same problem in the future. And really, we're bound to encounter this problem again. Sooner or later, the data in the store table will change, probably invalidating our crosstab query.

But hey, it's not so bad, is it? We know exactly wich parts of the code correspond to occurrences in the store table, right? And we know what parts of the code are independant of all that. So, let's try and devise something that will generate all that code for us.


select concat(
'select f.title','\n'
, group_concat(
concat(
', sum(','\n'
, ' if(i.store_id=',s.store_id,'\n'
, ' , 1','\n'
, ' , null','\n'
, ' )\n'
, ' )'
, ' "store_id=',s.store_id,'"\n'
)
order by s.store_id
separator ''
)
, 'from film f','\n'
, 'inner join inventory i','\n'
, 'on f.film_id = i.film_id','\n'
, 'left join rental r','\n'
, 'on i.inventory_id = r.inventory_id','\n'
, 'group by f.title','\n'
) statement
from store s

This query returns just one row. The one column will contain a string value that is exactly equal to the statement text of our previous hand-coded crosstab query. Compare the hand-coded crosstab-query SELECT expression with the SELECT list of the generator. As you can see, the generator does not much more than select the concatenated string literals that make up the crosstab query. The only variable element is the GROUP_CONCAT call wich aggregates all the rows from the store table in the FROM list of the generator, yielding the collection of SELECT list expressions that makes up the horizontal axis of the crosstab, and of course the data appearing in the cells.

(By the way, GROUP_CONCAT is a commonly recurring element in these generators. I just love this feature, wich really seems to be unique to MySQL. So much for open source products not being innovative)

Now that we can generate the crosstab query SELECT expression, we will never have to worry about data changing in the store table. We can simply run this query to obtain an updated query for the crosstab.

Step 4 - continued: wrapping generation and execution in a stored procedure


Generating the query is certainly an improvement. By generating the statement, we created a single point of definition that handles the duplication of code (the Column expression) in a structured, data driven way. That saves us time, and more important, it will always be correct. It will never contain syntax errors, and it will always cost the same amount of time to create an updated version of the Statement should the data in the store stable be modified.

It's still a bit of a nuisance that we can't just execute the crosstab query with a single statement though.
Right now, we always need at least two statements: one to generate the statement, and one to execute the generated statement.
We would really like to do that in one statement.

Lucky for us, we can actually do that.
We can create a stored procedure that performs these steps in the right order:

CREATE PROCEDURE `sakilar2`.`p_film_rentals_per_store`()
LANGUAGE SQL
NOT DETERMINISTIC
CONTAINS SQL
SQL SECURITY DEFINER
begin
select concat(
'select f.title','\n'
, group_concat(
concat(
', sum(','\n'
, ' if(i.store_id=',s.store_id,'\n'
, ' , 1','\n'
, ' , null','\n'
, ' )\n'
, ' )'
, ' "store_id=',s.store_id,'"\n'
)
separator ''
)
, 'from film f','\n'
, 'inner join inventory i','\n'
, 'on f.film_id = i.film_id','\n'
, 'left join rental r','\n'
, 'on i.inventory_id = r.inventory_id','\n'
, 'group by f.title','\n'
) statement
into @film_rentals_per_store_sql
from store s
order by s.store_id
;
prepare film_rentals_per_store
from @film_rentals_per_store_sql
;
execute film_rentals_per_store
;
deallocate
prepare film_rentals_per_store;
end

Inside the procedure, we first generate the sql that defines the query, and then use the prepared statement syntax to execute the generated statement. It's not as if we really need statement preparation; we just need some dynamic SQL. The prepared statement syntax happens to be the only construct that supports this in MySQL right now.

By now, we've achieved pretty much all we wanted. We can get a crosstab query result by simply calling the procedure. I'ts a bit of a pity we cannot setup a more generalized approach, but I'll settle with this for a while.

Friday, October 28, 2005

Internationalization overdone

...I could've guessed it. A couple of posts ago, I wrote on offline webpages for some webapps. Yesterday, I encountered a really annoying bug in my solution

It's like this, once the pages are offline, you need a way to prevent broken links and references to other pages. Because I always use relative URI's for links in the web app, this is not too big of a problem. What is a problemn though is that my URI's contain queries, as in:


{path}/{resource}?name1=value1&name2=value2


I use the URI to generate a filename, in order for the links to keep on working once the page is taken offline, and stored as a file. Unfortunately, the ? is not a valid character for a filename on Windows (wich is the target platform). So, I have substitute that for another character. I have to do that everywhere I link to a page, and of course when generating the filename.

Fortunately, the only cases of references to pages are through the javascript window.open() method, wich takes, among others, an URI parameter. And I already wrapped that call in my own function to provide the appropriate options to the opened window. So, there already was a single point receiveing all the URI's, making it easy enough to substitute the ? character for the tilde ~.

The only thing I needed now was a way of determining wheter the user was browsing offline. Because this is an Internet Explorers Solution, I used the document.protocol property for that. It the protocol is "Hypertext Transfer Protocol" we must use the ?; else, we must use the ~. Easy enough right?

As it turned out, the value returned by the document.protocol property depends upon the language of the IE installation. On my developer machine, that's English (US); on the users machines, it's Dutch. In wich case the value returned by document.protocol is "Hypertext Overdrachts protocol" or something like that.

It's of course all solvable, you just have to check for document.protocol.indexOf("HyperText")==0, but I really don't see why, and obviously would not expect, such a property to be Language dependent.

Anyone?

Tuesday, October 25, 2005

Just want to say thanks

Hey,

I just found out I've been awarded an iPOD for the beta challenge! Amazing, when I received that June (yep, June already, time just flies when you're having fun) issue of the MySQL newsletter, I decided to try and do something.

As I recall, people were being teased to participate by promising them stuff like coffee mugs, T-shirts....Wich immediatly deepened my breath, and made my heart go BOOM,BOOM,BOOM as you might reckon!

Seriously, what convinced me to download the 5.0.7 release, was Arjen Lentz's "call to arms" to evaluate MySQL 5: a be-real, no-nonsense and to-the-point kind of appeal to test the new release of that well known open source rdbms, and this time, it was harbouring all kinds of enterprise grade features: stored procedures, views, triggers. At least, those were the features that I recall being mentioned repeatedly, let's say, the stuff that lets developers and programmers write more code.

Now, those that read some of the entries in blog probably wouldn't 've guessed it, but I hate writing code. I like to think about code, I just don't like writing it. Being an IT-consultant, database developer and information analyst, this attitude poses a problem, as people sure as hell expect me to deliver code, at least some of the time. Some while ago, I found ways of partially solving the paradox, by having as much of my code being generated for me, giving me more time to just think of what kind of could should be generated in the first place.

Now, to generate code, you need a solid source of information that you can use to drive the generation process. From the practice of generating code, I was already knowledgable about the INFORMATION_SCHEMA standard. So, when I noticed that MySQL was going to go along that road as well, I just had to check it out. Apart from offering a platform for wich you could code, this was also going to be a platform for wich I could go and build code generators too! So that's how I came about drawing that diagram, mainly to get myself acquainted with MySQL's implementation of the INFORMATION_SCHEMA standard.

And you bet, I want to get acquainted, because the way I see it, with this release, MySQL has taken a major step towards the direction of leading RDBMS vendors, at least feature-wise (at least One Of these vendOrs seems tO think sO tOO). I really think it's a mistake to, like I see too many of my colleagues do, keep ignoring MySQL as a product that you can do serious business with. I mean business as in: adding value for my customers, making Real Money with it etc.

Once I had the diagram, I thought, c'mon, let's share that stuff. Since then, I've had some valuable feedback on the diagram from Carsten Segieth and Mike Lischke, wich I think is great. I have already commited myself to maintaining it and keeping it up to date, and I'm going to recreate it in MySQL Workbench real soon. And Andrew Gilfrin's given me an opportunity to publish a code generator based on the mysql information_schema on his site. And now, there's this award. It may not be the coffee mug I hoped for, but I guess an iPOD's nice too ;-D. Terrific, I love it!

So, thanks Carsten and Mike, thanks Andrew and thank you very much Arjen. And of course, thanks to MySQL AB for developing this great product, and making it accessible to so many people. MySQL Rocks!

(walks of the stand, flowers in one hand, iPod in the other, almost bursting in to tears...whoops, I better quit this post)

Formatting MySQL code for the Web...in My:SPL

Among the more time consuming things maintaining this blog is formatting the code examples. I decided it was time for me to take that to the next level so I wrote a HTML formatter for mysql code. In case you're interested, it's available for download (zip 7kb). After downloading, just unzip it to your favourite folder, read the readme.txt, and everything should be fine after that.
I will spend the rest of this blog entry on some explanation and examples.

The Function format_mysql_as_html()


Here's the function that's responsible for actually creating html from mysql code. It is passed a string parameter via p_sql, and cuts that to bits: the tokens.
Dependant upon the type of the token maintained by v_state, it's wrapped into a <span> tag or left just as it is. The <span> is given a class attribute containing part of the name of the type of token, so we can mark it up through css later on.
There's one special type of token, the 'ACCEPT_OTHER' type. A token of this type is checked against the mysql.help_keyword table to see if it is a reserved word, in wich case it gets an extra distinguishing css class. (Alas, this table does not seem to contain all the words that are considered keywords; for example, it does not contain DECLARE or ELSEIF. But, for now, I'll settle for this.)
Finally, a token is produced of the type 'ACCEPT_END_OF_INPUT', wich terminates the function, returning a string of html.

CREATE FUNCTION `utils`.`format_mysql_as_html`(
p_sql longtext
)
RETURNS longtext
LANGUAGE SQL
NOT DETERMINISTIC
CONTAINS SQL
SQL SECURITY DEFINER
BLOCK_MAIN: begin
declare v_line int unsigned default 1;
declare v_position int unsigned default 1;
declare v_token varchar(65535) ;
declare v_html_token varchar(65535);
declare v_html longtext default '';

declare v_from_index int unsigned default 1;
declare v_to_index int unsigned default 1;
declare v_state enum(
'ACCEPT_BACKTICK'
, 'ACCEPT_APOS'
, 'ACCEPT_QUOT'
, 'ACCEPT_WHITESPACE'
, 'ACCEPT_SINGLE_LINE_COMMENT'
, 'ACCEPT_MULTI_LINE_COMMENT'
, 'ACCEPT_OTHER'
, 'ACCEPT_END_OF_INPUT'
, 'ACCEPT_ENDLINE'
, 'ACCEPT_DELIMITER'
, 'ACCEPT_AT'
, 'ACCEPT_ATAT'
);

set v_html := concat(v_html,'\n','<pre>');
LOOP_TOKENS: loop
call next_mysql_token(
p_sql
, v_to_index
, v_state
);
set v_token := substring(
p_sql
, v_from_index
, v_to_index - v_from_index
);
set v_html_token := html_entities(v_token);
if v_state = 'ACCEPT_END_OF_INPUT' then
leave LOOP_TOKENS;
elseif v_state in (
'ACCEPT_WHITESPACE'
, 'ACCEPT_ENDLINE'
, 'ACCEPT_DELIMITER'
) then
set v_html := concat(v_html,v_html_token);
else
set v_html := concat(v_html,'<span class="',substring(v_state,8));
if v_state = 'ACCEPT_OTHER' then
if exists (
select null
from mysql.help_keyword
where name = v_token
) then
set v_html := concat(v_html,' RESERVED');
end if;
end if;
set v_html := concat(v_html,'">',v_html_token,'</span>');
end if;
set v_from_index := v_to_index;
end loop LOOP_TOKENS;
set v_html := concat(v_html,'\n','</pre>');

return v_html;
end BLOCK_MAIN


Most of the work is of course done by the next_mysql_token() (Im just printing the signature of that one, not the body. Look it up in the dist. if you're curious):

CREATE PROCEDURE `utils`.`next_mysql_token`(
IN p_text longtext
, INOUT p_index int unsigned
, OUT p_current_state enum(
'ACCEPT_BACKTICK'
, 'ACCEPT_APOS'
, 'ACCEPT_QUOT'
, 'ACCEPT_WHITESPACE'
, 'ACCEPT_SINGLE_LINE_COMMENT'
, 'ACCEPT_MULTI_LINE_COMMENT'
, 'ACCEPT_OTHER'
, 'ACCEPT_END_OF_INPUT'
, 'ACCEPT_ENDLINE'
, 'ACCEPT_DELIMITER'
, 'ACCEPT_AT'
, 'ACCEPT_ATAT'
)
)
LANGUAGE SQL
NOT DETERMINISTIC
CONTAINS SQL
SQL SECURITY DEFINER

This function implements a deterministic state machine. It starts off reading p_text from position p_current_index. Then, it reads characters from p_text, one by one. For each character, the internal state is updated untill an accepting state is reached. Then, it returns.
When it returns, it will have updated the p_current_state as well as p_current_index, wich is advanced to the point right after the previous token. The token itself is easliy excised from the text by taking the substring between the positions indicated by p_current_index before and after the call.

Formatting entire routines


The information_schema database contains a ROUTINES table, providing us with lots of code to feed into format_mysql_as_html(). However, there's a small problem with that. It does not contain information about the parameterlist of the routine. Lucky for us, there's the mysql.proc table (Thanks Pro MySQL, didn't know that one before) that does contain all the information. So, I came up with the get_routine_ddl() function:

CREATE FUNCTION `utils`.`get_routine_ddl`(
p_schema varchar(64)
, p_name varchar(64)
)
RETURNS longtext
LANGUAGE SQL
NOT DETERMINISTIC
CONTAINS SQL
SQL SECURITY DEFINER
begin
declare v_routine_ddl longtext;
select concat(
'CREATE ',p.type
, ' `',p.db,'`.`',p.name,'`'
, '(',convert(p.param_list using utf8),')','\n'
, if( p.type='FUNCTION'
, concat('RETURNS ',p.returns,'\n')
, ''
)
, 'LANGUAGE ',p.language,'\n'
, if( p.is_deterministic='NO'
, 'NOT '
, ''
), 'DETERMINISTIC','\n'
, replace(p.sql_data_access,'_',' '),'\n'
, 'SQL SECURITY ',p.security_type,'\n'
, convert(p.body using utf8)
)
into v_routine_ddl
from mysql.proc p
where p.db = p_schema
and p.name = p_name
;
return v_routine_ddl
;
end


This function takes the name of a schema and the name of a function or stored procedure, and returns the complete DDL for it. Because it needs to read from the mysql.proc table, I decided it would be best to create it with SQL SECURITY DEFINER. Then it can be created by a user that has access to that table, but can be executed by anyone that is granted execution of the function.
To top it off, I decided it would be fun to be able to write the formatted ddl directly to file. This can be done with the SELECT...INTO OUTFILE syntax, but sadly that does not accept a variable to specify the filename.
No sweat, since MySQL 5.0.13, the prepared statement syntax is available inside stored procedures! This comes in handy in the next function:

CREATE PROCEDURE `utils`.`write_routine_html_file`(
p_schema varchar(64)
, p_name varchar(64)
, p_outfile varchar(255)
)
LANGUAGE SQL
NOT DETERMINISTIC
CONTAINS SQL
SQL SECURITY DEFINER
begin
declare v_quote char(1) default '''';

set @statement_text := concat(
'select '
, 'utils.format_mysql_as_html('
, 'utils.get_routine_ddl(?,?)'
, ')'
, ' into outfile '
, v_quote
, p_outfile
, v_quote
, ' fields escaped by '
, v_quote,v_quote
, ' lines terminated by '
, v_quote,v_quote
);

set @p_schema := p_schema;
set @p_name := p_name;

prepare stmt
from @statement_text
;
execute stmt
using @p_schema
, @p_name
;
deallocate prepare stmt
;
end


This function just wraps the get_routine_ddl() function inside a prepared SELECT...INTO OUTFILE allowing for a variable filename.

Now, using a statement like:

call utils.write_routine_html_file(
'utils'
, 'format_mysql_as_html'
, 'D:/temp/format_mysql_as_html.txt'
);


generates the listing found on top of this blog entry...effortlessly.

Saturday, October 22, 2005

Nesting MySQL Cursor Loops

Handling cursor loops in MySQL keeps puzzling people.

Single Cursor Loops


The common case is to have a simple cursor loop. Each record is fetched from the cursor until there are now more records, and for each record some work is performed.
Let's take a look at this snippet:

01 begin
02 declare v_col1 int; -- define the data to grab from the cursor
03 declare no_more_rows boolean := FALSE; -- define the loop control variable
04 declare cursor1 cursor for -- define the set to iterate through
05 select col1
06 from MyTable;
07 declare continue handler for not found -- Capture cursor exhaustion event
08 set no_more_rows := TRUE; -- modify the loop control variable, then return
09
10 open cursor1; -- execute query, and save the result so we can iterate it
11 LOOP1: loop -- start repeatable region of code
12 fetch cursor1 -- grab data from current cursor record
13 into v_col1 -- this will raise NOT FOUND if the cursor is exhausted
14 ;
15 if no_more_rows then -- check the loop control variable, the handler might've modified it
16 close cursor1; -- free resources
17 leave LOOP1; -- stop iterations
18 end if;
19 --
20 -- Process data here
21 --
22 end loop LOOP1;
23 end;


This is a fragment of a MySPL Procedure.
To summarize, the CURSOR is processed by iteration through an ordinary LOOP that happens to FETCH rows from the cursor. When the cursor's exhausted, the NOT FOUND condition is raised, transferring control to a HANDLER that sets a flag. This flag is checked inside the loop, right after the fetch, and when it is set, iteration is terminated.

Handling cursor exhaustion


An important point to make right away is that, unlike the OPEN (line 10), FETCH (line 12) and CLOSE (line 16) statements, there is no formal relationship between the HANDLER declaration (lines 07..08) and the CURSOR declaration (lines 04..06). For instance, the HANDLER declaration never refers to cursor1, the name of the cursor. It just 'happens' to be the case that FETCH-ing eventually leads to a situation in wich the cursor will have reached it's last record. Such an exhausted cursor will raise the NOT FOUND condition on attempt to FETCH from it again. This condition can then be captured by the, scopewise, nearest appropriate HANDLER.

Nested Cursor Loops


Sometimes you need to nest two cursors. I don't want to discuss what problems are appropriately solved with this device right now. Let me just say that most of the time someone is using this device, they are doing that because they don't know how to write a join properly.
The problem with nesting curors is this: because we can't associate a HANDLER with a particular CURSOR we have to think of a way to know wich one of the cursors is exhausted once the NOT FOUND condition is raised. We really need to know that, because our loop controls depend on it. We definitely want to terminate the corrsponding loop when one of the cursors will be exhausted, don't we?

Using a separate block


Perhaps one of cleanest ways to do this is to rely on the scopewise propagation of the condition. By putting the inner LOOP into it's own BEGIN..END block, we have created the possibility to give that block it's own NOT FOUND handler. A NOT FOUND condition raised by a FETCH in the inner LOOP will thus be captured by the appropriate HANDLER declared in the inner BEGIN..END block. That's because that inner block's handler is nearer than the handler declared in the outer block. From there, the inner loop can easily be terminated:

01 BLOCK1: begin
02 declare v_col1 int;
03 declare no_more_rows boolean1 := FALSE;
04 declare cursor1 cursor for
05 select col1
06 from MyTable;
07 declare continue handler for not found
08 set no_more_rows1 := TRUE;
09 open cursor1;
10 LOOP1: loop
11 fetch cursor1
12 into v_col1;
13 if no_more_rows1 then
14 close cursor1;
15 leave LOOP1;
16 end if;
17 BLOCK2: begin
18 declare v_col2 int;
19 declare no_more_rows2 boolean := FALSE;
20 declare cursor2 cursor for
21 select col2
22 from MyOtherTable
23 where ref_id = v_col1;
24 declare continue handler for not found
25 set no_more_rows2 := TRUE;
26 open cursor2;
27 LOOP2: loop
28 fetch cursor2
29 into v_col2;
30 if no_more_rows then
31 close cursor2;
32 leave LOOP2;
33 end if;
34 end loop LOOP2;
35 end BLOCK2;
36 end loop LOOP1;
37 end BLOCK1;

Note that the inner BEGIN..END block's handler can never capture the NOT FOUND condition raised by the FETCH in the outer block, because conditions are propagated from the inside to the outside.
Another advantage of this approach is that the inner loop is almost entirely self-contained. The only thing that binds the inner BEGIN..END block to the outer one is a reference to the variable v_col1 in the decalaration of cursor2. Everything that has to do with the inner loop is defined in just one place, BLOCK2.

Resetting the loop control variable


The disadvantage of the previous method is of course the duplication of code. Apart from the cursor itself, we have to declare the loop control variable and the handler twice too. Sometimes, this is way too much trouble. Lucky for us, there's a way to evade all that, by resetting the loop control variable just before the inner loop terminates. Check it out:


00 begin
01 declare no_more_rows boolean default false;
02 declare v_col1 int;
03 declare v_col2 int;
04 declare cursor1 cursor for
05 select col1
06 from MyTable
07 ;
08 declare cursor2 cursor for
09 select col2
10 from MyOtherTable
11 where ref_id = v_col1;
12 declare continue handler for not found
13 set no_more_rows := true;
14
15 open cursor1;
16 LOOP1: loop
17 fetch cursor1 into v_col1;
18 if no_more_rows then
19 close cursor1;
20 leave LOOP1;
21 end if;
22 open cursor2;
23 LOOP2: loop
24 fetch cursor2 into v_col2;
25 if no_more_rows then
26 set no_more_rows := false;
27 close cursor2;
28 leave LOOP2;
29 end if;
30 end loop LOOP2;
31 end loop LOOP1;
32 end;


Observe te difference between the handling of the loop termination. For the outer loop, we see the usual sequence of statements to close the cursor and to leave the loop (lines 18..21). For the inner loop, we use an additional assignment to reset the loop control variable (line 26). This ensures the outer loop gets a chance to roll off untill it's exhausted too.

Packing cursor loops into stored procedures


Another approach would be to pack all the code for the dependant, inner cursor loop into a separate stored procedure and have the outer loop CALL that. This approach resembles the first one where we put the inner loop in it's own BEGIN..END block:


01 create procedure p_inner_loop(
02 IN p_col1 int
03 )
04 begin
05 declare v_col int;
06 declare no_more_rows boolean := FALSE;
07 declare cursor cursor1 for
08 select col2
09 from MyOtherTable
10 where ref_id = p_col1;
11 declare continue handler for not found
12 set no_more_rows := TRUE;
13 open cursor1;
14 LOOP1: loop
15 fetch cursor1
16 into v_col;
17 if no_more_rows then
18 close cursor1;
19 leave LOOP1;
20 end if;
21 end loop LOOP1;
22 end;


And


01 BLOCK1: begin
02 declare v_col1 int;
03 declare no_more_rows boolean1 := FALSE;
04 declare cursor1 cursor for
05 select col1
06 from MyTable;
07 declare continue handler for not found
08 set no_more_rows1 := TRUE;
09 open cursor1;
10 LOOP1: loop
11 fetch cursor1
12 into v_col1;
13 if no_more_rows1 then
14 close cursor1;
15 leave LOOP1;
16 end if;
17 call p_inner_loop(v_col1);
18 end loop LOOP1;
19 end BLOCK1;


This approach is especially useful when you want to be able to use that inner loop from other contexts as well. When cursor processing is becoming complex, it's probably a good idea to use this technique as a "divide and conquer" strategy to keep development managable. For example, this technique allows you test the different cursor loops separate from each other, wich can be a great aid in debugging.

Monday, October 17, 2005

Know Thy Schema...without typing "information_schema.%" so much

The MySQL information_schema (wich is available as of version 5.0.2) is a very powerful feature that can help you maintain or administrate your databases. You can use it to investigate the existence and status of database objects (such as tables, constraints and indexes) and their components (such as columns).

I like to use the MySQL command line tool for lots of common tasks. From the command line tool, I frequently feel the need to query the information_schema. In most cases, I just need to have a quick overview of a database's assets. In fact, I took up the habit in my job as an Oracle developer. In the Oracle command line tool, SQL*Plus, I'd do something like:


select object_name
, object_type
, created
from all_objects o
where owner = 'SCOTT'
order by object_type
, object_name


and this would give me a list of all the objects accessible to the current user that reside in the schema owned by the user SCOTT. For each object, the schema in wich the object resides would be listed, along the object's name and the date (actually the datetime, but the formatting leaves out the time component) the object was created:


OBJECT_NAME OBJECT_TYPE CREATED
------------------------------ ------------------ --------
PK_DEPT INDEX 12-05-02
PK_EMP INDEX 12-05-02
PK_NODE INDEX 18-08-05
SYS_C003897 INDEX 09-08-05
UK_NODE1 INDEX 18-08-05
UK_NODE2 INDEX 18-08-05
EMPREPORT PACKAGE 20-03-05
EMPREPORT PACKAGE BODY 20-03-05
BONUS TABLE 12-05-02
DEPT TABLE 12-05-02
EMP TABLE 12-05-02
IMAGES TABLE 20-03-05
NODE TABLE 18-08-05
PK TABLE 09-08-05
PRODUCT TABLE 18-08-05
SALGRADE TABLE 12-05-02


With the mysql information_schema database, we can achieve the same sort of list. However, there's no table there that contains all these different types of objects.

Another thing that can be a bit frustrating is that such a query can be quite lengthy. The identifiers used throughout the information_schema database are quit long and verbose.

select avg(
character_length(
identifiers.identifier
)
) avg_character_length
, min(
character_length(
identifiers.identifier
)
) min_character_length
, max(
character_length(
identifiers.identifier
)
) max_character_length
from (
select table_name identifier
from information_schema.tables
where table_schema = 'information_schema'
union all
select column_name
from information_schema.columns
where table_schema = 'information_schema'
) identifiers
;

+----------------------+----------------------+----------------------+
| avg_character_length | min_character_length | max_character_length |
+----------------------+----------------------+----------------------+
| 13.1279 | 2 | 37 |
+----------------------+----------------------+----------------------+


On the one hand, these elaborate identifiers make it easy to understand the internal structure of the information_schema and the queries that are built upon it, wich is good. On the other hand, it easily becomes a nuisance when you have to type them over and over again.

On top of that, I usually query the information_schema database in a situation where it is not the current database. Lazy me just wants to perform an ad hoc query to get some information on the current database, and not switch databases first with a use command.

All in all, in it's raw form, the information_schema is not that suitable for quickly inspecting a database. I decided to make life easier for myself, and some time ago, I created a database to store all kinds of general purpose utilities.


create database utils
;


Among my utilities are a couple of views built on the information_schema. In these views, I tried to find a balance between ease of use and comprehensibility. I'm printing them all here, hoping someone else will benefit.

The objs View


First of all, there's the objs view:


create or replace
view
utils.objs (
db
, namespace
, name
, type
, extra
, created
, comments
)
as
select trigger_schema
, _utf8'TRIGGERS'
, trigger_name
, _utf8'TRIGGER'
, concat(
action_timing
, _utf8' '
, event_manipulation
, _utf8' ON '
, table_name
, _utf8' FOR EACH '
, action_orientation
) as extra
, created
, _utf8''
from information_schema.triggers
union all
select routine_schema
, _utf8'ROUTINES'
, routine_name
, routine_type
, concat(
if(is_deterministic='NO','NOT','')
, _utf8' DETERMINISTIC '
, ' SQL SECURITY '
, security_type
, sql_data_access
) as extra
, created
, routine_comment
from information_schema.routines
union all
select table_schema
, _utf8'TABLES'
, table_name
, table_type
, concat(
'ENGINE='
, engine
, ' ROW_FORMAT='
, row_format
) as extra
, create_time
, table_comment
from information_schema.tables


The objs view yields one row for each database object. I decided to include the following objects: base tables, (system) views, procedures, functions and triggers. I decided to leave out indexes, because you can't identify an index by name in a given database: you'll also need the table name to be absolutely sure.

The columns are:

db

The database (schema) in wich the object resides.

namespace

The namespace. Currently, there are three namespaces: 'TABLES' (base tables, views, system views); 'ROUTINES' (functions, stored procedures) and 'TRIGGERS'

name

The name of the object, wich is unique within the namespace given a particular database

type

The actual object type. Values include: 'BASE TABLE', 'VIEW' and 'SYSTEM VIEW' for objects in the 'TABLES' namespace; 'FUNCTION' or 'BASE TABLE', 'VIEW' and 'SYSTEM VIEW'for the 'ROUTINES'namespace; and 'TRIGGER' for the 'TRIGGERS'namespace

extra

This column shows some extra information, dependant upon the object type.

created

The datetime corresponding to the that the object was created

comments

The comments describing the object




Now, to display a list of all the objects in the current database, just do:


select *
from utils.objs
where db = schema()
;


The refs View


The refs view lists information about foreign keys. Here's the code:


create
or replace
view
utils.refs (
db
, cons
, tab
, ref_db
, ref_cons
, ref_type
, ref_tab
, nulls
, idpart
)
as
select c.constraint_schema
, c.constraint_name
, c.table_name
, p.constraint_schema
, p.constraint_name
, p.constraint_type
, p.table_name
, max(ctcol.is_nullable)
, case count(cidcon.constraint_name)
when 0 then 'NO'
else 'YES'
end
from information_schema.table_constraints c
inner join information_schema.key_column_usage kc
on c.table_schema = kc.table_schema
and c.table_name = kc.table_name
and c.constraint_name = kc.constraint_name
inner join information_schema.columns ctcol
on kc.table_schema = ctcol.table_schema
and kc.table_name = ctcol.table_name
and kc.column_name = ctcol.column_name
left join information_schema.key_column_usage cidcol
on kc.table_schema = cidcol.table_schema
and kc.table_name = cidcol.table_name
and kc.column_name = cidcol.column_name
left join information_schema.table_constraints cidcon
on cidcol.constraint_schema = cidcon.constraint_schema
and cidcol.constraint_name = cidcon.constraint_name
and cidcol.table_name = cidcon.table_name
and cidcon.constraint_type != 'FOREIGN KEY'
inner join information_schema.key_column_usage kp
on kc.referenced_table_schema = kp.table_schema
and kc.referenced_table_name = kp.table_name
and kc.referenced_column_name = kp.column_name
and kc.position_in_unique_constraint = kp.ordinal_position
inner join information_schema.table_constraints p
on kp.table_schema = p.table_schema
and kp.table_name = p.table_name
and kp.constraint_name = p.constraint_name
where p.constraint_type != 'FOREIGN KEY'
and c.constraint_type = 'FOREIGN KEY'
group by c.constraint_schema
, c.constraint_name
, c.table_name
, p.constraint_schema
, p.constraint_name
, p.constraint_type
, p.table_name
;


Apart from the foreign key name and table, the table that is referred to is listed too, along with the primary key or unique constraint of wich the columns are referenced by the foreign key columns. As a bonus, some extra information is shown about the type of relationship implemented by the foreign key.


db

The database in wich the foreign key table (and thus, the foreign key constraint) resides.

cons

The name of the foreign key constraint

tab

The name of the table with the foreign key constraint a.k.a. the child or detail table

ref_db

The database that contains the referenced table

ref_cons

The unique or primary key constraint of the referenced table that contains the columns that are referred by the foreign key columns

ref_type

The type of the referenced constraint: either 'PRIMARY KEY' or 'UNIQUE'

ref_tab

The referenced (a.k.a master or lookup) table

nulls

Whether the foreign key columns accept NULL values: 'YES' (foreign key columns accept nulls, the relationship is not mandatory) or 'NO' (foreign key columns cannot have nulls, thus the relationship is mandatory)

idpart

Wheter the foreign key columns are also part of a 'PRIMARY KEY' or 'UNIQUE' constraint in the referencing table: 'YES' (foreign key columns are partially identifying) or 'NO' (foreign key columns are not part of an identifier)



The refs view does not take into account that MySQL/innodb supports foreign keys that do not refer to either a 'PRIMARY KEY' or a 'UNIQUE' constraint.

(Q: Wow, can MySQL do that? A: Yes, it can! Q: Why would you want to do that anyway? A: uhmm, don't really know, but you can post on this subject if you have a suggestion).

The idxs View


The idxs view lists information about indexes. Here's the code:


create
or replace
view
utils.idxs (
db
, tab
, name
, type
, id
, nulls
, cols
, maxs
, mins
)
as
select t.table_schema
, t.table_name
, s.index_name
, s.index_type
, if(s.non_unique=0,_utf8'YES',_utf8'NO')
, if(s.nullable='YES',_utf8'YES',_utf8'NO')
, count(seq_in_index)
, round(max(s.cardinality)/t.table_rows,2)
, round(min(s.cardinality)/t.table_rows,2)
from information_schema.tables t
inner join information_schema.statistics s
on t.table_schema = s.table_schema
and t.table_name = s.table_name
group by t.table_schema
, t.table_name
, s.index_name
, s.index_type
, s.non_unique
, t.table_rows
;


This view yields one row for each index.


db

The database in wich the index table (and thus, the index) resides.

name

The name of the index

tab

The name of the table that defines the index

id

Whether the index accepts duplicate values or value combinations: 'YES' when this is an unique index, else 'NO' if the index accepts duplicates

nulls

Whether the index columns accept NULL values: 'YES' (index columns accept nulls) or 'NO' (index columns are mandatory)

cols

The number of columns in de index

maxs

The maximum selectivity for all columns in the index

mins

The minimum selectivity for all columns in the index



So, good luck with this everyone. I hope this will be as useful to you as it was to me.

Sunday, October 09, 2005

Support for scheduled tasks in MySQL 5.1

I received a very interesting comment on my previous post, "Scheduling procedure execution in MySQL using SLEEP() and EXECUTE"


In 5.1 we have the following support:


SCHEDULE
[ schedule-name ]
{ START TIME start-time | BETWEEN start-time AND end-time }
[ EVERY period { HOURS | MINUTES | SECONDS } ]
[ ON { ( day-of-week, ... ) | ( day-of-month, ... ) } ]
[ START DATE start-date ]


So we are looking at scheduled event support directly in the database.

Cheers,

-Brian


(I assume Brian is actually Brian Aker from MySQL AB. If not, could the real Brian Aker correct me on that?)

I think this is very, vey interesting news. And I like the idea of defining scheduled tasks through a kind DDL statement too. Oracle jobs are different: you've got a package (basically, a container of types, variables, cursors, procedures, functions) called DBMS_JOB. You call procedures in this package to define the job, and this results in some data being written to a table, wich is used later on to run the jobs. Of course, the solution I presented was modelled after that.

I do hope that this syntax will allow us to have a job run, say, every 1st of the month. Oracle's DBMS_JOB can do this, because it allows you to specify the next date for job running in the form of a string that is evaluated as a date expression. So in Oracle,


DBMS_JOB.SUBMIT(
job => v_job_id
, what => 'p_calculate_salaries'
, next_date => to_date('2005-11-01','YYYY-MM-DD')
, interval => 'add_months(sysdate,1)'
);


would run the job on the first of november for the first time, and each time the job runs, the next date to run the job would be calculated by evaluating the expression passed to the interval parameter. In this case, the current date plus one month.

Maybe the [ ON { ( day-of-week, ... ) | ( day-of-month, ... ) } ] could deal with a month?

But even the MySQL SCHEDULE syntax won't allow this, it still think it's a cool feature.

I also hope the scheduled tasks will be accessible through the data dictionary, or at least in the mysql database. I think that this willl probably be te case, I mean, MySQL will have to store the data associated with the schedule somewhere, so why not a table.

Knowing this functionality will become available in the near future, I probably won't be doing too much about my schedular. I'll probably build a small example, and make some additions available I made yesterday.

Anyway, I had fun building it. After all it is the first-not-totally-useless thing I built in MySPL, and it gave me to opportunity to work with SLEEP(), PREPARE and EXECUTE. And I learned to use GET_LOCK() and RELEASE_LOCK(). Maybe not too hot if you're a regular MySQL developer, but very useful when you aren't (yet) like me.

Friday, October 07, 2005

Scheduling procedure execution in MySQL using SLEEP() and EXECUTE

Both Oracle and MSSQL provide tools and utilities to run scheduled database jobs. A job is just a piece of work - a program, a script or report - that runs periodically without human interaction. Jobs are especially useful for doing work that can be automated and work that can or should be performed when there are no humans to do it. Typical examples include loading a datawarehouse or exporting large amounts of data in text formats.

MySQL does not yet provide builtin functionality to work with database jobs. But in most cases, this is not really a problem. At least, not one that can't be solved. In most cases, some generic external scheduling tool like Cron (Linux) or Scheduled Tasks (Windows) can be used to do this.

Although these tools are probably sufficient, I decided that it would be fun to build a scheduler in the MySQL stored procedure language (Yep, I'll be calling that MySPL again). For one thing, I was just curious whether it would be possible.
I mean, all the ingredients are there really:

  • we can build tables to hold the information that comprises a schedule and some definition of the actual work that needs to be done by the job

  • we can build procedures that access data from these tables, and have the flow of control be influenced by this data

  • as of the 5.0.12 version, we've got the SLEEP() function, wich is really important to implement recurrence. It's a bit tucked away inside the docs, but it's there allright!

  • as of the 5.0.13 version, we can use the prepared statement syntax from within MySPL



For those that are interested in all the details, there's an installation script available for download right here. It contains the bare schedular, and a little API as well. This little API makes it quite easy to define a job or alter it's characteristics. There's no documentation yet, but I hope this blog post can get you started. I am planning on writing some examples and documentation in the near future. I did include COMMENTs, so that should be of some help.

How do these pieces work together?


We can set up a table to hold the data that defines the actual work as well as the data associated with actually scheduling job execution. This information can be used by a particular procedure, the job runner. The job runner can be called, passing it a reference to a particular job. Then, the job runner takes care of scheduling and executing the actual work associated with the job.

Basically, the job runner implements a loop that uses the prepared statement syntax to perform the actual work. For each such individual run, the CURRENT_DATETIME() for start and finish are recorded, and this is used to have the SLEEP() function suspend the loop until the next execution is due. The scheduling information is retrieved anew for each cycle, so job execution can be influenced by changing the schedule data while the job runner is running the job.

Now there's something about this job runner mechanism that needs to be clarified right away. The job runner is just an ordinary procedure that uses the information in the job table. So, unlike the case with Oracle database jobs, there's no background process that automatically takes care of running jobs once we define them in the job table. We need to first define them and then call a job runner to actually run the job.

Because there's no way to call the job runner procedure asynchronously, the client that's executing a job runner will block until the entire job schedule has finished.
However, once a job runner is running, it will notice particular changes made to the job schedule. So, the timewindow wherein the job is supposed to run can be manipulated, as well as the interval between job cycles. We can also have the job runner inspect a status column. With these devices, we can instruct a running job gracefully (that is, without killing the thread running the job) from inside another thread, simply by updating data in a table.

The job base table


We'll start by The Most Important Thing First, and that's, yep you guessed it right: the data. We are going to need data to define the characteristics of a job, that is, when it should become active, what it's name is, what work it needs to do...that stuff. Once we're running the job, we will need status information: data to see if a particular job is done etc.
Here's what I came up with:


delimiter go

drop table if exists job
go

create table job (
job_id
int unsigned
NOT NULL
auto_increment
COMMENT 'Job identification for reference purposes'
, job_owner
varchar(77)
NOT NULL
COMMENT 'The name of the user that owns this job'
, job_name
varchar(129)
NOT NULL
COMMENT 'The per-owner unique name of this job.'
, job_description
varchar(1024)
COMMENT 'A description of this job.'
, job_procedure_schema
varchar(64)
NOT NULL
COMMENT 'The schema in which the job procedure resides.'
, job_procedure_name
varchar(64)
NOT NULL
COMMENT 'The name of the job procedure.'
, job_schedule_start
datetime
COMMENT 'If applicable, the start of the time window for which the job is scheduled.'
, job_schedule_finish
datetime
COMMENT 'If applicable, the end of the time window for which the job is scheduled.'
, job_schedule_interval
int unsigned
NOT NULL
COMMENT 'The number of seconds between starts of job execution instances'
, job_status
enum(
'ABORTED'
, 'DISABLED'
, 'ENABLED'
, 'ERROR'
, 'FINISHED'
, 'PAUSED'
, 'RESUMED'
, 'RUNNING'
, 'SLEEPING'
)
NOT NULL
default 'ENABLED'
COMMENT 'The current status of the job'
, job_runner_connection_id
int unsigned
COMMENT 'The CONNECTION_ID of the connection executing the runjob procedure for this job.'
, job_runner_start
datetime
COMMENT 'The last time a jobrunner was started for this job'
, job_runner_finish
datetime
COMMENT 'The last time a jobrunner finished this job'
, job_runner_cycles
int unsigned
COMMENT 'The number of cycles the jobrunner successfully performed since jobrunner_start'
, job_runner_last_cycle_start
datetime
COMMENT 'The last time a job cycle started'
, job_runner_last_cycle_finish
datetime
COMMENT 'The last time a job cycle finished'
, constraint pk_job primary key (
job_id
)
, constraint uk_job1 unique (
job_owner
, job_name
)
)
ENGINE=MyIsam
DEFAULT CHARSET=latin1
COMMENT='A table that schedules stored procedure execution'
go


Although you could put a lot more effort in defining table structures to make up a nice complete scheduler, I decided that a simple structure would do for most of the cases. The following brief explanation of the columns is meant to draw a general outline. The table contents are to be manipulated using a little API of MySPL procedures.

job definition


A job belongs to a particular database user, the job_owner. A job also has a job_name, wich uniquely identifies the job per owner. Optionally, a job_description can be entered so that an administrator or developer can read what the job is supposed to do. Also, an job_id was added to make it easy to reference a particular job.

The actual work that's performed by the job is represented as stored procedure. So, in order to define a job, you will first have to write yourself a (parameterless) procedure that does all the actual work. Then, you can enter the schema and the name of this procedure in the job_procedure_schema and job_procedure_name columns of the job table.

Note that there's really no technical reason to require a stored procedure for this.
In fact, any statement that can be executed from within the job runner procedure using the prepared statement syntax would do equally well. However, I chose to require a stored procedure because I think it's nicer that way. It ensures that there's a clear seperation between the job schedule definition and the actual work performed within it.

job scheduling


We still need some information to actually schedule job execution. Therefore, we have the job_schedule_start and job_schedule_end columns.
Together, these define a timeframe wherein the job is supposed to be automatically executed. The job_schedule_interval determines the frequency of execution.

Right now, scheduling is very primitive. Because you can only schedule using a fixed interval, it is impossible to schedule things at, say, every first of the month. That's because a month is not a fixed period of time. Oracle jobs use an expression that's evaluated at runtime to determine when the next execution will start. This is something I'm thinking of implementing, but for now, we're stuck with fixed intervals.

running the schedule


The other columns provide information about the current job runner executing the job.
Some are there for monitoring purposes, but some others are used to actully maintain the job runner's state. Most of these columns are updated each time a loop cycle done by the job runner starts or finishes the procedure associated with the job.

The job_status provides information about the state of the job: wheter it needs to be started, wheter is is currently executing the job procedure, and more.
Most of the available statuses are automatically provided by the job runner once it's running a job. The job_runner_connection_id identifies the thread currently running the job. This is used internally to prevent different job runner instances from running the same job. This can also be useful to kill the job.
The job_runner_start and job_runner_finish record the start and finish of the execution of the job runner procedure itself. The job_runner_cycles contains the number of cycles that the job runner has gone through since it was started. The job_runner_cycle_last_start and job_runner_cycle_last_finsished columns are used to record when the last cycle started or finished.

The job$run procedure


Once you scheduled jobs by entering data into the job table, you can invoke the job runner to actually execute the job according to schedule. I've implemented this in the procedure job$run. What happens in this procedure is this:

  1. We try to get a lock for this job runner. This prevents having more than one job runner executing the same job simultaneously

  2. If we succeed in acquiring the lock, we retrieve some data describing the job. We use this data decide wheter the job is eligable for running. For example, it could be the case that the job_schedule_finish is earlier than the current date and time, wich makes the job not eligable for running.

  3. If we decide the job is eligable for running, we could have to wait in case the current date and time are earlier than the job_schedule_start. THis is achieved with the SLEEP() function

  4. Once we reach the job_schedule_start date and time, we can start cycles executing the job procedure specified by the job_procedure_schema and job_procedure_name columns. Actual execution is performed by the PREPARE and EXECUTE syntax.

  5. Cycling will occur at the frequency specified by job_schedule_interval. Again, the SLEEP() is used. This time, we need to SLEEP() through the period between the finish of the EXECUTE and start of EXECUTE plus job_schedule_interval seconds

  6. For each cycle, some condition could arise that should terminate cycling. For example, job_schedule_finish could be reached, or job_status may be updated by another thread to actively initiate termination.




delimiter go

drop procedure if exists job$run
go

create procedure job$run(
-- the name of the job to run
-- this must correspond to an existing row in the job table
-- for which current_user() = job.job_owner
-- and for wich the p_job_name = job.job_name
in p_job_name varchar(64)
)
comment 'Activates a previously defined stored procedure execution schedule.'
BLOCK_MAIN: begin

declare JOBSTATUS_ABORTED varchar(7)
default 'ABORTED';
declare JOBSTATUS_ENABLED varchar(7)
default 'ENABLED';
declare JOBSTATUS_ERROR varchar(5)
default 'ERROR';
declare JOBSTATUS_FINISHED varchar(8)
default 'FINISHED';
declare JOBSTATUS_PAUSED varchar(6)
default 'PAUSED';
declare JOBSTATUS_RUNNING varchar(7)
default 'RUNNING';
declare JOBSTATUS_SLEEPING varchar(8)
default 'SLEEPING';

declare v_message varchar(1024);

declare MESSAGE_JOB_LOCKED varchar(1024)
default 'Job already locked by connection with id: ';
declare MESSAGE_ERROR_LOCKING_JOB varchar(1024)
default 'An error occurred while locking the job';
declare MESSAGE_JOBSTATUS_NOT_ENABLED varchar(1024)
default 'Current job status not ''ENABLED''';
declare MESSAGE_SCHEDULE_HAS_EXPIRED varchar(1024)
default 'The schedule has expired';
declare MESSAGE_INTERRUPTED varchar(1024)
default 'Interrupted while sleeping';
declare MESSAGE_NO_SUCH_JOB varchar(1024)
default 'No such job';
declare MESSAGE_PROCEDURE_FAILED varchar(1024)
default 'The job procedure raised an exception';
declare MESSAGE_JOB_ABORTED varchar(1024)
default 'Job aborted at external request';
declare MESSAGE_OK varchar(1024)
default 'Ok';

declare v_lock_status tinyint unsigned;
declare v_locked_by int unsigned;

-- get a lock to prevent simultaneous execution of the same job
call job$get_lock(
p_job_name
, v_lock_status
, v_locked_by
);
if v_lock_status = 1 then
-- Ok, we're now the one and only job runner for this job
BLOCK_LOCK_ACQUIRED: begin
declare v_current_timestamp timestamp
default current_timestamp();
declare v_job_id int unsigned;
declare v_job_schedule_start datetime;
declare v_job_status enum(
'ABORTED'
, 'DISABLED'
, 'ENABLED'
, 'ERROR'
, 'FINISHED'
, 'PAUSED'
, 'RUNNING'
, 'SLEEPING'
);
declare exit handler for not found
set v_message := MESSAGE_NO_SUCH_JOB;


-- retrieve some general data and decide whether this job is eligble to run
select j.job_id
, coalesce(
j.job_schedule_start
, v_current_timestamp
)
, case
when j.job_status != JOBSTATUS_ENABLED
then MESSAGE_JOBSTATUS_NOT_ENABLED
when j.job_schedule_finish < v_current_timestamp
then MESSAGE_SCHEDULE_HAS_EXPIRED
else MESSAGE_OK
end as message
, concat(
'CALL'
, ' '
, '`',j.job_procedure_schema,'`'
, '.'
, '`',j.job_procedure_name,'`'
, '()'
)
into v_job_id
, v_job_schedule_start
, v_message
, @job$run
from job as j
where j.job_owner = current_user()
and j.job_name = p_job_name
;
if v_message = MESSAGE_OK then
-- Yes, this job is eligble to run
BLOCK_ENTER_RUNJOBS: begin

declare v_sleep int unsigned;
declare v_job_schedule_interval int unsigned;
declare v_job_schedule_finish datetime;
declare v_job_runner_cycles int unsigned
default 0;

declare exit handler for not found
begin
set v_message := MESSAGE_NO_SUCH_JOB;
set v_job_status := JOBSTATUS_ERROR;
end;

-- preparing the job procedure
prepare stmt_jobproc
from @job$run
;
-- notify the outside we're running the job
-- it's just for monitoring purposes anyway
-- the lock is supposed the provide the actual safety
update job
set job_status = JOBSTATUS_RUNNING
, job_runner_connection_id = connection_id()
, job_runner_start = v_current_timestamp
, job_runner_finish = NULL
, job_runner_cycles = NULL
, job_runner_last_cycle_start = NULL
, job_runner_last_cycle_finish = NULL
where job_id = v_job_id
;
-- check if job is scheduled in the future in order to wait for it
if v_job_schedule_start > v_current_timestamp then
update job
set job_status = JOBSTATUS_SLEEPING
where id = v_job_id
;
-- sleep until job cycling is due
set v_sleep := timestampdiff(
SECOND
, v_current_timestamp
, v_job_schedule_start
);
if sleep(v_sleep) = 1 then
set v_message := MESSAGE_INTERRUPTED;
set v_job_status := JOBSTATUS_ERROR;
leave BLOCK_ENTER_RUNJOBS;
end if;
end if
;
-- this is the main loop
LOOP_RUNJOBS: loop
-- renew schedule data, it might've changed while we where asleep
select job_schedule_finish
, job_schedule_interval
, job_status
, current_timestamp()
into v_job_schedule_finish
, v_job_schedule_interval
, v_job_status
, v_current_timestamp
from job j
where j.job_id = v_job_id
;
-- decide wheter to finish or to continue cycling
if v_current_timestamp > v_job_schedule_finish then
set v_job_status := JOBSTATUS_FINISHED;
leave LOOP_RUNJOBS;
elseif v_job_status = JOBSTATUS_ABORTED then
leave LOOP_RUNJOBS;
else
-- skip the actual work when we're pausing
if v_job_status != JOBSTATUS_PAUSED then
update job
set job_status = JOBSTATUS_RUNNING
, job_runner_last_cycle_start = v_current_timestamp
, job_runner_cycles = v_job_runner_cycles
where job_id = v_job_id
;
-- calling the job procedure
BLOCK_EXECUTE_JOBPROC: begin
declare exit handler for sqlexception
begin
set v_job_status := JOBSTATUS_ERROR;
set v_message := MESSAGE_PROCEDURE_FAILED;
leave LOOP_RUNJOBS;
end;
execute stmt_jobproc;
end BLOCK_EXECUTE_JOBPROC
;
set v_job_runner_cycles := v_job_runner_cycles + 1
;
update job
set job_status = JOBSTATUS_SLEEPING
, job_runner_last_cycle_finish = current_timestamp()
, job_runner_cycles = v_job_runner_cycles
where job_id = v_job_id
;
end if;
-- calculate the amount of sleep until the next cycle's due
set v_sleep := timestampdiff(
SECOND
, current_timestamp()
, date_add(
v_current_timestamp
, interval v_job_schedule_interval SECOND
)
);
if v_sleep > 0 then
if sleep(v_sleep) = 1 then
set v_message := MESSAGE_INTERRUPTED;
set v_job_status := JOBSTATUS_ERROR;
leave BLOCK_ENTER_RUNJOBS;
end if;
end if;
end if;
end loop LOOP_RUNJOBS;
end BLOCK_ENTER_RUNJOBS
;
deallocate prepare stmt_jobproc
;
update job
set job_status = v_job_status
, job_runner_finish = current_timestamp()
where job_id = v_job_id
;
end if;
end BLOCK_LOCK_ACQUIRED
;
-- release the lock so another job runner can run the job
call job$release_lock(
p_job_name
, v_lock_status
, v_locked_by
);
elseif v_lock_status = 0 then
-- Rats! some other job runner was here first
set v_message := concat(MESSAGE_JOB_LOCKED,v_locked_by);
else
-- Whoops! an error occurred aqcuiring the lock
set v_message := MESSAGE_ERROR_LOCKING_JOB;
end if
;
-- Let the caller now what happened here
select p_job_name as job_name
, v_message as message
;
end BLOCK_MAIN
;
go


I used the bold typeface to highlight the usage of SLEEP() and the elements that are related to the prepared statement syntax.

Locking is implemented using job$get_lock and job$release_lock, wich use the builtin GET_LOCK() and RELEASE_LOCK() functions. Just download install_job.mysql to see how it works exactly.

The rest is pretty straight-forward: just nesting and looping.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...