Percona Live MySQL User's Conference, San Francisco, April 10-12th, 2012 Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Wednesday, February 17, 2010

MySQL - the best stored routine is the one you don't write

At Fosdem 2010, already two weeks ago, I had the pleasure of hearing Geert van der Kelen explain the work he has been doing on connecting MySQL and Python. I don't know anything about Python, but anybody that has the courage, perseverance and coding skills to create an implementation of the the MySQL wire protocol from scratch is a class-A programmer in my book. So, I encourage everyone that needs MySQL connectivity for Python programs to check out Geert's brainchild, MySQL Connector/Python.

In relation to MySQL Connector/Python, I just read a post from Geert about how he uses the MySQL information_schema to generate some Python code. In this particular case, he needs the data from the COLLATIONS table to maintain a data structure that describes all collations supported by MySQL.

For some reasons that I cannot fathom, Geert needed to generate a structure for each possible collation, not just the ones for which the COLLATIONS table contains a row. To do this, he wrote a stored procedure that uses a cursor to loop through the COLLATIONS table. In the loop, he detects it whenever there's a gap in the sequence of values from the ID column, and then starts a new loop to "fill the gaps". For each iteration of the outer cursor loop, a piece of text is emitted that conforms to the syntax of a Python tuple describing the collation, and each iteration of the inner loop generates the text None, a Python built-in constant.

The final result of the procedure is a snippet of Python code shown below (abbreviated):

..
("cp1251","cp1251_bulgarian_ci"), # 14
("latin1","latin1_danish_ci"), # 15
("hebrew","hebrew_general_ci"), # 16
None,
("tis620","tis620_thai_ci"), # 18
("euckr","euckr_korean_ci"), # 19
..


In the final code, these lines are themselves used to form yet another tuple:

desc = (
None,
("big5","big5_chinese_ci"), # 1
("latin2","latin2_czech_cs"), # 2
("dec8","dec8_swedish_ci"), # 3
("cp850","cp850_general_ci"), # 4
..


This is excellent use of the information schema! However, I am not too thrilled about using a stored routine for this. Enter my fosdem talk about refactoring stored routines.

In this case, performance is not really an issue, so I won't play that card. But many people that do need well-performing stored procedures might start out like Geert and write a cursor loop, and perhaps do some looping inside that loop. One of the big take-aways in my presentation is to become aware of the ways that you can avoid a stored procedure. Geerts procedure is an excellent candidate to illustrate the point. As a bonus, I'm adding the code that is necessary to generate the entire snippet, not just the collection of tuples inside the outer pair of parenthesis.

So, here goes:

set group_concat_max_len := @@max_allowed_packet;

select concat('desc = (',
group_concat('\n '
, if( collations.id is null, 'None',
concat('(', '"', character_set_name, '"',
',', '"', collation_name, '"', ')')
)
, if(ids.id=255, '', ','), ' #', ids.id
order by ids.id
separator ''
), '\n)'
)
from (select (t0.id<<0) + (t1.id<<1) + (t2.id<<2)
+ (t3.id<<3) + (t4.id<<4) + (t5.id<<5)
+ (t6.id<<6) + (t7.id<<7) id
from (select 0 id union all select 1) t0
, (select 0 id union all select 1) t1
, (select 0 id union all select 1) t2
, (select 0 id union all select 1) t3
, (select 0 id union all select 1) t4
, (select 0 id union all select 1) t5
, (select 0 id union all select 1) t6
, (select 0 id union all select 1) t7) ids
left join information_schema.collations on ids.id = collations.id;

This query works first by generating 256 rows having id's ranging from 0 to 255. (I think I recall Alexander Barkov mentioning that this is currently the maximum number of collations that MySQL supports - perhaps I am wronge there). This is done by cross-joining a simple derived table that generates two rows:

(select 0 id union all select 1)

So, one row that yields 0, and one that yields 1. By cross-joining 8 of these derived tables, we get 2 to the 8th power rows, which equals 256. In the SELECT-list, I use the left bitshift operator << to shift the original 0 and 1 0, 1, 2 and so on up to 7 positions. By then adding those values together, we fill up exactly one byte, and gain all possible values from 0 through 255:

(select (t0.id<<0) + (t1.id<<1) + (t2.id<<2)
+ (t3.id<<3) + (t4.id<<4) + (t5.id<<5)
+ (t6.id<<6) + (t7.id<<7) id
from (select 0 id union all select 1) t0
, ... t1
, ...
, (select 0 id union all select 1) t7) ids

Once we have this, the rest is straightforward - all we have to do now is use a LEFT JOIN to find any collations from the information_schema.COLLATIONS table in case the value of its ID column matches the value we computed with the bit-shifting jiggery-pokery. For the matching rows, we use CONCAT to generate a Python tuple describing the collation, and for the non-matching rows, we generate None:

if( collations.id is null, 'None',
concat('(', '"', character_set_name, '"',
',', '"', collation_name, '"', ')')
)

The final touch is a GROUP_CONCAT that we use to bunch these up into a comma separated list that is used as entries for the outer tuple. As always, you should set the value of the group_concat_max_len server variable to a sufficiently high value to hold the contents of the generated string, and if you want to be on the safe side and not run the risk of getting a truncated result, you should use max_allowed_packet.

I have the honour of speaking at the MySQL user conference, april 12-15 later this year. There, I will be doing a related talk called Optimizing MySQL Stored Routines. In this talk, I will explain how stored routines impact performance, and provide some tips on how you can avoid them, but also on how to improve your stored procedure code in case you really do need them.

9 comments:

rudy said...

don't you mean 0 through 255?

;o)

Geert JM Vanderkelen said...

What a query! :)

The thing is, the stored procedure solution might not be optimal, but I think anyone will easily understand that code. The above query, as cool as it is, is IMHO much harder to grasp. And for me, readability and simplicity comes first :)

Roland Bouman said...

@rudy: thanks, its fixed.

@geert: sure. well each to his own - i think the query is clearer :p

Shlomi Noach said...

Hi Roland,

I think the query is clear as well :D
Matter of taste though. A programmer may find it easier to follow a conditional algorithm.

I've developed the habit of having a readily "numbers" table for such occasions.
I also wrote about how to generate such tables on the fly: Generating numbers out of seemingly thin air.
I didn't test it - but did you ever check to see whether an 8-table join is as fast as using a single 256 rows table?

Roland Bouman said...

Hi Shlomi :)

Yeah - I read your posts with the number table - thanks for linking to it, it is appropriately related material.

I haven't tested the difference in performance between the table and the on-the-fly product I am using here - frankly, I don't think it matters with such a small data volume.

In this case, I wanted to not presume any existing database objects (other than the built-in COLLATIONS table) - I thought that in this case, the balance between code volume and no hassle of relying on a table was just right.

Of course, for larger sets, I would start looking at base table too.

Anonymous said...

Hi Roland,

Try following trick of generating a sequence number on the fly.
select concat('("',set_name,'","',col_name,'"), #',id)
from
(
SELECT count(*) as id
,a.CHARACTER_SET_NAME set_name
,a.COLLATION_NAME col_name
FROM INFORMATION_SCHEMA.COLLATIONS a
,INFORMATION_SCHEMA.COLLATIONS b
where a.id >= b.id
group by a.ID
,a.CHARACTER_SET_NAME
,a.COLLATION_NAME
) seq_gen


Mario

Anonymous said...

Ouch, this query is a mess.

I don't really understand the perpetual FUD about stored procedures and so ... ¿mysql fanboy ignorance perhaps?.

"i think the query is clearer"

Are you kidding us, right?.

Roland Bouman said...

Hi Anonymous,

thank you so much for posting anonymously.

Yes, I am totally convinced now: It is a mess, and I have been constantly spreading FUD about stored procedures. Boy was I wrong, I feel really ashamed now. How lucky we all are we have you so you can unmask me.

Really, you should post comments more often. I guess a person like you uis so busy taking the piss at others that it is just too much trouble to leave your name.

Now, bye bye, and happy trolling.

Log Buffer said...

[...]Pursuing a similar matter (collations), Roland Bouman opines, the best stored routine is the one you don’t write.[...]

Log Buffer #180