Friday, December 19, 2008

MySQL Conference Schedule Available

I just got word the Schedule for the MySQL Conference 2009 is now available!

In my previous post, and the one before that I already wrote about my approved proposals. I promised I would post more details on my sessions later on. Well, it's right here:

http://en.oreilly.com/mysql2009/public/schedule/speaker/198

While you're at it, check out the list of speakers. I think it is a pretty good line-up, and I am looking forward to seeing the talks, and meeting these people. I've seen quite a few interesting things and I'll compile a list later on of the talks that caught my special attention.

Wednesday, December 17, 2008

"Starring Sakila": a Data Warehousing mini-tutorial at the MySQL UC 2009

Hi!

Recently I wrote about how glad I was to see two of my proposals for the upcoming MySQL Conference approved.

I am absolutely thrilled to see my third proposal has been approved as well: "Starring Sakila: data warehousing explained, illustrated and subtitled". I'm very proud to announce that I'm doing this talk together with Matt Casters, chief data integration of Pentaho and creator of the popular ETL/Data Integration suite Kettle (a.k.a Pentaho data integration).

During this 45-minute seminar, Matt and I will take the Sakila sample database and explain how to design and load a data warehouse for it, and how to use that to create those reports and charts that makes the managers smile ;-). Along the way, we'll explain often uttered terms and concepts such as dimensional model, denormalization, star schema, OLAP, data staging, business intelligence, ETL and data integration.

Note that this session is targeted at SQL developers, DBAs and tech-savvy analysts ('suits' are most welcome) that are new to business intelligence and datawarehousing. We noticed that although data warehousing and business intelligence are hardly new topics, there is still a lot of confusion surrounding them. The goal of this session is mainly to explain what it is all about, and to immediately make the concepts we cover tangible by illustrating it using a database schema which is easy to grasp, and which already familiar to most of the MySQL crowd.

Of course - the MySQL User's conference will feature more advanced data warehousing and business intelligence session too. I am pretty sure we will see a lot of Infobright and Kickfire. One session that I know will be very interesting is Matt's talk on utilizing computing clouds to power highly performing, scalable ETL processes.

Anyway, I'm looking forward to it. I will post a more detailed overview of the content of this and the other sessions I'm involved in when the conference date is nearing.

Video

A video recording of this session can be found here, courtesy of Sheeri.

Thursday, December 11, 2008

Speaking at the MySQL Conference 2009

I am very proud to announce that so far, two proposal I submitted for the MySQL Conference 2009 have been approved! Wohoo!

(Did you submit a proposal too, and you still didn't receive a reply? No worries - there is a massive amount of proposals to go through. From past years experience I can say - Don't despair, just have some patience. It can take some time for the MySQL community guys to work though all the proposals.)

As every year, I am looking forward to this event a lot, and having the opportunity to speak there just brings a big smile to my face ;)

Anyway - These are the approved proposals:

  • Gearing up MySQL: Implementing MySQL synchronization for browser-based apps using an embedded database (powered by Google Gears)

  • Practical MySQL Plugin Development



The first proposal is for a 45 minute seminar on how to effectively use a client side embedded database in browser applications, and how to synchronize that local data to a remote mysql server. It is focussed on Google Gears, but the principles and approach is equally applicable to a HTML 5 compliant implementation of a browser embedded SQL database (such as Safari).

The second proposal is for a 3 hour tutorial. The aim is to do a hands-on tutorial on the basics of MySQL plugins that is suitable even for those that have no prior C/C++ skills. Yeah, it may sound ambitious, but last year I did a UDF tutorial along the same lines, and that seemed to work out pretty good. I am very lucky to be doing this one together with Sergei Golubchik, one of the MySQL Server devs that has seen every corner of the server, and who actively architects the plugin API. Really - if you want to learn about plugins, don't miss your chance to learn it at the conference.

I will most likely provide some more details on the sessions later on, but feel free to post a comment in case you want to know more.

Friday, October 31, 2008

The Dutch are making money with open source software

I just want to take a moment to congratulate Stani Michiels, the creator of Stani's Python IDE.

Stani has won a contest to create the Architecture Fiver, a commemorative Dutch 5 euro coin:



The front side shows the portrait of Queen Beatrice emerging from a spiral of the names of Dutch architects, ordered by the number of internet hits. The back side shows a number of books (authored by Dutch architects), rising up as if they are buildings, their roofs forming the contours of a map of the Netherlands.

I think this is a truly creative and beautiful design, and I think the price is well-deserved.

An extra tidbit that interests me is that the design was created exclusively by Open Source and Free software tools (predominantly Python). So literally, this is money made by open source software. Read all about it on his blog: SPE IDE - Stani's Python Editor: How to make money with free software....

Stani, Kudos to you - When I get my hands one of these I'll keep it for display. You can be sure the image of the coin will make its way to many presentations on open source business.

Read all about it on Stani's blog

Friday, October 24, 2008

What kind of life is that?!

Over a year ago, I wrote about a little bit of irony offered by the runtime error messages emitted from the Windows Vista ultimate testdrive site (which I just checked again, this time with IE6 and happens to choke on another bug).

Anyway, I just found another gem in this category. In search for MS Visual Studio Express, I bumped into an ad for the "Defy All Challenges" campaign on the Dutch Microsoft site. The site shows a picture of a pirate captain that is titled "A Developer's Life for me" and below is some text that invites other brave developers to "Defy all challenges" with Microsft Visual Studio 2008. I took a snapshot for posterity:

whoops

Unfortunately, as soon as you visit the site (I'm using IE 6.0 on Windows XP SP2 right now), you get a messagebox informing you that a javascript runtime error has occurred ("Object Expected"). So I'm afraid it's a case of "A bug's life" rather than "A Developer's life".

Wednesday, October 15, 2008

Maturity of Open Source ETL / Why do people fail to do their homework?

I just read this post on Matt Casters' blog. Here, Matt describes why Element 61's Jan Claes is dead wrong in the way he assesses the maturity of open source ETL tools.

Well, I've just read Jan Claes' article in the "research and insights" area of the Element61 website, and frankly, it is pretty easy to see how unsubstantiated it is. Some may be tempted to classify the article as FUD, although typically 'real' FUD articles are more sophisticated and are usually not so blatantly sloppy with facts. I could get all worked up over it, but mostly it leaves me with a genuine confusion as to why a seemingly serious IT consultancy company would want to have their professionals post this kind of stuff on the company website and let it pass as "research" or even "insight".

Anyway, let's take a look at the article and I'll try and explain why it won't wash...
(Disclaimer - I can easily debunk Jan's article when I look at Kettle, a.k.a. Pentaho data integration. I do not have enough experience with the other open source ETL tools mentioned in his article, Talend and CloverETL, so it is entirely possible that the article does these tools similar injustice).

The article starts off by explaining what ETL is, and correctly mentions that prerequisite tools and resources are generally considered expensive. Quite unsurprisingly, the article moves on to explore open source ETL tools in order to find a cost-reducing alternative. So far, so good - the notion that cost reduction is the primary motive for most businesses to start using open source software is far from controversial.

From here on the article wanders off, and starts talking about open source in general. Some sympathetic but unsubstantiated and unillustrated claims are made about the relation between "standards" and open source software.

The article then becomes more critical and downplays the viability of an ETL open source community as compared to the Linux community. According to Jan Claes, Linux is successful ...because the developers united in a “quest” against Windows.... According to him, a common motive lacks in the ETL community. He has it that ...[ETL] is a specialist's world..., and because of that, ...progress might be significantly slower than in the case of other –more successful- Open Source programs.

Just take a second to let it sink in. Anybody met a Linux dev lately that was out to beat windows? Last time I checked, Linux was meant to be a free (as in speech) UNIX system. Most Linux users I know couldn't care less about Windows, and they are certainly not 'united in a quest' to create something like it. I take it that by tagging ETL as belonging to the 'specialist's world', I am to understand that Linux kernel development is 'generalistic'.

If I am to believe Jan, everybody is so busy with their quest against windows, doing all that 'general' hacking in C/C++ on the Linux kernel that nobody is specialist enough to help that poor quailing open source ETL community develop say, a MySQL bulk loader step, or a Regular Expression Matching step, or a Normalizing step. (Those that have been following the development of Kettle know that these and more are all contributed by the community).

To make sure he gets the point across, the article then says that ...The most advanced suppliers of Open Source ETL actually admit today that they are not really community-driven, and that each company works with their own developers.... No link. No date. No nothing, we should probably just take Jan's word for it.

Look, don't get me wrong - I don't want to deny that ETL is a specialist area of expertise. But in my opinion, this is in part due to the fact that ETL (and BI tools in general) have been inaccessible to most people. Let me explain that.

My take on it is that now, open source BI is in the same boat where the open source databases were a few years ago. While mocked and ignored by the 'real' experts and specialists that do not deign themselves to touch the open source 'toy' databases, and can only be humbled to work with traditional 'enterprise' databases, the open source databases caused an emancipation among application developers and web developers.

Databases are now a commodity, and there are quite a lot of developers that know how to use one. Of course, apart from a large body of database users with basic skills, there are still 'real' experts, but some of them have become one with little or none of the typical DBA background thanks to the fact that open source databases disrupted the market and commoditized it. I see the same thing happening for BI: because the tools are now accessible and available, BI is popping up where it didn't before. In stealth mode, it is becoming a commodity.

The article continues, and now starts discussing a number of products: CloverETL, Pentaho, and Talend. Jan claims that these ETL tools provide ...just enough transformations to make it a viable option.

Huh?! Say what?! Just enough?... Of course, no comparison is made with a closed source product, so we can only guess what he was expecting. But really people, when I open Kettle 3.1, the latest GA version, I see:

  • Over 20 input steps, ranging from standard things like simple database tables, csv files and excel workbooks to slightly more advanced things like XML and LDAP to more exotic things like XBase and Web services

  • In addition to regular input steps, specialized bulk loaders for Oracle, MySQL, Postgres, Greenplum and MonetDB

  • Over 10 output steps (again ranging from standard stuff to more advanced things)

  • About 10 flavors of lookup, join and dimension maintenance steps

  • No less than 30 transformation steps, including normalization, denormalization, XSLT, XML injection

  • A bunch of flow control steps


And this is just for transformations! For kettle jobs, there's another bucket of steps, there's just too much to mention.

The article continues and does a run-down of "when to use/when not to use" open source ETL tools, and it is in this section where we see most of the inaccuracies.

Here are the "when to use" cases according to Jan Claes:


No complex transformations

The straightforward transformations can be implemented very quickly, ... When there are complex transformations to make, Open Source ETL tools will often not offer out-of-the-box solutions.


This is like, "so what"?! I mean, seriously - basically this complaint can be simplified to "when it's hard, it's hard". I dare anybody to come up with a single case where any tool provides an "out-of-the-box-solution" for an arbitrarily complex problem. I suspect Jan has not tried, and if he did, why doesn't he mention what he tried? I mean, man, just one example of what was expected but not found, surely that is not too much to ask? Gosh...


Few data sources to connect to

Most reputed ETL-vendors provide an extensive amount of connections to all sorts of data sources. This is a problem with Open Source ETL tools: most of them are based on Java architecture and need JDBC to connect to a database.


Seriously, what crap is this?

First of all, it is completely untrue. I mean, I won't bore you with the big 5 databases, and I know you have all heard of the less well known ones like Greenplum, Intersystems Cache and Netezza, but have you ever heard of "ExtenDB"? Or "Kingbase ES"? Or "Gupta"? No?! Well, Kettle has, and it has all of this shipped for free. Yup, that's right, No less than 32 different RDBMS systems, and for most of them, there isn't just one driver, but multiple drivers available - no additional installation or fee required.

Second, how does it make sense to whine about the underlying protocol or API used to connect to the data source? I have never heard anybody complain about MS SQL only offering AdoDB or ODBC connectivity, have you? In practice there is a JDBC driver available for every database, and if not you can still use ODBC connectivity. You are also not confined to the predelivered drivers, you can simply drop your own drivers in the kettle jdbc directory and connect to other dbs. If anything, Kettle's choice for JDBC seems like it may have done everything but limit connectivity.

Do you homework, dude. All it takes is one download and one unzip command.


Small scaled projects

Small scaled projects with one or two, very common sources and a limited number of target tables don't need an expensive and complicated ETL tool. An Open Source ETL tool can bring a cheap and easy to install solution, which even could run on a local PC.


Well, it is certainly true that you can run Kettle on a single PC with modest resource usage. But you can also think big, and benefit from built-in threading to burn all the cores of your high-end server. Or you can cluster a bunch of machines and have them divide the work to perform a single transformation. You can run it in a cloud and use sophisticated features like parallel input, you can use it to load shards ("database partitioning"), in fact there are no obvious limitations to the Kettle's deployment scale.

So, as it should be obvious by now, Jan has managed to write his "when to use cases" as if they are "when not to use" cases. The real "when not to use" cases follow:


Java & XML knowledge required for complex transformations

As all the Open Source ETL tools are Java and XML based, you need at least a basic understanding and knowledge of Java and XML to get the most out of them...[bla, bla and more bla]...the user has to develop the transformation rules, and therefore needs an advanced knowledge of Java.


This is a blatant lie. I won't deny that it is possible to muck around coding java as a Kettle user, but technically you would be a plugin writer. And the XML claim is completely off track - there is no requirement whatsoever for the end user to edit raw XML, ever. Really, I have just gone down the road describing the wealth of steps available to build you transformation, and really, you can trust me when I say: this is enough for about 95% of the cases, if not 99%. Who can blame the product for allowing you to write a java plugin if you really want it?

Speaking of which, does Jan mean to imply that the "traditional" products he is writing about do not allow you to write plugins? If that is so, then it seems an easy choice for me. I'd rather have more options not less. That saidd, I know that for example MS SQL IS allows .NET plugins. How is this functionally different?


Connectivity to multiple & less common data sources required.



Heh ;) As i have argued before, this is nonsense....Elementary, Watson.


Lack of skills, knowledge & resources


Well, I guess this is one of the things I got to give them. Clearly, certain skills, knowledge and resources are missing in Element61's outfit. Otherwise I can't explain why they'd be advertising such an easy target article.

Anyway, maybe you have thoughts on Open Source ETL that you want to share. If so, be my guest.

Thank you for bearing with me, and please left a comment if necessary.

Saturday, October 04, 2008

Less is More - Happy Eee PC-ing

Ok - I admit - it was impulse more than utility that made me buy this Asus Eee PC 900. People that know me can confirm that I am not a gadgety type of guy, so I realize it may seem a bit strange that I would sucker into buying trendy hardware.

Well, the secret is: there is no secret. The Eee PC is not a gadget - it is a completely functional laptop that is extremely suitable for traveling. It's got 1G of RAM, and 16G of flash storage onboard. An 900MHz Intel may sound slow, but I haven't really noticed that it is, in fact performance feels pretty peppy. The machine boots *fast* and shuts down even faster. It's light, and small, but still has a keyboard and display that are large enough to work on. The battery lasts me about 2.5 to 3 hours - I will try and find another battery. It cost 300 euros - this price was exactly right for me. If it had been more expensive I probably would not've bought it - even if it'd be a beefier machine.

The Eee PC 900 I bought comes with Xandros linux. I feel they really did a remarkable job of creating a desktop linux that is attractive, clear, easy to use and unwreckable by the non-technical user. The same model is available with windows XP too, but in order to keep the same price they took out Ram and Disk. Please dudes and dudettes - do yourself a favor and don't buy that one. Really, the pre-installed linux is pretty darn good, and considering the clipped hardware specs, I frankly can't see how the same machine with Windows XP could still pass as anything else but a joke. Yeah, I'm biased - trust me, I'm right ;)

If you have some experience with linux, its still easy to set it up so you can use a full KDE desktop.

I used to work on my notebook in packed trains and in cramped planes, but I think I will use the Eee PC instead - it really makes my life easier.

Thursday, October 02, 2008

MySQL 5.1 Cluster Certification Guide online Preview!

Wow!

I just read Martin "MC" Brown's blog announcing that a number of chapters are now available online for preview.

Personally, I think this is a good move. I think the online chapters provide a nice insight what the book is like. I can ensure you that there is lots and lots more to see in the other chapters. I can honestly and without reservation recommend this book to everybody that wants to read a recent reference book on MySQL Cluster, and those that want to obtain the 5.1 Cluster DBA certification exam in particular.

Now it must be noted that I am a co-author for this book, and author of the MySQL 5.1 Cluster DBA exam so my recommendations are hardly objective. However, I don't receive any royalties for the book (or for the exams) - my contributions were just part of my time working with MySQL AB. So, when I say that MC Brown is right when he says that:
...there is only one guide that you need, the MySQL 5.1 Cluster DBA Certification Study Guide
then I mean that sincerely - this is the book you'll want. Pick up your copy here.

Friday, September 26, 2008

Changing Lanes: Leaving Sun/MySQL

After having worked for two years and one quarter of a year with MySQL AB (now Sun), I have decided to resign and pursue another career.

Before I joined MySQL AB I was working as a database consultant and application developer. I knew that joining MySQL would mean giving that up, but I was happy to do so because I really liked the MySQL product. I was also attracted by the company itself, the flexible working hours, working from home, travelling abroad, and, very important, being one of the few companies that is capable creating FOSS software on a commercial basis.

I have had a splendid time at MySQL AB, and later Sun. I met lots of smart people and travelled to a bunch of places I would probably not've had the chance to visit otherwise. I learned a lot about the MySQL server, and a lot about the challenges involved in making an open source software company profitable. I've been given the chance to co-author a book on MySQL Cluster, and authored the cluster and associate certification exams. Finally, I created a training course that learns you how to write MySQL UDFs.

Looking back, I feel quite satisfied, and I'm thankful for having had this opportunity. However, slowly but surely I have noticed that I have been less involved in technical stuff, at least professionally. I have been spending so much time on hobby hacking in my spare time, that I reconsidered my position, and decided that I would really like to pick up application development again. At some point, a good opportunity came by, and I decided to seize it. So, that's what I did.

As of the 1st of October, next Wednesday, I will be working for Strukton Rail. I will be working for a department that's responsible for business process improvement and re-engineering. I will be doing a bit of business process analysis and consultancy, but the larger part of my job will consist of developing a Web application to model business processes and to support project management. These applications will have a distinctly graphical nature, and I have a pretty clear idea where I want that to end up. In addition I will be adding business intelligence to these applications.

I will be in the lucky position to be able to decide for a large part which technologies and products we will be using, and I think I will be able to build all of it on open source/free software. I am pretty certain that I will be using MySQL, Pentaho, PHP, ExtJs and Google Gears.

My intention is to keep blogging about MySQL. Most likely I will write more about all these other technologies too.

I'd like to finish by saying again what a wonderful time I had with MySQL/Sun! I wish MySQL/Sun the best of luck, and I want to thank all of my colleagues there. Thank you! I have learned a lot from you, and I'm looking forward to meeting again some place, some time.

Roland

Tuesday, August 26, 2008

Webinar: Introducing MySQL Workbench

September 16, 10:00 am PST, 1:00 pm EST, 18:00 GMT



I'm proud to inform you that I will be hosting a MySQL Workbench Webinar, September 16. You can find the announcement right here.

I have spent quite a deal of time testing MySQL Workbench. Thanks to the hard work of the MySQL Developer Tools Team, the tool just becomes better and better each new build. Personally, I think it is high time to present its features and benefits in a web seminar, and when Giuseppe asked me to do it, I was more than happy to oblige.

I'll shortly post an outline of some of the key features I will be presenting. Most likely, I'll address a number of issues I've seen recurring on the MySQL Workbench Forums, but I also intend to demonstrate a number more advanced features that I hope will benefit DBAs as well as Developers.

Stay tuned ;)

Tuesday, July 29, 2008

MySQL aggregate UDF Performance

At the MySQL User's conference 2008 I did a tutorial on MySQL User-defined functions. At the same conference, Brian Miezejewski and Mark Atwood did a talk on aggregate UDFs.

In relation their upcoming talk, Mark wrote a blog post reporting that MySQL Aggregate UDFs are three times slower than equivalent native functions:
I just proved to myself via testing and benchmarking that user defined aggregate functions are about three times slower than the call interface to the built-in functions.
Later on he posted an update, explaining that it looks like this result is probably "a MacOS thing".
It turns out to be a MacOS thing. On Linux, the performance penalty is less than 10%. Further research will surely be done.


I just tested it on Linux and Windows. For a simple aggregate UDF that is equivalent to the built-in COUNT() function, I found that the UDF is about 5% slower than the built-in function. Personally I think that's not too bad.

If you are interested in trying this on your own system, download these instructions. Would be cool if you could leave a comment here of your results, TIA

Thursday, July 24, 2008

Inspect the Query Cache using MySQL Information Schema Plug-ins

A while ago I wrote about MySQL 5.1 information schema plug-ins.

At the time, I wrote a plug-in to report the contents of the query cache, but for all kinds of reasons, I never found the time to write a decent article about it, nor to release the code.

I am not sure if I'll ever find the time to write that article, but I just tidied up the code, and installed it in a new MySQL 5.1.26-rc server. It seems to work there, so I put the code up on the web.

Inside the source file, there's instructions to build and deploy it. If all goes well, you can do things like:

mysql> select * from information_schema.mysql_cached_queries\G
*************************** 1. row ***************************
STATEMENT_ID: 1
SCHEMA_NAME: test
STATEMENT_TEXT: select count(*) from world.city
RESULT_BLOCKS_COUNT: 1
RESULT_BLOCKS_SIZE: 512
RESULT_BLOCKS_SIZE_USED: 106
1 row in set (0.00 sec)

Are you interested in this plugin-in? Do you have any suggestions? Download it, and put your comments here.

Building MySQL from Source: There's a fine line...


There's a fine line between love and hate, you see

Can't wait to late, but baby I'm on it


(From: "Liberation" by Outkast)

Suppose you need to build a MySQL server with the same options as a pre-built one, distributed through the MySQL download site.

Sounds strange?

Well, maybe, but that's what you need if you want to compile certain MySQL 5.1 plug-ins so that they'll will play nice with such a pre-built server.

Some plug-ins depend on nothing more but the plugin.h header file, but for example storage engine plug-ins require things way beyond that. If you want to compile those yourself, you are required to first build a MySQL server from source, even though you will install the plug-in in another, pre-built, server.

(If you want to see how this works in practice, check out the excellent documentation for building and installing the PBXT storage engine plug-in.)

Now, the snag is, the server you need to build must ideally be built in exactly the same way as the deployment server. So, the question is, how do you find out how your server was built?

As it turns out, there is a very simple method to obtain the original compile line. You can find it in the mysqlbug script. On line 28 you see the following line:

CONFIGURE_LINE="./configure '--prefix=/usr/local/mysql' '--localstatedir=/usr/local/mysql/data' '--libexecdir=/usr/local/mysql/bin' '--with-comment=MySQL Community Server (GPL)' '--with-server-suffix=' '--enable-thread-safe-client' '--enable-local-infile' '--enable-assembler' '--with-pic' '--with-fast-mutexes' '--with-client-ldflags=-static' '--with-mysqld-ldflags=-static' '--with-zlib-dir=bundled' '--with-big-tables' '--with-ssl' '--with-readline' '--with-embedded-server' '--with-partition' '--with-innodb' '--without-ndbcluster' '--with-archive-storage-engine' '--with-blackhole-storage-engine' '--with-csv-storage-engine' '--without-example-storage-engine' '--with-federated-storage-engine' '--with-extra-charsets=complex' 'CC=ccache gcc -static-libgcc' 'CFLAGS=-g -O3' 'CXXFLAGS=-g -O3' 'CXX=ccache gcc -static-libgcc'"


Are you curious what will happen if you compile your plug-in against a server that is not built the same as the deployment server? Well, the plug-in won't be "playing nice" with your server, and it is likely that something nasty will happen, such as a crash, or worse: the plugin may unintentionally change the behavior of the server, with unwelcome results such as data corruption.

So, find the mysqlbug script, (it's normally located in the MySQL bin directory) and find that fine line...

Friday, July 11, 2008

MySQL: DIVide and Conquer

Everybody that has had to do some numeric calculations in SQL will have encountered this simple problem: divide an integer by another integer and round down the outcome to the nearest integer. In many cases people write it like this:

FLOOR(a/b)

Simple enough, right? First we do the division a/b, then we round down using the function FLOOR().


Update: My claim that TRUNCATE(a/b, 0) is equivalent to FLOOR(a/b) is false! It maybe true when the outcome of the division is a positive number, but in case of a negative number, TRUNCATE() will only lose the decimals (resulting in a higher negative number) and FLOOR() will still round down.

Thanks Kai!


However, there is a better way.

We can use the integer division operator DIV instead of the ordinary division operator /. Because this is an integer division, there is no need for an extra function to lose the decimals, and the expression is simply:

a DIV b

This approach has a number of advantages:

  • It is explicit. By looking at the expression we know immediately that the result will be an integer, and that a and b are meant to be integers too.

  • It is easier to read. Because we don't need another function and parenthesis, this expression is easier on the eyes, something that you will appreciate if the expression is not simply FLOOR(a/b) but something like FLOOR(SUM(a)/SUM(IFNULL(b,0)))

  • It is fast! The DIV operation does not have to deal with complex floating point math, and will be much faster on most microprocessors


To prove the last point, take a look at the results of a simple benchmark. I simply used the BENCHMARK() function and executed:

mysql> SELECT BENCHMARK(10000000,1234567 DIV 7) ;
+-----------------------------------+
| BENCHMARK(10000000,1234567 DIV 7) |
+-----------------------------------+
| 0 |
+-----------------------------------+
1 row in set (0.83 sec)

mysql> SELECT BENCHMARK(10000000,1234567 / 7) ;
+---------------------------------+
| BENCHMARK(10000000,1234567 / 7) |
+---------------------------------+
| 0 |
+---------------------------------+
1 row in set (7.26 sec)

mysql> SELECT BENCHMARK(10000000,FLOOR(1234567 / 7)) ;
+----------------------------------------+
| BENCHMARK(10000000,FLOOR(1234567 / 7)) |
+----------------------------------------+
| 0 |
+----------------------------------------+
1 row in set (8.80 sec)

I repeated this two more times and averaged the time spent, and then made this little graph of the results:
DIVideAndConquerThe results show that DIV is about 9 to 10 times faster than the ordinary division operator, and that adding FLOOR() function makes the entire expression another 10% slower.

Now, I don't think the performance benefit is of much practical significance. You may see a slight improvement for large datasets using multiple division operations, but in many cases the ordinary query processing will probably have a much larger part in the total time spent. But still, DIV is faster, easier to read and more explicit if you want to solve this type of problem.

Wednesday, July 09, 2008

A fast, single pass method to calculate the median in MySQL

After stepping off of the GROUP_CONCAT() solution for calculating quantiles I figured it would be nice to find a better way to calculate the median too.

Solution


I previously wrote on how to calculate the median using GROUP_CONCAT(), but I think that this is a better way:

SELECT AVG(length) AS median -- take the average of left and right median
, MIN(length) AS left_median --
, MAX(length) AS right_median --
, @l AS left_median_position --
, @r AS right_median_position --
FROM (
SELECT @n, length -- @n is just here to facilitate debug
FROM film
CROSS JOIN (
SELECT @n:=0
, @r := COUNT(*) DIV 2 + 1 -- right median or true median
, @l := COUNT(*) DIV 2 -- left median, or true medain
+ IF(
COUNT(*) % 2 -- even or odd?
, 1 -- odd, need next value
, 0 -- even, need true left median
)
FROM film
) ``
WHERE (@n:=@n+1) -- row number
BETWEEN @l AND @r -- select two middle ones
ORDER BY length -- need to sort to get middle values
) ``

Apart from not relying on setting the buffer size for group_concat_max_len, this solution is also faster than the GROUP_CONCAT() solution. For example, calculating the median amount from the payment table in the sakila database takes some 0.12 seconds using the GROUP_CONCAT() solution and about 0.03 seconds with this method.

Explanation


At the heart of the solution are the user-defined variables:

  • @n: the row number

  • @l: the position of the 'left median' that is the row number with the highest value in the lower half of all rows

  • @r: the position of the 'right median', that is the row number with the lowest value in the higher half of all rows


These are initialized in the inmost subquery:

SELECT @n:=0
, @r := COUNT(*) DIV 2 + 1 -- right median or true median
, @l := COUNT(*) DIV 2 -- left median, or true medain
+ IF(
COUNT(*) % 2 -- even or odd?
, 1 -- odd, need next value
, 0 -- even, need true left median
)
FROM film

Note that this yields one row, initializing the row number @n to zero. We can calculate the position for the right median immediately by doing

COUNT(*) DIV 2 + 1

You see, first we divide the total number of rows by two. If there is an even number of rows, COUNT(*) DIV 2 will give us the left median, and adding 1 is then by definition the right median. If COUNT(*) is an odd number, it still holds: the integer division rounds down to the nearest integer value, and adding 1 then gives us the position of the true median in that case.

The calculation of the left median is along the same lines:

@l := COUNT(*) DIV 2 -- left median, or true medain
+ IF(
COUNT(*) % 2 -- even or odd?
, 1 -- odd, need next value
, 0 -- even, need true left median
)

We just have to take care that in case we do have an odd number of rows, we need to pick the true median position here too.

To do that we first find out if there is an odd number of rows using COUNT(*) % 2. This calculates the remainder of dividing the total number of rows by two. In case of an odd number of rows, the remainder is 1 which is considered to be TRUE by the IF function. In this case, 1 is added, effectively making @l equal to @r (that is, both hold the position of the true median). In case of an even number of rows, we add 0 as COUNT(*) % 2 is already the desired position of the left median.

Now that we have these values, we can use @n to generate a rownumber, which we can compare against the calculated @l and @r values:

WHERE (@n:=@n+1) -- row number
BETWEEN @l AND @r -- select two middle ones
ORDER BY length -- need to sort to get middle values

And this will give us at most two rows in case of an even number of rows, and one row for an odd number of rows.

The final step is to average the values from the rows, which is required to calculate the median in case of an even number of rows:

SELECT AVG(length) AS median -- take the average of left and right median
...
FROM (
...
) ``

Tuesday, July 08, 2008

MySQL Percentile aftermath: Calculating all quantiles

Are you getting fed up yet with my posts on calculating percentiles? Well, I'm sorry but I'm not quite finished.

Here's a simple, fast method to calculate the specified number of quantiles:

-- set the number of quantiles, for exmple:
-- quartiles: 4
-- deciles: 10
-- percentiles: 100

SET @quantiles:=4; -- select quartiles

-- calculate all quantiles
--
--
SELECT amount AS metric
, @n DIV (@c DIV @quantiles) AS quantile
, @n AS N
FROM sakila.payment
CROSS JOIN (
SELECT @n:=0 -- rownumber
, @c:=COUNT(*) -- need this to calculate quantile partitions
FROM sakila.payment
) c
WHERE NOT ( -- modulo zero (=false), we are at the quantile
(@n:=@n+1) % (@c DIV @quantiles) -- rownumber equal to the quantile partition?
)
ORDER BY amount; -- need ordered partitions

You can find this snippet on MySQL Forge.

Monday, July 07, 2008

Calculating Percentiles with MySQL, Round 2

My previous post on calculating percentiles with MySQL generated some comments and good discussion. In particular, I got some very interesting comments from Vladimir.

Basically, Vladimir was doubtful whether the GROUP_CONCAT() solution would be optimal in comparison to a JOIN. His proposal is to solve it like this:

SELECT SUM(g1.r) sr
, g2.length l
, SUM(g1.r)/(SELECT COUNT(*) FROM film) p
FROM (SELECT COUNT(*) r, length FROM film GROUP BY length) g1
JOIN (SELECT COUNT(*) r, length FROM film GROUP BY length) g2
ON g1.length < g2.length
GROUP BY g2.length
HAVING p > 0.9
ORDER BY p
LIMIT 1

First, this query sets up two identical subqueries in the FROM list using GROUP BY and COUNT() to calculate the number of occurrences of each distinct value. Then, these are joined and GROUP BY is again applied to calculate the total number of rows having a lower value. Finally, HAVING is used to find the groups in the upper percentiles, and LIMIT and ORDER BY are used to single out the one desired percentile value.

As it turns out, this solution is slower for moderately small data sets, but much faster for large data sets. He benchmarked it for a total of 999 distinct values on varying number of rows. Here are his slightly rounded numbers:

#rows: group_concat: groupby-join:
4M 1 min 6 sec 5.3 sec
1M 5 sec 2.5 sec
100K 0.5 sec 1.6 sec

Although GROUP_CONCAT() seems to break down pretty soon, he also writes:

I must admit that when N of distinct rows reaches approx. 10K I get pretty the opposite results if the total number of rows is relatively small. Basically we get into the same situation as with joining the whole tables.

He concluded by saying:

But what I think is the right solution is having something like this on the server side:

SELECT COUNT(INC length)/(SELECT COUNT(*) FROM film) p, length
FROM film
GROUP BY length
HAVING p >= 0.9 ORDER BY p LIMIT 1

Where "INC" is a flag that tells the server to not reset per-group counters in the aggregate functions. This would be quite a trivial change in the Item_sum class and would make sense not only for SUM, but maybe also for MIN, MAX, AVG, COUNT and maybe some other aggregate functions.

So, COUNT(INC length) would be the cumulative count, or a running total of counts. The fun thing is, you can already do exactly that using user-defined variables. Look:

-- allow statement batching
DELIMITER go

-- initialize
SELECT 0, COUNT(*)
INTO @cum_cnt, @cnt
FROM sakila.payment;

-- calculate percentiles
SELECT @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p, -- running fraction of #rows per distinct amount
amount
FROM sakila.payment
GROUP BY amount
HAVING p >= 0.9
LIMIT 1;

go

and this gets us the result:

Query OK, 1 row affected (0.01 sec)

+-------------+--------+
| p | amount |
+-------------+--------+
| 0.904542334 | 6.99 |
+-------------+--------+
1 row in set (0.03 sec)

Here is the equivalent GROUP_CONCAT solution:

SELECT SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(
p.amount
ORDER BY p.amount
SEPARATOR ','
)
, ','
, 90/100 * COUNT(*) + 1
)
, ','
, -1
) AS `90th Percentile`
FROM sakila.payment AS p;

...and it is considerably slower:

+-----------------+
| 90th Percentile |
+-----------------+
| 6.99 |
+-----------------+
1 row in set (0.08 sec)

(sakila.payment has 16049 rows and 19 distinct values for amount)

So, the sane thing to do would be to forget about that GROUP_CONCAT idea, and use this method. It does not have the nasty drawbacks of having to mess with group_concat_max_len and I am pretty sure Vladimir's method will be faster across the board anyway.

The one thing you could object about is the extra query to initialize the user-defined variables. You can get around that by initializing those in a single row subquery in the FROM clause, a technique described by Baron Schwartz (see for example this post)

SELECT @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p,
amount
FROM sakila.payment
CROSS JOIN (SELECT @cum_cnt:=0
, @cnt:=COUNT(*)
FROM sakila.payment) p
GROUP BY amount
HAVING p >= 0.9
LIMIT 1;

This has the advantage that the variables are initialized in one go with the entire query, ensuring you are not accidentally working with uninitialized or garbage variables.

(BTW: If you want to do more with these user-defined variables, I can highly recommend more from Baron's site, see for example his article on advanced user defined variable techniques.)

Writing to the MySQL error log

In almost all application development situations, one needs to log debug information now and then. In almost all production systems, one needs to log serious error events somewhere too.

So, what can you do? Create a log table? Sprinkle your code with SELECT 'Debug: ...' statements?

At the UDF Repository for MySQL, we now have a solution to log messages to the MySQL error log: a user-defined function called log_error().

Currently it is all very crude: the log_error function takes one argument and writes it to the mysql_error log, appending a line terminator.

Please try it out, and let us know if you have comments or suggestions to improve. Thanks in advance,

Roland

Friday, July 04, 2008

Calculating the Nth percentile in MySQL

Yesterday, I was on the freenode ##pentaho irc channel when Andres Chaves asked me how to calculate the Nth percentile in MySQL. He saw a solution somewhere using subqueries, but wasn't too happy about it.

A while ago I wrote about calulating the median in MySQL, and it turns out the Nth percentile can be calculated using a similar, single-pass approach, not relying on subqueries, UDFs, or user-defined variables.

The percentile....


So, what is a percentile exactly? Here's what the wikipedia says:

A percentile is the value of a variable below which a certain percent of observations fall. So the 20th percentile is the value (or score) below which 20 percent of the observations may be found.

....and the median


The wikipedia continues and hints at the relationship between the Nth percentile and the median:

The 25th percentile is also known as the first quartile; the 50th percentile as the median.

Sidenote: as I understand it, this latter remark concerning the median is not entirely correct. The median is "the middle value": the number of observations with a higher value is equal to the number of observations that has a lower value. For a series with an even number of observations, say {1,2,3,4}, there isn't one middle value, there are two, in this case: 2 and 3. Typically, the median is computed by taking the arithmic mean of the two middle values, which would be (2+3) / 2 = 2.5 in this particular case. But for this example, the 50th percentile would be 3 and not 2.5. However, in most practical cases the values are fairly evenly distributed and sufficiently large, which means the difference between the median and 50th percentile will be small or absent.

However, the median and percentile problems are quite similar: in both cases, a value is picked or computed that is higher than the value found in a particular portion of the total number of observations. For the median, the requirement is that that particular portion is equal to the number of observations that exceeds it; for the Nth percentile the requirement is that that portion constitutes N percent of the total number of observations.

The Solution


In the following example, we calculate the 90th percentile of film lengths:

SELECT SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT( -- 1) make a sorted list of values
f.length
ORDER BY f.length
SEPARATOR ','
)
, ',' -- 2) cut at the comma
, 90/100 * COUNT(*) + 1 -- at the position beyond the 90% portion
)
, ',' -- 3) cut at the comma
, -1 -- right after the desired list entry
) AS `90th Percentile`
FROM sakila.film AS f

Here, the literal 90 represents which percentile we want, that is, "the 90th percentile".

(If you like, you can leave out the SEPARATOR ',' bit, as the default separator is a comma anyway. I just wanted to have a clear indication for the source of the ',' arguments in the SUBSTRING_INDEX() calls)

Explanation


The median and Nth percentile problem can be solved using a similar apporoach: first, we use the string aggregate function GROUP_CONCAT() to create an ordered list of values. Then we use the substring variation SUBSTRING_INDEX() to find and excise a list entry at a particular desired position.

The differences between the solutions for the median and the Nth is mainly in which entry we have to pick from the list. (See my prior article for the full story on median). For the Nth percentile, we first calculate the desired portion of the total number of observations. Because N is defined as a percentage, we divide by 100 to get the actual fraction of the total number of observations:

N / 100

Then, we multiply by the total number of observations to find the number of observations that make up the actual portion of observations within the specified percentile:

N / 100 * COUNT(*)

Because we want to find the observation for which the specified portion has a lower value, we need to look at the next entry instead of the last entry within the portion, so we add 1:

N / 100 * COUNT(*) + 1

Caveats


There are a number of things to look out for:

Percentile value not unique


When we calculate the 90th percentile of the film lengths, we get 173:

+-----------------+
| 90th Percentile |
+-----------------+
| 173 |
+-----------------+

If we check the result by counting the portion of films with a length lower than 173 we see:

mysql> SELECT 100 * COUNT(IF(f.length < 173, 1, NULL))/COUNT(*) `Percentage`
-> FROM film AS f;
+------------+
| Percentage |
+------------+
| 89.4212 |
+------------+

The reason that we do not get 90% is that there are multiple occurrences of films with length equal to 173:

mysql> SELECT title FROM film WHERE length = 173;
+----------------------+
| title |
+----------------------+
| BALLROOM MOCKINGBIRD |
| CONQUERER NUTS |
| FIRE WOLVES |
| GLADIATOR WESTWARD |
| PIZZA JUMANJI |
| TALENTED HOMICIDE |
| VELVET TERMINATOR |
+----------------------+
7 rows in set (0.01 sec)

So, even though we may have picked the entry at the right position, this may still be a value within the specified portion instead of beyond. The definitions I found for the Nth percentile do not stipulate any mechanism to deal with this kind of ambiguity, whereas for the median, the correct value is always found by averaging the left and right median if necessary.

What about NULL


The second caveat are NULL values. Currently this method does not work when the column for which you want to caculate percentile values is nullable. It is possible to work around this though. If you can exclude the rows with the NULL value entirely, you can simply add a WHERE clause. This is a good idea also because it will cull the number of rows to process.

It may not be acceptable to throw away the rows with NULL values, for example if there is another expression in your SELECT list that needs to do something with all rows. You can then still work around it, with some extra hassle. It would involve tweaking the GROUP_CONCAT to ignore NULL values. This could be done like this:

GROUP_CONCAT(
IF(<column> IS NULL
, ''
, <column>)
ORDER BY <column>
SEPARATOR ','
)

This will ensure that GROUP_CONCAT() does not also return NULL when a NULL value is present in the specified column. If there are NULL values, these will end up as a list of comma's in the head of the result:

,,,,<non-null-value1>,...,<non-null-valueN> -- 4 NULL values

Assuming we know the number of NULL's (and we do) we can clean up our list easily with just SUBSTRING():

SUBSTRING(
GROUP_CONCAT(
IF(<column> IS NULL
, ''
, <column>)
ORDER BY <column>
SEPARATOR ','
)
, SUM(IF(<column> IS NULL, 1, 0)) + 1
)

Because we are ignoring the NULL values in our list, we must likewise ignore them in our calculation of the portion of rows. So, instead if COUNT(*) we should use COUNT(<column>) in order to not count the NULL values.

group_concat_max_len


When using GROUP_CONCAT(), an issue that should always be on your radar is the maximum length of the GROUP_CONCAT() result. If the result value exceeds the maximum length, the GROUP_CONCAT() result will be truncated, and a warning will be issued:

1 row in set, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+--------------------------------------+
| Level | Code | Message |
+---------+------+--------------------------------------+
| Warning | 1260 | 1 line(s) were cut by GROUP_CONCAT() |
+---------+------+--------------------------------------+

It is really important to be aware of any warnings, as a truncation of the GROUP_CONCAT() result messes up the entire calculation.

The maximum length of the GROUP_CONCAT() result is controlled through the group_concat_max_len system variable.

It can be set thus:

SET @@group_concat_max_len := <num-bytes>

The maximum practical value is the maximum packet size, which is available as the max_allowed_packet system variable. This means you can write:

SET @@group_concat_max_len := @@max_allowed_packet;
and you will never be bothered by this problem again. The GROUP_CONCAT() result can still be too large though (namely, larger than the maximum packet size) but in that case you will get a proper error instead of a truncated result.

You should realize that setting the group_concat_max_len to a high value may lead to memory problems, as each GROUP_CONCAT() invocation may individually reserve the specified amount of memory to deal with its result.

Finally...


I will maintain this percentile calculation as a snippet on the MySQL Forge site. Please go there to find the latest version.

Now, finally, I have a question for you. When I wrote about the median calculation, I mentioned that I thought it was an original method, and I asked whether someone could deny that claim. I did not get any reaction, so I'd like to repeat my question: Do you know of any book or text or blog that describes this technique? If so, let me know so I can provide proper accreditation.

TIA, Roland.

Tuesday, June 24, 2008

Why Schoenberg Rocks...

Personally, I have had a taste for the music of Arnold Schoenberg for a long time. But now, I the wikipedia explains why:

Ten features of Schoenberg's mature twelve-tone practice are characteristic, interdependent, and interactive[1]:

1. Hexachordal inversional combinatoriality
2. Aggregates
3. Linear set presentation
4. Partitioning
5. Isomorphic partitioning
6. Invariants
7. Hexachordal levels
8. Harmony, "consistent with and derived from the properties of the referential set"
9. Metre, established through "pitch-relational characteristics"
10. Multidimensional set presentations
So, aggregates, partitioning, metrics and referential sets, play us another tune...

Monday, June 02, 2008

MySQL 5.1: Measuring #queries/sec. using information_schema.GLOBAL_STATUS

MySQL 5.1 offers new information tables such as GLOBAL_STATUS. This can be used to report certain performance metrics, such as the number of queries processed per second:

SELECT MAX( -- use MAX to force aggregation
IF(variable_name='Questions' -- no. of queries sent to server
, CAST(variable_value AS unsigned) -- make integer value
, 0 -- ignore if not 'Questions'
)
)
/ -- divide by
MAX( -- use MAX to force aggregation
IF(variable_name='Uptime' -- no. of seconds the server is up
, CAST(variable_value AS unsigned) -- make integer value
, 0 -- ignore if not 'Uptime'
)
) AS queries_per_second
FROM information_schema.GLOBAL_STATUS
WHERE variable_name in ('Questions', 'Uptime');

(This snippet is maintained on MySQL Forge)

This query could have been solved in a number of other ways involving subqueries, for example like this:

SELECT (SELECT CAST(variable_value AS unsigned)
FROM information_schema.GLOBAL_STATUS
WHERE variable_name = 'Questions')
/ (SELECT CAST(variable_value AS unsigned)
FROM information_schema.GLOBAL_STATUS
WHERE variable_name = 'Uptime') as queries_per_second

For now I'm sticking to my original snippet. The disadvantage to this approach is that it requires a hack to force aggregation of all rows from GLOBAL_STATUS to one single row (using MAX), as well as some conditional expressions.

The disadvantage of a using subqueries is that it would be slighly more verbose, and it would also require more scanning of information_schema.GLOBAL_STATUS, although I am unsure if that would really matter that much.

What do you think?

Sunday, May 11, 2008

Getting terrorists to talk...

Another quote from slashdot that cracked me up...This time it's gbjbaanb who install MS silverlight to keep his machine from whining and moaning, as found in the recent Silverlight thread:
I installed it in the end just to shut the damn thing up, and even then it refused to install. I almost cried with the frustration! The CIA could use this technique to get their terrorist suspects to talk.

I must admit that I have not installed Silverlight, but out of curiosity, I just might...

Saturday, May 03, 2008

Random slahsdot comment gem

By vux984 in a thread about the article Twitter Said To Be Abandoning Ruby on Rails:
The rails framework is aptly named. Its like driving a train. You follow the rails. Its easy, simple, and those are its strengths. But if one day, you decide you want to cut across a field save a few hours of travel, well, you probably shouldn't have chosen 'train' as your mode of transportation.

Not trying to bash anyone here. This is just a well formulated comment. Maybe the guy took it from somewhere, I dunno. It just stuck on me.

Feel free to post your flames below, maybe I'll pass em on to vux984....

Monday, April 21, 2008

MySQL UC2008 presentation "Grand Tour of the information schema" now online

Yup, the presentation slides as well as the scripts are now available online on the conference website.

The stuff you will find in there:

Information_schema.pdf
Diagram of the information schema

I_S_VC_Slides.pdf
Slides for the UC presentation, "Grand Tour of the Information Schema"

I_S_INDEXES2.sql
script, returns one row for each index (rollup of information_schema.STATISTICS)

I_S_REDUNDANT_INDEXES2.sql
script, lists all redundant indexes. Redundancy rules:

  • two indexes with the same columns, type and uniqueness are interchangeable. The one with the largest index name is listed as redundant

  • if there is a unique index and a non unique index with the same columns and type exists, the non-unique one is considered redundant

  • A non-unique btree index is the leftmost prefix of another btree index (that may or may not be unique)
TODO: indexes created on the referencing columns for InnoDB foreign keys are often redundant. It would be nice if the script could list that the FK currently supports an FK

p_check_fk_constraint_violations.sql
stored procedure Detect probles after using @@foreign_key_checks=1. Takes a schema name LIKE pattern and a table name LIKE pattern and reports any foreign keys constraints for which violations exist. TODO: if the referenced table in the foreign key was dropped, the script will encounter a runtime error. Would be nice to properly report instead

qwz.sql
Stored procedure, The Query wizard for the mysql command line pronounced "cues". To use it, run the script, and it starts a new session with the query wizard. Alternatively, do: USE qwz; CALL qwz(NULL); and follow the on-screen instructions.
schema_doc.sql
script, Generates HTML documentation for the current schema. Run using a command like: mysql -uroot -Dsakila -Nrs -e"source schema_doc.sql" > schema_doc.html

schema_doc.sh
shell, Linux example for running schema_doc.sql

gen_fks.sql
script, Generates trigger code to emulate foreign keys. Set a schema containing foreign key constraints as the default database. Run the script. The output forms trigger code that may be used to recreate the foreign keys functionality in a database that has the same tables but no declarative foreign keys. For foreign keys with the RESTRICT or NO ACTION rules, a non-existent stored procedure will be called: p_child_row_exists(). For orphan INSERTs and UPDATEs, a call is made to p_no_parent_row(). Currently this is the best we can do - it will result in a runtime error and the action causing it will not complete but it is not ideal of course.Corrective rules (CASCADE etc) should just work.

p_create_federated_table.sql
stored procedure, Creates a federated table (residing on a 5.0 or better remote host)

Saturday, March 29, 2008

MySQL information_schema: Identifying rows from TABLE_CONSTRAINTS

Yesterday, I set out a little quiz about the TABLE_CONSTRAINTS table in the MySQL information_schema. The task was:
  • Specify a minimal set of columns of the information_schema.TABLE_CONSTRAINTS table that is sufficient to reliably identify a single row in the information_schema.TABLE_CONSTRAINTS table.
  • Argue why these columns are necessary and sufficient to identify a row, and why a smaller set of columns does not exist

Short Answer


For MySQL there are two such column sets:

  • CONSTRAINT_SCHEMA, TABLE_NAME, CONSTRAINT_NAME, and CONSTRAINT_TYPE
  • TABLE_NAME, TABLE_SCHEMA, CONSTRAINT_NAME, and CONSTRAINT_TYPE

Explanation


According to the ISO SQL specification (ISO/IEC 9075-11:2003 (E) 6.48 TABLE_CONSTRAINTS base table), we can derive that the "conceptual primary key" of a standard implementation of the TABLE_CONSTRAINTS system view should be (CONSTRAINT_CATALOG, CONSTRAINT_SCHEMA, and CONSTRAINT_NAME). In the absence of support for catalogs, we can ignore the CONSTRAINT_CATALOG column, and this also applies to MySQL (all %_CATALOG columns in the MySQL information_schema are always NULL). That would leave us with (CONSTRAINT_SCHEMA and CONSTRAINT_NAME) - that is, table constraint names would be unique within a schema. However, in MySQL, constraint names need not be unique within a schema.

MySQL supports three types of table constraints:

  • PRIMARY KEY
  • UNIQUE
  • FOREIGN KEY (that is, they are supported dependent upon the storage engine)
.

Identifying FOREIGN KEY constraints


FOREIGN KEYs are in their own per schema namespace, so at a glance it may seem that the combination of (CONSTRAINT_SCHEMA and CONSTRAINT_NAME) is sufficient to identify a FOREIGN KEY constraint. However, because PRIMARY KEY and UNIQUE constraints are not within the same namespace as FOREIGN KEY constraints, a single schema may contain a FOREIGN KEY constraint and a non-FOREIGN KEY constraint that have the same CONSTRAINT_NAME.

From this it follows that we need to take the CONSTRAINT_TYPE into account too.

So in order to reliably identify a FOREIGN KEY constraint, we need to have at least CONSTRAINT_SCHEMA, CONSTRAINT_NAME *and* require that CONSTRAINT_TYPE equals 'FOREIGN KEY'.

(Among the answers was a suggestion that the CONSTRAINT_TYPE is not necessary, because the CONSTRAINT_NAME would provide the necessary differentiation. Although this is true for the identifiers generated by MySQL in the absence of explicit identifiers, it is not true in the general case)

Identifying PRIMARY KEY constraints


So, would (CONSTRAINT_SCHEMA, CONSTRAINT_NAME, and CONSTRAINT_TYPE) be sufficient to identify any of the other types of table constraints?

Unfortunately not. Almost everybody that is somewhat familiar with MySQL knows that in MySQL a PRIMARY KEY constraint doesn't really have its own identifier. Rather, the CONSTRAINT_NAME for a PRIMARY KEY constraint is always 'PRIMARY'. In fact, the identifier 'PRIMARY' is reserved exclusively for PRIMARY KEY constraints, and may not be used for any other type of constraint.

From this, it follows that there can be multiple rows in information_schema.TABLE_CONSTRAINTS that have the same combination of values in the columns (CONSTRAINT_SCHEMA, CONSTRAINT_NAME, and CONSTRAINT_TYPE). In fact, for one particular value of CONSTRAINT_SCHEMA there will be just as many occurrences of the combination ('PRIMARY', 'PRIMARY KEY') in the columns (CONSTRAINT_NAME, CONSTRAINT_TYPE) as there are PRIMARY KEY constraints in that schema.

Of course, we know that there can be only one PRIMARY KEY per table. From that, it follows that we can identify a PRIMARY KEY constraint if we know to which table it belongs.

Tables (and views) reside in their own per-schema namespace, so the combination of (TABLE_SCHEMA and TABLE_NAME) is sufficient to identify a table. That means that the combination of (TABLE_SCHEMA, TABLE_NAME, and CONSTRAINT_TYPE) is sufficient to identify a PRIMARY KEY constraint. Once we identified a particular table using (TABLE_SCHEMA, TABLE_NAME) we know that the table constraint with the CONSTRAINT_TYPE equal to 'PRIMARY KEY' is the PRIMARY KEY constraint.

Of course, because the name 'PRIMARY' is reserved exclusively for PRIMARY KEY constraints we could've equally well settled for (TABLE_SCHEMA, TABLE_NAME, CONSTRAINT_NAME) although personally I prefer to use CONSTRAINT_TYPE.

Identifying UNIQUE constraints


This leaves us with the UNIQUE constraints. UNIQUE constraints must have a unique name per table, and the name must not be equal to 'PRIMARY' (which is reserved for PRIMARY KEY constraints). From this we must conclude that we need no less than the combination of (TABLE_SCHEMA, TABLE_NAME, CONSTRAINT_TYPE, and CONSTRAINT_NAME) to identify a UNIQUE constraint.

We must have (TABLE_SCHEMA, TABLE_NAME) to identify the table because the UNIQUE constraints have a unique name per table. Now we can take a shortcut: within one table we can distinguish between PRIMARY KEY and UNIQUE constraints by virtue of the fact that the CONSTRAINT_NAME for PRIMARY KEYs will always be 'PRIMARY'. However, if we need to distinguish between FOREIGN KEY and UNIQUE constraints, we still need to look at the CONSTRAINT_TYPE too.

Summary


To recapitulate:
  • we need CONSTRAINT_SCHEMA, CONSTRAINT_NAME, CONSTRAINT_TYPE to identify a FOREIGN KEY constraint
  • we need TABLE_SCHEMA, TABLE_NAME, and CONSTRAINT_TYPE to identify a PRIMARY KEY constraint.
  • Alternatively we could identify PRIMARY KEY constraints by looking at TABLE_SCHEMA, TABLE_NAME, CONSTRAINT_NAME by virtue of the fact that all PRIMARY KEYs always have a CONSTRAINT_NAME equal to 'PRIMARY'
  • we need TABLE_SCHEMA, TABLE_NAME, CONSTRAINT_TYPE and CONSTRAINT_NAME to identify a UNIQUE constraint

From this we could conclude that the minimal set of columns required to identify an arbitrary table constraint consists of:
  • CONSTRAINT_SCHEMA
  • CONSTRAINT_NAME
  • CONSTRAINT_TYPE
  • TABLE_SCHEMA
  • TABLE_NAME

However, in MySQL, a table constraint is 'owned' by the table on which it is defined and a table constraint defined on particular table cannot reside in a schema different from the schema its table. This means that for any row in information_schema.TABLE_CONSTRAINTS, the columns TABLE_SCHEMA and CONSTRAINT_SCHEMA will always have the same value.

This brings us to the final conclusion that there are two minimal set of columns that may be used to identify an arbitrary constraint in the information_schema.TABLE_CONSTRAINTS table:
  • CONSTRAINT_SCHEMA, TABLE_NAME, CONSTRAINT_NAME, CONSTRAINT_TYPE
  • TABLE_SCHEMA, TABLE_NAME, CONSTRAINT_NAME, CONSTRAINT_TYPE

As CONSTRAINT_SCHEMA and TABLE_SCHEMA are completely interchangeable as far as the result is concerned, there is no specific reason to generically prefer either combination.

To learn more about this and other traps in the MySQL information schema, come meet me at the MySQL User's conference. I will be your guide on the Grand Tour of the Information Schema and its Applications, and explain more of these gotcha's.

Thursday, March 27, 2008

"Me Too" MySQL Information Schema popquiz

A few days ago, I wrote how I will be your guide to the Grand Tour of the Information Schema and its Applications which is one of the two talks I will be doing at the upcoming MySQL User's Conference.

In view of the popularity of "Pop Quiz" format so successfully used by Carsten, I feel compelled to imitation, and as a primer to my talk, I'd like to offer you my "Me Too" MySQL Information Schema popquiz. So, here goes...


The MySQL information_schema contains a number of tables. Among them, one is called TABLE_CONSTRAINTS. Unsurprisingly, each row in TABLE_CONSTRAINTS table represents a single table constraint. The TABLE_CONSTRAINTS table has the following columns:

+--------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+-------+
| CONSTRAINT_CATALOG | varchar(512) | YES | | NULL | |
| CONSTRAINT_SCHEMA | varchar(64) | NO | | | |
| CONSTRAINT_NAME | varchar(64) | NO | | | |
| TABLE_SCHEMA | varchar(64) | NO | | | |
| TABLE_NAME | varchar(64) | NO | | | |
| CONSTRAINT_TYPE | varchar(64) | NO | | | |
+--------------------+--------------+------+-----+---------+-------+


  • Specify a minimal set of columns of the information_schema.TABLE_CONSTRAINTS table that is sufficient to reliably identify a single row in the information_schema.TABLE_CONSTRAINTS table.

  • Argue why these columns are necessary and sufficient to identify a row, and why a smaller set of columns does not exist



Feel free to post the answer as a comment. Every correct answer will earn a 20% discount to the admission fee. You can get bonus points for nice elegant argumentation, and also if you specify more than one minimal set of columns.

I will elect three to five of the best answers and make sure that whoever posts it gets a laminated printout of the diagram of the MySQL information schema.

(Please make sure you post me your email address if you think you are eligible to either the discount code, the laminated diagram or both)

Tuesday, March 25, 2008

MySQL Information Schema applications at the UC2008

Last week I blogged about the upcoming MySQL Users conference, in particular about the Writing MySQL UDFs tutorial that I will be delivering.

I will also be doing the Grand Tour of the Information Schema and its Applications.


I will discuss the elements in the MySQL information schema, and provide tips to write queries against it. Most of the talk will revolve around a number of scripts I have developed over the past few years:

  • Generating a history/audit database - generate all the code you need to keep track of all changes occurring in your on-line database. Use it to audit or implement 'flashback'

  • Checking foreign key violations - disabling foreign key checks may be useful, but is dangerous. This script helps you find problems with foreign key constraints

  • Creating federated tables - FEDERATED tables are useful, but tedious and error-prone to create. Let this script do the work instead

  • Checking for duplicated and redundant indexes - Redundant or duplicate indexes can slow down your database performance. Find them with this script (Note: the original script I blogged about earlier contains a critical flaw - this is a completely new version)

I also have a few new information schema tricks up my sleeve. Without giving away too much, take a look at this little conversation I just had with my MySQL command line client:

mysql> call qwz.qwz(null);
+-----+------------------------------------------------------------------+
| | Welcome to the command line Query Wizard for MySQL!!! |
+-----+------------------------------------------------------------------+
| 1 | Set the schema (current: world) |
| --- | ---------------------------------------------------------------- |
| 2 | Choose a table |
+-----+------------------------------------------------------------------+
3 rows in set (0.00 sec)

Yes! It's an interactive query wizard for the command line. I don't want to spoil too much, but I can unveil that I will be presenting a full-fledged interactive query wizard that operates completely from the MySQL command line. It's just a stored procedure - no proxy, no UDFs, no plugins or whatever.

To give you a taste of the possibilities, take a look at the continuation of my session with the query wizard. First let's choose the schema:

mysql> call qwz.qwz(1);
+----+------------------------+
| | Set the default schema |
+---+------------------------+
| 1 | information_schema |
| 2 | mysql |
| 3 | qwz |
| 4 | sakila |
| 5 | world |
+---+------------------------+

Let's settle for the sakila schema:

mysql> call qwz.qwz(4);

This will prompt us to choose a table from the sakila schema:

+-----+------------------------------------------------------------------+
| | The command line Query Wizard for MySQL |
+-----+------------------------------------------------------------------+
| 1 | Set the schema (current: sakila) |
| --- | ---------------------------------------------------------------- |
| 2 | actor |
| 3 | actor_info |
| 4 | address |
| 5 | category |
| 6 | city |
| 7 | country |
| 8 | customer |
| 9 | customer_list |
| 10 | film |
| 11 | film_actor |
| 12 | film_category |
| 13 | film_list |
| 14 | film_text |
| 15 | inventory |
| 16 | language |
| 17 | nicer_but_slower_film_list |
| 18 | payment |
| 19 | rental |
| 20 | sales_by_film_category |
| 21 | sales_by_store |
| 22 | staff |
| 23 | staff_list |
| 24 | store |
+-----+------------------------------------------------------------------+
26 rows in set (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Let's pick the rental table:

mysql> call qwz.qwz(19);

Once we selected the initial table, we can now build a join path, and we are prompted to pick any of the tables related to rental:

+-----+------------------------------------------------------------------+
| | Join path: + tables are added / - tables are removed |
+-----+------------------------------------------------------------------+
| | sakila.rental |
| 1 | + sakila.customer (fk_rental_customer) |
| 2 | + sakila.inventory (fk_rental_inventory) |
| 3 | + sakila.staff (fk_rental_staff) |
| --- | ---------------------------------------------------------------- |
| 4 | Choose columns |
| --- | ---------------------------------------------------------------- |
| 5 | New Query |
+-----+------------------------------------------------------------------+
8 rows in set (0.20 sec)

Query OK, 0 rows affected (0.20 sec)

Let's extend the join path with the customer table:

mysql> call qwz.qwz(1);
+-----+------------------------------------------------------------------+
| | Join path: + tables are added / - tables are removed |
+-----+------------------------------------------------------------------+
| | sakila.rental |
| 1 | - sakila.customer (fk_rental_customer) |
| 2 | + sakila.address (fk_customer_address) |
| 3 | + sakila.store (fk_customer_store) |
| 4 | + sakila.inventory (fk_rental_inventory) |
| 5 | + sakila.staff (fk_rental_staff) |
| --- | ---------------------------------------------------------------- |
| 6 | Choose columns |
| --- | ---------------------------------------------------------------- |
| 7 | New Query |
+-----+------------------------------------------------------------------+
10 rows in set (0.09 sec)

Query OK, 0 rows affected (0.09 sec)

Let's throw in the inventory table in the mix too:

mysql> call qwz.qwz(4);
+-----+------------------------------------------------------------------+
| | Join path: + tables are added / - tables are removed |
+-----+------------------------------------------------------------------+
| | sakila.rental |
| 1 | - sakila.customer (fk_rental_customer) |
| 2 | + sakila.address (fk_customer_address) |
| 3 | + sakila.store (fk_customer_store) |
| 4 | - sakila.inventory (fk_rental_inventory) |
| 5 | + sakila.film (fk_inventory_film) |
| 6 | + sakila.store (fk_inventory_store) |
| 7 | + sakila.staff (fk_rental_staff) |
| --- | ---------------------------------------------------------------- |
| 8 | Choose columns |
| --- | ---------------------------------------------------------------- |
| 9 | New Query |
+-----+------------------------------------------------------------------+
12 rows in set (0.10 sec)

Query OK, 0 rows affected (0.10 sec)

We can keep this up quite a long time, and we can remove tables from our join path in a similar way. We can then start specifying some columns:

mysql> call qwz.qwz(8);
+-----+------------------------------------------------------------------+
| | Choose columns for the SELECT list |
+-----+------------------------------------------------------------------+
| | sakila.rental.* |
| | --------------- |
| 1 | + rental_id |
| 2 | + rental_date |
| 3 | + inventory_id |
| 4 | + customer_id |
| 5 | + return_date |
| 6 | + staff_id |
| 7 | + last_update |
| | --------------- |
| | sakila.customer.* (fk_rental_customer) |
| | -------------------------------------- |
| 8 | + customer_id |
| 9 | + store_id |
| 10 | + first_name |
| 11 | + last_name |
| 12 | + email |
| 13 | + address_id |
| 14 | + active |
| 15 | + create_date |
| 16 | + last_update |
| | -------------------------------------- |
| | sakila.inventory.* (fk_rental_inventory) |
| | ---------------------------------------- |
| 17 | + inventory_id |
| 18 | + film_id |
| 19 | + store_id |
| 20 | + last_update |
| --- | ---------------------------------------------------------------- |
| 21 | New Query |
+-----+------------------------------------------------------------------+
30 rows in set (0.02 sec)

Query OK, 0 rows affected (0.02 sec)

Currently I'm still working on a multiple select interface, but for now I'll snip out all but the last individual column selection:

mysql> call qwz.qwz(18);
+-----+------------------------------------------------------------------+
| | SELECT list: + columns are added / - columns are removed |
+-----+------------------------------------------------------------------+
| | sakila.rental.* |
| | --------------- |
| 1 | + rental_id |
| 2 | - rental_date |
| 3 | + inventory_id |
| 4 | + customer_id |
| 5 | + return_date |
| 6 | + staff_id |
| 7 | + last_update |
| | --------------- |
| | sakila.customer.* (fk_rental_customer) |
| | -------------------------------------- |
| 8 | + customer_id |
| 9 | + store_id |
| 10 | - first_name |
| 11 | - last_name |
| 12 | + email |
| 13 | + address_id |
| 14 | + active |
| 15 | + create_date |
| 16 | + last_update |
| | -------------------------------------- |
| | sakila.inventory.* (fk_rental_inventory) |
| | ---------------------------------------- |
| 17 | + inventory_id |
| 18 | - film_id |
| 19 | + store_id |
| 20 | + last_update |
| --- | ---------------------------------------------------------------- |
| 21 | Show SQL |
| --- | ---------------------------------------------------------------- |
| 22 | Explain SQL |
| --- | ---------------------------------------------------------------- |
| 23 | Execute SQL |
| --- | ---------------------------------------------------------------- |
| 24 | New Query |
+-----+------------------------------------------------------------------+
36 rows in set (0.03 sec)

Query OK, 0 rows affected (0.03 sec)

Now is a good moment to inspect the SQL underlying our query:

mysql> call qwz.qwz(21);

+-----+----------------------------------------------------------------------------+
| | The command line Query Wizard for MySQL |
+-----+----------------------------------------------------------------------------+
| | SELECT t1.rental_date |
| | ,t2.first_name |
| | ,t2.last_name |
| | ,t8.film_id |
| | FROM sakila.rental AS t1 |
| | INNER JOIN sakila.customer AS t2 ON (t1.customer_id) = (t2.customer_id) |
| | INNER JOIN sakila.inventory AS t8 ON (t1.inventory_id) = (t8.inventory_id) |
| --- | ---------------------------------------------------------------- |
| 1 | Execute SQL |
| --- | ---------------------------------------------------------------- |
| 2 | Explain SQL |
| --- | ---------------------------------------------------------------- |
| 3 | Choose columns |
| --- | ---------------------------------------------------------------- |
| 4 | New Query |
+-----+----------------------------------------------------------------------------+
15 rows in set (1.21 sec)

Query OK, 0 rows affected (1.21 sec)

And we can execute it, or run EXPLAIN on it:

mysql> call qwz.qwz(2);
+----+-------------+-------+--------+----------------------------------------+---------------------+---------+------------------------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------------------------------+---------------------+---------+------------------------+------+-------+
| 1 | SIMPLE | t8 | index | PRIMARY | PRIMARY | 3 | NULL | 4673 | |
| 1 | SIMPLE | t1 | ref | idx_fk_inventory_id,idx_fk_customer_id | idx_fk_inventory_id | 3 | sakila.t8.inventory_id | 1 | |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY | PRIMARY | 2 | sakila.t1.customer_id | 1 | |
+----+-------------+-------+--------+----------------------------------------+---------------------+---------+------------------------+------+-------+
3 rows in set (0.00 sec)

As you can imagine, quite a good deal of information schema hacking going on here.

Code will be released in full immediately after my talk, and if there is sufficient interest I will discuss the internals of the qwz stored procedure in full.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...