I just got word the Schedule for the MySQL Conference 2009 is now available!
In my previous post, and the one before that I already wrote about my approved proposals. I promised I would post more details on my sessions later on. Well, it's right here:
http://en.oreilly.com/mysql2009/public/schedule/speaker/198
While you're at it, check out the list of speakers. I think it is a pretty good line-up, and I am looking forward to seeing the talks, and meeting these people. I've seen quite a few interesting things and I'll compile a list later on of the talks that caught my special attention.
Friday, December 19, 2008
Wednesday, December 17, 2008
"Starring Sakila": a Data Warehousing mini-tutorial at the MySQL UC 2009
Hi!
Recently I wrote about how glad I was to see two of my proposals for the upcoming MySQL Conference approved.
I am absolutely thrilled to see my third proposal has been approved as well: "Starring Sakila: data warehousing explained, illustrated and subtitled". I'm very proud to announce that I'm doing this talk together with Matt Casters, chief data integration of Pentaho and creator of the popular ETL/Data Integration suite Kettle (a.k.a Pentaho data integration).
During this 45-minute seminar, Matt and I will take the Sakila sample database and explain how to design and load a data warehouse for it, and how to use that to create those reports and charts that makes the managers smile ;-). Along the way, we'll explain often uttered terms and concepts such as dimensional model, denormalization, star schema, OLAP, data staging, business intelligence, ETL and data integration.
Note that this session is targeted at SQL developers, DBAs and tech-savvy analysts ('suits' are most welcome) that are new to business intelligence and datawarehousing. We noticed that although data warehousing and business intelligence are hardly new topics, there is still a lot of confusion surrounding them. The goal of this session is mainly to explain what it is all about, and to immediately make the concepts we cover tangible by illustrating it using a database schema which is easy to grasp, and which already familiar to most of the MySQL crowd.
Of course - the MySQL User's conference will feature more advanced data warehousing and business intelligence session too. I am pretty sure we will see a lot of Infobright and Kickfire. One session that I know will be very interesting is Matt's talk on utilizing computing clouds to power highly performing, scalable ETL processes.
Anyway, I'm looking forward to it. I will post a more detailed overview of the content of this and the other sessions I'm involved in when the conference date is nearing.
Recently I wrote about how glad I was to see two of my proposals for the upcoming MySQL Conference approved.
I am absolutely thrilled to see my third proposal has been approved as well: "Starring Sakila: data warehousing explained, illustrated and subtitled". I'm very proud to announce that I'm doing this talk together with Matt Casters, chief data integration of Pentaho and creator of the popular ETL/Data Integration suite Kettle (a.k.a Pentaho data integration).
During this 45-minute seminar, Matt and I will take the Sakila sample database and explain how to design and load a data warehouse for it, and how to use that to create those reports and charts that makes the managers smile ;-). Along the way, we'll explain often uttered terms and concepts such as dimensional model, denormalization, star schema, OLAP, data staging, business intelligence, ETL and data integration.
Note that this session is targeted at SQL developers, DBAs and tech-savvy analysts ('suits' are most welcome) that are new to business intelligence and datawarehousing. We noticed that although data warehousing and business intelligence are hardly new topics, there is still a lot of confusion surrounding them. The goal of this session is mainly to explain what it is all about, and to immediately make the concepts we cover tangible by illustrating it using a database schema which is easy to grasp, and which already familiar to most of the MySQL crowd.
Of course - the MySQL User's conference will feature more advanced data warehousing and business intelligence session too. I am pretty sure we will see a lot of Infobright and Kickfire. One session that I know will be very interesting is Matt's talk on utilizing computing clouds to power highly performing, scalable ETL processes.
Anyway, I'm looking forward to it. I will post a more detailed overview of the content of this and the other sessions I'm involved in when the conference date is nearing.
Video
A video recording of this session can be found here, courtesy of Sheeri.Thursday, December 11, 2008
Speaking at the MySQL Conference 2009
I am very proud to announce that so far, two proposal I submitted for the MySQL Conference 2009 have been approved! Wohoo!
(Did you submit a proposal too, and you still didn't receive a reply? No worries - there is a massive amount of proposals to go through. From past years experience I can say - Don't despair, just have some patience. It can take some time for the MySQL community guys to work though all the proposals.)
As every year, I am looking forward to this event a lot, and having the opportunity to speak there just brings a big smile to my face ;)
Anyway - These are the approved proposals:
The first proposal is for a 45 minute seminar on how to effectively use a client side embedded database in browser applications, and how to synchronize that local data to a remote mysql server. It is focussed on Google Gears, but the principles and approach is equally applicable to a HTML 5 compliant implementation of a browser embedded SQL database (such as Safari).
The second proposal is for a 3 hour tutorial. The aim is to do a hands-on tutorial on the basics of MySQL plugins that is suitable even for those that have no prior C/C++ skills. Yeah, it may sound ambitious, but last year I did a UDF tutorial along the same lines, and that seemed to work out pretty good. I am very lucky to be doing this one together with Sergei Golubchik, one of the MySQL Server devs that has seen every corner of the server, and who actively architects the plugin API. Really - if you want to learn about plugins, don't miss your chance to learn it at the conference.
I will most likely provide some more details on the sessions later on, but feel free to post a comment in case you want to know more.
(Did you submit a proposal too, and you still didn't receive a reply? No worries - there is a massive amount of proposals to go through. From past years experience I can say - Don't despair, just have some patience. It can take some time for the MySQL community guys to work though all the proposals.)
As every year, I am looking forward to this event a lot, and having the opportunity to speak there just brings a big smile to my face ;)
Anyway - These are the approved proposals:
- Gearing up MySQL: Implementing MySQL synchronization for browser-based apps using an embedded database (powered by Google Gears)
- Practical MySQL Plugin Development
The first proposal is for a 45 minute seminar on how to effectively use a client side embedded database in browser applications, and how to synchronize that local data to a remote mysql server. It is focussed on Google Gears, but the principles and approach is equally applicable to a HTML 5 compliant implementation of a browser embedded SQL database (such as Safari).
The second proposal is for a 3 hour tutorial. The aim is to do a hands-on tutorial on the basics of MySQL plugins that is suitable even for those that have no prior C/C++ skills. Yeah, it may sound ambitious, but last year I did a UDF tutorial along the same lines, and that seemed to work out pretty good. I am very lucky to be doing this one together with Sergei Golubchik, one of the MySQL Server devs that has seen every corner of the server, and who actively architects the plugin API. Really - if you want to learn about plugins, don't miss your chance to learn it at the conference.
I will most likely provide some more details on the sessions later on, but feel free to post a comment in case you want to know more.
Friday, October 31, 2008
The Dutch are making money with open source software
I just want to take a moment to congratulate Stani Michiels, the creator of Stani's Python IDE.
Stani has won a contest to create the Architecture Fiver, a commemorative Dutch 5 euro coin:
The front side shows the portrait of Queen Beatrice emerging from a spiral of the names of Dutch architects, ordered by the number of internet hits. The back side shows a number of books (authored by Dutch architects), rising up as if they are buildings, their roofs forming the contours of a map of the Netherlands.
I think this is a truly creative and beautiful design, and I think the price is well-deserved.
An extra tidbit that interests me is that the design was created exclusively by Open Source and Free software tools (predominantly Python). So literally, this is money made by open source software. Read all about it on his blog: SPE IDE - Stani's Python Editor: How to make money with free software....
Stani, Kudos to you - When I get my hands one of these I'll keep it for display. You can be sure the image of the coin will make its way to many presentations on open source business.
Read all about it on Stani's blog
Stani has won a contest to create the Architecture Fiver, a commemorative Dutch 5 euro coin:
The front side shows the portrait of Queen Beatrice emerging from a spiral of the names of Dutch architects, ordered by the number of internet hits. The back side shows a number of books (authored by Dutch architects), rising up as if they are buildings, their roofs forming the contours of a map of the Netherlands.
I think this is a truly creative and beautiful design, and I think the price is well-deserved.
An extra tidbit that interests me is that the design was created exclusively by Open Source and Free software tools (predominantly Python). So literally, this is money made by open source software. Read all about it on his blog: SPE IDE - Stani's Python Editor: How to make money with free software....
Stani, Kudos to you - When I get my hands one of these I'll keep it for display. You can be sure the image of the coin will make its way to many presentations on open source business.
Read all about it on Stani's blog
Friday, October 24, 2008
What kind of life is that?!
Over a year ago, I wrote about a little bit of irony offered by the runtime error messages emitted from the Windows Vista ultimate testdrive site (which I just checked again, this time with IE6 and happens to choke on another bug).
Anyway, I just found another gem in this category. In search for MS Visual Studio Express, I bumped into an ad for the "Defy All Challenges" campaign on the Dutch Microsoft site. The site shows a picture of a pirate captain that is titled "A Developer's Life for me" and below is some text that invites other brave developers to "Defy all challenges" with Microsft Visual Studio 2008. I took a snapshot for posterity:
Unfortunately, as soon as you visit the site (I'm using IE 6.0 on Windows XP SP2 right now), you get a messagebox informing you that a javascript runtime error has occurred ("Object Expected"). So I'm afraid it's a case of "A bug's life" rather than "A Developer's life".
Anyway, I just found another gem in this category. In search for MS Visual Studio Express, I bumped into an ad for the "Defy All Challenges" campaign on the Dutch Microsoft site. The site shows a picture of a pirate captain that is titled "A Developer's Life for me" and below is some text that invites other brave developers to "Defy all challenges" with Microsft Visual Studio 2008. I took a snapshot for posterity:
Unfortunately, as soon as you visit the site (I'm using IE 6.0 on Windows XP SP2 right now), you get a messagebox informing you that a javascript runtime error has occurred ("Object Expected"). So I'm afraid it's a case of "A bug's life" rather than "A Developer's life".
Wednesday, October 15, 2008
Maturity of Open Source ETL / Why do people fail to do their homework?
I just read this post on Matt Casters' blog. Here, Matt describes why Element 61's Jan Claes is dead wrong in the way he assesses the maturity of open source ETL tools.
Well, I've just read Jan Claes' article in the "research and insights" area of the Element61 website, and frankly, it is pretty easy to see how unsubstantiated it is. Some may be tempted to classify the article as FUD, although typically 'real' FUD articles are more sophisticated and are usually not so blatantly sloppy with facts. I could get all worked up over it, but mostly it leaves me with a genuine confusion as to why a seemingly serious IT consultancy company would want to have their professionals post this kind of stuff on the company website and let it pass as "research" or even "insight".
Anyway, let's take a look at the article and I'll try and explain why it won't wash...
(Disclaimer - I can easily debunk Jan's article when I look at Kettle, a.k.a. Pentaho data integration. I do not have enough experience with the other open source ETL tools mentioned in his article, Talend and CloverETL, so it is entirely possible that the article does these tools similar injustice).
The article starts off by explaining what ETL is, and correctly mentions that prerequisite tools and resources are generally considered expensive. Quite unsurprisingly, the article moves on to explore open source ETL tools in order to find a cost-reducing alternative. So far, so good - the notion that cost reduction is the primary motive for most businesses to start using open source software is far from controversial.
From here on the article wanders off, and starts talking about open source in general. Some sympathetic but unsubstantiated and unillustrated claims are made about the relation between "standards" and open source software.
The article then becomes more critical and downplays the viability of an ETL open source community as compared to the Linux community. According to Jan Claes, Linux is successful ...because the developers united in a “quest” against Windows.... According to him, a common motive lacks in the ETL community. He has it that ...[ETL] is a specialist's world..., and because of that, ...progress might be significantly slower than in the case of other –more successful- Open Source programs.
Just take a second to let it sink in. Anybody met a Linux dev lately that was out to beat windows? Last time I checked, Linux was meant to be a free (as in speech) UNIX system. Most Linux users I know couldn't care less about Windows, and they are certainly not 'united in a quest' to create something like it. I take it that by tagging ETL as belonging to the 'specialist's world', I am to understand that Linux kernel development is 'generalistic'.
If I am to believe Jan, everybody is so busy with their quest against windows, doing all that 'general' hacking in C/C++ on the Linux kernel that nobody is specialist enough to help that poor quailing open source ETL community develop say, a MySQL bulk loader step, or a Regular Expression Matching step, or a Normalizing step. (Those that have been following the development of Kettle know that these and more are all contributed by the community).
To make sure he gets the point across, the article then says that ...The most advanced suppliers of Open Source ETL actually admit today that they are not really community-driven, and that each company works with their own developers.... No link. No date. No nothing, we should probably just take Jan's word for it.
Look, don't get me wrong - I don't want to deny that ETL is a specialist area of expertise. But in my opinion, this is in part due to the fact that ETL (and BI tools in general) have been inaccessible to most people. Let me explain that.
My take on it is that now, open source BI is in the same boat where the open source databases were a few years ago. While mocked and ignored by the 'real' experts and specialists that do not deign themselves to touch the open source 'toy' databases, and can only be humbled to work with traditional 'enterprise' databases, the open source databases caused an emancipation among application developers and web developers.
Databases are now a commodity, and there are quite a lot of developers that know how to use one. Of course, apart from a large body of database users with basic skills, there are still 'real' experts, but some of them have become one with little or none of the typical DBA background thanks to the fact that open source databases disrupted the market and commoditized it. I see the same thing happening for BI: because the tools are now accessible and available, BI is popping up where it didn't before. In stealth mode, it is becoming a commodity.
The article continues, and now starts discussing a number of products: CloverETL, Pentaho, and Talend. Jan claims that these ETL tools provide ...just enough transformations to make it a viable option.
Huh?! Say what?! Just enough?... Of course, no comparison is made with a closed source product, so we can only guess what he was expecting. But really people, when I open Kettle 3.1, the latest GA version, I see:
And this is just for transformations! For kettle jobs, there's another bucket of steps, there's just too much to mention.
The article continues and does a run-down of "when to use/when not to use" open source ETL tools, and it is in this section where we see most of the inaccuracies.
Here are the "when to use" cases according to Jan Claes:
This is like, "so what"?! I mean, seriously - basically this complaint can be simplified to "when it's hard, it's hard". I dare anybody to come up with a single case where any tool provides an "out-of-the-box-solution" for an arbitrarily complex problem. I suspect Jan has not tried, and if he did, why doesn't he mention what he tried? I mean, man, just one example of what was expected but not found, surely that is not too much to ask? Gosh...
Seriously, what crap is this?
First of all, it is completely untrue. I mean, I won't bore you with the big 5 databases, and I know you have all heard of the less well known ones like Greenplum, Intersystems Cache and Netezza, but have you ever heard of "ExtenDB"? Or "Kingbase ES"? Or "Gupta"? No?! Well, Kettle has, and it has all of this shipped for free. Yup, that's right, No less than 32 different RDBMS systems, and for most of them, there isn't just one driver, but multiple drivers available - no additional installation or fee required.
Second, how does it make sense to whine about the underlying protocol or API used to connect to the data source? I have never heard anybody complain about MS SQL only offering AdoDB or ODBC connectivity, have you? In practice there is a JDBC driver available for every database, and if not you can still use ODBC connectivity. You are also not confined to the predelivered drivers, you can simply drop your own drivers in the kettle jdbc directory and connect to other dbs. If anything, Kettle's choice for JDBC seems like it may have done everything but limit connectivity.
Do you homework, dude. All it takes is one download and one unzip command.
Well, it is certainly true that you can run Kettle on a single PC with modest resource usage. But you can also think big, and benefit from built-in threading to burn all the cores of your high-end server. Or you can cluster a bunch of machines and have them divide the work to perform a single transformation. You can run it in a cloud and use sophisticated features like parallel input, you can use it to load shards ("database partitioning"), in fact there are no obvious limitations to the Kettle's deployment scale.
So, as it should be obvious by now, Jan has managed to write his "when to use cases" as if they are "when not to use" cases. The real "when not to use" cases follow:
This is a blatant lie. I won't deny that it is possible to muck around coding java as a Kettle user, but technically you would be a plugin writer. And the XML claim is completely off track - there is no requirement whatsoever for the end user to edit raw XML, ever. Really, I have just gone down the road describing the wealth of steps available to build you transformation, and really, you can trust me when I say: this is enough for about 95% of the cases, if not 99%. Who can blame the product for allowing you to write a java plugin if you really want it?
Speaking of which, does Jan mean to imply that the "traditional" products he is writing about do not allow you to write plugins? If that is so, then it seems an easy choice for me. I'd rather have more options not less. That saidd, I know that for example MS SQL IS allows .NET plugins. How is this functionally different?
Heh ;) As i have argued before, this is nonsense....Elementary, Watson.
Well, I guess this is one of the things I got to give them. Clearly, certain skills, knowledge and resources are missing in Element61's outfit. Otherwise I can't explain why they'd be advertising such an easy target article.
Anyway, maybe you have thoughts on Open Source ETL that you want to share. If so, be my guest.
Thank you for bearing with me, and please left a comment if necessary.
Well, I've just read Jan Claes' article in the "research and insights" area of the Element61 website, and frankly, it is pretty easy to see how unsubstantiated it is. Some may be tempted to classify the article as FUD, although typically 'real' FUD articles are more sophisticated and are usually not so blatantly sloppy with facts. I could get all worked up over it, but mostly it leaves me with a genuine confusion as to why a seemingly serious IT consultancy company would want to have their professionals post this kind of stuff on the company website and let it pass as "research" or even "insight".
Anyway, let's take a look at the article and I'll try and explain why it won't wash...
(Disclaimer - I can easily debunk Jan's article when I look at Kettle, a.k.a. Pentaho data integration. I do not have enough experience with the other open source ETL tools mentioned in his article, Talend and CloverETL, so it is entirely possible that the article does these tools similar injustice).
The article starts off by explaining what ETL is, and correctly mentions that prerequisite tools and resources are generally considered expensive. Quite unsurprisingly, the article moves on to explore open source ETL tools in order to find a cost-reducing alternative. So far, so good - the notion that cost reduction is the primary motive for most businesses to start using open source software is far from controversial.
From here on the article wanders off, and starts talking about open source in general. Some sympathetic but unsubstantiated and unillustrated claims are made about the relation between "standards" and open source software.
The article then becomes more critical and downplays the viability of an ETL open source community as compared to the Linux community. According to Jan Claes, Linux is successful ...because the developers united in a “quest” against Windows.... According to him, a common motive lacks in the ETL community. He has it that ...[ETL] is a specialist's world..., and because of that, ...progress might be significantly slower than in the case of other –more successful- Open Source programs.
Just take a second to let it sink in. Anybody met a Linux dev lately that was out to beat windows? Last time I checked, Linux was meant to be a free (as in speech) UNIX system. Most Linux users I know couldn't care less about Windows, and they are certainly not 'united in a quest' to create something like it. I take it that by tagging ETL as belonging to the 'specialist's world', I am to understand that Linux kernel development is 'generalistic'.
If I am to believe Jan, everybody is so busy with their quest against windows, doing all that 'general' hacking in C/C++ on the Linux kernel that nobody is specialist enough to help that poor quailing open source ETL community develop say, a MySQL bulk loader step, or a Regular Expression Matching step, or a Normalizing step. (Those that have been following the development of Kettle know that these and more are all contributed by the community).
To make sure he gets the point across, the article then says that ...The most advanced suppliers of Open Source ETL actually admit today that they are not really community-driven, and that each company works with their own developers.... No link. No date. No nothing, we should probably just take Jan's word for it.
Look, don't get me wrong - I don't want to deny that ETL is a specialist area of expertise. But in my opinion, this is in part due to the fact that ETL (and BI tools in general) have been inaccessible to most people. Let me explain that.
My take on it is that now, open source BI is in the same boat where the open source databases were a few years ago. While mocked and ignored by the 'real' experts and specialists that do not deign themselves to touch the open source 'toy' databases, and can only be humbled to work with traditional 'enterprise' databases, the open source databases caused an emancipation among application developers and web developers.
Databases are now a commodity, and there are quite a lot of developers that know how to use one. Of course, apart from a large body of database users with basic skills, there are still 'real' experts, but some of them have become one with little or none of the typical DBA background thanks to the fact that open source databases disrupted the market and commoditized it. I see the same thing happening for BI: because the tools are now accessible and available, BI is popping up where it didn't before. In stealth mode, it is becoming a commodity.
The article continues, and now starts discussing a number of products: CloverETL, Pentaho, and Talend. Jan claims that these ETL tools provide ...just enough transformations to make it a viable option.
Huh?! Say what?! Just enough?... Of course, no comparison is made with a closed source product, so we can only guess what he was expecting. But really people, when I open Kettle 3.1, the latest GA version, I see:
- Over 20 input steps, ranging from standard things like simple database tables, csv files and excel workbooks to slightly more advanced things like XML and LDAP to more exotic things like XBase and Web services
- In addition to regular input steps, specialized bulk loaders for Oracle, MySQL, Postgres, Greenplum and MonetDB
- Over 10 output steps (again ranging from standard stuff to more advanced things)
- About 10 flavors of lookup, join and dimension maintenance steps
- No less than 30 transformation steps, including normalization, denormalization, XSLT, XML injection
- A bunch of flow control steps
And this is just for transformations! For kettle jobs, there's another bucket of steps, there's just too much to mention.
The article continues and does a run-down of "when to use/when not to use" open source ETL tools, and it is in this section where we see most of the inaccuracies.
Here are the "when to use" cases according to Jan Claes:
No complex transformations
The straightforward transformations can be implemented very quickly, ... When there are complex transformations to make, Open Source ETL tools will often not offer out-of-the-box solutions.
This is like, "so what"?! I mean, seriously - basically this complaint can be simplified to "when it's hard, it's hard". I dare anybody to come up with a single case where any tool provides an "out-of-the-box-solution" for an arbitrarily complex problem. I suspect Jan has not tried, and if he did, why doesn't he mention what he tried? I mean, man, just one example of what was expected but not found, surely that is not too much to ask? Gosh...
Few data sources to connect to
Most reputed ETL-vendors provide an extensive amount of connections to all sorts of data sources. This is a problem with Open Source ETL tools: most of them are based on Java architecture and need JDBC to connect to a database.
Seriously, what crap is this?
First of all, it is completely untrue. I mean, I won't bore you with the big 5 databases, and I know you have all heard of the less well known ones like Greenplum, Intersystems Cache and Netezza, but have you ever heard of "ExtenDB"? Or "Kingbase ES"? Or "Gupta"? No?! Well, Kettle has, and it has all of this shipped for free. Yup, that's right, No less than 32 different RDBMS systems, and for most of them, there isn't just one driver, but multiple drivers available - no additional installation or fee required.
Second, how does it make sense to whine about the underlying protocol or API used to connect to the data source? I have never heard anybody complain about MS SQL only offering AdoDB or ODBC connectivity, have you? In practice there is a JDBC driver available for every database, and if not you can still use ODBC connectivity. You are also not confined to the predelivered drivers, you can simply drop your own drivers in the kettle jdbc directory and connect to other dbs. If anything, Kettle's choice for JDBC seems like it may have done everything but limit connectivity.
Do you homework, dude. All it takes is one download and one unzip command.
Small scaled projects
Small scaled projects with one or two, very common sources and a limited number of target tables don't need an expensive and complicated ETL tool. An Open Source ETL tool can bring a cheap and easy to install solution, which even could run on a local PC.
Well, it is certainly true that you can run Kettle on a single PC with modest resource usage. But you can also think big, and benefit from built-in threading to burn all the cores of your high-end server. Or you can cluster a bunch of machines and have them divide the work to perform a single transformation. You can run it in a cloud and use sophisticated features like parallel input, you can use it to load shards ("database partitioning"), in fact there are no obvious limitations to the Kettle's deployment scale.
So, as it should be obvious by now, Jan has managed to write his "when to use cases" as if they are "when not to use" cases. The real "when not to use" cases follow:
Java & XML knowledge required for complex transformations
As all the Open Source ETL tools are Java and XML based, you need at least a basic understanding and knowledge of Java and XML to get the most out of them...[bla, bla and more bla]...the user has to develop the transformation rules, and therefore needs an advanced knowledge of Java.
This is a blatant lie. I won't deny that it is possible to muck around coding java as a Kettle user, but technically you would be a plugin writer. And the XML claim is completely off track - there is no requirement whatsoever for the end user to edit raw XML, ever. Really, I have just gone down the road describing the wealth of steps available to build you transformation, and really, you can trust me when I say: this is enough for about 95% of the cases, if not 99%. Who can blame the product for allowing you to write a java plugin if you really want it?
Speaking of which, does Jan mean to imply that the "traditional" products he is writing about do not allow you to write plugins? If that is so, then it seems an easy choice for me. I'd rather have more options not less. That saidd, I know that for example MS SQL IS allows .NET plugins. How is this functionally different?
Connectivity to multiple & less common data sources required.
Heh ;) As i have argued before, this is nonsense....Elementary, Watson.
Lack of skills, knowledge & resources
Well, I guess this is one of the things I got to give them. Clearly, certain skills, knowledge and resources are missing in Element61's outfit. Otherwise I can't explain why they'd be advertising such an easy target article.
Anyway, maybe you have thoughts on Open Source ETL that you want to share. If so, be my guest.
Thank you for bearing with me, and please left a comment if necessary.
Saturday, October 04, 2008
Less is More - Happy Eee PC-ing
Ok - I admit - it was impulse more than utility that made me buy this Asus Eee PC 900. People that know me can confirm that I am not a gadgety type of guy, so I realize it may seem a bit strange that I would sucker into buying trendy hardware.
Well, the secret is: there is no secret. The Eee PC is not a gadget - it is a completely functional laptop that is extremely suitable for traveling. It's got 1G of RAM, and 16G of flash storage onboard. An 900MHz Intel may sound slow, but I haven't really noticed that it is, in fact performance feels pretty peppy. The machine boots *fast* and shuts down even faster. It's light, and small, but still has a keyboard and display that are large enough to work on. The battery lasts me about 2.5 to 3 hours - I will try and find another battery. It cost 300 euros - this price was exactly right for me. If it had been more expensive I probably would not've bought it - even if it'd be a beefier machine.
The Eee PC 900 I bought comes with Xandros linux. I feel they really did a remarkable job of creating a desktop linux that is attractive, clear, easy to use and unwreckable by the non-technical user. The same model is available with windows XP too, but in order to keep the same price they took out Ram and Disk. Please dudes and dudettes - do yourself a favor and don't buy that one. Really, the pre-installed linux is pretty darn good, and considering the clipped hardware specs, I frankly can't see how the same machine with Windows XP could still pass as anything else but a joke. Yeah, I'm biased - trust me, I'm right ;)
If you have some experience with linux, its still easy to set it up so you can use a full KDE desktop.
I used to work on my notebook in packed trains and in cramped planes, but I think I will use the Eee PC instead - it really makes my life easier.
Well, the secret is: there is no secret. The Eee PC is not a gadget - it is a completely functional laptop that is extremely suitable for traveling. It's got 1G of RAM, and 16G of flash storage onboard. An 900MHz Intel may sound slow, but I haven't really noticed that it is, in fact performance feels pretty peppy. The machine boots *fast* and shuts down even faster. It's light, and small, but still has a keyboard and display that are large enough to work on. The battery lasts me about 2.5 to 3 hours - I will try and find another battery. It cost 300 euros - this price was exactly right for me. If it had been more expensive I probably would not've bought it - even if it'd be a beefier machine.
The Eee PC 900 I bought comes with Xandros linux. I feel they really did a remarkable job of creating a desktop linux that is attractive, clear, easy to use and unwreckable by the non-technical user. The same model is available with windows XP too, but in order to keep the same price they took out Ram and Disk. Please dudes and dudettes - do yourself a favor and don't buy that one. Really, the pre-installed linux is pretty darn good, and considering the clipped hardware specs, I frankly can't see how the same machine with Windows XP could still pass as anything else but a joke. Yeah, I'm biased - trust me, I'm right ;)
If you have some experience with linux, its still easy to set it up so you can use a full KDE desktop.
I used to work on my notebook in packed trains and in cramped planes, but I think I will use the Eee PC instead - it really makes my life easier.
Thursday, October 02, 2008
MySQL 5.1 Cluster Certification Guide online Preview!
Wow!
I just read Martin "MC" Brown's blog announcing that a number of chapters are now available online for preview.
Personally, I think this is a good move. I think the online chapters provide a nice insight what the book is like. I can ensure you that there is lots and lots more to see in the other chapters. I can honestly and without reservation recommend this book to everybody that wants to read a recent reference book on MySQL Cluster, and those that want to obtain the 5.1 Cluster DBA certification exam in particular.
Now it must be noted that I am a co-author for this book, and author of the MySQL 5.1 Cluster DBA exam so my recommendations are hardly objective. However, I don't receive any royalties for the book (or for the exams) - my contributions were just part of my time working with MySQL AB. So, when I say that MC Brown is right when he says that:
I just read Martin "MC" Brown's blog announcing that a number of chapters are now available online for preview.
Personally, I think this is a good move. I think the online chapters provide a nice insight what the book is like. I can ensure you that there is lots and lots more to see in the other chapters. I can honestly and without reservation recommend this book to everybody that wants to read a recent reference book on MySQL Cluster, and those that want to obtain the 5.1 Cluster DBA certification exam in particular.
Now it must be noted that I am a co-author for this book, and author of the MySQL 5.1 Cluster DBA exam so my recommendations are hardly objective. However, I don't receive any royalties for the book (or for the exams) - my contributions were just part of my time working with MySQL AB. So, when I say that MC Brown is right when he says that:
...there is only one guide that you need, the MySQL 5.1 Cluster DBA Certification Study Guidethen I mean that sincerely - this is the book you'll want. Pick up your copy here.
Friday, September 26, 2008
Changing Lanes: Leaving Sun/MySQL
After having worked for two years and one quarter of a year with MySQL AB (now Sun), I have decided to resign and pursue another career.
Before I joined MySQL AB I was working as a database consultant and application developer. I knew that joining MySQL would mean giving that up, but I was happy to do so because I really liked the MySQL product. I was also attracted by the company itself, the flexible working hours, working from home, travelling abroad, and, very important, being one of the few companies that is capable creating FOSS software on a commercial basis.
I have had a splendid time at MySQL AB, and later Sun. I met lots of smart people and travelled to a bunch of places I would probably not've had the chance to visit otherwise. I learned a lot about the MySQL server, and a lot about the challenges involved in making an open source software company profitable. I've been given the chance to co-author a book on MySQL Cluster, and authored the cluster and associate certification exams. Finally, I created a training course that learns you how to write MySQL UDFs.
Looking back, I feel quite satisfied, and I'm thankful for having had this opportunity. However, slowly but surely I have noticed that I have been less involved in technical stuff, at least professionally. I have been spending so much time on hobby hacking in my spare time, that I reconsidered my position, and decided that I would really like to pick up application development again. At some point, a good opportunity came by, and I decided to seize it. So, that's what I did.
As of the 1st of October, next Wednesday, I will be working for Strukton Rail. I will be working for a department that's responsible for business process improvement and re-engineering. I will be doing a bit of business process analysis and consultancy, but the larger part of my job will consist of developing a Web application to model business processes and to support project management. These applications will have a distinctly graphical nature, and I have a pretty clear idea where I want that to end up. In addition I will be adding business intelligence to these applications.
I will be in the lucky position to be able to decide for a large part which technologies and products we will be using, and I think I will be able to build all of it on open source/free software. I am pretty certain that I will be using MySQL, Pentaho, PHP, ExtJs and Google Gears.
My intention is to keep blogging about MySQL. Most likely I will write more about all these other technologies too.
I'd like to finish by saying again what a wonderful time I had with MySQL/Sun! I wish MySQL/Sun the best of luck, and I want to thank all of my colleagues there. Thank you! I have learned a lot from you, and I'm looking forward to meeting again some place, some time.
Roland
Before I joined MySQL AB I was working as a database consultant and application developer. I knew that joining MySQL would mean giving that up, but I was happy to do so because I really liked the MySQL product. I was also attracted by the company itself, the flexible working hours, working from home, travelling abroad, and, very important, being one of the few companies that is capable creating FOSS software on a commercial basis.
I have had a splendid time at MySQL AB, and later Sun. I met lots of smart people and travelled to a bunch of places I would probably not've had the chance to visit otherwise. I learned a lot about the MySQL server, and a lot about the challenges involved in making an open source software company profitable. I've been given the chance to co-author a book on MySQL Cluster, and authored the cluster and associate certification exams. Finally, I created a training course that learns you how to write MySQL UDFs.
Looking back, I feel quite satisfied, and I'm thankful for having had this opportunity. However, slowly but surely I have noticed that I have been less involved in technical stuff, at least professionally. I have been spending so much time on hobby hacking in my spare time, that I reconsidered my position, and decided that I would really like to pick up application development again. At some point, a good opportunity came by, and I decided to seize it. So, that's what I did.
As of the 1st of October, next Wednesday, I will be working for Strukton Rail. I will be working for a department that's responsible for business process improvement and re-engineering. I will be doing a bit of business process analysis and consultancy, but the larger part of my job will consist of developing a Web application to model business processes and to support project management. These applications will have a distinctly graphical nature, and I have a pretty clear idea where I want that to end up. In addition I will be adding business intelligence to these applications.
I will be in the lucky position to be able to decide for a large part which technologies and products we will be using, and I think I will be able to build all of it on open source/free software. I am pretty certain that I will be using MySQL, Pentaho, PHP, ExtJs and Google Gears.
My intention is to keep blogging about MySQL. Most likely I will write more about all these other technologies too.
I'd like to finish by saying again what a wonderful time I had with MySQL/Sun! I wish MySQL/Sun the best of luck, and I want to thank all of my colleagues there. Thank you! I have learned a lot from you, and I'm looking forward to meeting again some place, some time.
Roland
Tuesday, August 26, 2008
Webinar: Introducing MySQL Workbench
September 16, 10:00 am PST, 1:00 pm EST, 18:00 GMT
I'm proud to inform you that I will be hosting a MySQL Workbench Webinar, September 16. You can find the announcement right here.
I have spent quite a deal of time testing MySQL Workbench. Thanks to the hard work of the MySQL Developer Tools Team, the tool just becomes better and better each new build. Personally, I think it is high time to present its features and benefits in a web seminar, and when Giuseppe asked me to do it, I was more than happy to oblige.
I'll shortly post an outline of some of the key features I will be presenting. Most likely, I'll address a number of issues I've seen recurring on the MySQL Workbench Forums, but I also intend to demonstrate a number more advanced features that I hope will benefit DBAs as well as Developers.
Stay tuned ;)
Tuesday, July 29, 2008
MySQL aggregate UDF Performance
At the MySQL User's conference 2008 I did a tutorial on MySQL User-defined functions. At the same conference, Brian Miezejewski and Mark Atwood did a talk on aggregate UDFs.
In relation their upcoming talk, Mark wrote a blog post reporting that MySQL Aggregate UDFs are three times slower than equivalent native functions:
I just tested it on Linux and Windows. For a simple aggregate UDF that is equivalent to the built-in
If you are interested in trying this on your own system, download these instructions. Would be cool if you could leave a comment here of your results, TIA
In relation their upcoming talk, Mark wrote a blog post reporting that MySQL Aggregate UDFs are three times slower than equivalent native functions:
I just proved to myself via testing and benchmarking that user defined aggregate functions are about three times slower than the call interface to the built-in functions.Later on he posted an update, explaining that it looks like this result is probably "a MacOS thing".
It turns out to be a MacOS thing. On Linux, the performance penalty is less than 10%. Further research will surely be done.
I just tested it on Linux and Windows. For a simple aggregate UDF that is equivalent to the built-in
COUNT()
function, I found that the UDF is about 5% slower than the built-in function. Personally I think that's not too bad.If you are interested in trying this on your own system, download these instructions. Would be cool if you could leave a comment here of your results, TIA
Thursday, July 24, 2008
Inspect the Query Cache using MySQL Information Schema Plug-ins
A while ago I wrote about MySQL 5.1 information schema plug-ins.
At the time, I wrote a plug-in to report the contents of the query cache, but for all kinds of reasons, I never found the time to write a decent article about it, nor to release the code.
I am not sure if I'll ever find the time to write that article, but I just tidied up the code, and installed it in a new MySQL 5.1.26-rc server. It seems to work there, so I put the code up on the web.
Inside the source file, there's instructions to build and deploy it. If all goes well, you can do things like:
Are you interested in this plugin-in? Do you have any suggestions? Download it, and put your comments here.
At the time, I wrote a plug-in to report the contents of the query cache, but for all kinds of reasons, I never found the time to write a decent article about it, nor to release the code.
I am not sure if I'll ever find the time to write that article, but I just tidied up the code, and installed it in a new MySQL 5.1.26-rc server. It seems to work there, so I put the code up on the web.
Inside the source file, there's instructions to build and deploy it. If all goes well, you can do things like:
mysql> select * from information_schema.mysql_cached_queries\G
*************************** 1. row ***************************
STATEMENT_ID: 1
SCHEMA_NAME: test
STATEMENT_TEXT: select count(*) from world.city
RESULT_BLOCKS_COUNT: 1
RESULT_BLOCKS_SIZE: 512
RESULT_BLOCKS_SIZE_USED: 106
1 row in set (0.00 sec)
Are you interested in this plugin-in? Do you have any suggestions? Download it, and put your comments here.
Building MySQL from Source: There's a fine line...
There's a fine line between love and hate, you see
Can't wait to late, but baby I'm on it
(From: "Liberation" by Outkast)
Suppose you need to build a MySQL server with the same options as a pre-built one, distributed through the MySQL download site.
Sounds strange?
Well, maybe, but that's what you need if you want to compile certain MySQL 5.1 plug-ins so that they'll will play nice with such a pre-built server.
Some plug-ins depend on nothing more but the
plugin.h
header file, but for example storage engine plug-ins require things way beyond that. If you want to compile those yourself, you are required to first build a MySQL server from source, even though you will install the plug-in in another, pre-built, server. (If you want to see how this works in practice, check out the excellent documentation for building and installing the PBXT storage engine plug-in.)
Now, the snag is, the server you need to build must ideally be built in exactly the same way as the deployment server. So, the question is, how do you find out how your server was built?
As it turns out, there is a very simple method to obtain the original compile line. You can find it in the
mysqlbug
script. On line 28 you see the following line:
CONFIGURE_LINE="./configure '--prefix=/usr/local/mysql' '--localstatedir=/usr/local/mysql/data' '--libexecdir=/usr/local/mysql/bin' '--with-comment=MySQL Community Server (GPL)' '--with-server-suffix=' '--enable-thread-safe-client' '--enable-local-infile' '--enable-assembler' '--with-pic' '--with-fast-mutexes' '--with-client-ldflags=-static' '--with-mysqld-ldflags=-static' '--with-zlib-dir=bundled' '--with-big-tables' '--with-ssl' '--with-readline' '--with-embedded-server' '--with-partition' '--with-innodb' '--without-ndbcluster' '--with-archive-storage-engine' '--with-blackhole-storage-engine' '--with-csv-storage-engine' '--without-example-storage-engine' '--with-federated-storage-engine' '--with-extra-charsets=complex' 'CC=ccache gcc -static-libgcc' 'CFLAGS=-g -O3' 'CXXFLAGS=-g -O3' 'CXX=ccache gcc -static-libgcc'"
Are you curious what will happen if you compile your plug-in against a server that is not built the same as the deployment server? Well, the plug-in won't be "playing nice" with your server, and it is likely that something nasty will happen, such as a crash, or worse: the plugin may unintentionally change the behavior of the server, with unwelcome results such as data corruption.
So, find the
mysqlbug
script, (it's normally located in the MySQL bin directory) and find that fine line...
Friday, July 11, 2008
MySQL: DIVide and Conquer
Everybody that has had to do some numeric calculations in SQL will have encountered this simple problem: divide an integer by another integer and round down the outcome to the nearest integer. In many cases people write it like this:
Simple enough, right? First we do the division
Update: My claim that
Thanks Kai!
However, there is a better way.
We can use the integer division operator
This approach has a number of advantages:
To prove the last point, take a look at the results of a simple benchmark. I simply used the
I repeated this two more times and averaged the time spent, and then made this little graph of the results:
The results show that
Now, I don't think the performance benefit is of much practical significance. You may see a slight improvement for large datasets using multiple division operations, but in many cases the ordinary query processing will probably have a much larger part in the total time spent. But still,
FLOOR(a/b)
Simple enough, right? First we do the division
a/b
, then we round down using the function FLOOR()
. Update: My claim that
TRUNCATE(a/b, 0)
is equivalent to FLOOR(a/b)
is false! It maybe true when the outcome of the division is a positive number, but in case of a negative number, TRUNCATE()
will only lose the decimals (resulting in a higher negative number) and FLOOR()
will still round down.Thanks Kai!
However, there is a better way.
We can use the integer division operator
DIV
instead of the ordinary division operator /
. Because this is an integer division, there is no need for an extra function to lose the decimals, and the expression is simply:
a DIV b
This approach has a number of advantages:
- It is explicit. By looking at the expression we know immediately that the result will be an integer, and that
a
andb
are meant to be integers too. - It is easier to read. Because we don't need another function and parenthesis, this expression is easier on the eyes, something that you will appreciate if the expression is not simply
FLOOR(a/b)
but something likeFLOOR(SUM(a)/SUM(IFNULL(b,0)))
- It is fast! The
DIV
operation does not have to deal with complex floating point math, and will be much faster on most microprocessors
To prove the last point, take a look at the results of a simple benchmark. I simply used the
BENCHMARK()
function and executed:
mysql> SELECT BENCHMARK(10000000,1234567 DIV 7) ;
+-----------------------------------+
| BENCHMARK(10000000,1234567 DIV 7) |
+-----------------------------------+
| 0 |
+-----------------------------------+
1 row in set (0.83 sec)
mysql> SELECT BENCHMARK(10000000,1234567 / 7) ;
+---------------------------------+
| BENCHMARK(10000000,1234567 / 7) |
+---------------------------------+
| 0 |
+---------------------------------+
1 row in set (7.26 sec)
mysql> SELECT BENCHMARK(10000000,FLOOR(1234567 / 7)) ;
+----------------------------------------+
| BENCHMARK(10000000,FLOOR(1234567 / 7)) |
+----------------------------------------+
| 0 |
+----------------------------------------+
1 row in set (8.80 sec)
I repeated this two more times and averaged the time spent, and then made this little graph of the results:
The results show that
DIV
is about 9 to 10 times faster than the ordinary division operator, and that adding FLOOR()
function makes the entire expression another 10% slower.Now, I don't think the performance benefit is of much practical significance. You may see a slight improvement for large datasets using multiple division operations, but in many cases the ordinary query processing will probably have a much larger part in the total time spent. But still,
DIV
is faster, easier to read and more explicit if you want to solve this type of problem.
Wednesday, July 09, 2008
A fast, single pass method to calculate the median in MySQL
After stepping off of the
I previously wrote on how to calculate the median using
Apart from not relying on setting the buffer size for
At the heart of the solution are the user-defined variables:
These are initialized in the inmost subquery:
Note that this yields one row, initializing the row number
You see, first we divide the total number of rows by two. If there is an even number of rows,
The calculation of the left median is along the same lines:
We just have to take care that in case we do have an odd number of rows, we need to pick the true median position here too.
To do that we first find out if there is an odd number of rows using
Now that we have these values, we can use
And this will give us at most two rows in case of an even number of rows, and one row for an odd number of rows.
The final step is to average the values from the rows, which is required to calculate the median in case of an even number of rows:
GROUP_CONCAT()
solution for calculating quantiles I figured it would be nice to find a better way to calculate the median too. Solution
I previously wrote on how to calculate the median using
GROUP_CONCAT()
, but I think that this is a better way:
SELECT AVG(length) AS median -- take the average of left and right median
, MIN(length) AS left_median --
, MAX(length) AS right_median --
, @l AS left_median_position --
, @r AS right_median_position --
FROM (
SELECT @n, length -- @n is just here to facilitate debug
FROM film
CROSS JOIN (
SELECT @n:=0
, @r := COUNT(*) DIV 2 + 1 -- right median or true median
, @l := COUNT(*) DIV 2 -- left median, or true medain
+ IF(
COUNT(*) % 2 -- even or odd?
, 1 -- odd, need next value
, 0 -- even, need true left median
)
FROM film
) ``
WHERE (@n:=@n+1) -- row number
BETWEEN @l AND @r -- select two middle ones
ORDER BY length -- need to sort to get middle values
) ``
Apart from not relying on setting the buffer size for
group_concat_max_len
, this solution is also faster than the GROUP_CONCAT()
solution. For example, calculating the median amount
from the payment
table in the sakila
database takes some 0.12 seconds using the GROUP_CONCAT()
solution and about 0.03 seconds with this method.Explanation
At the heart of the solution are the user-defined variables:
@n
: the row number@l
: the position of the 'left median' that is the row number with the highest value in the lower half of all rows@r
: the position of the 'right median', that is the row number with the lowest value in the higher half of all rows
These are initialized in the inmost subquery:
SELECT @n:=0
, @r := COUNT(*) DIV 2 + 1 -- right median or true median
, @l := COUNT(*) DIV 2 -- left median, or true medain
+ IF(
COUNT(*) % 2 -- even or odd?
, 1 -- odd, need next value
, 0 -- even, need true left median
)
FROM film
Note that this yields one row, initializing the row number
@n
to zero. We can calculate the position for the right median immediately by doing
COUNT(*) DIV 2 + 1
You see, first we divide the total number of rows by two. If there is an even number of rows,
COUNT(*) DIV 2
will give us the left median, and adding 1
is then by definition the right median. If COUNT(*)
is an odd number, it still holds: the integer division rounds down to the nearest integer value, and adding 1
then gives us the position of the true median in that case.The calculation of the left median is along the same lines:
@l := COUNT(*) DIV 2 -- left median, or true medain
+ IF(
COUNT(*) % 2 -- even or odd?
, 1 -- odd, need next value
, 0 -- even, need true left median
)
We just have to take care that in case we do have an odd number of rows, we need to pick the true median position here too.
To do that we first find out if there is an odd number of rows using
COUNT(*) % 2
. This calculates the remainder of dividing the total number of rows by two. In case of an odd number of rows, the remainder is 1
which is considered to be TRUE
by the IF
function. In this case, 1
is added, effectively making @l
equal to @r
(that is, both hold the position of the true median). In case of an even number of rows, we add 0
as COUNT(*) % 2
is already the desired position of the left median.Now that we have these values, we can use
@n
to generate a rownumber, which we can compare against the calculated @l
and @r
values:
WHERE (@n:=@n+1) -- row number
BETWEEN @l AND @r -- select two middle ones
ORDER BY length -- need to sort to get middle values
And this will give us at most two rows in case of an even number of rows, and one row for an odd number of rows.
The final step is to average the values from the rows, which is required to calculate the median in case of an even number of rows:
SELECT AVG(length) AS median -- take the average of left and right median
...
FROM (
...
) ``
Tuesday, July 08, 2008
MySQL Percentile aftermath: Calculating all quantiles
Are you getting fed up yet with my posts on calculating percentiles? Well, I'm sorry but I'm not quite finished.
Here's a simple, fast method to calculate the specified number of quantiles:
You can find this snippet on MySQL Forge.
Here's a simple, fast method to calculate the specified number of quantiles:
-- set the number of quantiles, for exmple:
-- quartiles: 4
-- deciles: 10
-- percentiles: 100
SET @quantiles:=4; -- select quartiles
-- calculate all quantiles
--
--
SELECT amount AS metric
, @n DIV (@c DIV @quantiles) AS quantile
, @n AS N
FROM sakila.payment
CROSS JOIN (
SELECT @n:=0 -- rownumber
, @c:=COUNT(*) -- need this to calculate quantile partitions
FROM sakila.payment
) c
WHERE NOT ( -- modulo zero (=false), we are at the quantile
(@n:=@n+1) % (@c DIV @quantiles) -- rownumber equal to the quantile partition?
)
ORDER BY amount; -- need ordered partitions
You can find this snippet on MySQL Forge.
Monday, July 07, 2008
Calculating Percentiles with MySQL, Round 2
My previous post on calculating percentiles with MySQL generated some comments and good discussion. In particular, I got some very interesting comments from Vladimir.
Basically, Vladimir was doubtful whether the
First, this query sets up two identical subqueries in the
As it turns out, this solution is slower for moderately small data sets, but much faster for large data sets. He benchmarked it for a total of 999 distinct values on varying number of rows. Here are his slightly rounded numbers:
Although
He concluded by saying:
So,
and this gets us the result:
Here is the equivalent
...and it is considerably slower:
(
So, the sane thing to do would be to forget about that
The one thing you could object about is the extra query to initialize the user-defined variables. You can get around that by initializing those in a single row subquery in the
This has the advantage that the variables are initialized in one go with the entire query, ensuring you are not accidentally working with uninitialized or garbage variables.
(BTW: If you want to do more with these user-defined variables, I can highly recommend more from Baron's site, see for example his article on advanced user defined variable techniques.)
Basically, Vladimir was doubtful whether the
GROUP_CONCAT()
solution would be optimal in comparison to a JOIN
. His proposal is to solve it like this:
SELECT SUM(g1.r) sr
, g2.length l
, SUM(g1.r)/(SELECT COUNT(*) FROM film) p
FROM (SELECT COUNT(*) r, length FROM film GROUP BY length) g1
JOIN (SELECT COUNT(*) r, length FROM film GROUP BY length) g2
ON g1.length < g2.length
GROUP BY g2.length
HAVING p > 0.9
ORDER BY p
LIMIT 1
First, this query sets up two identical subqueries in the
FROM
list using GROUP BY
and COUNT()
to calculate the number of occurrences of each distinct value. Then, these are joined and GROUP BY
is again applied to calculate the total number of rows having a lower value. Finally, HAVING
is used to find the groups in the upper percentiles, and LIMIT
and ORDER BY
are used to single out the one desired percentile value.As it turns out, this solution is slower for moderately small data sets, but much faster for large data sets. He benchmarked it for a total of 999 distinct values on varying number of rows. Here are his slightly rounded numbers:
#rows: group_concat: groupby-join:
4M 1 min 6 sec 5.3 sec
1M 5 sec 2.5 sec
100K 0.5 sec 1.6 sec
Although
GROUP_CONCAT()
seems to break down pretty soon, he also writes:
I must admit that when N of distinct rows reaches approx. 10K I get pretty the opposite results if the total number of rows is relatively small. Basically we get into the same situation as with joining the whole tables.
He concluded by saying:
But what I think is the right solution is having something like this on the server side:
SELECT COUNT(INC length)/(SELECT COUNT(*) FROM film) p, length
FROM film
GROUP BY length
HAVING p >= 0.9 ORDER BY p LIMIT 1
Where "INC" is a flag that tells the server to not reset per-group counters in the aggregate functions. This would be quite a trivial change in the Item_sum class and would make sense not only forSUM
, but maybe also forMIN
,MAX
,AVG
,COUNT
and maybe some other aggregate functions.
So,
COUNT(INC length)
would be the cumulative count, or a running total of counts. The fun thing is, you can already do exactly that using user-defined variables. Look:
-- allow statement batching
DELIMITER go
-- initialize
SELECT 0, COUNT(*)
INTO @cum_cnt, @cnt
FROM sakila.payment;
-- calculate percentiles
SELECT @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p, -- running fraction of #rows per distinct amount
amount
FROM sakila.payment
GROUP BY amount
HAVING p >= 0.9
LIMIT 1;
go
and this gets us the result:
Query OK, 1 row affected (0.01 sec)
+-------------+--------+
| p | amount |
+-------------+--------+
| 0.904542334 | 6.99 |
+-------------+--------+
1 row in set (0.03 sec)
Here is the equivalent
GROUP_CONCAT
solution:
SELECT SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(
p.amount
ORDER BY p.amount
SEPARATOR ','
)
, ','
, 90/100 * COUNT(*) + 1
)
, ','
, -1
) AS `90th Percentile`
FROM sakila.payment AS p;
...and it is considerably slower:
+-----------------+
| 90th Percentile |
+-----------------+
| 6.99 |
+-----------------+
1 row in set (0.08 sec)
(
sakila.payment
has 16049 rows and 19 distinct values for amount)So, the sane thing to do would be to forget about that
GROUP_CONCAT
idea, and use this method. It does not have the nasty drawbacks of having to mess with group_concat_max_len
and I am pretty sure Vladimir's method will be faster across the board anyway. The one thing you could object about is the extra query to initialize the user-defined variables. You can get around that by initializing those in a single row subquery in the
FROM
clause, a technique described by Baron Schwartz (see for example this post)
SELECT @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p,
amount
FROM sakila.payment
CROSS JOIN (SELECT @cum_cnt:=0
, @cnt:=COUNT(*)
FROM sakila.payment) p
GROUP BY amount
HAVING p >= 0.9
LIMIT 1;
This has the advantage that the variables are initialized in one go with the entire query, ensuring you are not accidentally working with uninitialized or garbage variables.
(BTW: If you want to do more with these user-defined variables, I can highly recommend more from Baron's site, see for example his article on advanced user defined variable techniques.)
Writing to the MySQL error log
In almost all application development situations, one needs to log debug information now and then. In almost all production systems, one needs to log serious error events somewhere too.
So, what can you do? Create a log table? Sprinkle your code with
At the UDF Repository for MySQL, we now have a solution to log messages to the MySQL error log: a user-defined function called
Currently it is all very crude: the
Please try it out, and let us know if you have comments or suggestions to improve. Thanks in advance,
Roland
So, what can you do? Create a log table? Sprinkle your code with
SELECT 'Debug: ...'
statements?At the UDF Repository for MySQL, we now have a solution to log messages to the MySQL error log: a user-defined function called
log_error()
.Currently it is all very crude: the
log_error
function takes one argument and writes it to the mysql_error log, appending a line terminator.Please try it out, and let us know if you have comments or suggestions to improve. Thanks in advance,
Roland
Friday, July 04, 2008
Calculating the Nth percentile in MySQL
Yesterday, I was on the freenode ##pentaho irc channel when Andres Chaves asked me how to calculate the Nth percentile in MySQL. He saw a solution somewhere using subqueries, but wasn't too happy about it.
A while ago I wrote about calulating the median in MySQL, and it turns out the Nth percentile can be calculated using a similar, single-pass approach, not relying on subqueries, UDFs, or user-defined variables.
So, what is a percentile exactly? Here's what the wikipedia says:
The wikipedia continues and hints at the relationship between the Nth percentile and the median:
Sidenote: as I understand it, this latter remark concerning the median is not entirely correct. The median is "the middle value": the number of observations with a higher value is equal to the number of observations that has a lower value. For a series with an even number of observations, say {1,2,3,4}, there isn't one middle value, there are two, in this case: 2 and 3. Typically, the median is computed by taking the arithmic mean of the two middle values, which would be (2+3) / 2 = 2.5 in this particular case. But for this example, the 50th percentile would be 3 and not 2.5. However, in most practical cases the values are fairly evenly distributed and sufficiently large, which means the difference between the median and 50th percentile will be small or absent.
However, the median and percentile problems are quite similar: in both cases, a value is picked or computed that is higher than the value found in a particular portion of the total number of observations. For the median, the requirement is that that particular portion is equal to the number of observations that exceeds it; for the Nth percentile the requirement is that that portion constitutes N percent of the total number of observations.
In the following example, we calculate the 90th percentile of film lengths:
Here, the literal
(If you like, you can leave out the
The median and Nth percentile problem can be solved using a similar apporoach: first, we use the string aggregate function
The differences between the solutions for the median and the Nth is mainly in which entry we have to pick from the list. (See my prior article for the full story on median). For the Nth percentile, we first calculate the desired portion of the total number of observations. Because N is defined as a percentage, we divide by 100 to get the actual fraction of the total number of observations:
Then, we multiply by the total number of observations to find the number of observations that make up the actual portion of observations within the specified percentile:
Because we want to find the observation for which the specified portion has a lower value, we need to look at the next entry instead of the last entry within the portion, so we add 1:
There are a number of things to look out for:
When we calculate the 90th percentile of the film lengths, we get 173:
If we check the result by counting the portion of films with a length lower than
The reason that we do not get 90% is that there are multiple occurrences of films with length equal to 173:
So, even though we may have picked the entry at the right position, this may still be a value within the specified portion instead of beyond. The definitions I found for the Nth percentile do not stipulate any mechanism to deal with this kind of ambiguity, whereas for the median, the correct value is always found by averaging the left and right median if necessary.
The second caveat are
It may not be acceptable to throw away the rows with NULL values, for example if there is another expression in your
This will ensure that
Assuming we know the number of
Because we are ignoring the NULL values in our list, we must likewise ignore them in our calculation of the portion of rows. So, instead if
When using
It is really important to be aware of any warnings, as a truncation of the
The maximum length of the
It can be set thus:
The maximum practical value is the maximum packet size, which is available as the
You should realize that setting the
I will maintain this percentile calculation as a snippet on the MySQL Forge site. Please go there to find the latest version.
Now, finally, I have a question for you. When I wrote about the median calculation, I mentioned that I thought it was an original method, and I asked whether someone could deny that claim. I did not get any reaction, so I'd like to repeat my question: Do you know of any book or text or blog that describes this technique? If so, let me know so I can provide proper accreditation.
TIA, Roland.
A while ago I wrote about calulating the median in MySQL, and it turns out the Nth percentile can be calculated using a similar, single-pass approach, not relying on subqueries, UDFs, or user-defined variables.
The percentile....
So, what is a percentile exactly? Here's what the wikipedia says:
A percentile is the value of a variable below which a certain percent of observations fall. So the 20th percentile is the value (or score) below which 20 percent of the observations may be found.
....and the median
The wikipedia continues and hints at the relationship between the Nth percentile and the median:
The 25th percentile is also known as the first quartile; the 50th percentile as the median.
Sidenote: as I understand it, this latter remark concerning the median is not entirely correct. The median is "the middle value": the number of observations with a higher value is equal to the number of observations that has a lower value. For a series with an even number of observations, say {1,2,3,4}, there isn't one middle value, there are two, in this case: 2 and 3. Typically, the median is computed by taking the arithmic mean of the two middle values, which would be (2+3) / 2 = 2.5 in this particular case. But for this example, the 50th percentile would be 3 and not 2.5. However, in most practical cases the values are fairly evenly distributed and sufficiently large, which means the difference between the median and 50th percentile will be small or absent.
However, the median and percentile problems are quite similar: in both cases, a value is picked or computed that is higher than the value found in a particular portion of the total number of observations. For the median, the requirement is that that particular portion is equal to the number of observations that exceeds it; for the Nth percentile the requirement is that that portion constitutes N percent of the total number of observations.
The Solution
In the following example, we calculate the 90th percentile of film lengths:
SELECT SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT( -- 1) make a sorted list of values
f.length
ORDER BY f.length
SEPARATOR ','
)
, ',' -- 2) cut at the comma
, 90/100 * COUNT(*) + 1 -- at the position beyond the 90% portion
)
, ',' -- 3) cut at the comma
, -1 -- right after the desired list entry
) AS `90th Percentile`
FROM sakila.film AS f
Here, the literal
90
represents which percentile we want, that is, "the 90th percentile".(If you like, you can leave out the
SEPARATOR ','
bit, as the default separator is a comma anyway. I just wanted to have a clear indication for the source of the ','
arguments in the SUBSTRING_INDEX()
calls)Explanation
The median and Nth percentile problem can be solved using a similar apporoach: first, we use the string aggregate function
GROUP_CONCAT()
to create an ordered list of values. Then we use the substring variation SUBSTRING_INDEX()
to find and excise a list entry at a particular desired position.The differences between the solutions for the median and the Nth is mainly in which entry we have to pick from the list. (See my prior article for the full story on median). For the Nth percentile, we first calculate the desired portion of the total number of observations. Because N is defined as a percentage, we divide by 100 to get the actual fraction of the total number of observations:
N / 100
Then, we multiply by the total number of observations to find the number of observations that make up the actual portion of observations within the specified percentile:
N / 100 * COUNT(*)
Because we want to find the observation for which the specified portion has a lower value, we need to look at the next entry instead of the last entry within the portion, so we add 1:
N / 100 * COUNT(*) + 1
Caveats
There are a number of things to look out for:
Percentile value not unique
When we calculate the 90th percentile of the film lengths, we get 173:
+-----------------+
| 90th Percentile |
+-----------------+
| 173 |
+-----------------+
If we check the result by counting the portion of films with a length lower than
173
we see:
mysql> SELECT 100 * COUNT(IF(f.length < 173, 1, NULL))/COUNT(*) `Percentage`
-> FROM film AS f;
+------------+
| Percentage |
+------------+
| 89.4212 |
+------------+
The reason that we do not get 90% is that there are multiple occurrences of films with length equal to 173:
mysql> SELECT title FROM film WHERE length = 173;
+----------------------+
| title |
+----------------------+
| BALLROOM MOCKINGBIRD |
| CONQUERER NUTS |
| FIRE WOLVES |
| GLADIATOR WESTWARD |
| PIZZA JUMANJI |
| TALENTED HOMICIDE |
| VELVET TERMINATOR |
+----------------------+
7 rows in set (0.01 sec)
So, even though we may have picked the entry at the right position, this may still be a value within the specified portion instead of beyond. The definitions I found for the Nth percentile do not stipulate any mechanism to deal with this kind of ambiguity, whereas for the median, the correct value is always found by averaging the left and right median if necessary.
What about NULL
The second caveat are
NULL
values. Currently this method does not work when the column for which you want to caculate percentile values is nullable. It is possible to work around this though. If you can exclude the rows with the NULL
value entirely, you can simply add a WHERE
clause. This is a good idea also because it will cull the number of rows to process. It may not be acceptable to throw away the rows with NULL values, for example if there is another expression in your
SELECT
list that needs to do something with all rows. You can then still work around it, with some extra hassle. It would involve tweaking the GROUP_CONCAT
to ignore NULL values. This could be done like this:
GROUP_CONCAT(
IF(<column> IS NULL
, ''
, <column>)
ORDER BY <column>
SEPARATOR ','
)
This will ensure that
GROUP_CONCAT()
does not also return NULL
when a NULL
value is present in the specified column. If there are NULL
values, these will end up as a list of comma's in the head of the result:
,,,,<non-null-value1>,...,<non-null-valueN> -- 4 NULL values
Assuming we know the number of
NULL
's (and we do) we can clean up our list easily with just SUBSTRING()
:
SUBSTRING(
GROUP_CONCAT(
IF(<column> IS NULL
, ''
, <column>)
ORDER BY <column>
SEPARATOR ','
)
, SUM(IF(<column> IS NULL, 1, 0)) + 1
)
Because we are ignoring the NULL values in our list, we must likewise ignore them in our calculation of the portion of rows. So, instead if
COUNT(*)
we should use COUNT(<column>)
in order to not count the NULL
values.group_concat_max_len
When using
GROUP_CONCAT()
, an issue that should always be on your radar is the maximum length of the GROUP_CONCAT()
result. If the result value exceeds the maximum length, the GROUP_CONCAT()
result will be truncated, and a warning will be issued:
1 row in set, 1 warning (0.00 sec)
mysql> show warnings;
+---------+------+--------------------------------------+
| Level | Code | Message |
+---------+------+--------------------------------------+
| Warning | 1260 | 1 line(s) were cut by GROUP_CONCAT() |
+---------+------+--------------------------------------+
It is really important to be aware of any warnings, as a truncation of the
GROUP_CONCAT()
result messes up the entire calculation.The maximum length of the
GROUP_CONCAT()
result is controlled through the group_concat_max_len
system variable. It can be set thus:
SET @@group_concat_max_len := <num-bytes>
The maximum practical value is the maximum packet size, which is available as the
max_allowed_packet
system variable. This means you can write:and you will never be bothered by this problem again. The
SET @@group_concat_max_len := @@max_allowed_packet;
GROUP_CONCAT()
result can still be too large though (namely, larger than the maximum packet size) but in that case you will get a proper error instead of a truncated result.You should realize that setting the
group_concat_max_len
to a high value may lead to memory problems, as each GROUP_CONCAT()
invocation may individually reserve the specified amount of memory to deal with its result.Finally...
I will maintain this percentile calculation as a snippet on the MySQL Forge site. Please go there to find the latest version.
Now, finally, I have a question for you. When I wrote about the median calculation, I mentioned that I thought it was an original method, and I asked whether someone could deny that claim. I did not get any reaction, so I'd like to repeat my question: Do you know of any book or text or blog that describes this technique? If so, let me know so I can provide proper accreditation.
TIA, Roland.
Tuesday, June 24, 2008
Why Schoenberg Rocks...
Personally, I have had a taste for the music of Arnold Schoenberg for a long time. But now, I the wikipedia explains why:
So, aggregates, partitioning, metrics and referential sets, play us another tune...
Ten features of Schoenberg's mature twelve-tone practice are characteristic, interdependent, and interactive[1]:
1. Hexachordal inversional combinatoriality
2. Aggregates
3. Linear set presentation
4. Partitioning
5. Isomorphic partitioning
6. Invariants
7. Hexachordal levels
8. Harmony, "consistent with and derived from the properties of the referential set"
9. Metre, established through "pitch-relational characteristics"
10. Multidimensional set presentations
Monday, June 02, 2008
MySQL 5.1: Measuring #queries/sec. using information_schema.GLOBAL_STATUS
MySQL 5.1 offers new
(This snippet is maintained on MySQL Forge)
This query could have been solved in a number of other ways involving subqueries, for example like this:
For now I'm sticking to my original snippet. The disadvantage to this approach is that it requires a hack to force aggregation of all rows from
The disadvantage of a using subqueries is that it would be slighly more verbose, and it would also require more scanning of
What do you think?
information
tables such as GLOBAL_STATUS
. This can be used to report certain performance metrics, such as the number of queries processed per second:
SELECT MAX( -- use MAX to force aggregation
IF(variable_name='Questions' -- no. of queries sent to server
, CAST(variable_value AS unsigned) -- make integer value
, 0 -- ignore if not 'Questions'
)
)
/ -- divide by
MAX( -- use MAX to force aggregation
IF(variable_name='Uptime' -- no. of seconds the server is up
, CAST(variable_value AS unsigned) -- make integer value
, 0 -- ignore if not 'Uptime'
)
) AS queries_per_second
FROM information_schema.GLOBAL_STATUS
WHERE variable_name in ('Questions', 'Uptime');
(This snippet is maintained on MySQL Forge)
This query could have been solved in a number of other ways involving subqueries, for example like this:
SELECT (SELECT CAST(variable_value AS unsigned)
FROM information_schema.GLOBAL_STATUS
WHERE variable_name = 'Questions')
/ (SELECT CAST(variable_value AS unsigned)
FROM information_schema.GLOBAL_STATUS
WHERE variable_name = 'Uptime') as queries_per_second
For now I'm sticking to my original snippet. The disadvantage to this approach is that it requires a hack to force aggregation of all rows from
GLOBAL_STATUS
to one single row (using MAX
), as well as some conditional expressions. The disadvantage of a using subqueries is that it would be slighly more verbose, and it would also require more scanning of
information_schema.GLOBAL_STATUS
, although I am unsure if that would really matter that much.What do you think?
Sunday, May 11, 2008
Getting terrorists to talk...
Another quote from slashdot that cracked me up...This time it's gbjbaanb who install MS silverlight to keep his machine from whining and moaning, as found in the recent Silverlight thread:
I must admit that I have not installed Silverlight, but out of curiosity, I just might...
I installed it in the end just to shut the damn thing up, and even then it refused to install. I almost cried with the frustration! The CIA could use this technique to get their terrorist suspects to talk.
I must admit that I have not installed Silverlight, but out of curiosity, I just might...
Saturday, May 03, 2008
Random slahsdot comment gem
By vux984 in a thread about the article Twitter Said To Be Abandoning Ruby on Rails:
Not trying to bash anyone here. This is just a well formulated comment. Maybe the guy took it from somewhere, I dunno. It just stuck on me.
Feel free to post your flames below, maybe I'll pass em on to vux984....
The rails framework is aptly named. Its like driving a train. You follow the rails. Its easy, simple, and those are its strengths. But if one day, you decide you want to cut across a field save a few hours of travel, well, you probably shouldn't have chosen 'train' as your mode of transportation.
Not trying to bash anyone here. This is just a well formulated comment. Maybe the guy took it from somewhere, I dunno. It just stuck on me.
Feel free to post your flames below, maybe I'll pass em on to vux984....
Monday, April 21, 2008
MySQL UC2008 presentation "Grand Tour of the information schema" now online
Yup, the presentation slides as well as the scripts are now available online on the conference website.
The stuff you will find in there:
The stuff you will find in there:
- Information_schema.pdf
- Diagram of the information schema
- I_S_VC_Slides.pdf
- Slides for the UC presentation, "Grand Tour of the Information Schema"
- I_S_INDEXES2.sql
- script, returns one row for each index (rollup of information_schema.STATISTICS)
- I_S_REDUNDANT_INDEXES2.sql
- script, lists all redundant indexes. Redundancy rules:
- two indexes with the same columns, type and uniqueness are interchangeable. The one with the largest index name is listed as redundant
- if there is a unique index and a non unique index with the same columns and type exists, the non-unique one is considered redundant
- A non-unique btree index is the leftmost prefix of another btree index (that may or may not be unique)
- p_check_fk_constraint_violations.sql
- stored procedure Detect probles after using @@foreign_key_checks=1. Takes a schema name LIKE pattern and a table name LIKE pattern and reports any foreign keys constraints for which violations exist. TODO: if the referenced table in the foreign key was dropped, the script will encounter a runtime error. Would be nice to properly report instead
- qwz.sql
- Stored procedure, The Query wizard for the mysql command line pronounced "cues". To use it, run the script, and it starts a new session with the query wizard. Alternatively, do:
USE qwz; CALL qwz(NULL);
and follow the on-screen instructions. - schema_doc.sql
- script, Generates HTML documentation for the current schema. Run using a command like:
mysql -uroot -Dsakila -Nrs -e"source schema_doc.sql" > schema_doc.html
- schema_doc.sh
- shell, Linux example for running schema_doc.sql
- gen_fks.sql
- script, Generates trigger code to emulate foreign keys. Set a schema containing foreign key constraints as the default database. Run the script. The output forms trigger code that may be used to recreate the foreign keys functionality in a database that has the same tables but no declarative foreign keys. For foreign keys with the RESTRICT or NO ACTION rules, a non-existent stored procedure will be called:
p_child_row_exists()
. For orphan INSERTs and UPDATEs, a call is made top_no_parent_row()
. Currently this is the best we can do - it will result in a runtime error and the action causing it will not complete but it is not ideal of course.Corrective rules (CASCADE etc) should just work. - p_create_federated_table.sql
- stored procedure, Creates a federated table (residing on a 5.0 or better remote host)
Saturday, March 29, 2008
MySQL information_schema: Identifying rows from TABLE_CONSTRAINTS
Yesterday, I set out a little quiz about the
For MySQL there are two such column sets:
According to the ISO SQL specification (ISO/IEC 9075-11:2003 (E) 6.48
MySQL supports three types of table constraints:
Identifying
From this it follows that we need to take the
So in order to reliably identify a
(Among the answers was a suggestion that the
Identifying
So, would (
Unfortunately not. Almost everybody that is somewhat familiar with MySQL knows that in MySQL a
From this, it follows that there can be multiple rows in
Of course, we know that there can be only one
Tables (and views) reside in their own per-schema namespace, so the combination of (
Of course, because the name
Identifying
This leaves us with the
We must have (
To recapitulate:
From this we could conclude that the minimal set of columns required to identify an arbitrary table constraint consists of:
However, in MySQL, a table constraint is 'owned' by the table on which it is defined and a table constraint defined on particular table cannot reside in a schema different from the schema its table. This means that for any row in
This brings us to the final conclusion that there are two minimal set of columns that may be used to identify an arbitrary constraint in the
As
To learn more about this and other traps in the MySQL
TABLE_CONSTRAINTS
table in the MySQL information_schema
. The task was:- Specify a minimal set of columns of the information_schema.TABLE_CONSTRAINTS table that is sufficient to reliably identify a single row in the information_schema.TABLE_CONSTRAINTS table.
- Argue why these columns are necessary and sufficient to identify a row, and why a smaller set of columns does not exist
Short Answer
For MySQL there are two such column sets:
CONSTRAINT_SCHEMA
,TABLE_NAME
,CONSTRAINT_NAME
, andCONSTRAINT_TYPE
TABLE_NAME
,TABLE_SCHEMA
,CONSTRAINT_NAME
, andCONSTRAINT_TYPE
Explanation
According to the ISO SQL specification (ISO/IEC 9075-11:2003 (E) 6.48
TABLE_CONSTRAINTS
base table), we can derive that the "conceptual primary key" of a standard implementation of the TABLE_CONSTRAINTS
system view should be (CONSTRAINT_CATALOG
, CONSTRAINT_SCHEMA
, and CONSTRAINT_NAME
). In the absence of support for catalogs, we can ignore the CONSTRAINT_CATALOG
column, and this also applies to MySQL (all %_CATALOG
columns in the MySQL information_schema
are always NULL
). That would leave us with (CONSTRAINT_SCHEMA
and CONSTRAINT_NAME
) - that is, table constraint names would be unique within a schema. However, in MySQL, constraint names need not be unique within a schema. MySQL supports three types of table constraints:
PRIMARY KEY
UNIQUE
FOREIGN KEY
(that is, they are supported dependent upon the storage engine)
Identifying FOREIGN KEY
constraints
FOREIGN KEY
s are in their own per schema namespace, so at a glance it may seem that the combination of (CONSTRAINT_SCHEMA
and CONSTRAINT_NAME
) is sufficient to identify a FOREIGN KEY
constraint. However, because PRIMARY KEY
and UNIQUE
constraints are not within the same namespace as FOREIGN KEY
constraints, a single schema may contain a FOREIGN KEY
constraint and a non-FOREIGN KEY
constraint that have the same CONSTRAINT_NAME
. From this it follows that we need to take the
CONSTRAINT_TYPE
into account too. So in order to reliably identify a
FOREIGN KEY
constraint, we need to have at least CONSTRAINT_SCHEMA
, CONSTRAINT_NAME
*and* require that CONSTRAINT_TYPE
equals 'FOREIGN KEY'
. (Among the answers was a suggestion that the
CONSTRAINT_TYPE
is not necessary, because the CONSTRAINT_NAME
would provide the necessary differentiation. Although this is true for the identifiers generated by MySQL in the absence of explicit identifiers, it is not true in the general case)Identifying PRIMARY KEY
constraints
So, would (
CONSTRAINT_SCHEMA
, CONSTRAINT_NAME
, and CONSTRAINT_TYPE
) be sufficient to identify any of the other types of table constraints? Unfortunately not. Almost everybody that is somewhat familiar with MySQL knows that in MySQL a
PRIMARY KEY
constraint doesn't really have its own identifier. Rather, the CONSTRAINT_NAME
for a PRIMARY KEY
constraint is always 'PRIMARY'
. In fact, the identifier 'PRIMARY'
is reserved exclusively for PRIMARY KEY
constraints, and may not be used for any other type of constraint. From this, it follows that there can be multiple rows in
information_schema.TABLE_CONSTRAINTS
that have the same combination of values in the columns (CONSTRAINT_SCHEMA
, CONSTRAINT_NAME
, and CONSTRAINT_TYPE
). In fact, for one particular value of CONSTRAINT_SCHEMA
there will be just as many occurrences of the combination ('PRIMARY'
, 'PRIMARY KEY'
) in the columns (CONSTRAINT_NAME
, CONSTRAINT_TYPE
) as there are PRIMARY KEY
constraints in that schema.Of course, we know that there can be only one
PRIMARY KEY
per table. From that, it follows that we can identify a PRIMARY KEY
constraint if we know to which table it belongs. Tables (and views) reside in their own per-schema namespace, so the combination of (
TABLE_SCHEMA
and TABLE_NAME
) is sufficient to identify a table. That means that the combination of (TABLE_SCHEMA
, TABLE_NAME
, and CONSTRAINT_TYPE
) is sufficient to identify a PRIMARY KEY
constraint. Once we identified a particular table using (TABLE_SCHEMA
, TABLE_NAME
) we know that the table constraint with the CONSTRAINT_TYPE
equal to 'PRIMARY KEY'
is the PRIMARY KEY
constraint. Of course, because the name
'PRIMARY'
is reserved exclusively for PRIMARY KEY
constraints we could've equally well settled for (TABLE_SCHEMA
, TABLE_NAME
, CONSTRAINT_NAME
) although personally I prefer to use CONSTRAINT_TYPE
.Identifying UNIQUE
constraints
This leaves us with the
UNIQUE
constraints. UNIQUE
constraints must have a unique name per table, and the name must not be equal to 'PRIMARY'
(which is reserved for PRIMARY KEY
constraints). From this we must conclude that we need no less than the combination of (TABLE_SCHEMA
, TABLE_NAME
, CONSTRAINT_TYPE
, and CONSTRAINT_NAME
) to identify a UNIQUE
constraint. We must have (
TABLE_SCHEMA
, TABLE_NAME
) to identify the table because the UNIQUE
constraints have a unique name per table. Now we can take a shortcut: within one table we can distinguish between PRIMARY KEY
and UNIQUE
constraints by virtue of the fact that the CONSTRAINT_NAME
for PRIMARY KEY
s will always be 'PRIMARY'
. However, if we need to distinguish between FOREIGN KEY
and UNIQUE
constraints, we still need to look at the CONSTRAINT_TYPE
too.Summary
To recapitulate:
- we need
CONSTRAINT_SCHEMA
,CONSTRAINT_NAME
,CONSTRAINT_TYPE
to identify aFOREIGN KEY
constraint - we need
TABLE_SCHEMA
,TABLE_NAME
, andCONSTRAINT_TYPE
to identify aPRIMARY KEY
constraint. - Alternatively we could identify
PRIMARY KEY
constraints by looking atTABLE_SCHEMA
,TABLE_NAME
,CONSTRAINT_NAME
by virtue of the fact that allPRIMARY KEY
s always have aCONSTRAINT_NAME
equal to'PRIMARY'
- we need
TABLE_SCHEMA
,TABLE_NAME
,CONSTRAINT_TYPE
andCONSTRAINT_NAME
to identify aUNIQUE
constraint
From this we could conclude that the minimal set of columns required to identify an arbitrary table constraint consists of:
CONSTRAINT_SCHEMA
CONSTRAINT_NAME
CONSTRAINT_TYPE
TABLE_SCHEMA
TABLE_NAME
However, in MySQL, a table constraint is 'owned' by the table on which it is defined and a table constraint defined on particular table cannot reside in a schema different from the schema its table. This means that for any row in
information_schema.TABLE_CONSTRAINTS
, the columns TABLE_SCHEMA
and CONSTRAINT_SCHEMA
will always have the same value. This brings us to the final conclusion that there are two minimal set of columns that may be used to identify an arbitrary constraint in the
information_schema.TABLE_CONSTRAINTS
table:CONSTRAINT_SCHEMA
,TABLE_NAME
,CONSTRAINT_NAME
,CONSTRAINT_TYPE
TABLE_SCHEMA
,TABLE_NAME
,CONSTRAINT_NAME
,CONSTRAINT_TYPE
As
CONSTRAINT_SCHEMA
and TABLE_SCHEMA
are completely interchangeable as far as the result is concerned, there is no specific reason to generically prefer either combination.To learn more about this and other traps in the MySQL
information schema
, come meet me at the MySQL User's conference. I will be your guide on the Grand Tour of the Information Schema and its Applications, and explain more of these gotcha's.
Thursday, March 27, 2008
"Me Too" MySQL Information Schema popquiz
A few days ago, I wrote how I will be your guide to the Grand Tour of the Information Schema and its Applications which is one of the two talks I will be doing at the upcoming MySQL User's Conference.
In view of the popularity of "Pop Quiz" format so successfully used by Carsten, I feel compelled to imitation, and as a primer to my talk, I'd like to offer you my "Me Too" MySQL Information Schema popquiz. So, here goes...
Feel free to post the answer as a comment. Every correct answer will earn a 20% discount to the admission fee. You can get bonus points for nice elegant argumentation, and also if you specify more than one minimal set of columns.
I will elect three to five of the best answers and make sure that whoever posts it gets a laminated printout of the diagram of the MySQL information schema.
(Please make sure you post me your email address if you think you are eligible to either the discount code, the laminated diagram or both)
In view of the popularity of "Pop Quiz" format so successfully used by Carsten, I feel compelled to imitation, and as a primer to my talk, I'd like to offer you my "Me Too" MySQL Information Schema popquiz. So, here goes...
The MySQLinformation_schema
contains a number of tables. Among them, one is calledTABLE_CONSTRAINTS
. Unsurprisingly, each row inTABLE_CONSTRAINTS
table represents a single table constraint. TheTABLE_CONSTRAINTS
table has the following columns:
+--------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+-------+
| CONSTRAINT_CATALOG | varchar(512) | YES | | NULL | |
| CONSTRAINT_SCHEMA | varchar(64) | NO | | | |
| CONSTRAINT_NAME | varchar(64) | NO | | | |
| TABLE_SCHEMA | varchar(64) | NO | | | |
| TABLE_NAME | varchar(64) | NO | | | |
| CONSTRAINT_TYPE | varchar(64) | NO | | | |
+--------------------+--------------+------+-----+---------+-------+
- Specify a minimal set of columns of the
information_schema.TABLE_CONSTRAINTS
table that is sufficient to reliably identify a single row in theinformation_schema.TABLE_CONSTRAINTS
table.- Argue why these columns are necessary and sufficient to identify a row, and why a smaller set of columns does not exist
Feel free to post the answer as a comment. Every correct answer will earn a 20% discount to the admission fee. You can get bonus points for nice elegant argumentation, and also if you specify more than one minimal set of columns.
I will elect three to five of the best answers and make sure that whoever posts it gets a laminated printout of the diagram of the MySQL information schema.
(Please make sure you post me your email address if you think you are eligible to either the discount code, the laminated diagram or both)
Tuesday, March 25, 2008
MySQL Information Schema applications at the UC2008
Last week I blogged about the upcoming MySQL Users conference, in particular about the Writing MySQL UDFs tutorial that I will be delivering.
I will also be doing the Grand Tour of the Information Schema and its Applications.
I will discuss the elements in the MySQL information schema, and provide tips to write queries against it. Most of the talk will revolve around a number of scripts I have developed over the past few years:
Yes! It's an interactive query wizard for the command line. I don't want to spoil too much, but I can unveil that I will be presenting a full-fledged interactive query wizard that operates completely from the MySQL command line. It's just a stored procedure - no proxy, no UDFs, no plugins or whatever.
To give you a taste of the possibilities, take a look at the continuation of my session with the query wizard. First let's choose the schema:
Let's settle for the
This will prompt us to choose a table from the
Let's pick the
Once we selected the initial table, we can now build a join path, and we are prompted to pick any of the tables related to
Let's extend the join path with the
Let's throw in the
We can keep this up quite a long time, and we can remove tables from our join path in a similar way. We can then start specifying some columns:
Currently I'm still working on a multiple select interface, but for now I'll snip out all but the last individual column selection:
Now is a good moment to inspect the SQL underlying our query:
And we can execute it, or run
As you can imagine, quite a good deal of information schema hacking going on here.
Code will be released in full immediately after my talk, and if there is sufficient interest I will discuss the internals of the
I will also be doing the Grand Tour of the Information Schema and its Applications.
I will discuss the elements in the MySQL information schema, and provide tips to write queries against it. Most of the talk will revolve around a number of scripts I have developed over the past few years:
- Generating a history/audit database - generate all the code you need to keep track of all changes occurring in your on-line database. Use it to audit or implement 'flashback'
- Checking foreign key violations - disabling foreign key checks may be useful, but is dangerous. This script helps you find problems with foreign key constraints
- Creating federated tables -
FEDERATED
tables are useful, but tedious and error-prone to create. Let this script do the work instead - Checking for duplicated and redundant indexes - Redundant or duplicate indexes can slow down your database performance. Find them with this script (Note: the original script I blogged about earlier contains a critical flaw - this is a completely new version)
mysql> call qwz.qwz(null);
+-----+------------------------------------------------------------------+
| | Welcome to the command line Query Wizard for MySQL!!! |
+-----+------------------------------------------------------------------+
| 1 | Set the schema (current: world) |
| --- | ---------------------------------------------------------------- |
| 2 | Choose a table |
+-----+------------------------------------------------------------------+
3 rows in set (0.00 sec)
Yes! It's an interactive query wizard for the command line. I don't want to spoil too much, but I can unveil that I will be presenting a full-fledged interactive query wizard that operates completely from the MySQL command line. It's just a stored procedure - no proxy, no UDFs, no plugins or whatever.
To give you a taste of the possibilities, take a look at the continuation of my session with the query wizard. First let's choose the schema:
mysql> call qwz.qwz(1);
+----+------------------------+
| | Set the default schema |
+---+------------------------+
| 1 | information_schema |
| 2 | mysql |
| 3 | qwz |
| 4 | sakila |
| 5 | world |
+---+------------------------+
Let's settle for the
sakila
schema:
mysql> call qwz.qwz(4);
This will prompt us to choose a table from the
sakila
schema:
+-----+------------------------------------------------------------------+
| | The command line Query Wizard for MySQL |
+-----+------------------------------------------------------------------+
| 1 | Set the schema (current: sakila) |
| --- | ---------------------------------------------------------------- |
| 2 | actor |
| 3 | actor_info |
| 4 | address |
| 5 | category |
| 6 | city |
| 7 | country |
| 8 | customer |
| 9 | customer_list |
| 10 | film |
| 11 | film_actor |
| 12 | film_category |
| 13 | film_list |
| 14 | film_text |
| 15 | inventory |
| 16 | language |
| 17 | nicer_but_slower_film_list |
| 18 | payment |
| 19 | rental |
| 20 | sales_by_film_category |
| 21 | sales_by_store |
| 22 | staff |
| 23 | staff_list |
| 24 | store |
+-----+------------------------------------------------------------------+
26 rows in set (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
Let's pick the
rental
table:
mysql> call qwz.qwz(19);
Once we selected the initial table, we can now build a join path, and we are prompted to pick any of the tables related to
rental
:
+-----+------------------------------------------------------------------+
| | Join path: + tables are added / - tables are removed |
+-----+------------------------------------------------------------------+
| | sakila.rental |
| 1 | + sakila.customer (fk_rental_customer) |
| 2 | + sakila.inventory (fk_rental_inventory) |
| 3 | + sakila.staff (fk_rental_staff) |
| --- | ---------------------------------------------------------------- |
| 4 | Choose columns |
| --- | ---------------------------------------------------------------- |
| 5 | New Query |
+-----+------------------------------------------------------------------+
8 rows in set (0.20 sec)
Query OK, 0 rows affected (0.20 sec)
Let's extend the join path with the
customer
table:
mysql> call qwz.qwz(1);
+-----+------------------------------------------------------------------+
| | Join path: + tables are added / - tables are removed |
+-----+------------------------------------------------------------------+
| | sakila.rental |
| 1 | - sakila.customer (fk_rental_customer) |
| 2 | + sakila.address (fk_customer_address) |
| 3 | + sakila.store (fk_customer_store) |
| 4 | + sakila.inventory (fk_rental_inventory) |
| 5 | + sakila.staff (fk_rental_staff) |
| --- | ---------------------------------------------------------------- |
| 6 | Choose columns |
| --- | ---------------------------------------------------------------- |
| 7 | New Query |
+-----+------------------------------------------------------------------+
10 rows in set (0.09 sec)
Query OK, 0 rows affected (0.09 sec)
Let's throw in the
inventory
table in the mix too:
mysql> call qwz.qwz(4);
+-----+------------------------------------------------------------------+
| | Join path: + tables are added / - tables are removed |
+-----+------------------------------------------------------------------+
| | sakila.rental |
| 1 | - sakila.customer (fk_rental_customer) |
| 2 | + sakila.address (fk_customer_address) |
| 3 | + sakila.store (fk_customer_store) |
| 4 | - sakila.inventory (fk_rental_inventory) |
| 5 | + sakila.film (fk_inventory_film) |
| 6 | + sakila.store (fk_inventory_store) |
| 7 | + sakila.staff (fk_rental_staff) |
| --- | ---------------------------------------------------------------- |
| 8 | Choose columns |
| --- | ---------------------------------------------------------------- |
| 9 | New Query |
+-----+------------------------------------------------------------------+
12 rows in set (0.10 sec)
Query OK, 0 rows affected (0.10 sec)
We can keep this up quite a long time, and we can remove tables from our join path in a similar way. We can then start specifying some columns:
mysql> call qwz.qwz(8);
+-----+------------------------------------------------------------------+
| | Choose columns for the SELECT list |
+-----+------------------------------------------------------------------+
| | sakila.rental.* |
| | --------------- |
| 1 | + rental_id |
| 2 | + rental_date |
| 3 | + inventory_id |
| 4 | + customer_id |
| 5 | + return_date |
| 6 | + staff_id |
| 7 | + last_update |
| | --------------- |
| | sakila.customer.* (fk_rental_customer) |
| | -------------------------------------- |
| 8 | + customer_id |
| 9 | + store_id |
| 10 | + first_name |
| 11 | + last_name |
| 12 | + email |
| 13 | + address_id |
| 14 | + active |
| 15 | + create_date |
| 16 | + last_update |
| | -------------------------------------- |
| | sakila.inventory.* (fk_rental_inventory) |
| | ---------------------------------------- |
| 17 | + inventory_id |
| 18 | + film_id |
| 19 | + store_id |
| 20 | + last_update |
| --- | ---------------------------------------------------------------- |
| 21 | New Query |
+-----+------------------------------------------------------------------+
30 rows in set (0.02 sec)
Query OK, 0 rows affected (0.02 sec)
Currently I'm still working on a multiple select interface, but for now I'll snip out all but the last individual column selection:
mysql> call qwz.qwz(18);
+-----+------------------------------------------------------------------+
| | SELECT list: + columns are added / - columns are removed |
+-----+------------------------------------------------------------------+
| | sakila.rental.* |
| | --------------- |
| 1 | + rental_id |
| 2 | - rental_date |
| 3 | + inventory_id |
| 4 | + customer_id |
| 5 | + return_date |
| 6 | + staff_id |
| 7 | + last_update |
| | --------------- |
| | sakila.customer.* (fk_rental_customer) |
| | -------------------------------------- |
| 8 | + customer_id |
| 9 | + store_id |
| 10 | - first_name |
| 11 | - last_name |
| 12 | + email |
| 13 | + address_id |
| 14 | + active |
| 15 | + create_date |
| 16 | + last_update |
| | -------------------------------------- |
| | sakila.inventory.* (fk_rental_inventory) |
| | ---------------------------------------- |
| 17 | + inventory_id |
| 18 | - film_id |
| 19 | + store_id |
| 20 | + last_update |
| --- | ---------------------------------------------------------------- |
| 21 | Show SQL |
| --- | ---------------------------------------------------------------- |
| 22 | Explain SQL |
| --- | ---------------------------------------------------------------- |
| 23 | Execute SQL |
| --- | ---------------------------------------------------------------- |
| 24 | New Query |
+-----+------------------------------------------------------------------+
36 rows in set (0.03 sec)
Query OK, 0 rows affected (0.03 sec)
Now is a good moment to inspect the SQL underlying our query:
mysql> call qwz.qwz(21);
+-----+----------------------------------------------------------------------------+
| | The command line Query Wizard for MySQL |
+-----+----------------------------------------------------------------------------+
| | SELECT t1.rental_date |
| | ,t2.first_name |
| | ,t2.last_name |
| | ,t8.film_id |
| | FROM sakila.rental AS t1 |
| | INNER JOIN sakila.customer AS t2 ON (t1.customer_id) = (t2.customer_id) |
| | INNER JOIN sakila.inventory AS t8 ON (t1.inventory_id) = (t8.inventory_id) |
| --- | ---------------------------------------------------------------- |
| 1 | Execute SQL |
| --- | ---------------------------------------------------------------- |
| 2 | Explain SQL |
| --- | ---------------------------------------------------------------- |
| 3 | Choose columns |
| --- | ---------------------------------------------------------------- |
| 4 | New Query |
+-----+----------------------------------------------------------------------------+
15 rows in set (1.21 sec)
Query OK, 0 rows affected (1.21 sec)
And we can execute it, or run
EXPLAIN
on it:
mysql> call qwz.qwz(2);
+----+-------------+-------+--------+----------------------------------------+---------------------+---------+------------------------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------------------------------+---------------------+---------+------------------------+------+-------+
| 1 | SIMPLE | t8 | index | PRIMARY | PRIMARY | 3 | NULL | 4673 | |
| 1 | SIMPLE | t1 | ref | idx_fk_inventory_id,idx_fk_customer_id | idx_fk_inventory_id | 3 | sakila.t8.inventory_id | 1 | |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY | PRIMARY | 2 | sakila.t1.customer_id | 1 | |
+----+-------------+-------+--------+----------------------------------------+---------------------+---------+------------------------+------+-------+
3 rows in set (0.00 sec)
As you can imagine, quite a good deal of information schema hacking going on here.
Code will be released in full immediately after my talk, and if there is sufficient interest I will discuss the internals of the
qwz
stored procedure in full.
Subscribe to:
Posts (Atom)
DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance
DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...
-
Like all procedural database languages I know, the MySQL stored procedure language supports explicit cursors . I just wrote "explicit c...
-
Every now and then, people are puzzled by the precise status and extent of MySQL support for dynamic SQL. Statement Handling MySQL support ...
-
Yesterday, I was on the freenode ##pentaho irc channel when Andres Chaves asked me how to calculate the N th percentile in MySQL. He saw ...