<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-15319370</id><updated>2012-01-30T15:09:11.045+01:00</updated><category term='Aggregate Functions'/><category term='data integration'/><category term='subquery'/><category term='Windows XP'/><category term='Base'/><category term='SUBSTRING_INDEX'/><category term='Performance'/><category term='lib_mysqludf_xql'/><category term='bug'/><category term='lib_mysqludf_preg'/><category term='autocommit'/><category term='Exadata'/><category term='Windows'/><category term='SQLite'/><category term='Apple'/><category term='Customers'/><category term='depencencies'/><category term='mql-to-sql'/><category term='easter'/><category term='software development'/><category term='TeProFoKATeMUR'/><category term='RANK'/><category term='www'/><category term='Pentaho Kettle Solutions'/><category term='SIGNAL'/><category term='MySQL Cluster'/><category term='Freebase'/><category term='vmx'/><category term='MySQL command line client'/><category term='Kettle repository'/><category term='Kubuntu'/><category term='Customer Experience'/><category term='spam'/><category term='Savepoint'/><category term='Data Vault'/><category term='Case'/><category term='WTF'/><category term='unicode'/><category term='stored routine'/><category term='Bernardo'/><category term='stored routines'/><category term='Fosdem'/><category term='Celko'/><category term='XML Schema'/><category term='Anchor Modeling'/><category term='i18n'/><category term='query cache'/><category term='VMWare'/><category term='CSS'/><category term='java'/><category term='pentaho data integration'/><category term='Javascript'/><category term='error handling'/><category term='information_schema.TABLE_CONSTRAINTS'/><category term='MySQL Cluster Study Guide'/><category term='UDF'/><category term='Opera'/><category term='synchronization'/><category term='haha'/><category term='LOAD XML'/><category term='&quot;Building Pentaho Solutions&quot;'/><category term='antitrust'/><category term='/'/><category term='gui tool'/><category term='query wizard'/><category term='Connector/J'/><category term='patents'/><category term='GPL'/><category term='Xandros'/><category term='JTidy'/><category term='MySQL information schema plugin'/><category term='vmdk'/><category term='Firefox'/><category term='Matt Casters'/><category term='.ktr file'/><category term='innodb'/><category term='view'/><category term='Sqoop'/><category term='Curt Monash'/><category term='Alex Bolenok'/><category term=';'/><category term='perfomance'/><category term='Internet Explorer'/><category term='Explain Extended'/><category term='mash-up'/><category term='T-SQL'/><category term='FEDERATED'/><category term='DIV'/><category term='MySQL Administrator'/><category term='lib_mysqludf_udf'/><category term='json'/><category term='Writer'/><category term='Desktop Database'/><category term='Vista'/><category term='Data  warehousing'/><category term='Microsoft'/><category term='Infobright'/><category term='Calpont'/><category term='ETL'/><category term='HPCC Systems'/><category term='MS Visual Studio'/><category term='SQL_MODE'/><category term='Survey'/><category term='Atomicity'/><category term='hacking'/><category term='redundant indexes'/><category term='google chart api'/><category term='Eee PC'/><category term='Oracle'/><category term='Talks'/><category term='MySQL Conference'/><category term='Disgusting'/><category term='Web SQL Database API'/><category term='MySQL University'/><category term='ECL'/><category term='Chrome'/><category term='Roxie'/><category term='mysqldump'/><category term='self-join'/><category term='slander'/><category term='Money'/><category term='Virtualization'/><category term='WTF?'/><category term='Google Gears'/><category term='ScaleDB'/><category term='Monty'/><category term='comments'/><category term='common'/><category term='Kettle'/><category term='Median'/><category term='hack'/><category term='mysql connector/python'/><category term='KDE'/><category term='User defined function'/><category term='OLAP'/><category term='IE6'/><category term='character encoding'/><category term='Office'/><category term='ISO 9075'/><category term='Webinar'/><category term='Localization'/><category term='Calc'/><category term='udf_init_error'/><category term='mysql administration'/><category term='coding horror'/><category term='loss of service'/><category term='MySQL UC'/><category term='join'/><category term='mysql user conference'/><category term='UDF Repository for MySQL'/><category term='lib_mysqludf_sys'/><category term='information_schema.TABLES'/><category term='Kickfire'/><category term='Open Source'/><category term='JDBC'/><category term='MySQL Certification'/><category term='division'/><category term='Essent'/><category term='Pig'/><category term='wikipedia'/><category term='drizzle'/><category term='The Symbol'/><category term='Database Administration'/><category term='Google Gears Query Tool'/><category term='namespace'/><category term='Database'/><category term='Linux'/><category term='MS Project'/><category term='webdata'/><category term='Workbench'/><category term='BI'/><category term='Database Application Development'/><category term='log'/><category term='GROUP BY'/><category term='Pentaho Solutions'/><category term='Percentile'/><category term='column oriented databases'/><category term='Mondrian'/><category term='gcc'/><category term='Jaspersoft'/><category term='Thor'/><category term='mysqlconf'/><category term='Ubuntu'/><category term='information_schema'/><category term='libcurl'/><category term='Pentaho'/><category term='replication'/><category term='error log'/><category term='Dutch'/><category term='single pass'/><category term='MonetDB'/><category term='Tidy'/><category term='AGPL'/><category term='MySQL 5.1'/><category term='FIND_IN_SET'/><category term='UTF8'/><category term='Wordpress'/><category term='Offline'/><category term='MQL'/><category term='SQL'/><category term='Journalism'/><category term='MS Access'/><category term='Database Stored Procedures'/><category term='database size'/><category term='Stored Procedure'/><category term='PL/SQL'/><category term='cmmn'/><category term='plugin_dir'/><category term='l10n'/><category term='Semantics'/><category term='&quot;Pentaho Solutions&quot;'/><category term='Microsoft Visual Studio'/><category term='syntax'/><category term='Webbrowser'/><category term='Element61'/><category term='presentation'/><category term='OSCON 2009'/><category term='Giuseppe Maxia'/><category term='MongoDB'/><category term='MySQL Plugin'/><category term='IEs4Linux'/><category term='geert'/><category term='slowly changing dimension'/><category term='MySQL Carrier Grade'/><category term='Source'/><category term='cursor'/><category term='Asus'/><category term='Rhino'/><category term='Microsoft Sharepoint'/><category term='MySQLForge'/><category term='Comment'/><category term='Safari'/><category term='Wiley'/><category term='MySQL information schema'/><category term='André Simões'/><category term='integer'/><category term='backup'/><category term='sys_exec'/><category term='MySQL Forge'/><category term='MySQL'/><category term='XSLT'/><category term='GROUP_CONCAT'/><category term='MySQL performance monitoring'/><category term='XML'/><category term='MySQL monitor'/><category term='terminator'/><category term='MariaDB'/><category term='Neelie Kroes'/><category term='Big Data'/><category term='Application Development'/><category term='Refactoring'/><category term='Open Office'/><category term='CouchDB'/><category term='Webservices'/><category term='SET SESSION'/><category term='Enterprise'/><category term='Indexed Database API'/><category term='hacked'/><category term='PostgreSQL'/><category term='DELIMITER'/><category term='XPath'/><category term='SQL Developer'/><category term='IE8'/><category term='.kjb file'/><category term='HTML'/><category term='EU'/><category term='Internationalization'/><category term='DBA'/><category term='MySQL Plugin API'/><category term='Foreign keys'/><category term='MyISAM'/><category term='Open Community Camp'/><category term='Architecture'/><category term='debugging'/><category term='SQL Standard'/><category term='analytic databases'/><category term='Apache v2 license'/><category term='Windows Update'/><category term='SCD Type 2'/><category term='Write'/><category term='Falcon'/><category term='business intelligence'/><category term='http'/><category term='sys_eval'/><category term='quantile'/><category term='Ajax'/><category term='Programming'/><category term='partitioning'/><category term='Sakila'/><category term='Politics'/><category term='Migration toolkit'/><category term='Hive'/><category term='common_schema'/><category term='users conference'/><category term='delete'/><category term='unconference'/><category term='python'/><category term='Microsoft Visual C++ Express 2005'/><category term='command line client'/><category term='Virus'/><category term='User defined variable'/><category term='Hadoop'/><category term='Conference'/><category term='outage'/><category term='Oracle / Sun deal'/><category term='qwz'/><category term='O&apos;Reilly'/><category term='Writing'/><category term='Kettle Solutions'/><category term='SCD Type 1'/><category term='Ms SQL Server'/><category term='mysql stored routine'/><category term='ONLY_FULL_GROUP_BY'/><category term='kettle-cookbook'/><category term='Percona'/><category term='Isolation'/><category term='SHOW TABLES'/><category term='xsltproc'/><category term='Jan Claes'/><category term='LucidDB'/><category term='MySQL UDF Repository'/><category term='Jeff Prenevost'/><category term='translation'/><category term='namespace prefix'/><category term='php'/><category term='NULL'/><category term='C/C++'/><category term='trigger'/><category term='Xmla4Js'/><category term='Stored Procedures'/><category term='Modeling'/><category term='Stored function'/><category term='rtf'/><category term='blog'/><category term='book'/><category term='ascii'/><category term='NoSQL'/><category term='Stani Michiels'/><category term='mysql 5.5'/><category term='Transactions'/><category term='Quipu'/><category term='Decadence'/><category term='notbook'/><category term='Myths'/><category term='Malware'/><category term='hello world'/><category term='captcha'/><category term='Sun'/><category term='MySQL command line tool'/><category term='MS Sharepoint'/><category term='Pedantic'/><category term='OpenOffice.org'/><category term='GFYS'/><category term='Jos van Dongen'/><category term='compiling'/><category term='Rant'/><category term='strict mode'/><category term='database development'/><title type='text'>Roland Bouman's blog</title><subtitle type='html'>Tutorials, programming examples, and opinions on database products like Oracle and MySQL, Business Intelligence and Data Warehousing (in particular, about Pentaho), and web technologies like Javascript, XML and XSLT.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default?start-index=101&amp;max-results=100'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>221</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-15319370.post-9181837979266909749</id><published>2011-12-01T20:17:00.005+01:00</published><updated>2011-12-01T22:33:54.973+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql stored routine'/><category scheme='http://www.blogger.com/atom/ns#' term='database development'/><category scheme='http://www.blogger.com/atom/ns#' term='Open Source'/><category scheme='http://www.blogger.com/atom/ns#' term='depencencies'/><category scheme='http://www.blogger.com/atom/ns#' term='Database Administration'/><category scheme='http://www.blogger.com/atom/ns#' term='SQL'/><category scheme='http://www.blogger.com/atom/ns#' term='common_schema'/><title type='text'>Common Schema: dependencies routines</title><content type='html'>Are you a MySQL DBA? Checkout the &lt;code&gt;&lt;a href="http://code.google.com/p/common-schema/"&gt;common_schema&lt;/a&gt;&lt;/code&gt; project by Oracle Ace &lt;a href="http://code.openark.org/"&gt;Shlomi Noach&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The &lt;code&gt;common_schema&lt;/code&gt; is an open source MySQL schema that packs a number of utility views, functions and stored procedures. You can use these utilities to simplify MySQL database administration and development. Shlomi &lt;a href="http://code.openark.org/blog/mysql/common_schema-rev-178-foreach-repeat_exec-roland-bouman-query-analysis"&gt;just released revision 178&lt;/a&gt;, and I'm happy and proud to be working together with Shlomi on this project.&lt;br /&gt;&lt;br /&gt;Among the many cool features created by Shlomi, such as &lt;code&gt;foreach&lt;/code&gt;, &lt;code&gt;repeat_exec&lt;/code&gt; and &lt;code&gt;exec_file&lt;/code&gt;, there are a few &lt;code&gt;%_dependencies&lt;/code&gt; procedures I contributed:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;&lt;code&gt;get_event_dependencies(schema_name, event_name)&lt;/code&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;code&gt;get_routine_dependencies(schema_name, routine_name)&lt;/code&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;code&gt;get_sql_dependencies(sql, default_schema)&lt;/code&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;code&gt;get_view_dependencies(schema_name, view_name)&lt;/code&gt;&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;All these procedures return a resultset that indicates which schema objects are used by the object identified by the input parameters. Here are a few examples that should give you an idea:&lt;pre&gt;&lt;br /&gt;mysql&amp;gt; call common_schema.get_routine_dependencies('common_schema', 'get_routine_dependencies');&lt;br /&gt;+---------------+----------------------+-------------+--------+&lt;br /&gt;| schema_name   | object_name          | object_type | action |&lt;br /&gt;+---------------+----------------------+-------------+--------+&lt;br /&gt;| common_schema | get_sql_dependencies | procedure   | call   |&lt;br /&gt;| mysql         | proc                 | table       | select |&lt;br /&gt;+---------------+----------------------+-------------+--------+&lt;br /&gt;2 rows in set (0.19 sec)&lt;br /&gt;&lt;br /&gt;Query OK, 0 rows affected (0.19 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; call common_schema.get_routine_dependencies('common_schema', 'get_sql_dependencies');&lt;br /&gt;+---------------+-------------------+-------------+--------+&lt;br /&gt;| schema_name   | object_name       | object_type | action |&lt;br /&gt;+---------------+-------------------+-------------+--------+&lt;br /&gt;| common_schema | _get_sql_token    | procedure   | call   |&lt;br /&gt;| common_schema | _sql_dependencies | table       | create |&lt;br /&gt;| common_schema | _sql_dependencies | table       | drop   |&lt;br /&gt;| common_schema | _sql_dependencies | table       | insert |&lt;br /&gt;| common_schema | _sql_dependencies | table       | select |&lt;br /&gt;+---------------+-------------------+-------------+--------+&lt;br /&gt;5 rows in set (1.59 sec)&lt;br /&gt;&lt;/pre&gt;Of course, there's always a lot to be desired. The main shortcomings as I see it now is that the dependencies are listed only one level deep: that is, the dependencies are not recursively analyzed. Another problem is that there is currently nothing to calculate reverse dependencies (which would arguably be more useful). &lt;br /&gt;&lt;br /&gt;The good news is, this is all open source, and your contributions are welcome! If you're interested in the source code of these routines, &lt;a href="http://code.google.com/p/common-schema/source/checkout"&gt;checkout the common_schema project&lt;/a&gt;, and look in the &lt;code&gt;common_schema/routines/dependencies&lt;/code&gt; directory.&lt;br /&gt;&lt;br /&gt;If you'd like to add recursive dependencies, or reverse dependencies, then don't hesitate and contribute. If you have a one-off contribution that relates directly to these dependencies routines, then it's probably easiest if you email me directly, and I'll see what I can do to get it in. If you are interested in more long term contribution, it's probably best if you write Shlomi, as he is the owner of the common_schema project.&lt;br /&gt;&lt;br /&gt;You can even contribute without implementing new features or fixing bugs. You can simply contribute by using the software and find bugs or offer suggestions to improve it. If you found a bug, or have an idea for an improvement or an entirely new feature, please &lt;a href="http://code.google.com/p/common-schema/issues/list"&gt;use the issue tracker&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;For now, enjoy, and untill next time.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-9181837979266909749?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/9181837979266909749/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=9181837979266909749' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/9181837979266909749'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/9181837979266909749'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2011/12/common-schema-dependencies-routines.html' title='Common Schema: dependencies routines'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-6760649886277364297</id><published>2011-10-21T15:00:00.002+02:00</published><updated>2011-10-21T15:36:34.429+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='delete'/><category scheme='http://www.blogger.com/atom/ns#' term='FEDERATED'/><category scheme='http://www.blogger.com/atom/ns#' term='innodb'/><category scheme='http://www.blogger.com/atom/ns#' term='Ms SQL Server'/><category scheme='http://www.blogger.com/atom/ns#' term='hack'/><category scheme='http://www.blogger.com/atom/ns#' term='trigger'/><title type='text'>MySQL Hacks: Preventing deletion of specific rows</title><content type='html'>Recently, someone emailed me:&lt;blockquote&gt;I have a requirement in MYSQL as follows:&lt;br /&gt;we have a table &lt;code&gt;EMP&lt;/code&gt; and we have to restrict the users not delete employees with &lt;code&gt;DEPT_ID = 10&lt;/code&gt;. If user executes a &lt;code&gt;DELETE&lt;/code&gt; statement without giving any &lt;code&gt;WHERE&lt;/code&gt; condition all the rows should be deleted except those with &lt;code&gt;DEPT_ID = 10&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;We are trying to write a &lt;code&gt;BEFORE DELETE&lt;/code&gt; trigger but we are not able to get this functionality.&lt;br /&gt;&lt;br /&gt;I have seen your blog where you explained about &lt;a href="http://rpbouman.blogspot.com/2005/11/using-udf-to-raise-errors-from-inside.html"&gt;Using an UDF to Raise Errors from inside MySQL Procedures and/or Triggers&lt;/a&gt;. Will it helps me to get this functionality? Could you suggest if we have any other alternatives to do this as well?&lt;/blockquote&gt;Frankly, I usually refer people that write me these things to a public forum, but this time I felt like giving it a go. I figured it would be nice to share my solution and I'm also curious if others found other solutions still. &lt;br /&gt;&lt;br /&gt;(Oh, I should point out that I haven't asked what the underlying reasons are for this somewhat extraordinary requirement. I normally would do that if I would be confronted with sucha a requirement in a professional setting. In this case I'm only interested in finding a crazy hack)&lt;h3&gt;Attempt 1: Re-insert deleted rows with a trigger&lt;/h3&gt;My first suggestion was:&lt;blockquote&gt;Raising the error won't help you achieve your goal: as soon as you raise the error, the statement will either abort (in case of a non-transactional table) or rollback all row changes made up to raising the error (in case of a transactional table)&lt;br /&gt;&lt;br /&gt;Although I find the requirement strange, here's a trick you could try:&lt;br /&gt;&lt;br /&gt;write a &lt;code&gt;AFTER DELETE FOR EACH ROW&lt;/code&gt; trigger that re-inserts the rows back into the table in case the condition &lt;code&gt;DEPT_ID = 10&lt;/code&gt; is met.&lt;br /&gt;&lt;br /&gt;Hope this helps...&lt;/blockquote&gt;&lt;br /&gt;Alas, I should've actually tried it myself before replying, because it doesn't work. If you do try it, a &lt;code&gt;DELETE&lt;/code&gt;results in this runtime error:&lt;pre&gt;Can't update table 'emp' in stored function/trigger because it is already used by statement which invoked this stored function/trigger.&lt;/pre&gt;This is also known as "the mutating table problem".&lt;h3&gt;Attempt 2: Re-insert deleted rows into a FEDERATED table&lt;/h3&gt;As it turns out, there is a workaround that meets all of the original demands. The workaround relies on &lt;a href="http://dev.mysql.com/doc/refman/5.5/en/federated-storage-engine.html"&gt;the &lt;code&gt;FEDERATED&lt;/code&gt; storage engine&lt;/a&gt;, which we can use to trick MySQL into thinking we're manipulating a different table than the one that fires the trigger. My first attempt went something like this:&lt;pre&gt;&lt;br /&gt;CREATE TABLE t (&lt;br /&gt;    id INT AUTO_INCREMENT PRIMARY KEY,&lt;br /&gt;    dept_id INT,&lt;br /&gt;    INDEX(dept_id)&lt;br /&gt;);&lt;br /&gt;&lt;br /&gt;CREATE TABLE federated_t (&lt;br /&gt;    id INT AUTO_INCREMENT PRIMARY KEY,&lt;br /&gt;    dept_id INT,&lt;br /&gt;    INDEX(dept_id)&lt;br /&gt;)&lt;br /&gt;ENGINE FEDERATED&lt;br /&gt;CONNECTION = 'mysql://root@localhost:3306/test/t';&lt;br /&gt;&lt;br /&gt;DELIMITER //&lt;br /&gt;&lt;br /&gt;CREATE TRIGGER adr_t&lt;br /&gt;AFTER DELETE ON t&lt;br /&gt;FOR EACH ROW&lt;br /&gt;IF old.dept_id = 10 THEN&lt;br /&gt;    INSERT INTO t_federated&lt;br /&gt;    VALUES  (old.id, old.dept_id);&lt;br /&gt;END IF;&lt;br /&gt;//&lt;br /&gt;&lt;br /&gt;DELIMITER ;&lt;br /&gt;&lt;/pre&gt;So the idea is to let the trigger re-insert the deleted rows back into the federated table, which in turn points to the original table that fired the trigger to fool MySQL into thinking it isn't touching the mutating table. Although this does prevent one from deleting any rows that satisfy the &lt;code&gt;DEPT_ID = 10&lt;/code&gt; condition, it does not work as intended:&lt;pre&gt;&lt;br /&gt;mysql&amp;gt; INSERT INTO t VALUES (1,10), (2,20), (3,30);&lt;br /&gt;Query OK, 3 rows affected (0.11 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; DELETE FROM t;&lt;br /&gt;ERROR 1159 (08S01): Got timeout reading communication packets&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; SELECT * FROM t;&lt;br /&gt;+----+---------+&lt;br /&gt;| id | dept_id |&lt;br /&gt;+----+---------+&lt;br /&gt;|  1 |      10 |&lt;br /&gt;|  2 |      20 |&lt;br /&gt;|  3 |      30 |&lt;br /&gt;+----+---------+&lt;br /&gt;3 rows in set (0.00 sec)&lt;br /&gt;&lt;/pre&gt;At this point I can only make an educated guess about the actual underlying reason for this failure. It could be that the deletion is locking the rows or even the table, thereby blocking the insert into the federated table until we get a timeout. Or maybe MySQL enters into an infinite loop of deletions and insertions until we hit a timeout. I didn't investigate, so I don't know, but it seems clear this naive solution doesn't solve he problem.&lt;h3&gt;Attempt 3: Deleting from the FEDERATED table and re-inserting into the underlying table&lt;/h3&gt;It turns out that we can solve it with a &lt;code&gt;FEDERATED&lt;/code&gt; table by turning the problem around: Instead of manipulating the original table, we can &lt;code&gt;INSERT&lt;/code&gt; and &lt;code&gt;DELETE&lt;/code&gt; from the &lt;code&gt;FEDERATED&lt;/code&gt; table, and have an &lt;code&gt;AFTER DELETE&lt;/code&gt; trigger on the &lt;code&gt;FEDERATED&lt;/code&gt; table re-insert the deleted rows back into the original table:&lt;pre&gt;&lt;br /&gt;DROP TRIGGER adr_t;&lt;br /&gt;&lt;br /&gt;DELIMITER //&lt;br /&gt;&lt;br /&gt;CREATE TRIGGER adr_federated_t&lt;br /&gt;AFTER DELETE ON federated_t&lt;br /&gt;FOR EACH ROW&lt;br /&gt;IF old.dept_id = 10 THEN&lt;br /&gt;    INSERT INTO t&lt;br /&gt;    VALUES  (old.id, old.dept_id);&lt;br /&gt;END IF;&lt;br /&gt;//&lt;br /&gt;&lt;br /&gt;DELIMITER ;&lt;br /&gt;&lt;/pre&gt;Now, the &lt;code&gt;DELETE&lt;/code&gt; does work as intended:&lt;pre&gt;&lt;br /&gt;mysql&amp;gt; DELETE FROM federated_t;&lt;br /&gt;Query OK, 3 rows affected (0.14 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; SELECT * FROM federated_t;&lt;br /&gt;+----+---------+&lt;br /&gt;| id | dept_id |&lt;br /&gt;+----+---------+&lt;br /&gt;|  1 |      10 |&lt;br /&gt;+----+---------+&lt;br /&gt;1 row in set (0.00 sec)&lt;/pre&gt;Of course, to actually use this solution, one would grant applications access only to the federated table, and "hide" the underlying table so they can't bypass the trigger by deleting rows directly from the underlying table.&lt;br /&gt;&lt;br /&gt;Now, even though this solution does seem to fit the original requirements, I would not recommend it for several reasons:&lt;ul&gt;&lt;li&gt;It uses the &lt;code&gt;FEDERATED&lt;/code&gt; storage engine, which hasn't been well supported. For that reason, it isn't enabled by default, and you need access to the MySQL configuration to enable it, limiting the applicability of this solution. Also, you could run into some nasty performance problems with the &lt;code&gt;FEDERATED&lt;/code&gt; storage engine&lt;/li&gt;&lt;li&gt;The solution relies on a trigger. In MySQL, triggers can really limit performance&lt;/li&gt;&lt;li&gt;Perhaps the most important reason is that this solution performs "magic" by altering the behaviour of SQL statements. Arguably, this is not so much the fault of the solution as it is of the original requirement.&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;An Alternative without relying on magic: a foreign key constraint&lt;/h3&gt;If I were to encounter the original requirement in a professional situation, I would argue that we should not desire to alter the semantics of SQL commands. If we tell the RDBMS to delete all rows from a table, it should either succeed and result in all rows being deleted, or it should fail and fail completely, leaving the data unchanged.&lt;br /&gt;&lt;br /&gt;So how would we go about implementing a solution for this changed requirement? &lt;br /&gt;&lt;br /&gt;We certainly could try the approach that was suggested in the original request: create a trigger that raises an exception whenever we find the row should not be deleted. However, this would still rely on a trigger (which is slow). And if you're not on MySQL 5.5 (or higher), you would have to use one of the ugly hacks to raise an exception.&lt;br /&gt;&lt;br /&gt;As it turns out, there is a very simple solution that does not rely on triggers. We can create a "guard table" that references the table we want to protect using a foreign key constraint:&lt;pre&gt;&lt;br /&gt;mysql&amp;gt; CREATE TABLE t_guard (&lt;br /&gt;    -&amp;gt;     dept_id INT PRIMARY KEY,&lt;br /&gt;    -&amp;gt;     FOREIGN KEY (dept_id)&lt;br /&gt;    -&amp;gt;         REFERENCES t(dept_id)&lt;br /&gt;    -&amp;gt; );&lt;br /&gt;Query OK, 0 rows affected (0.11 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; INSERT INTO t_guard values (10);&lt;br /&gt;Query OK, 1 row affected (0.08 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; DELETE FROM t;&lt;br /&gt;ERROR 1451 (23000): Cannot delete or update a parent row: a foreign key constraint fails (`test`.`t_guard`, CONSTRAINT `t_guard_ibfk_1` FOREIGN KEY (`dept_id`) REFERENCES `t` (`dept_id`))&lt;br /&gt;mysql&amp;gt; DELETE FROM t WHERE dept_id != 10;&lt;br /&gt;Query OK, 2 rows affected (0.05 sec)&lt;br /&gt;&lt;/pre&gt;(Like in the prior example with the federated table, the guard table would not be accessible to the application, and the "guard rows" would have to be inserted by a privileged user)&lt;h3&gt;Finally: what a quirkyy foreign key constraint!&lt;/h3&gt;You might have noticed that there's something quite peculiar about the foreign key constraint: typically, foreign key constraints serve to relate "child" rows to their respective "parent" row. To do that, the foreign key would typically point to a column (or set of columns) that make up either the primary key or a unique constraint in the parent table. But in this case, the referenced column &lt;code&gt;dept_id&lt;/code&gt; in the &lt;code&gt;t&lt;/code&gt; table is contained only in an index which is not unique. Strange as it may seem, this is allowed by MySQL (or rather, InnoDB). In this particular case, this flexibility (or is it a bug?) serves us quite well, and it allows us to guard many rows in the &lt;code&gt;t&lt;/code&gt; table with &lt;code&gt;dept_id = 10&lt;/code&gt; with just one single row in the guard table.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-6760649886277364297?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/6760649886277364297/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=6760649886277364297' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/6760649886277364297'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/6760649886277364297'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2011/10/mysql-hacks-preventing-deletion-of.html' title='MySQL Hacks: Preventing deletion of specific rows'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-8500678325736759158</id><published>2011-10-07T08:50:00.002+02:00</published><updated>2011-10-07T08:59:08.918+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='captcha'/><category scheme='http://www.blogger.com/atom/ns#' term='blog'/><category scheme='http://www.blogger.com/atom/ns#' term='comments'/><category scheme='http://www.blogger.com/atom/ns#' term='spam'/><title type='text'>Fighting Spam: Word Verification</title><content type='html'>Hi All,&lt;br /&gt;&lt;br /&gt;this is a quick note to let you know that from now on, commenters on this blog will need to complete a word verification (captcha) step. &lt;br /&gt;&lt;br /&gt;Personally, I regret to have to take this measure. Let me explain why I'm doing it anyway. &lt;br /&gt;&lt;br /&gt;Since 3 months or so, moderating comments on this blog is becoming a real drag due to a surge in anonymous spam. While bloggers spam detection is quite good, I still get notificaton mails prompting me to moderate. I feel this is consuming more of my time than it's worth.&lt;br /&gt;&lt;br /&gt;Except for requiring word verification, other policies (or lack thereof) are still in effect: all comments are moderated, but anyone can comment, even anonymously. In practice, all real comments get published - even negative or derogatory ones (should I receive them).&lt;br /&gt;&lt;br /&gt;Sorry for the convenience, but I hope you'll understand.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-8500678325736759158?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/8500678325736759158/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=8500678325736759158' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/8500678325736759158'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/8500678325736759158'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2011/10/fighting-spam-word-verification.html' title='Fighting Spam: Word Verification'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-7795034775481724057</id><published>2011-08-24T22:40:00.005+02:00</published><updated>2011-08-24T23:19:02.271+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql stored routine'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql 5.5'/><category scheme='http://www.blogger.com/atom/ns#' term='error handling'/><category scheme='http://www.blogger.com/atom/ns#' term='Database Stored Procedures'/><category scheme='http://www.blogger.com/atom/ns#' term='udf_init_error'/><category scheme='http://www.blogger.com/atom/ns#' term='SIGNAL'/><category scheme='http://www.blogger.com/atom/ns#' term='MySQL UDF Repository'/><title type='text'>Re-implementing udf_init_error in MySQL 5.5 and up</title><content type='html'>To whom it may concern - &lt;br /&gt;&lt;br /&gt;Today, I received an email from a user of the &lt;a href="http://www.mysqludf.org/lib_mysqludf_udf/index.php#udf_init_error"&gt;udf_init_error&lt;/a&gt; UDF (which resides in the &lt;a href="http://www.mysqludf.org/lib_mysqludf_udf/"&gt;lib_mysqludf_udf&lt;/a&gt; library). The purpose of this &lt;a href="http://dev.mysql.com/doc/refman/5.5/en/udf-features.html"&gt;UDF&lt;/a&gt; is to generate an error condition, which can be used to abruptly terminate a trigger or stored procedure. As such it is a workaround for &lt;a href="http://bugs.mysql.com/bug.php?id=11661"&gt;bug #11661&lt;/a&gt;.  This is all described extensively in my now ancient article &lt;a href="http://rpbouman.blogspot.com/2005/11/using-udf-to-raise-errors-from-inside.html"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The user wrote me because of a problem experienced in MySQL 5.5: &lt;blockquote&gt;...calling &lt;pre&gt;select udf_init_error('Transaction Cannot Be Done Because....');&lt;/pre&gt; will return user friendly error message: &lt;pre&gt;Transaction Cannot Be Done Because....&lt;/pre&gt;. But in MySQL 5.5, it returns &lt;pre&gt;Can't initialize function 'udf_init_error; Transaction Cannot Be Done Because....&lt;/pre&gt; The &lt;code&gt;Can't initialize function 'udf_init_error;&lt;/code&gt; bit is so annoying! How can I get rid of that?&lt;/blockquote&gt;I explained that the UDF still works like it should; it's just that at some point during the 5.0 lifecycle, the format of the error message was changed. (I can't recall exactly which version that was, but I did file &lt;a href="http://bugs.mysql.com/bug.php?id=38452"&gt;bug #38452&lt;/a&gt; that describes this issue).&lt;br /&gt;&lt;br /&gt;Anyway, I suggested to move away from using the &lt;code&gt;udf_init_error()&lt;/code&gt; UDF, and port all dependent code to use the &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.5/en/signal.html"&gt;SIGNAL&lt;/a&gt;&lt;/code&gt; syntax instead, which was introduced in MySQL 5.5. (For a friendly introduction to using the &lt;code&gt;SIGNAL&lt;/code&gt; syntax, please check out &lt;a href="http://rpbouman.blogspot.com/2009/12/validating-mysql-data-entry-with_15.html"&gt;one of my prior articles&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;Unfortunately, for this particular user this would not be an easy task:&lt;blockquote&gt; The use of SIGNAL did come to my mind, but the implementation is not easy. I have thousands of stored routines to modify. Besides, I'm already satisfied with what the UDF does.&lt;/blockquote&gt;On the one hand, It makes me happy to hear the &lt;code&gt;udf_init_error()&lt;/code&gt; UDF served him so well that he wrote so many routines that rely on it; on the other hand, I feel bad that this is holding him back from upgrading to MySQL 5.5.&lt;br /&gt;&lt;br /&gt;For everybody that is in this same position, I'd like to suggest the following solution: simply re-implement &lt;code&gt;udf_init_error()&lt;/code&gt; as a stored SQL function that uses the &lt;code&gt;SIGNAL&lt;/code&gt; functionality instead. The error message returned to the client will not be exactly the same as in the olden MySQL 5.0 days, but at least there will not be an annoying complaint about a UDF that cannot be initialized. &lt;br /&gt;&lt;br /&gt;Here's a very simple example that illustrates how to do it:&lt;pre&gt;CREATE FUNCTION udf_init_error(&lt;br /&gt;   p_message VARCHAR(80)&lt;br /&gt;)&lt;br /&gt;RETURNS INTEGER&lt;br /&gt;DETERMINISTIC&lt;br /&gt;NO SQL&lt;br /&gt;BEGIN&lt;br /&gt;   DECLARE err CONDITION FOR SQLSTATE '45000';&lt;br /&gt;   SIGNAL err SET MESSAGE_TEXT = p_message;&lt;br /&gt;   RETURN 1;&lt;br /&gt;END;&lt;/pre&gt;I hope this helps.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-7795034775481724057?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/7795034775481724057/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=7795034775481724057' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/7795034775481724057'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/7795034775481724057'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2011/08/re-implementing-udfiniterror-in-mysql.html' title='Re-implementing udf_init_error in MySQL 5.5 and up'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-8271811154218255754</id><published>2011-08-15T09:45:00.000+02:00</published><updated>2011-08-15T09:45:00.268+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Vault'/><category scheme='http://www.blogger.com/atom/ns#' term='Xmla4Js'/><category scheme='http://www.blogger.com/atom/ns#' term='Pentaho'/><category scheme='http://www.blogger.com/atom/ns#' term='mql-to-sql'/><category scheme='http://www.blogger.com/atom/ns#' term='Kettle'/><category scheme='http://www.blogger.com/atom/ns#' term='pentaho data integration'/><category scheme='http://www.blogger.com/atom/ns#' term='Anchor Modeling'/><category scheme='http://www.blogger.com/atom/ns#' term='MQL'/><title type='text'>Proposals for Codebits.EU</title><content type='html'>&lt;a href="http://codebits.eu" border="0"&gt;&lt;img src="http://codebits.eu/imgs/logo_site2011_v1.png"/&gt;&lt;/a&gt; &lt;a href="http://codebits.eu"&gt;Codebits&lt;/a&gt; is an annual 3-day conference about software and, well, code. It's organized by SAPO and this year's edition is to be held on November 10 thru 12 at the &lt;a href="http://www.pavilhaoatlantico.pt/vEN/AboutPavilhaoAtlantico/LocationAndAcess/Pages/LocationAndAcess.aspx"&gt;Pavilhão Atlântico&lt;/a&gt;, Sala Tejo in Lisbon, Portugal. &lt;br /&gt;&lt;br /&gt;I've never attended SAPO Codebits before, but I heard good things about it from Datacharmer &lt;a href="http://datacharmer.blogspot.com/"&gt;Giuseppe Maxia&lt;/a&gt;. The interesting thing about the way this conference is organized is that all proposals are available to the public, which can also vote for the proposals. &lt;a href="http://codebits.eu/intra/s/talks"&gt;This year's proposals&lt;/a&gt; are looking very interesting already, with high quality proposals from &lt;a href="http://codebits.eu/intra/s/user/580"&gt;Giuseppe&lt;/a&gt; about database replication with Tungsten replicator, Pentaho's chief of data integration &lt;a href="http://codebits.eu/intra/s/user/1649"&gt;Matt Casters&lt;/a&gt; about Kettle (aka Pentaho data integration), and &lt;a href="http://codebits.eu/intra/s/user/1653"&gt;Pedro Alves&lt;/a&gt; from webdetails who will be talking about "Big Data" analysis and dashboarding work he did for the Mozilla team. &lt;br /&gt;&lt;br /&gt;There are many more interesting talks, and you should simply &lt;a href="http://codebits.eu/intra/s/talks"&gt;check out the proposals for yourself&lt;/a&gt; and give a thumbs up or a thumbs down according to whether you'd like see a particular proposal at the conference. I decided to send in a few proposals as well:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://codebits.eu/intra/s/proposal/183"&gt;MQL-to-SQL: a JSON-based query language for RDBMS access from AJAX applications&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://codebits.eu/intra/s/proposal/184"&gt;DataVault and Anchor Modeling: Methods for auditable and agile Data Warehousing&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://codebits.eu/intra/s/proposal/185"&gt;Xmla4js: Online Analytical Processing and Business Intelligence for web applications&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;So, if you like what you see here, take a minute to vote and shape this codebits conference. I'm hoping to meet you there!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-8271811154218255754?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/8271811154218255754/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=8271811154218255754' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/8271811154218255754'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/8271811154218255754'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2011/08/proposals-for-codebitseu.html' title='Proposals for Codebits.EU'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-1512738441360831140</id><published>2011-08-12T11:59:00.007+02:00</published><updated>2011-08-12T18:17:58.045+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='O&apos;Reilly'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql user conference'/><category scheme='http://www.blogger.com/atom/ns#' term='Percona'/><category scheme='http://www.blogger.com/atom/ns#' term='MySQL UC'/><title type='text'>Regarding the MySQL Conference and Expo 2012</title><content type='html'>Last week, &lt;a href="http://www.mysqlperformanceblog.com/2011/08/09/announcing-percona-live-mysql-conference-and-expo-2012/"&gt;Baron Schwartz&lt;/a&gt; announced the &lt;a href="http://www.percona.com/live/mysql-conference-2012/"&gt;Percona Live MySQL Conference and Expo 2012&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.percona.com/about-us/conferences/"&gt;Percona organized MySQL related conferences and seminars before&lt;/a&gt;, and from what I've heard, with considerable success and to satisfaction of its attendees, and there's one coming up in &lt;a href="http://www.percona.com/live/london-2011/"&gt;London in October 2011&lt;/a&gt;. But arguably, last week's announcement is quite different from the prior Percona conferences. It's different, because it seeks to replace the annual O'Reilly MySQL Conference and Expo. &lt;br /&gt;&lt;br /&gt;Everyone that has read the announcement will have no trouble recognizing it as a replacement, since it reads:&lt;blockquote&gt;We all know that the entire MySQL community has been waiting to see if there will be a MySQL conference next year in the traditional date and location. To the best of our knowledge, no one else was planning one, so we decided to keep the tradition alive.&lt;/blockquote&gt;If you're still in doubt:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;The conference title contains the phrase: "MySQL Conference and Expo".&lt;/li&gt;&lt;br /&gt;&lt;li&gt;It's to be held in the Hyatt Regency Hotel in Santa Clara, which has been the venue for the O'Reilly MySQL Conference and Expo since at least 2005.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;It's scheduled to take place midway April, exactly like the O'Reilly conferences used to be.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;The scope of the conference encompasses the entire "eco-system" - whatever that is: Developers and DBAs; tools and techniques; tutorials, talks and BOFs. It's about users, but also explicitly about companies and businesses.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;Immediately following the announcement, bloggers from the MySQL community - all of which I respect, and consider friends of mine - started posting their opinions:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://www.skysql.com/company/management"&gt;Kaj Arnö&lt;/a&gt;, EVP Products of &lt;a href="http://www.skysql.com"&gt;SkySQL&lt;/a&gt;: &lt;a href="http://blogs.skysql.com/2011/08/santa-clara-mysql-conference-2012-unity.html"&gt;"Santa Clara MySQL Conference 2012: Unity or division?"&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://datacharmer.blogspot.com/"&gt;Giuseppe Maxia&lt;/a&gt;, member of the IOUG's (&lt;a href="http://www.ioug.org/"&gt;Independent Oracle User Group&lt;/a&gt;) &lt;a href="http://www.ioug.org/Events/IOUGWelcomesMySQL/tabid/164/Default.aspx"&gt;MySQL Council&lt;/a&gt;: &lt;a href="http://datacharmer.blogspot.com/2011/08/call-for-disclosure-on-mysql-conference.html"&gt;"Call for disclosure on MySQL Conference 2012"&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://palominodb.com/who-we-are"&gt;Sheeri Cabral&lt;/a&gt;, sr. DBA and Community Liason at &lt;a href="http://palominodb.com/"&gt;PalominiDB&lt;/a&gt;: &lt;a href="http://palominodb.com/blog/2011/08/10/disclosure-truth-about-mysql-2012-conference-planning"&gt;"Disclosure: Truth About MySQL 2012 Conference Planning"&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://openlife.cc"&gt;Henrik Ingo&lt;/a&gt;: &lt;a href="http://openlife.cc/blogs/2011/august/mysql-conference-back-2012-courtesy-percona"&gt;"The MySQL conference is back for 2012, courtesy of Percona"&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://www.tusacentral.net/"&gt;Marco Tusa&lt;/a&gt;: &lt;a href="http://www.tusacentral.net/joomla/index.php/mysql-blogs/108-a-missed-opportunity"&gt;"A Missed Opportunity?"&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://monty-says.blogspot.com"&gt;Monty Widenius&lt;/a&gt; of &lt;a href="http://montyprogram.com/about/"&gt;Monty Program ab&lt;/a&gt;: &lt;a href="http://monty-says.blogspot.com/2011/08/what-is-happening-with-mysql-conference.html"&gt;"What is happening with the MySQL conference?"&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;(In this list, I tried to maintain the affiliation of these bloggers as appropriate and relevant as possible. Please let me know if you feel I wrongly associated someone with a particular company or organizational body)&lt;br /&gt;&lt;br /&gt;Except for Henrik's post, all of these express a negative attitude towards Percona's announcement. The critique focuses on a few themes:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Giuseppe and Sheeri express similar thoughts. They recall how Baron Schwartz and Peter Zaitsev (both from Percona, and now initiating the 2012 conference) criticized the O'Reilly MySQL Conference and Expo 2008 edition for increasingly becoming an event focused at business and vendors, rather than at users (see &lt;a href="http://www.xaprb.com/blog/2008/04/23/like-it-or-not-it-is-the-mysql-conference-and-expo/"&gt;here&lt;/a&gt; and &lt;a href="http://www.mysqlperformanceblog.com/2008/04/23/conference-for-mysql-users/"&gt;here&lt;/a&gt;). There seems to be a hidden accusation that now, only a few years later, Baron and Peter are "guilty" of organizing a business-oriented conference themsevles.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Also both Sheeri and Giuseppe's posts express the concern that Oracle might not allow any of its MySQL engineers and architects to speak at the conference. This would arguably make it a less interesting conference as Oracle is a major -if not the main- contributor to both MySQL and InnoDB.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;All bloggers argue that organizing a conference of this scale should not be the effort of a single company. In particular Kaj Arnö and Monty Widenius allude to the possibility of O'Reilly organizing the conference again, just like the way things used to be. They both explicitly include a list of the major companies contributing to MySQL which they envision should help drive such a conference. &lt;br /&gt;&lt;br /&gt;The main concern here is that when a single company organizes this event, it will be their event. In other words, it will not be neutral. There is serious concern for unfair competition, as the organizing company gets to decide or exert greater influence on which talks are approved, and how talks are scheduled against each other&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;Although I understand the critique, I do not agree with it. I hope I'm not offending any of my friends, but I think none of these seemingly sensible arguments against Percona organizing the MySQL Conference have true merit. &lt;br /&gt;&lt;br /&gt;But before I explain, I think it's interesting to observe that nobody seems to assume it as a given that O'Reilly would be organizing another MySQL Conference and Expo. Monty comes closest to saying something about it:&lt;blockquote&gt;The reason for my state of mind is that although there have been rumors about discontinuance of the O'Reilly arranged conference there hasn't been any announcement about this. &lt;br /&gt;&lt;br /&gt;In fact, I have been working with O'Reilly to try to setup next year's O'Reilly MySQL conference with the intention of having it 'exactly like before', even if Oracle would not participate. &lt;/blockquote&gt; So basically, because O'Reilly didn't say they weren't going to do one, it might be possible, right :) I tend to look at it differently: It means exactlty nothing when someone, O'Reilly included, didn't announce something. The way I see it, O'Reilly has nothing to gain by announcing that they will not be organizing another MySQL Conference. Similarly, they've got nothing to lose by not announcing they aren't. &lt;br /&gt;&lt;br /&gt;In the end, organizing conferences is one of O'Reilly's business activities. The mere fact that they've been organizing one during the previous years does not bestow any special responsibility upon them to inform potential attendees and sponsors that they are discontinuing such an activity.&lt;br /&gt;&lt;br /&gt;It's interesting that Monty mentions he was working together with O'Reilly on it. I have no reason to doubt it, but I do suspect that whatever was in the works, it was probably not going to happen at the traditional location and at the traditional time window. Silicon Valley Conference centers are busy places, and need to be reserved well in advance - starting to work on it less than three quarters in advance can probably not be considered "well in advance" for an event of this scale.&lt;br /&gt;&lt;br /&gt;Now, here are my arguments as to why I do not share the opinions I mentioned above:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;If you read Peter and Baron's posts from way back in 2008 (I included the links already, but &lt;a href="http://www.xaprb.com/blog/2008/04/23/like-it-or-not-it-is-the-mysql-conference-and-expo/"&gt;here&lt;/a&gt; and &lt;a href="http://www.mysqlperformanceblog.com/2008/04/23/conference-for-mysql-users/"&gt;here&lt;/a&gt; they are again), you will notice that they were not in fact criticizing the O'Reilly MySQL Conference and Expo at all. They simply recognized there was a gap and felt there should also be a community-driven conference. In fact, Baron initiated such an event, the &lt;a href="http://opensqlcamp.org/Main_Page"&gt;Open SQL Camp&lt;/a&gt;. That turned out to be such a great success that others started organizing OpenSQL Camps too. &lt;br /&gt;&lt;br /&gt;Now, if you read Baron's &lt;a href="http://www.mysqlperformanceblog.com/2011/08/09/announcing-percona-live-mysql-conference-and-expo-2012/"&gt;announcement&lt;/a&gt; for the Percona MySQL Conference and Expo, you'll notice that precisely because the tables are turned, they now feel the need to maintain a business-driven MySQL conference. They simply recognized that now there is the risk of a gap as far as a business-driven MySQL conference is concerned. In other words, there is no question of should this be a business-driven event or a community-driven event. Both kinds of events are needed, and the business one wasn't being taken care of, neither by O'Reilly, nor by Oracle.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;The concern that Oracle might not allow its engineers to attend a conference that is organized by a competitor seems reasonable. But it assumes that they would allow it if it was a vendor-neutral conference, or at least a conference that could be perceived as such. To those that have been involved to some extent in the organization of the 2010 and 2011 editions of the MySQL Conference, it should be no secret that Oracle's participation hasn't been exactly eager. Just listen to &lt;a href="http://en.oreilly.com/mysql2010/public/schedule/detail/12441"&gt;Tim O'Reilly's own talk&lt;/a&gt; at the MySQL 2010 conference. If that doesn't convince you, look at the &lt;a href="http://en.oreilly.com/mysql2011/public/content/sponsors"&gt;sponsor list&lt;/a&gt; for the 2011 edition: no Oracle. And if that still doesn't convince you - Last year it was very unclear whether Oracle was sending any delegation at all. Only at a very late stage did we receive proposals from Oracle. &lt;br /&gt;&lt;br /&gt;To be clear, I am not blaming Oracle for not wishing to participate in a particular conference. They have their own strategy and they are entitled to execute that however they see fit, even if that includes not sponsoring or speaking at a major MySQL conference. I'm just arguing that whether or not such a conference is organized by a vendor neutral party does not seem to be a part of Oracle's consideration. There is in my opinion absolutely no guarantee that Oracle would participate if things really would be like they used to, and O'Reilly and not Percona would be organizing another edition.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;The final matter I'd like to discuss is the idea of all major MySQL contributing companies on working togehter to organize a conference. Although I think that's a very sympathetic idea, it doesn't seem very realistic to me. Or at least, it doesn't seem realistic that this would lead to a Santa Clara conference in April 2012. So maybe this is something all involved parties should discuss for the years to come.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;My final remarks are that in the end, I am mostly happy that at least one party is willing to take the up-front risk in securing the venue. Percona has announced that, just like O'Reilly, they want to set up a board of community members to drive the program. I think other companies have legitimate concerns over vendor neutrality, but I have no reason to doubt that Baron, and by extension, Percona are doing whatever they can to safeguard that.&lt;br /&gt;&lt;br /&gt;What is left is Percona's company name in the conference title. While Percona is likely to use that to drive their own business, this is not really different from all pre-2010 MySQL conferences, where MySQL AB and then Sun used it to drive theirs. That does not mean it excludes competitors using the conference to their advantage to pursue their business interests. There has always been room for competing vendors, and arguably that is what made the event not just *a* MySQL conference, but *the* MySQL conference. &lt;br /&gt;&lt;br /&gt;I understand that not everybody is happy about how things are going now, and it would be great if all companies that feel they have a stake here collaborate in the future. But for now, I'm really happy there will be a 2012 edition, and I thank Percona for organizing it. I will definitely send in proposals as soon as the call for papers is open, and I hope everybody that feels they have something to talk about or present will do the same. I really believe this can be a conference exactly like it was before, with the only difference that it's organized by Percona, and not O'Reilly.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-1512738441360831140?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/1512738441360831140/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=1512738441360831140' title='24 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/1512738441360831140'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/1512738441360831140'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2011/08/regarding-mysql-conference-and-expo.html' title='Regarding the MySQL Conference and Expo 2012'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>24</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-7006951792989834499</id><published>2011-06-22T00:13:00.004+02:00</published><updated>2011-06-22T02:20:35.726+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='XPath'/><category scheme='http://www.blogger.com/atom/ns#' term='XML Schema'/><category scheme='http://www.blogger.com/atom/ns#' term='namespace prefix'/><category scheme='http://www.blogger.com/atom/ns#' term='namespace'/><category scheme='http://www.blogger.com/atom/ns#' term='XML'/><category scheme='http://www.blogger.com/atom/ns#' term='XSLT'/><title type='text'>Working with namespaces and namespace prefixes in XSLT 1.0</title><content type='html'>To whom it may concern - &lt;br /&gt;&lt;br /&gt;I'm currently developing a &lt;a href="http://www.w3.org/TR/xslt"&gt;Xslt 1.0&lt;/a&gt; stylesheet for analysis of &lt;a href="http://en.wikipedia.org/wiki/XML_Schema_(W3C)"&gt;XML Schema&lt;/a&gt; documents. As part of this work, I developed a couple of templates for working with namespace names and prefixes, and I like to share them via this post. The code is not incredibly hard or advanced, but it gets the job done and it may save you some time if you need something similar.&lt;ul&gt;&lt;li&gt;&lt;code&gt;get-local-name&lt;/code&gt;: Return the local name part of a given QName.&lt;/code&gt; function in XPath 2.0&lt;/li&gt;&lt;li&gt;&lt;code&gt;get-prefix&lt;/code&gt;: Return the prefix part of a given QName.&lt;/li&gt;&lt;li&gt;&lt;code&gt;get-ns-name&lt;/code&gt;: Return the namespace name associated to the given prefix.&lt;/li&gt;&lt;li&gt;&lt;code&gt;get-ns-prefix&lt;/code&gt;: Return a prefix that can be used to denote the given namespace name.&lt;/li&gt;&lt;li&gt;&lt;code&gt;resolve-ns-identifier&lt;/code&gt;: Return the namespace name for a given QName prefix&lt;/li&gt;&lt;/ul&gt;Before I discuss the code, I want to make a few remarks: &lt;ol&gt;&lt;li&gt;This s all about Xslt 1.0 and its query langue XPath 1.0. All these things can be solved much more conveniently in XPath 2.0, and hence in Xslt 2.0 because that builds on Xpath 2.0 (and not XPath 1.0 like Xslt 1.0 does)&lt;/li&gt;&lt;li&gt;If you're planning to use this in a web-browser, and you want to target Firefox, your're out of luck. Sorry. Firefox is a greatt browser, but unlike Chrome, Opera and even Internet Explorer, it doesn't care enough about Xslt to fix &lt;a href="https://bugzilla.mozilla.org/show_bug.cgi?id=94270"&gt;bug #94270&lt;/a&gt;, which has been living in their bug tracker as long as August 2001 (nope - I didn't mistype 2011, that's 2001 as in almost a decade ago)&lt;/li&gt;&lt;/ol&gt;&lt;h3&gt;get-local-name&lt;/h3&gt;Return the local name part of a given QName. This is functionally equivalent to the &lt;code&gt;&lt;a href="http://www.w3.org/TR/2010/REC-xpath-functions-20101214/#func-local-name-from-QName"&gt;fn:local-name-from-QName&lt;/a&gt;&lt;/code&gt;&lt;pre&gt;&amp;lt;!-- get the last substring after the *last* colon (or he argument if no colon) --&amp;gt;&lt;br /&gt;&amp;lt;xsl:template name="get-local-name"&amp;gt;&lt;br /&gt;    &amp;lt;xsl:param name="qname"/&amp;gt;&lt;br /&gt;    &amp;lt;xsl:choose&amp;gt;&lt;br /&gt;        &amp;lt;xsl:when test="contains($qname, ':')"&amp;gt;&lt;br /&gt;            &amp;lt;xsl:call-template name="get-local-name"&amp;gt;&lt;br /&gt;                &amp;lt;xsl:with-param name="qname" select="substring-after($qname, ':')"/&amp;gt;&lt;br /&gt;            &amp;lt;/xsl:call-template&amp;gt;&lt;br /&gt;        &amp;lt;/xsl:when&amp;gt;&lt;br /&gt;        &amp;lt;xsl:otherwise&amp;gt;&lt;br /&gt;            &amp;lt;xsl:value-of select="$qname"/&amp;gt;&lt;br /&gt;        &amp;lt;/xsl:otherwise&amp;gt;&lt;br /&gt;    &amp;lt;/xsl:choose&amp;gt;&lt;br /&gt;&amp;lt;/xsl:template&amp;gt;&lt;/pre&gt;&lt;h3&gt;get-prefix&lt;/h3&gt;Return the prefix part of a given QName. This is functionally equivalent to the &lt;code&gt;&lt;a href="http://www.w3.org/TR/2010/REC-xpath-functions-20101214/#func-prefix-from-QName"&gt;fn:prefix-from-QName&lt;/a&gt;&lt;/code&gt; function in XPath 2.0&lt;pre&gt;&amp;lt;!-- get the substring before the *last* colon (or empty string if no colon) --&amp;gt;&lt;br /&gt;&amp;lt;xsl:template name="get-prefix"&amp;gt;&lt;br /&gt;    &amp;lt;xsl:param name="qname"/&amp;gt;&lt;br /&gt;    &amp;lt;xsl:param name="prefix" select="''"/&amp;gt;&lt;br /&gt;    &amp;lt;xsl:choose&amp;gt;&lt;br /&gt;        &amp;lt;xsl:when test="contains($qname, ':')"&amp;gt;&lt;br /&gt;            &amp;lt;xsl:call-template name="get-prefix"&amp;gt;&lt;br /&gt;                &amp;lt;xsl:with-param name="qname" select="substring-after($qname, ':')"/&amp;gt;&lt;br /&gt;                &amp;lt;xsl:with-param name="prefix" select="concat($prefix, substring-before($qname, ':'))"/&amp;gt;&lt;br /&gt;            &amp;lt;/xsl:call-template&amp;gt;&lt;br /&gt;        &amp;lt;/xsl:when&amp;gt;&lt;br /&gt;        &amp;lt;xsl:otherwise&amp;gt;&lt;br /&gt;            &amp;lt;xsl:value-of select="$prefix"/&amp;gt;&lt;br /&gt;        &amp;lt;/xsl:otherwise&amp;gt;&lt;br /&gt;    &amp;lt;/xsl:choose&amp;gt;&lt;br /&gt;&amp;lt;/xsl:template&amp;gt;&lt;/pre&gt;&lt;h3&gt;get-ns-name&lt;/h3&gt;Return the namespace name associated to the given prefix. This is functionally equivalent to the &lt;code&gt;&lt;a href="http://www.w3.org/TR/2010/REC-xpath-functions-20101214/#func-namespace-uri-for-prefix"&gt;fn:namespace-uri-for-prefix&lt;/a&gt;&lt;/code&gt; function in XPath 2.0. The main difference is that this template does the lookup against the namespace definitions that are in effect in the current context, whereas the XPath 2.0 function allows the element which is used as context to be passed in as argument.&lt;pre&gt;&amp;lt;!-- get the namespace uri for the namespace identified by the prefix in the parameter --&amp;gt;&lt;br /&gt;&amp;lt;xsl:template name="get-ns-name"&amp;gt;&lt;br /&gt;    &amp;lt;xsl:param name="ns-prefix"/&amp;gt;&lt;br /&gt;    &amp;lt;xsl:variable name="ns-node" select="namespace::node()[local-name()=$ns-prefix]"/&amp;gt;&lt;br /&gt;    &amp;lt;xsl:value-of select="$ns-node"/&amp;gt;&lt;br /&gt;&amp;lt;/xsl:template&amp;gt;&lt;/pre&gt;&lt;h3&gt;get-ns-prefix&lt;/h3&gt;Return a prefix that can be used to denote the given namespace name. This template is complementary to the &lt;code&gt;get-ns-name&lt;/code&gt; template. This template assumes only one prefix will be defined for each namespace. The namspace is resolved against the current context.&lt;pre&gt;&amp;lt;!-- get the namespace prefix for the namespace name parameter --&amp;gt;&lt;br /&gt;&amp;lt;xsl:template name="get-ns-prefix"&amp;gt;&lt;br /&gt;    &amp;lt;xsl:param name="ns-name"/&amp;gt;&lt;br /&gt;    &amp;lt;xsl:variable name="ns-node" select="namespace::node()[.=$ns-name]"/&amp;gt;&lt;br /&gt;    &amp;lt;xsl:value-of select="local-name($ns-node)"/&amp;gt;&lt;br /&gt;&amp;lt;/xsl:template&amp;gt;&lt;/pre&gt;&lt;h3&gt;resolve-ns-identifier&lt;/h3&gt;Return the namespace name for a given QName prefix (be it a namespace prefix or a namspace name). This template is useful to generically obtain a namespace name when feeding it the prefix part of a QName. If the prefix happens to be a namespace name, then that is returned, but if it happens to be a namespace prefix, then a lookup is performed to return the namspace name. This template also looks at the namspaces in effect in the current context.&lt;pre&gt;&amp;lt;!-- return the namespace name --&amp;gt;&lt;br /&gt;&amp;lt;xsl:template name="resolve-ns-identifier"&amp;gt;&lt;br /&gt;    &amp;lt;xsl:param name="ns-identifier"/&amp;gt;&lt;br /&gt;    &amp;lt;xsl:choose&amp;gt;&lt;br /&gt;        &amp;lt;xsl:when test="namespace::node()[.=$ns-identifier]"&amp;gt;&lt;br /&gt;            &amp;lt;xsl:value-of select="$ns-identifier"/&amp;gt;&lt;br /&gt;        &amp;lt;/xsl:when&amp;gt;&lt;br /&gt;        &amp;lt;xsl:when test="namespace::node()[local-name()=$ns-identifier]"&amp;gt;&lt;br /&gt;            &amp;lt;xsl:value-of select="namespace::node()[local-name()=$ns-identifier]"/&amp;gt;&lt;br /&gt;        &amp;lt;/xsl:when&amp;gt;&lt;br /&gt;        &amp;lt;xsl:otherwise&amp;gt;&lt;br /&gt;            &amp;lt;xsl:message terminate="yes"&amp;gt;&lt;br /&gt;                Error: "&amp;lt;xsl:value-of select="$ns-identifier"/&amp;gt;" is neither a valid namespace prefix nor a valid namespace name.&lt;br /&gt;            &amp;lt;/xsl:message&amp;gt;&lt;br /&gt;        &amp;lt;/xsl:otherwise&amp;gt;&lt;br /&gt;    &amp;lt;/xsl:choose&amp;gt;&lt;br /&gt;&amp;lt;/xsl:template&amp;gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-7006951792989834499?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/7006951792989834499/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=7006951792989834499' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/7006951792989834499'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/7006951792989834499'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2011/06/working-with-namespaces-and-namespace.html' title='Working with namespaces and namespace prefixes in XSLT 1.0'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-3322503370403668865</id><published>2011-06-18T10:22:00.004+02:00</published><updated>2011-06-18T11:22:30.823+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='AGPL'/><category scheme='http://www.blogger.com/atom/ns#' term='Open Source'/><category scheme='http://www.blogger.com/atom/ns#' term='NoSQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Roxie'/><category scheme='http://www.blogger.com/atom/ns#' term='Thor'/><category scheme='http://www.blogger.com/atom/ns#' term='Hadoop'/><category scheme='http://www.blogger.com/atom/ns#' term='Apache v2 license'/><category scheme='http://www.blogger.com/atom/ns#' term='ECL'/><category scheme='http://www.blogger.com/atom/ns#' term='Big Data'/><category scheme='http://www.blogger.com/atom/ns#' term='GPL'/><category scheme='http://www.blogger.com/atom/ns#' term='Pentaho'/><category scheme='http://www.blogger.com/atom/ns#' term='ETL'/><category scheme='http://www.blogger.com/atom/ns#' term='Sqoop'/><category scheme='http://www.blogger.com/atom/ns#' term='Pig'/><category scheme='http://www.blogger.com/atom/ns#' term='business intelligence'/><category scheme='http://www.blogger.com/atom/ns#' term='HPCC Systems'/><category scheme='http://www.blogger.com/atom/ns#' term='Hive'/><title type='text'>HPCC vs Hadoop at a glance</title><content type='html'>Yesterday I noticed &lt;a href="http://twitter.com/#!/andreisavu/status/81289141073096704"&gt;this tweet&lt;/a&gt; by &lt;a href="http://twitter.com/#!/andreisavu"&gt;Andrei Savu&lt;/a&gt;: &lt;a href="http://twitter.com/#!/andreisavu/status/81289141073096704"&gt;&lt;img border="0" src="http://farm6.static.flickr.com/5155/5844277649_04ea4fba67_z.jpg"/&gt;&lt;/a&gt;. This prompted me to read the related &lt;a href="http://gigaom.com/cloud/lexisnexis-open-sources-its-hadoop-killer/"&gt;GigaOM article&lt;/a&gt; and then check out the &lt;a href="http://hpccsystems.com/"&gt;HPCC Systems website&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;If you're too lazy to read the article or visit that website:&lt;blockquote&gt;HPCC (High Performance Computing Cluster) is a massive parallel-processing computing platform that solves Big Data problems. The platform is now Open Source!&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;HPCC Systems &lt;a href="http://hpccsystems.com/Why-HPCC/HPCC-vs-Hadoop"&gt;compares itself&lt;/a&gt; to &lt;a href="http://hadoop.apache.org/"&gt;Hadoop&lt;/a&gt;, which I think is completely justified in terms of functionality. Its product originated as a homegrown solution of &lt;a href="https://www.lexisnexis.com/risk/"&gt;LexisNexis Risk Solutions&lt;/a&gt; allowing its customers (banks, insurance companies, law enforcment and federal government) to quickly analyze billions of records, and as such it has been in use for a decade or so. It is now open sourced, and I already heard &lt;a href="http://blog.pentaho.com/2011/06/17/spotlight-this-week-lexisnexis-infobright-openbi/"&gt;an announcement&lt;/a&gt; that &lt;a href="http://www.pentaho.org"&gt;Pentaho&lt;/a&gt; is its major &lt;a href="http://hpccsystems.com/partners/business-intelligence"&gt;Business Intelligence Partner&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Based on the limited information a made a quick analysis, which I emailed to the HPCC Systems CTO, &lt;a href="http://www.linkedin.com/in/armandoescalante"&gt;Armando Escalante&lt;/a&gt;. My friend &lt;a href="http://www.tholis.com/"&gt;Jos van Dongen&lt;/a&gt; said it was a good analysis and told me I should post it. Now, I don't really have time to make a nice blog post out of it, but I figured it can't hurt to just repeat what I said in my emails. So here goes:&lt;br /&gt;&lt;br /&gt;Just going by the documentation, I see a two real unique selling points in HPCC Systems as compared to Hadoop:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Real-time query performance (as opposed to only analytic jobs). HPCC offers two difference setups, labelled Thor and Roxie. Functionalitywise, Thor should be compared to a Map/Reduce cluster like Hadoop: it's good for doing fairly long running analyses on large volumes of data. Roxie is a different beast, and designed to offer fast data access, supporting ad-hoc real-time queries&lt;/li&gt;&lt;li&gt;Integrated toolset (as opposed to hodgepodge of third party tools). We're talking about an IDE, job monitoring, code repository, scheduler, configuration manager, and whatnot. This really looks like like big productivity boosters, which may make Big Data processing a lot more accessible to companies that don't have the kind of development teams required to work with Hadoop.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;(there may be many more benefits, but these are just the ones I could clearly distill from the press release and the website)&lt;br /&gt;&lt;br /&gt;Especially for Business Intelligence, Roxie maybe a big thing. If real-time Big Data queries could be integrated with Business Intelligence OLAP and reporting tools, then this is certainly a big thing. I can't disclose the details but I have trustworthy information that integration with Pentaho's Analysis Engine, the &lt;a href="http://mondrian.pentaho.com/"&gt;Mondrian ROLAP engine&lt;/a&gt; is underway and will be available as an Enterprise feature.&lt;br /&gt;&lt;br /&gt;A few things that look different but which may not matter too much when looking at HPCC and Hadoop from a distance:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://hpccsystems.com/faq/ecl-language-and-developers"&gt;ECL&lt;/a&gt;, the "Enterprise Control Language", which is a declarative query language (as opposed to just Map/Reduce). This initially seems like a big difference but Hadoop has tools like &lt;a href="http://pig.apache.org/"&gt;pig&lt;/a&gt; and &lt;a href="http://incubator.apache.org/projects/sqoop.html"&gt;sqoop&lt;/a&gt; and &lt;a href="http://hive.apache.org/"&gt;hive&lt;/a&gt;. Now, it could be that ECL is vastly superior to these hadoop tools, but my hunch is you'd have to be careful in how you position that. If you choose a head-on strategy in promoting ECL as opposed to pig, then the chances are that people will just spend their energy in discovering the things that pig can do and ECL cannot (not sure if those features actually exist, but that is what hadoop fanboys will look for), and in addition, the pig developers might simply clone the unique ECL features and the leveling of that playing field will just be a matter of time. This does not mean you shouldn't promote ECL - on the contrary, if you feel it is a more productive language than pig or any other hadoop tool, then by all means let your customers and prospects know. Just be careful and avoid downplaying the hadoop equivalents because that strategy could backfire.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Windows support. It's really nice that HPCC Systems is available for Microsoft Windows, it makes that a lot easier for Microsoft shops (and there are a lot of them). That said, customers that really have a big-data problem will solve it no matter what their internal software policies are. So they'd happily start running hadoop on linux if that solves their problems.&lt;/li&gt;&lt;li&gt;Maturity. On paper HPCC looks more mature than hadoop. It's hard to tell how much that matters though because hadoop has all the momentum. People might choose for hadoop because they anticipate that the maturity will come thanks to the sheer number of developers committing to that platform.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;The only thing I can think of where HPCC looks like it has a disadvantage as compared to Hadoop is adoption rate and licensing. I hope these will prove not to be significant hurdles for HPCC, but I think that these might be bigger problems then they seem. Especially the &lt;a href="http://www.gnu.org/licenses/agpl.html"&gt;AGPL licensing&lt;/a&gt; seems problematic to me.&lt;br /&gt;&lt;br /&gt;The AGPL is not well regarded by anyone I know - not in the open source world. The general idea seems to be that even more than plain GPL3 it restricts how the software may be used. If the goal of open sourcing HPCC is to gain mindshare and a developer community (something that hadoop has done and is doing extremely well) then a more permissive license is really the way to go.&lt;br /&gt;&lt;br /&gt;If you look at products like MySQL but also Pentaho - they are both very strongly corporately led products. The have a good number of users, but few contributions from outside the company, and this is probably due to a combination of  GPL licensing and the additional requirement for handing over the copyright of any contributions to the company. Hence these products don't really benefit from an open source development model (or at least not as much as they could). For these companies, Open source may help initially to gain a lot of users, but those are in majority the users that just want a free ride: conversion rates to enterprise edition customers are quite low. It might be enough to make a decent buck, but eventually you'll hit a cap on how far you can grow. I'm not saying this is bad - you only need to grow as much as you have to, but it is something to be aware of.&lt;br /&gt;&lt;br /&gt;Contrast this to Hadoop. The have a &lt;a href="http://www.apache.org/licenses/LICENSE-2.0.html"&gt;Apache 2.0 permissive license&lt;/a&gt;, and this results in many individuals but also companies contributing to the project. And there are still companies like &lt;a href="http://www.cloudera.com/"&gt;Cloudera&lt;/a&gt; that manage to make a good living off of the services around their distribution of Hadoop. You don't lose the ability to develop add-ons either with this model - apache 2.0 allows all that. The difference with GPL (and AGPL) of course is that it allows this also to other users and companies. So the trick to stay on top in this model is to simply offer the best product (as opposed to being the sole holder of the copyright to he code).&lt;br /&gt;&lt;br /&gt;Anyway - that is it for now - I hope this is helpful.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-3322503370403668865?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/3322503370403668865/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=3322503370403668865' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/3322503370403668865'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/3322503370403668865'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2011/06/hpcc-vs-hadoop-at-glance.html' title='HPCC vs Hadoop at a glance'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm6.static.flickr.com/5155/5844277649_04ea4fba67_t.jpg' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-669675250590460288</id><published>2011-05-30T17:50:00.001+02:00</published><updated>2011-05-31T10:42:58.628+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Webservices'/><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='Javascript'/><category scheme='http://www.blogger.com/atom/ns#' term='Pentaho'/><category scheme='http://www.blogger.com/atom/ns#' term='Kettle'/><category scheme='http://www.blogger.com/atom/ns#' term='Pentaho Kettle Solutions'/><category scheme='http://www.blogger.com/atom/ns#' term='JTidy'/><category scheme='http://www.blogger.com/atom/ns#' term='XML'/><category scheme='http://www.blogger.com/atom/ns#' term='Tidy'/><category scheme='http://www.blogger.com/atom/ns#' term='pentaho data integration'/><category scheme='http://www.blogger.com/atom/ns#' term='data integration'/><title type='text'>Cleaning webpages with Pentaho Data Integration and JTidy</title><content type='html'>Here's an issue I've come across multiple times: I need to scrape HTML websites to extract data. &lt;a href="http://sourceforge.net/projects/pentaho/" target="pentaho"&gt;Pentaho&lt;/a&gt; Data Integration (&lt;a href="http://kettle.pentaho.com/" target="kettle"&gt;kettle&lt;/a&gt;) has lots of functionality on-board to make this an easy process, except one: it does not support reading data directly from HTML. &lt;br /&gt;&lt;br /&gt;In this short post, I provide a simple tip to clean HTML pages and convert them to XML so you can extract its data using the conventional "Get data from XML" step. The solution hinges on two ingredients: &lt;ul&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://sourceforge.net/projects/jtidy/" target="tidy"&gt;JTtidy&lt;/a&gt;, the Java port of the popuar &lt;a href="http://tidy.sourceforge.net/" target="tidy"&gt;HTML Tidy&lt;/a&gt; program, which is an open source HTML cleaning program originally created by &lt;a href="http://www.w3.org/People/Raggett/" target="tidy"&gt;Dave Ragget&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;The &lt;a href="http://wiki.pentaho.com/display/EAI/User+Defined+Java+Class" target="kettle"&gt;user-defined Java Class step&lt;/a&gt;, which lets us write a small snippet of Java that we need to interface with JTidy&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;Standard Kettle tools for Webservices&lt;/h3&gt;Kettle is really good at fetching data from the web and extracting data from webservices, be they in a SOAP/XML, REST/JSON or RSS flavor. (There is an extensive chapter on this subject in &lt;a href="http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177" target="amazon"&gt;Pentaho Kettle Solutions&lt;/a&gt;). But when you're dealing with plain old HTML, things can get pretty hairy. &lt;br /&gt;&lt;br /&gt;If you're lucky, the page may be in XHTML, and in that case it's worth trying the &lt;a href="http://wiki.pentaho.com/display/EAI/Get+Data+From+XML" target="kettle"&gt;Get Data from XML&lt;/a&gt; step. However, quite often a webpage that claims to be XHTML is not well-formed XML, and even if it is, Kettle does not understand things like &lt;code&gt;&amp;amp;nbsp;&lt;/code&gt; entities, which are valid in XHTML, but not in plain XML. And of course, more often than not, you're not lucky, and XHTML represents only a minor fraction of all the web pages out there.&lt;h3&gt;Workaround: JavaScript string manipulation&lt;/h3&gt;In the past, I usually worked around these issues. In practice, some quick and dirty string manipulation using the &lt;a href="http://wiki.pentaho.com/display/EAI/Modified+Java+Script+Value" target="kettle"&gt;Modified Javascript Value&lt;/a&gt; step and some built-in &lt;code&gt;indexOf()&lt;/code&gt;, &lt;code&gt;substring&lt;/code&gt; and &lt;code&gt;replace()&lt;/code&gt; functions go a long way. &lt;br /&gt;&lt;br /&gt;In most cases I don't really need the entire web page, but only a &lt;code&gt;&amp;lt;table&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;ul&amp;gt;&lt;/code&gt; or &lt;code&gt;&amp;lt;ol&amp;gt;&lt;/code&gt; element in the &lt;code&gt;&amp;lt;body&amp;gt;&lt;/code&gt;. Excising only the interesting sections out of the page using plain string manipulation will often get rid of most of the cruft that prevents the data from being treated as XML. For example, if we only need to get the rows from a table with a particular &lt;code&gt;id&lt;/code&gt; attribute, we can use a JavaScript snippet like this:&lt;pre&gt;&lt;br /&gt;&lt;span style="color:silver"&gt;//table we're looking for&lt;/span&gt;&lt;br /&gt;&lt;span style="color:blue"&gt;var&lt;/span&gt; startHandle = &lt;span style="color:red"&gt;"&amp;lt;table class=\"lvw\" cellpadding=0 cellspacing=0&amp;gt;"&lt;/span&gt;;&lt;br /&gt;&lt;span style="color:blue"&gt;var&lt;/span&gt; startPosition= html.indexOf(startHandle);&lt;br /&gt;&lt;span style="color:silver"&gt;//look beyond the start tag to lose the invalid unquoted attributes&lt;/span&gt;&lt;br /&gt;startPosition += startHandle.length;&lt;br /&gt;&lt;br /&gt;&lt;span style="color:silver"&gt;//find where this table ends (lucky us, no nested table elements :)&lt;/span&gt;&lt;br /&gt;&lt;span style="color:blue"&gt;var&lt;/span&gt; endHandle = &lt;span style="color:red"&gt;"&amp;lt;/table&amp;gt;"&lt;/span&gt;;&lt;br /&gt;&lt;span style="color:blue"&gt;var&lt;/span&gt; endPosition = html.indexOf(endHandle, startPosition);&lt;br /&gt;&lt;br /&gt;&lt;span style="color:silver"&gt;//make a complete table fragment out of it again&lt;/span&gt;&lt;br /&gt;&lt;span style="color:blue"&gt;var&lt;/span&gt; table = &lt;span style="color:red"&gt;"&amp;lt;table&amp;gt;"&lt;/span&gt; + html.substring(startPosition, endPosition + endHandle.length);&lt;br /&gt;&lt;br /&gt;&lt;span style="color:silver"&gt;//replace nbsp entities, empty unclosed img elements, and value-less nowrap attributes&lt;/span&gt;&lt;br /&gt;table = table.replace(&lt;span style="color:red"&gt;/&amp;amp;nbsp;|&amp;lt;img[^&amp;gt;]&gt;|nowrap/ig&lt;/span&gt;, &lt;span style="color:red"&gt;""&lt;/span&gt;);&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;There are of course no guarantees that the sections you cut out like that are in fact well-formed XML, but in my experience it's often worth a try.&lt;h3&gt;A better way: using JTidy&lt;/h3&gt;While the JavaScript workaround may just work for your particular case, it certainly has disadvantages. Sometimes it may just be not so simple to clean the HTML with plain string manipulation. And of course there is a performance issue too - the JavaScript step can be quite slow. &lt;br /&gt;&lt;br /&gt;Fortunately, there is a better way.&lt;br /&gt;&lt;br /&gt;Using a &lt;a href="http://wiki.pentaho.com/display/EAI/User+Defined+Java+Class" target="kettle"&gt;user-defined Java Class step&lt;/a&gt; we can have JTidy do the dirty work of cleaning the HTML and converting it to XML, which we can then process in a sane way with Kettle's &lt;a href="http://wiki.pentaho.com/display/EAI/Get+Data+From+XML" target="kettle"&gt;Get Data from XML&lt;/a&gt; step. &lt;br /&gt;&lt;br /&gt;We need to do two things to make this work: first, you have to &lt;a href="http://sourceforge.net/projects/jtidy/" target="tidy"&gt;download JTidy&lt;/a&gt;, unzip it, and place the &lt;code&gt;jtidy-r938.jar&lt;/code&gt; in the &lt;code&gt;libext&lt;/code&gt; directory, which resides immediately in your kettle installation directory. (note that if you were running spoon, you need to restart it before it will be picked up). Second, you need a little bit of glue code for the User-defined Java class step so Kettle can use the &lt;code&gt;Tidy&lt;/code&gt; class inside the jar. With some help from the &lt;a href="http://wiki.pentaho.com/display/EAI/User+Defined+Java+Class" target="kettle"&gt;pentaho wiki&lt;/a&gt; and the &lt;a href="http://jtidy.sourceforge.net/apidocs/org/w3c/tidy/ant/JTidyTask.html" target="tidy"&gt;JTidy JavaDoc&lt;/a&gt; documentation, I came up with the following Java snippet to make it work:&lt;pre&gt;&lt;br /&gt;&lt;span style="color:blue"&gt;import&lt;/span&gt; org.w3c.tidy.Tidy;&lt;br /&gt;&lt;span style="color:blue"&gt;import&lt;/span&gt; java.io.StringReader;&lt;br /&gt;&lt;span style="color:blue"&gt;import&lt;/span&gt; java.io.StringWriter;&lt;br /&gt;&lt;br /&gt;&lt;span style="color:blue"&gt;protected&lt;/span&gt; Tidy tidy;&lt;br /&gt;&lt;br /&gt;&lt;span style="color:blue"&gt;public boolean&lt;/span&gt; init(StepMetaInterface stepMetaInterface, StepDataInterface stepDataInterface)&lt;br /&gt;{&lt;br /&gt;    &lt;span style="color:silver"&gt;//create and configure a Tidy instance&lt;/span&gt;&lt;br /&gt;    tidy = &lt;span style="color:blue"&gt;new&lt;/span&gt; Tidy();&lt;br /&gt;    tidy.setXmlOut(&lt;span style="color:blue"&gt;true&lt;/span&gt;);&lt;br /&gt;    &lt;span style="color:blue"&gt;return&lt;/span&gt; parent.initImpl(stepMetaInterface, stepDataInterface);&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;span style="color:blue"&gt;public boolean&lt;/span&gt; processRow(StepMetaInterface smi, StepDataInterface sdi) &lt;span style="color:blue"&gt;throws&lt;/span&gt; KettleException {&lt;br /&gt;    Object[] r;&lt;br /&gt;    &lt;span style="color:silver"&gt;//Get row from incoming stream.&lt;/span&gt;&lt;br /&gt;    &lt;span style="color:silver"&gt;//Bail out if its not there.&lt;/span&gt;&lt;br /&gt;    &lt;span style="color:blue"&gt;if&lt;/span&gt; ((r = getRow()) == &lt;span style="color:blue"&gt;null&lt;/span&gt;) {&lt;br /&gt;       setOutputDone();&lt;br /&gt;       &lt;span style="color:blue"&gt;return false&lt;/span&gt;;&lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;    &lt;span style="color:silver"&gt;//read the value of the html input field&lt;/span&gt;&lt;br /&gt;    &lt;span style="color:silver"&gt;//the html field happens to be the 5th field in the stream, &lt;/span&gt;&lt;br /&gt;    &lt;span style="color:silver"&gt;//because java arrays start at 0, we use index 4 to reference it&lt;/span&gt;&lt;br /&gt;    StringReader html = &lt;span style="color:blue"&gt;new&lt;/span&gt; StringReader((String)r[&lt;span style="color:red"&gt;4&lt;/span&gt;]);&lt;br /&gt;&lt;br /&gt;    &lt;span style="color:silver"&gt;//use tidy to parse html to xml&lt;/span&gt;&lt;br /&gt;    StringWriter xml = &lt;span style="color:blue"&gt;new&lt;/span&gt; StringWriter();&lt;br /&gt;    tidy.parse(html, xml);&lt;br /&gt;&lt;br /&gt;    &lt;span style="color:silver"&gt;//assign the xml to the output row&lt;/span&gt;&lt;br /&gt;    &lt;span style="color:silver"&gt;//note we simply overwrite the original html field from the input row.&lt;/span&gt;&lt;br /&gt;    r[4] = xml.toString();&lt;br /&gt;&lt;br /&gt;    &lt;span style="color:silver"&gt;//push the output row to the outgoing stream.&lt;/span&gt;&lt;br /&gt;    putRow(data.outputRowMeta, r);&lt;br /&gt;    &lt;span style="color:blue"&gt;return true&lt;/span&gt;;&lt;br /&gt;}&lt;/pre&gt;&lt;br /&gt;(Tip: for more examples and background information on the user-defined java class step, check out the excellent blog posts by &lt;a href="http://type-exit.org/adventures-with-open-source-bi/2010/10/the-user-defined-java-class-step/" target="kettle"&gt;Slawomir Chodnicki&lt;/a&gt;, &lt;a target="kettle" href="http://www.ibridge.be/?p=180"&gt;Matt Casters&lt;/a&gt; and the &lt;a href="http://people.mozilla.com/~deinspanjer/KettleJSPerformance.mov" target="kettle"&gt;video walk-through&lt;/a&gt; by &lt;a href="http://daniele.livejournal.com/78409.html" target="kettle"&gt;Dein Einspanjer&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;The big advantage of using Tidy is that you can be sure that the result is well-formed XML. In addition, you can have JTidy report on any errors or warnings, which makes it much more robust than any ad-hoc string manipulation you can come up with.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-669675250590460288?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/669675250590460288/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=669675250590460288' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/669675250590460288'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/669675250590460288'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2011/05/using-tidy-to-clean-webpages-with.html' title='Cleaning webpages with Pentaho Data Integration and JTidy'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-8579546996780303965</id><published>2011-05-18T14:05:00.003+02:00</published><updated>2011-05-18T14:08:25.744+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><title type='text'>Check out this excellent article by Shlomi Noach!!</title><content type='html'>Check out &lt;a href="http://code.openark.org/blog/mysql/tales-of-the-trade-4-how-to-author-a-super-successful-mysql-blog"&gt;this excellent article&lt;/a&gt; by Shlomi Noach!!&lt;br /&gt;&lt;br /&gt;Really - my life is much happier now, and as a bonus I got a free set of steak knives and even lost 20 pounds. (and MySQL!)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-8579546996780303965?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/8579546996780303965/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=8579546996780303965' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/8579546996780303965'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/8579546996780303965'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2011/05/check-out-this-excellent-article-by.html' title='Check out this excellent article by Shlomi Noach!!'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-334651073332155541</id><published>2011-05-10T10:55:00.005+02:00</published><updated>2011-05-10T12:39:44.168+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Kettle'/><category scheme='http://www.blogger.com/atom/ns#' term='Open Source'/><category scheme='http://www.blogger.com/atom/ns#' term='ETL'/><category scheme='http://www.blogger.com/atom/ns#' term='&quot;Building Pentaho Solutions&quot;'/><category scheme='http://www.blogger.com/atom/ns#' term='pentaho data integration'/><category scheme='http://www.blogger.com/atom/ns#' term='BI'/><title type='text'>Managing kettle job configuration</title><content type='html'>Over time I've grown a habit of making a configuration file for my kettle jobs. This is especially useful if you have a reusable job, where the same work has to be done but against different conditions. A simple example where I found this useful is when you have separate development, testing and production environments: when you're done developing your job, you transfer the .kjb file (and its dependencies) to the testing environment. This is the easy part. But the job still has to run within the new environment, against different database connections, webservice urls and file system paths.&lt;br /&gt;&lt;h3&gt;Variables&lt;/h3&gt;&lt;br /&gt;In the past, much has been written about using kettle variables, parameters and arguments. Variables are the basic features that provide the mechanism to configure the transformation steps and job entries: instead of using literal configuration values, you use a variable reference. This way, you can initialize all variables to whatever values are appropriate at that time, and for that environment. Today, I don't want to discuss variables and variable references - instead I'm just focussing on how to manage the configuration once you already used variable references inside your your jobs and transformations.&lt;br /&gt;&lt;h3&gt;Managing configuration&lt;/h3&gt;&lt;br /&gt;To manage the configuration, I typically start the main job with a &lt;code&gt;set-variables.ktr&lt;/code&gt; transformation. This transformation reads configuration data from a &lt;code&gt;config.properties&lt;/code&gt; file and assigns it to the variables so any subsequent jobs and transformations can access the configration data through variable references. The main job has one parameter called &lt;code&gt;${CONFIG_DIR}&lt;/code&gt; which has to be set by the caller so the &lt;code&gt;set-variables.ktr&lt;/code&gt; transformation knows where to look for its &lt;code&gt;config.properties&lt;/code&gt; file:&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm4.static.flickr.com/3448/5705931619_af204fa2b0_b.jpg"/&gt;&lt;br /&gt;&lt;h3&gt;Reading configuration properties&lt;/h3&gt;&lt;br /&gt;The &lt;code&gt;config.properties&lt;/code&gt; file is just a list of key/value pairs separated by an equals sign. Each key represents a variable name, and the value the appropriate value. The following snippet should give you an idea:&lt;pre&gt;#staging database connection&lt;br /&gt;STAGING_DATABASE=staging&lt;br /&gt;STAGING_HOST=localhost&lt;br /&gt;STAGING_PORT=3351&lt;br /&gt;STAGING_USER=staging&lt;br /&gt;STAGING_PASSWORD=$74g!n9&lt;/pre&gt; The &lt;cide&gt;set-variables.ktr&lt;/code&gt; transformation reads it using a "Property Input" step, and this yields a stream of key/value pairs:&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm4.static.flickr.com/3404/5705931621_c121745dcb_b.jpg"/&gt;&lt;br /&gt;&lt;h3&gt;Pivoting key/value pairs to use the "set variables" step&lt;/h3&gt;&lt;br /&gt;In the past, I used to set the variables using the "Set variables" step. This step works by creating a variable from selected fields in the incoming stream and assigning the field value to it. This means that you can't just feed the stream of key/value pairs from the property input step into the set variables step: the stream coming out of the property input step contains multiple rows with just two fields called "Key" and "value". Feeding it directly into the "Set variables" step would just lead to creating two variables called Key and Value, and they would be assigned values multiple times for all key/value pairs in the stream. So in order to meaningfully assign variable, I used to pivot the stream of key/value pairs into a single row having one field for each key in the stream using the "Row Denormaliser" step:&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm3.static.flickr.com/2668/5706742692_77efb3516c_z.jpg"/&gt;As you can see in the screenshot, "Key" is the key field: the value of this field is scanned to determine in which output fields to put the corresponding value. There are no fields that make up a grouping: rather, we want all key/value pairs to end up in one big row. Or put another way, there is just one group comprising all key/value pairs. Finally, the grid below specifies for each distinct value of the "Key" field to which output field name it should be mapped, and in all cases,  we want the value of the "Value" field to be stored in those fields. &lt;br /&gt;&lt;h3&gt;Drawbacks&lt;/h3&gt;&lt;br /&gt;There are two important drawbacks to this approach:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;The "Row Normaliser" uses the value of the keys to map the value to a new field. This means that we have to type the name of each and every variable appearing in the &lt;code&gt;config.properties&lt;/code&gt; file. So you manually need to keep he &lt;code&gt;config.propeties&lt;/code&gt; and the "Denormaliser" synchronized, and in practice it's very easy to make mistakes here.&lt;/li&gt;&lt;li&gt;Due to the fact that the "Row Denormaliser" step literally needs to know all variables, the &lt;code&gt;set-variables.ktr&lt;/code&gt; transformation becomes specific for just one particular project.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt; Given these drawbacks, I seriously started to question the usefulness of a separate configuration file: because the &lt;code&gt;set-variables.ktr&lt;/code&gt; transformation has to know all variables names anyway, I was tempted to store the configration values themselves also inside the transformation (using a "generate rows" or "data grid" step or something like that), and "simply" make a new &lt;code&gt;set-variables.ktr&lt;/code&gt; transformation for every environment. Of course, that didn't feel right either.&lt;br /&gt;&lt;h3&gt;Solution: Javascript&lt;/h3&gt;&lt;br /&gt;As it turns out, there is in fact a very simple solution that solves all of these problems: don't use the "set variables" step for this kind of problem! We still need to set the variables of course, but we can conveniently do this using a JavaScript step. The new &lt;code&gt;set-variables.ktr&lt;/code&gt; transformation now looks like this:&lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm4.static.flickr.com/3411/5705931627_623955f9e9_b.jpg"/&gt;&lt;br /&gt;&lt;br /&gt;The actual variable assignemnt is done with Kettle's built-in &lt;code&gt;setVariable(key, value, scope)&lt;/code&gt;. The key and value from the incoming stream are passed as arguments to the key and value arguments of the &lt;code&gt;setVariable()&lt;/code&gt; function. The third argument of the &lt;code&gt;setVariable()&lt;/code&gt; function is a string that identifies the scope of the variable, and must have one of the following values: &lt;ul&gt;&lt;br /&gt;&lt;li&gt;&lt;code&gt;"s"&lt;/code&gt; - system-wide&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;code&gt;"r"&lt;/code&gt; - up to the root&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;code&gt;"p"&lt;/code&gt; - up to the parent job of this transormation&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;code&gt;"g"&lt;/code&gt; - up to the grandparent job of this transormation&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;For my purpose, I settle for &lt;code&gt;"r"&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;The bonus is that this &lt;code&gt;set-variables.ktr&lt;/code&gt; is less complex than the previous one and is now even completely independent of the content of the configuration. It has become a reusable transformation that you can use over and over.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-334651073332155541?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/334651073332155541/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=334651073332155541' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/334651073332155541'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/334651073332155541'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2011/05/managing-kettle-job-configuration.html' title='Managing kettle job configuration'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm4.static.flickr.com/3448/5705931619_af204fa2b0_t.jpg' height='72' width='72'/><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-3280511797730830280</id><published>2011-01-26T11:40:00.007+01:00</published><updated>2011-01-26T22:44:17.080+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Kettle'/><category scheme='http://www.blogger.com/atom/ns#' term='Open Source'/><category scheme='http://www.blogger.com/atom/ns#' term='&quot;Building Pentaho Solutions&quot;'/><category scheme='http://www.blogger.com/atom/ns#' term='NoSQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Jaspersoft'/><category scheme='http://www.blogger.com/atom/ns#' term='pentaho data integration'/><category scheme='http://www.blogger.com/atom/ns#' term='business intelligence'/><title type='text'>NoSQL support lands in JasperSoft</title><content type='html'>&lt;a target="jasper" href="http://www.jaspersoft.com/"&gt;JasperSoft&lt;/a&gt;, one of the leading open source BI suites &lt;a target="jasper" href="http://www.jaspersoft.com/announcing-bigdata"&gt;just announced&lt;/a&gt; it is delivering connectors for a range of so-called NoSQL databases. The big names are all there: Cassandra, MongoDB, Riak, HBase, CouchDB, Neo4J, Infinispan, VoltDB and Redis.&lt;br /&gt;&lt;br /&gt;I used to explain to people that the lack of SQL support in NoSQL databases poses a challenge for traditional Business Intelligence tools, because those all talk either SQL or MDX (and maybe some XQuey/XPath). With this development, this is no longer true, and I want to congratulate JasperSoft in spearheading this innovation.&lt;br /&gt;&lt;br /&gt;I still have a number of reservations though. Although I personally value the ability to report on data in my NoSQL database, I think its usefulness will hava a number of limitations that are worth consideration. &lt;br /&gt;&lt;br /&gt;Admittedly I am not an expert in the NoSQL database field, but as far my knowledge goes, both the &lt;a href="http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html" target="nosql"&gt;dynamo&lt;/a&gt;-style key/value stores like &lt;a href="http://www.basho.com/Riak.html" target="nosql"&gt;Riak&lt;/a&gt;, and the &lt;a href="http://labs.google.com/papers/bigtable.html" target="nosql"&gt;Bigtable&lt;/a&gt;-style hashtable stores like &lt;a href="http://hbase.apache.org/" target="nosql"&gt;HBase&lt;/a&gt; and &lt;a href="http://cassandra.apache.org/" target="nosql"&gt;Cassandra&lt;/a&gt; can basically do 2 types of read operations: fetch a single object by key, or scan everything. The fetched object can be complex and contain a lot of data, and it would certainly be nice if you could run a report on that. The scan everything operation doesn't seem that useful at the report level: for all but trivial cases, you need considerable logic to make this scan useful, and I don't think a report is the right place for this. Apart from that, if the NoSQL solution was put in place because of the large data volume, then the report itself would probably need to be executed on a cluster just to achieve acceptable response time. I may be wrong but I don't think JasperReports supports that. &lt;br /&gt;&lt;br /&gt;So, for a full scan of those NoSQL databases, connectors at the data integration end seem more appropriate. I think the &lt;a href="http://www.pentaho.com/products/hadoop/" target="pentaho"&gt;integration&lt;/a&gt; of &lt;a href="http://hadoop.apache.org/" target="nosql"&gt;Hadoop&lt;/a&gt; with &lt;a href="http://www.pentaho.org" target="pentaho"&gt;Pentaho&lt;/a&gt; data integration (a.k.a &lt;a href="http://kettle.pentaho.com/" target="pentaho"&gt;Kettle&lt;/a&gt;) is a step in the right direction, but of course only applicable if you're a Hadoop user.&lt;br /&gt;&lt;br /&gt;Another point is data quality. Typically reporting is done on a data warehouse or reporting environment where the data quality is kept in check by processing the raw data with a data integration and quality tools. Directly reporting on any operational database can be problematic because you skip those checks. Because the NoSQL databases offer virtually no constraints, those checks are even more important. So to me this seems like another reason why NoSQL connectivity is more useful in the data integration tools.&lt;br /&gt;&lt;br /&gt;JasperSoft also offers connectivity for the &lt;a href="http://www.mongodb.org/" target="nosql"&gt;MongoDB&lt;/a&gt; and &lt;a href="http://couchdb.apache.org/" target="nosql"&gt;CouchDB&lt;/a&gt; docmentstores. I think that for raw reporting on the actual source documents, the same reservations apply as I mentioned in relation to the dynamo and Bigtable style solutions. But, there may be a few more possibilities here, at least for CouchDB&lt;br /&gt;&lt;br /&gt;CouchDB has a feature called &lt;a href="http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views" target="nosql"&gt;views&lt;/a&gt;, which allows you to "query" the raw documents using a map/reduce job. I can certainly see why it'd be useful to build a report on top of that. Of course, you would still have to implement the logic to do a useful scan, and you would still have to deal with data quality issues, but you can do it in the map/reduce job, which seems a more appropriate place to handle this than a report.&lt;br /&gt;&lt;br /&gt;All in all, I think this is a promising development, and I should probably get my feet wet and try it out myself. But for now, I would recommend to keep it out of the wrecking tentacles of unaware business users :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-3280511797730830280?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/3280511797730830280/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=3280511797730830280' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/3280511797730830280'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/3280511797730830280'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2011/01/nosql-support-lands-in-jaspersoft.html' title='NoSQL support lands in JasperSoft'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-2085109314869449180</id><published>2011-01-07T02:30:00.004+01:00</published><updated>2011-01-08T11:46:18.532+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Javascript'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql user conference'/><category scheme='http://www.blogger.com/atom/ns#' term='mql-to-sql'/><category scheme='http://www.blogger.com/atom/ns#' term='Ajax'/><category scheme='http://www.blogger.com/atom/ns#' term='json'/><category scheme='http://www.blogger.com/atom/ns#' term='MQL'/><category scheme='http://www.blogger.com/atom/ns#' term='mash-up'/><title type='text'>MQL-to-SQL: A JSON-based query language for your favorite RDBMS - Part III</title><content type='html'>This is the third article in a series providing background information to &lt;a href="http://en.oreilly.com/mysql2011/public/schedule/detail/17134" targe="mysqlconf"&gt;my talk&lt;/a&gt; for the &lt;a href="http://en.oreilly.com/mysql2011/" target="mysqlconf"&gt;MySQL User's conference&lt;/a&gt;, entitled &lt;cite&gt;&lt;a href="http://en.oreilly.com/mysql2011/public/schedule/detail/17134" targe="mysqlconf"&gt;MQL-to-SQL: a JSON-based Query Language for RDBMS Access from AJAX Applications&lt;/a&gt;&lt;/cite&gt;.&lt;br /&gt;&lt;br /&gt;In the &lt;a href="http://rpbouman.blogspot.com/2011/01/mql-to-sql-json-based-query-language.html"&gt;first installment&lt;/a&gt;, I introduced &lt;a href="http://www.freebase.com/" target="fb"&gt;freebase&lt;/a&gt;, an &lt;cite&gt;open shared database of the world's knowledge&lt;/cite&gt; and its JSON-based query language, the &lt;a href="http://wiki.freebase.com/wiki/MQL" target="fb"&gt;Metaweb Query Language&lt;/a&gt; (MQL, pronounced &lt;em&gt;Mickle&lt;/em&gt;). In addition, I discussed the &lt;a href="http://www.json.org/" target="json"&gt;JSON&lt;/a&gt; data format, its syntax, its relationship with the de-facto standard client side browser scripting language &lt;a href="http://en.wikipedia.org/wiki/JavaScript" target="json"&gt;JavaScript&lt;/a&gt;, and its increasing relevance for modern &lt;a href="http://en.wikipedia.org/wiki/Ajax_(programming)" target="json"&gt;AJAX&lt;/a&gt;-based webapplications.&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://rpbouman.blogspot.com/2011/01/mql-to-sql-json-based-query-language_07.html"&gt;second installment&lt;/a&gt; provides a brief introduction to MQL as database query language, and how it compares to the de-facto standard query language for relational database systems (RDBMS), the &lt;a href="http://en.wikipedia.org/wiki/Sql" target="sql"&gt;Structured Query Language&lt;/a&gt; (SQL). I argued that MQL has some advantages over SQL, in particular for programming modern webapplications. I mentioned the following reasons: &lt;ul&gt;&lt;li&gt;Since MQL is JSON, and JSON is JavaScript, it's a more natural fit for modern AJAX applications&lt;/li&gt;&lt;li&gt;MQL is almost trivial to parse, making it much easier to write tool such as editors, but also to implement advanced authorization policies&lt;/li&gt;&lt;li&gt;MQL is easy to generate: the structure of MQL queries is mirrored by their results. A fragment of the result can be easily augmented with a subquery making it easy to subsequently drill down into the retrieved dataset&lt;/li&gt;&lt;li&gt;MQL is more declarative than SQL. Both attributes and relationships are represented as JSON object properties and one need not and cannot specify join conditions to combine data from different types of objects. The "cannot" is actually A Good Thing because it means one cannot make any mistakes.&lt;/li&gt;&lt;li&gt;MQL is more focussed on just the data. It largely lacks functions to transform data retrieved from the database forcing application developers to do data processing in the application or middleware layer, not in the database.&lt;/li&gt;&lt;/ul&gt;In this article, I want to discuss common practices in realizing data access for applications, especially web applications, and how database query languages like SQL and MQL fit in there.&lt;h2&gt;Web Application Data Access Practices&lt;/h2&gt;After reading the introduction of this article, one might get the idea that I hate relational databases and SQL. I don't, I love them both! It's just that when developing database applications, especially for the web, SQL isn't helping much. Or rather, it's just one tiny clog in a massive clockwork that has to be set up again and again. Let me explain...&lt;h3&gt;The Data Access Problem&lt;/h3&gt;It just happens to be the case that I'm an application developer. Typically, I develop rich internet and intranet applications and somewhere along the line, a database is involved for storing and retrieving data. So, I need to put "something" in place that allows the user to interact with a database via the web browser.&lt;br /&gt;The way I write that I need "something" so the user can interact with the database, it seems like it's just one innocent little thing. In reality, "something" becomes a whole bunch of things that need to work together:&lt;ul&gt;&lt;li&gt;Browsers don't speak database wire protocols - they speak HTTP to back-end HTTP servers. No matter what, there is going to be some part that is accessible via HTTP that knows how to contact the database server. Examples of solutions to this part of the problem are &lt;a target="ws" href="http://en.wikipedia.org/wiki/Common_Gateway_Interface"&gt;Common Gateway Interface&lt;/a&gt; (CGI) programs, server-side scripting languages like PHP or Perl (which are often themselves implemented as a CGI program) or in the case of Java, specialized Servlet classes&lt;/li&gt;&lt;li&gt;The component at the HTTP server that mediates between web browser and database is going to require a protocol: a set of rules that determine how a URI, a HTTP method and parameters can lead to executing a database command. Examples of approaches to design such a protocol are &lt;a href="http://en.wikipedia.org/wiki/Remote_procedure_call" target="ws"&gt;Remote Procedure Calls&lt;/a&gt; (RPC) and &lt;a href="http://en.wikipedia.org/wiki/REST" target="ws"&gt;Representational State Transfer&lt;/a&gt; (REST)&lt;/li&gt;&lt;li&gt;There's the way back too: the application running in the web browser is going to have to understand the results coming back from the database. For data, a choice has to be made for a particular data exchange format, typically &lt;a href="http://en.wikipedia.org/wiki/XML" target="ws"&gt;eXtensible Markup Language&lt;/a&gt; or &lt;a href="http://en.wikipedia.org/wiki/JSON" target="ws"&gt;JSON&lt;/a&gt;&lt;/li&gt;&lt;li&gt;The user interface at the browser end need not only understand the protocol we invented for the data exchange, Ideally it should also guide the user and be able to validate whatever data the user is going to feed it. In other words, the user interface needs to have the metadata concerning the data exchange interface.&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;The Webservice Solution&lt;/h3&gt;A typical way to tackle the data access problem is:&lt;ul&gt;&lt;li&gt;analyze the functionality of the application, categorizing it into a series of clear and isolated actions&lt;/li&gt;&lt;li&gt;identify which data flows from application to database and back for each action&lt;/li&gt;&lt;li&gt;decide on a data representation, and a scheme to identify actions and parameters&lt;/li&gt;&lt;li&gt;create one or more programs in Java, Perl, PHP, Python, Ruby or whatever fits your stack that can execute the appropriate tasks and process the associated data flows to implement the actions&lt;/li&gt;&lt;/ul&gt;For web applications, the program or programs developed in this way are typically &lt;em&gt;&lt;a href="http://en.wikipedia.org/wiki/Web_service" target="ws"&gt;webservices&lt;/a&gt;&lt;/em&gt; that run as part of the process of the HTTP server. The client gets to do a HTTP request, to a particular URI, using the right HTTP method, and the right parameters. The service gets to process the parameters, execute tasks such as accessing a database, and finally, sending back a HTTP response, which typically contains data requested by the application.&lt;h3&gt;Development Dynamics&lt;/h3&gt;The problem with the webservice approach is that it isn't very flexible. It presumes the application's functionality and hence the actions are quite well-defined. Although this looks reasonable on paper, in reality development tends to be quite evolutionary.&lt;br /&gt;&lt;br /&gt;Typically, the core functionality of applications is quite well defined, but often a lot of additional functionalities are required. Although we can pretend these could be known in advance if only we'd spend more time designing and planning in advance, in practice, they often aren't. It may seem sad, but in many cases, the best way to find out is simply to start developing, and find out along the way. Agile is the latest buzzword that captures some of these development dynamics, but there have been other buzzwords for it in the past, such as RAD (rapid application development) and DSDM (dynamic systems development method).&lt;br /&gt;&lt;br /&gt;The problem with this approach is that it requires a lot of going back-and-forth between front- and back-end development tasks: whenever the front-end wants to develop a new feature, it is often dependent upon the back-end offering a service for it. Front-end and back-end developers often are not the same people, so what we get is front-end development cycles having to wait on back-end development cycles to complete. Or in case front-end and back-end developers are the same person, they are constantly switching between tool sets and development environments. &lt;br /&gt;&lt;br /&gt;In part this is because front-end development is usually done in JavaScript, and although server-side JavaScript is gaining ground, the server-side is still dominated mainly by Java, PHP, C++ and ASP.NET. But it's not just a programming language problem - developing a client, and especially a frond-end for end-users, presumes a very different mindset than developing a back-end process. Front-end development should focus on usability and quality user-experience; back-end development should focus on robustness, reliability, availability, scalability and performance. Although some of these aspects influence each other, in practice, front-end development is simply a different cup of tea than back-end development.&lt;h3&gt;A Simple Plan: Building a Query Service&lt;/h3&gt;There is a very simple solution that would largely solve the data access problem without dealing with the inefficiencies of the recurring development process: If you could build a single service that can accept any parameters, understand them, and somehow return an appropriate result, we would never have to add functionality to the service itself. Instead, the front end application would somehow have to construct the right parameters to tell the service what it wants whenever the need arises.&lt;br /&gt;&lt;br /&gt;This sounds almost like magic, right? So we must be kidding, right? Well, we're not kidding, and it's not magic either; it's more like a cheap parlour trick.&lt;br /&gt;As the title of this section suggests, a &lt;em&gt;query service&lt;/em&gt; fits this bill. &lt;br /&gt;&lt;br /&gt;It would be very easy to build a single service that accepts a query as a parameter, and returns its result as response. And seriously, it's not that strange an idea: many people use between one and perhaps ten or twenty different services exactly like this everyday, multiple times...and it's called a search engine.&lt;br /&gt;&lt;br /&gt;Can't we use something like that to solve our database access problem? Well, we could. But actually, someone beat you to it already.&lt;h3&gt;DBSlayer, a Webservice for SQL Queries&lt;/h3&gt;A couple of years ago, The New York Times released &lt;a href="http://code.nytimes.com/projects/dbslayer" target="ws"&gt;DBSlayer&lt;/a&gt;. DBSlayer (DataBase accesS layer) is best described as a HTTP server that acts as a database proxy. It accepts regular SQL queries via a parameter in a regular HTTP &lt;code&gt;GET&lt;/code&gt; request, and sends a HTTP response that contains the resulting data as JSON. It currently supports only MySQL databases but announcements were made that support was planned for other database products too. DBSlayer is actually a bit more than just a database access layer, as it also supports simple failover and round-robin request distribution, which can be used to scale out database requests. But I mention it here, because it implements exactly the kind of query service that would appear to solve all the aforementioned problems.&lt;br /&gt;&lt;br /&gt;Or would it?&lt;br /&gt;&lt;br /&gt;Every web developer and every database administrator should realize immediately that it's not a good idea. At least, not for internet-facing applications anyway. The DBSlayer developers &lt;a href="http://ec2-75-101-154-248.compute-1.amazonaws.com/projects/dbslayer/wiki/AdvancedTopics#Security" target="ws"&gt;documented&lt;/a&gt; that themselves quite clearly:&lt;blockquote&gt;Access to the DBSlayer can be controlled via firewalls; the DBSlayer should never be exposed to the outside world.&lt;/blockquote&gt;... and ...&lt;blockquote&gt;The account DBSlayer uses to access the MySQL database should not be allowed to execute dangerous operations like dropping tables or deleting rows. Ideally, the account would only be able to run selects and/or certain stored procedures.&lt;/blockquote&gt;So there's the rub: it may be very easy and convenient from the application development point of view, but it is a horrendous idea when you think about security. A general purpose SQL query service is simply too powerful. &lt;br /&gt;&lt;br /&gt;If a web application accidentally allows arbitrary SQL to be executed, it would be called an &lt;a href="http://en.wikipedia.org/wiki/SQL_injection" target="ws"&gt;SQL injection&lt;/a&gt; vulnerability, and it would be (or at least, should be) treated as a major breach of security. Creating a service that offers exactly that behavior as a feature doesn't lessen the security concerns at all.&lt;h3&gt;What about a MQL query service&lt;/h3&gt;In this article I tried to explain the problems that must be solved in order to arrange and manage data access for web applications. The key message is that we need to create a service that provides data access. But in doing so, we have to balance between security, functionality and flexibility. &lt;br /&gt;&lt;br /&gt;It is fairly easy to create a webservice that exactly fulfills a particular application requirement, thus ensuring security and manageability. However, this will usually be a very inflexible service, and it will need lots of maintenance to keep up with change in application requirements. It's also easy to create a webservice that is at least as powerful as the underlying database: this would be a database proxy over HTTP, just like DBSlayer. Although it will likely never need to change since it simply passes requests on to the back-end database, it is very hard to secure it in a way that would allow external requests from possibly malignant users.&lt;br /&gt;&lt;br /&gt;I believe that an MQL webservice actually does offer the best of both worlds, without suffering from the disadvantages. A MQL query service will be flexible enough for most web applications - MQL queries are only limited by the underlying data model, not by the set of application-specific actions designed for one particular purpose. At the same time, it will be relatively easy to efficiently analyze MQL queries and apply policies to prevent malicious use. For example, checking that a MQL query doesn't join more than X tables is quite easy.&lt;br /&gt;&lt;br /&gt;In the forthcoming installment, I will explore the concept of a MQL webservice in more detail, and I will explain more about the MQL-to SQL project. As always, I'm looking forward to your comments, suggestions and critique so don't hesitate to leave a comment.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-2085109314869449180?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/2085109314869449180/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=2085109314869449180' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/2085109314869449180'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/2085109314869449180'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2011/01/mql-to-sql-json-based-query-language_4061.html' title='MQL-to-SQL: A JSON-based query language for your favorite RDBMS - Part III'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-7042325305385497085</id><published>2011-01-07T00:40:00.010+01:00</published><updated>2011-01-07T23:31:59.735+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Javascript'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql user conference'/><category scheme='http://www.blogger.com/atom/ns#' term='mql-to-sql'/><category scheme='http://www.blogger.com/atom/ns#' term='Ajax'/><category scheme='http://www.blogger.com/atom/ns#' term='json'/><category scheme='http://www.blogger.com/atom/ns#' term='MQL'/><category scheme='http://www.blogger.com/atom/ns#' term='mash-up'/><title type='text'>MQL-to-SQL: A JSON-based query language for your favorite RDBMS - Part II</title><content type='html'>This is the second article in a series to provide some background to &lt;a href="http://en.oreilly.com/mysql2011/public/schedule/detail/17134" target="mysqlconf"&gt;my talk&lt;/a&gt; for the &lt;a href="http://en.oreilly.com/mysql2011/" target="mysqlconf"&gt;MySQL User's conference&lt;/a&gt;.&lt;br /&gt;The conference will be held April 11-14 2011 in the Hyatt Regency hotel in Santa Clara, California.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Abstract:&lt;/b&gt; &lt;a href="http://www.freebase.com/docs/mql/ch03.html" target="fb"&gt;MQL&lt;/a&gt; is a &lt;a href="http://www.json.org/" target="json"&gt;JSON&lt;/a&gt;-based database query language that has some very interesting features as compared to SQL, especially for modern (&lt;a href="http://en.wikipedia.org/wiki/Ajax_(programming)" target="ajax"&gt;AJAX&lt;/a&gt;) web-applications. MQL is not a standard database query language, and currently only natively supported by &lt;a href="http://www.freebase.com/" target="fb"&gt;Freebase&lt;/a&gt;. However, with &lt;a href="http://code.google.com/p/mql-to-sql/"&gt;MQL-to-SQL&lt;/a&gt;, a project that provides a SQL adapter for MQL, you can run MQL queries against any RDBMS with a SQL interface.&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://rpbouman.blogspot.com/2011/01/mql-to-sql-json-based-query-language.html"&gt;my previous post&lt;/a&gt;, I covered some background information on modern web applications, JavaScript and JSON. &lt;br /&gt;The topic of this installment is the MQL database query language and how it compares to SQL. In a third article, I will discuss the &lt;a href="http://code.google.com/p/mql-to-sql/"&gt;mql-to-sql project&lt;/a&gt;,&lt;br /&gt;which implements a MQL service for relational database systems.&lt;h3&gt;MQL Queries&lt;/h3&gt;A MQL query is either a JSON object, or a JSON array of objects. In fact, the sample object we discussed in previous section about JSON is nearly a valid MQL query. For now, consider the following JSON representation of the chemical element Oxygen:&lt;pre&gt;{&lt;br /&gt;  "type": "/chemistry/chemical_element",&lt;br /&gt;  "name": "Oxygen",&lt;br /&gt;  "symbol": 'O',&lt;br /&gt;  "atomic_number": 8,&lt;br /&gt;  "ionization_energy": 13.6181,&lt;br /&gt;  "melting_point": -2.1835e+2,&lt;br /&gt;  "isotopes": []&lt;br /&gt;}&lt;/pre&gt;&lt;br /&gt;The difference with the example from &lt;a href="http://rpbouman.blogspot.com/2011/01/mql-to-sql-json-based-query-language.html"&gt;my previous post&lt;/a&gt; is that here, we added a key/value pair at the top, &lt;code&gt;"type": "/chemistry/chemical_element"&lt;/code&gt;, and we use an empty array (&lt;code&gt;[]&lt;/code&gt;)as value for the &lt;code&gt;"isotopes"&lt;/code&gt; key.&lt;br /&gt;&lt;br /&gt;So how is this a query? Didn't we claim JSON is a data format? So isn't this by definition data? And if this is data, how can it be a query?&lt;br /&gt;&lt;br /&gt;Well, the answer is: yes, JSON is a data format, and so, yes: by definition, it must be data. But the paradox how some data can also be a query is quite easily solved once you realize that there are some pieces missing from the data. In this particular case, we left the &lt;code&gt;"isotopes"&lt;/code&gt; array empty, whereas Oxygen has a bunch of isotopes in the real world.&lt;br /&gt;&lt;br /&gt;This is how a JSON object can be a query:&lt;ul&gt;&lt;li&gt;By specifying values that describe known facts about an object, we define a filter or access path that can be used to find data in some database. For example, the JSON fragment above states that there exists an object of the type &lt;code&gt;"/chemistry/chemical_element"&lt;/code&gt; which has the name &lt;code&gt;"Oxygen"&lt;/code&gt;, and the symbol &lt;code&gt;"O"&lt;/code&gt;, among other characterisics&lt;/li&gt;&lt;li&gt;By specifying special placeholders like &lt;code&gt;null&lt;/code&gt; or an empty array or object (like &lt;code&gt;[]&lt;/code&gt; and &lt;code&gt;{}&lt;/code&gt; respectively), we define what data should be retrieved from the database so it can be returned in the result.&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Executing a query with the Freebase MQL query editor&lt;/h4&gt;&lt;br /&gt;The easiest way to execute the query above is by using the &lt;a href="http://www.freebase.com/queryeditor?query={%20%22query%22%3A%20{%20%22type%22%3A%20%22%2Fchemistry%2Fchemical_element%22%2C%20%22name%22%3A%20%22Oxygen%22%2C%20%22symbol%22%3A%20%22O%22%2C%20%22atomic_number%22%3A%208%2C%20%22ionization_energy%22%3A%2013.6181%2C%20%22melting_point%22%3A%20-218.35%2C%20%22isotopes%22%3A%20%5B%5D%20}%20}&amp;callback=cb1294354939787x764" target="fb"&gt;Freebase Query Editor&lt;/a&gt;. Here's a screenshot: &lt;a href="http://www.freebase.com/queryeditor?query={%20%22query%22%3A%20{%20%22type%22%3A%20%22%2Fchemistry%2Fchemical_element%22%2C%20%22name%22%3A%20%22Oxygen%22%2C%20%22symbol%22%3A%20%22O%22%2C%20%22atomic_number%22%3A%208%2C%20%22ionization_energy%22%3A%2013.6181%2C%20%22melting_point%22%3A%20-218.35%2C%20%22isotopes%22%3A%20%5B%5D%20}%20}&amp;callback=cb1294354939787x764"&gt;&lt;img border="0" src="http://farm6.static.flickr.com/5081/5331599292_e1fc995307_b.jpg"/&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;In the screenshot, the left upper side is the area where you can type or paste your MQL query. You can then press the Run button, which is located below the query area on its right side. &lt;br /&gt;&lt;br /&gt;When you hit the Run button, the query is sent to a special webservice called &lt;a href="http://wiki.freebase.com/wiki/MQL_Manual/mqlread" target="fb"&gt;the mqlread service&lt;/a&gt;. The results appear in the results area on the right.&lt;h4&gt;MQL Query Results&lt;/h4&gt;In the Text tab of the results area of the MQL query editor, you can see the raw response. It should be something like this:&lt;pre&gt;{&lt;br /&gt;  "code": "/api/status/ok",&lt;br /&gt;  "result": {&lt;br /&gt;    "atomic_number": 8,&lt;br /&gt;    "ionization_energy": 13.6181,&lt;br /&gt;    "isotopes": [&lt;br /&gt;      "Oxygen-15",&lt;br /&gt;      "Oxygen-16",&lt;br /&gt;      "Oxygen-17",&lt;br /&gt;      "Oxygen-18",&lt;br /&gt;      "Oxygen-28",&lt;br /&gt;      "Oxygen-19",&lt;br /&gt;      "Oxygen-20",&lt;br /&gt;      "Oxygen-23",&lt;br /&gt;      "Oxygen-25",&lt;br /&gt;      "Oxygen-24",&lt;br /&gt;      "Oxygen-13",&lt;br /&gt;      "Oxygen-22",&lt;br /&gt;      "Oxygen-26",&lt;br /&gt;      "Oxygen-21",&lt;br /&gt;      "Oxygen-14",&lt;br /&gt;      "Oxygen-27",&lt;br /&gt;      "Oxygen-12"&lt;br /&gt;    ],&lt;br /&gt;    "melting_point": -218.35,&lt;br /&gt;    "name": "Oxygen",&lt;br /&gt;    "symbol": "O",&lt;br /&gt;    "type": "/chemistry/chemical_element"&lt;br /&gt;  },&lt;br /&gt;  "status": "200 OK",&lt;br /&gt;  "transaction_id": "cache;cache01.p01.sjc1:8101;2010-05-15T03:08:10Z;0004"&lt;br /&gt;}&lt;/pre&gt;As you can see, the response is again a JSON object. The outermost object is the &lt;em&gt;result envelope&lt;/em&gt;, which contains the actual query result assigned to the result property, as well as a few other fields with information about how the request was fulfilled. (Actually, when we hit the Run button, our MQL query was also first embedded into a &lt;em&gt;query envelope&lt;/em&gt; before it was sent to the mqlread service, but let's worry about those details later.&lt;h4&gt;Query by Example, or filling-in-the blanks&lt;/h4&gt;If you compare the actual query result assigned to the &lt;code&gt;"result"&lt;/code&gt; property of the result envelope with the original query, you will notice that they are largely the same: at least, the keys of query object and result object all match up, and so do the values, except the one for the &lt;code&gt;"isotopes"&lt;/code&gt; key: whereas the &lt;code&gt;"isotopes"&lt;/code&gt; property was an empty array in the query, in the result it contains an array filled with elements, each representing a particular isotope of the Oxygen element. These array elements were retrieved from the Oxygen entry stored in Freebase, which was found by matching the properties like &lt;code&gt;"name": "Oxygen"&lt;/code&gt; and &lt;code&gt;"atomic_number": 8&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;This illustrates an important concept of MQL queries: &lt;em&gt;the query result mirrors the structure of the query itself&lt;/em&gt;. Wherever the query contains a special placeholder value, such as an empty array (&lt;code&gt;[]&lt;/code&gt;), a &lt;code&gt;null&lt;/code&gt; value or an empty object (&lt;code&gt;{}&lt;/code&gt;), the query engine will "fill in the blanks" with data that is retrieved from the database. &lt;br /&gt;&lt;br /&gt;Another way of putting it is to say a MQL query is a query by example. &lt;br /&gt;&lt;br /&gt;To hammer this concept down further, consider the following similar query. In this case, it contains even more special placeholder values (namely, the &lt;code&gt;null&lt;/code&gt; values for all but the &lt;code&gt;"name"&lt;/code&gt; property:&lt;pre&gt;{&lt;br /&gt;  "type": "/chemistry/chemical_element",&lt;br /&gt;  "name": "Oxygen",&lt;br /&gt;  "symbol": null,&lt;br /&gt;  "atomic_number": null,&lt;br /&gt;  "ionization_energy": null,&lt;br /&gt;  "melting_point": null,&lt;br /&gt;  "isotopes": []&lt;br /&gt;}&lt;/pre&gt;If you execute this query, you get a result that is essentially the same as the previous one. Because chemcical elements are identified by name (there are more properties that can identify a chemical element, name is just one of them), this query will match the same object in Freebase. This query specifies &lt;code&gt;null&lt;/code&gt; for almost all other properties, and because &lt;code&gt;null&lt;/code&gt; is a special placeholder for scalar values, the query engine responds by retrieving the values for those keys and returns them in the result.&lt;h3&gt;Differences between MQL and SQL&lt;/h3&gt;Even this simple example reveals a lot about the differences between MQL and SQL.&lt;h4&gt;Symmetry between Query and Result&lt;/h4&gt;In the previous example, we just saw that the query and its result have a similar structure. It's like the query is mirrored in the result. In SQL, this is very different. Let's see how different.&lt;br /&gt;&lt;br /&gt;Assume for a minute we would have a relational database with a schema called "chemistry" and a table called "chemical_element". We could write an SQL query like this:&lt;pre&gt;SELECT  name&lt;br /&gt;,       symbol&lt;br /&gt;,       atomic_number&lt;br /&gt;,       ionization_energy&lt;br /&gt;,       melting_point&lt;br /&gt;FROM    chemistry.chemical_element&lt;br /&gt;WHERE   name = 'Oxygen'&lt;/pre&gt;...and the result would look something like this:&lt;pre&gt;&lt;br /&gt;+--------+--------+---------------+-------------------+---------------+&lt;br /&gt;| name   | symbol | atomic_number | ionization_energy | melting_point |&lt;br /&gt;+--------+--------+---------------+-------------------+---------------+&lt;br /&gt;| Oxygen | O      |             8 |           13.6181 |       -218.35 | &lt;br /&gt;+--------+--------+---------------+-------------------+---------------+&lt;/pre&gt;Even if we forget for a moment that this particular SQL query doesn't handle retrieving the isotopes, the difference between the query and the result is striking - the structure of the SQL query, which is basically a piece of text, has very little to do with the structure of the result, which is unmistakenly tabular.&lt;br /&gt; &lt;br /&gt;Everybody knows that SQL was designed for relational databases, so we probably shouldn't be too surprised that the result of the SQL query has a tablular form. But why does the query have to be text? Couldn't the query have been tabular too, leaving blanks where we'd expect to retrieve some data?&lt;br /&gt;&lt;br /&gt;Lest I lose the proponents of SQL and the relational model here: don't get me wrong - I realize that an SQL query isn't just "a piece of text; rather SQL attempts to express facts about database state using a practical form of relational algebra, and the result represents the set that satisfies the facts stipulated by the query. And as we shall see later on, MQL actually has a few operator constructs that resemble the ones found in SQL. But for now the message is: MQL queries and their results look a lot like one another; SQL queries and their results do not.&lt;h4&gt;Relational focus&lt;/h4&gt;I already mentioned the MQL is expressed in JSON, and we saw how both query and result are structured as objects. SQL results are always tabular. Superficially, this seems like the right thing to do for a query language that is supposed to work on and for relational database systems. But is it really that natural?&lt;br /&gt;&lt;br /&gt;Many textbooks and courses about relational databases and SQL spend much time and effort on the topic of normalization, and rightly so! One of the great achievements of the relational model and normalization is that it helps to minimize or even eliminate data integrity problems that arise in non-relational storage structures and their inherent redundancy.&lt;br /&gt;&lt;br /&gt;A typical textbook or course may start with a modelling exercise with some real-world, unnormalized (sub 1NF) data as input. The first mental step is to normalize that data to the first normal form (1NF), by splitting off multivalued attributes (a single data item with a list of values) and their big brother, repeating groups (a "table-inside-a-table"), to separate tables, on and on until all repeating groups and multi-valued attributes are eliminated. In the next phases of normalization, different forms of data redundancy are removed, moving up to even higher normal forms, typically finishing at the third normal form (3NF) or the Boyce-Codd normal form (BCNF).&lt;br /&gt;&lt;br /&gt;The result of this normalization process is an overview of all separate independent sets of data, with their keys and relationships. From there, it's usually a small step to a physical database design. For example, in our Oxygen example, the list of isotopes is a multi-valued attribute. In a proper relational design, the isotopes would typically end up in a separate table of their own, with a foreign key pointing to a table containing the chemical elements.&lt;br /&gt;&lt;br /&gt;The textbook usually continues with a crash course in SQL, and chances are the &lt;code&gt;SELECT&lt;/code&gt; syntax is the first item on the menu. Within a chapter or two, you'll learn that although normalized storage is great for data maintenance, it isn't great for data presentation. Because applications and end users care a lot about data presentation, you'll learn how to use the &lt;code&gt;JOIN&lt;/code&gt; operator to combine the results from related database tables. Usually the goal of that exercise is to end up with a result set that is typically 1NF, or at least, has a lower normal form than the source tables.&lt;br /&gt;&lt;br /&gt;Don't get me wrong: this is great stuff! The relational model isn't fighting redundant data: its fighting data integrity issues, and the lack of control that results from storing redundant data. By first decomposing the data into independent sets for storage, and then using things like &lt;code&gt;JOIN&lt;/code&gt; operations to re-combine them, SQL offers a good method to master data integrity issues, and to deliver consistent, reliable data to applications and end-users.&lt;br /&gt;&lt;br /&gt;It's just a pity SQL forgot to finish what it started. Recall that the text-book modeling exercise started by eliminating repeating groups and multi-valued attributes to achieve 1NF. What SQL queries should we use to transform the data back to that format?&lt;br /&gt;&lt;br /&gt;SQL turns out to be so single-mindedly focused on the relational model that it simply can't return sub-1NF, unnormalized data. Much like &lt;a href="http://en.wikipedia.org/wiki/Flatland" target="fl"&gt;the square from flatland can't comprehend the sphere from spaceland&lt;/a&gt;, SQL simply hasn't got a clue about nested data structures, like multi-valued attributes and repeating groups.&lt;br /&gt;&lt;br /&gt;Somewhere along the way, SQL forgot that the text-book course started with real-world, unnormalized data, full of repeating groups and multivalued attributes.&lt;br /&gt;&lt;br /&gt;This represents quite a share of challenges for database application development. One class of software solutions that deal with solving this problem are the so-called object-relational mappers (ORM). It would not do enough credit to the ORM's to claim that their only purpose is to solve this problem, but it's definitely a major problem they take care of.&lt;h4&gt;Parsing Queries&lt;/h4&gt;Everyone that has tried it knows that it isn't exactly trivial to write a fast yet fully functional SQL parser. Being able to parse SQL is a requirement for tools like query editors and report builders, but also for proxies and monitoring tools.&lt;br /&gt;&lt;br /&gt;Parsing MQL on the other hand is almost trivially simple. At the application level, this would in theory make it quite easy to implement advanced access policies and limit the complexity of the queries on a per user or role basis.&lt;h4&gt;Generating Queries&lt;/h4&gt;One of the things I like about MQL is that applications have to do a lot less work to formulate a query that drills down into some detail of a previously returned data set. For example, from the the result of our query about Oxygen, we just learned that there is an isotope called &lt;code&gt;"Oxygen-16"&lt;/code&gt;. Suppose we want to know more about that particular isotope, say, its relative abundance, and whether it's stable or not.&lt;br /&gt;&lt;br /&gt;With SQL, we would have to construct a new algebraic expression that somehow combines the set of chemical elements with the set of isotopes. Although we would certainly need some data from the tabular result obtained from the previous query, we have little hope of actually re-using the query itself - at the application level, the SQL query is most likely just a piece of text, and it's probably not worth it to use string manipulation to forge a new query out of it.&lt;br /&gt;&lt;br /&gt;Here's an example which shows the parts that should be added to accommodate this requirement, just to show that such a change isn't localized to just one spot in the original query:&lt;pre&gt;SELECT  e.name&lt;br /&gt;,       e.symbol&lt;br /&gt;,       e.atomic_number&lt;br /&gt;,       e.ionization_energy&lt;br /&gt;,       e.melting_point&lt;br /&gt;&lt;ins style="background-color: yellow; font-weight: bold"&gt;,       i.name&lt;br /&gt;,       i.natural_abundance&lt;br /&gt;,       i.stable&lt;/ins&gt;&lt;br /&gt;FROM    chemistry.chemical_element e&lt;br /&gt;&lt;ins style="background-color: yellow; font-weight: bold"&gt;INNER JOIN chemistry.isotope i ON e.name = e.element_name&lt;/ins&gt;&lt;br /&gt;WHERE   e.name = 'Oxygen'&lt;br /&gt;&lt;ins style="background-color: yellow; font-weight: bold"&gt;AND     i.name = 'Oxygen-16'&lt;/ins&gt;&lt;/pre&gt;I won't let my head explode over the string manipulation code required to change the original query into this one. If you like, post clever solutions as a comment to this post :)&lt;br /&gt;&lt;br /&gt;With MQL, this task is considerably easier. The query and the result have a high degree of correspondence. In our application, both would typically be represented as objects or structs. In most programming languages, it is trivial to go from the original Oxygen Query to an augmented form that retrieves the details about the  isotope "Oxygen-16". This is especially true for JavaScript, where we'd simply write something like:&lt;pre&gt;&lt;span style="color: rgb(200,200,200)"&gt;//execute the query and obtain the result.&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(200,200,200)"&gt;//note: in a real application this would typically be an asynchronous request&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(200,200,200)"&gt;//to the mqlread service which would accept and return JSON strings.&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(200,200,200)"&gt;//For simplicity sake we pretend we have a mqlRead function that can accept&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(200,200,200)"&gt;//a regular JavaScript object literal, convert it into a JSON string, and&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(200,200,200)"&gt;//call the mqlread service synchronously, parse the the JSON query result&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(200,200,200)"&gt;//into a JavaScript object and return that to the caller.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0,0,255)"&gt;var&lt;/span&gt; queryResult = mqlRead({&lt;br /&gt;  type: &lt;span style="color: rgb(255,0,0)"&gt;"/chemistry/chemical_element"&lt;/span&gt;,&lt;br /&gt;  name: &lt;span style="color: rgb(255,0,0)"&gt;"Oxygen"&lt;/span&gt;,&lt;br /&gt;  symbol: &lt;span style="color: rgb(0,0,255)"&gt;null&lt;/span&gt;,&lt;br /&gt;  atomic_number: &lt;span style="color: rgb(0,0,255)"&gt;null&lt;/span&gt;,&lt;br /&gt;  ionization_energy: &lt;span style="color: rgb(0,0,255)"&gt;null&lt;/span&gt;,&lt;br /&gt;  melting_point: &lt;span style="color: rgb(0,0,255)"&gt;null&lt;/span&gt;,&lt;br /&gt;  isotopes: []&lt;br /&gt;});&lt;br /&gt; &lt;br /&gt;&lt;span style="color: rgb(200,200,200)"&gt;//assign a subquery for "Oxygen-16" to the isotopes property of the queryResult.&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(200,200,200)"&gt;//Remember, the queryResult has the same structure as the original query, just&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(200,200,200)"&gt;//with the null's and the empty arrays filled with data.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;queryResult.isotopes = {&lt;br /&gt;  name: &lt;span style="color: rgb(255,0,0)"&gt;"Oxygen-16"&lt;/span&gt;,&lt;br /&gt;  natural_abundance: &lt;span style="color: rgb(0,0,255)"&gt;null&lt;/span&gt;,&lt;br /&gt;  stable: &lt;span style="color: rgb(0,0,255)"&gt;null&lt;/span&gt;&lt;br /&gt;};&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(200,200,200)"&gt;//execute the modified query:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;queryResult = mqlRead(queryResult);&lt;br /&gt;&lt;/pre&gt;(In the first example where we discussed the Oxygen query, you might've noticed that in the query result, the &lt;code&gt;isotopes&lt;/code&gt; member was an array of strings, each representing a particular isotope. So naturally, you might be tempted to think that the value of the &lt;code&gt;isotopes&lt;/code&gt; property is an array of strings. However, this is not quite the case. Rather, the &lt;code&gt;isotopes&lt;/code&gt; property stands for the relationship between elements and its isotopes. Due to the form of the original query (having the empty array for the &lt;code&gt;isotopes&lt;/code&gt; property), the MQL query engine responds by listing the default property of the related isotopes. In freebase, &lt;code&gt;name&lt;/code&gt; is a special property and typically that's used as default property. So in that previous query result, the isotopes were merely represented only by their name. In the example above however, we assign a single object literal to the &lt;code&gt;isotopes&lt;/code&gt; property which identifies one particular isotope. Because the &lt;code&gt;isotopes&lt;/code&gt; property represents a relationship, the query should be read: "find me the oxygen element and the related isotope with the name Oxygen-16", and not "find me the oxygen element that has Oxygen-16 as its isotope".)&lt;br /&gt;&lt;br /&gt;Of course, it's entirely possible to design data structures to hold all the data you need to generate your SQL queries (a query model). You can easily come up with something that would allow such a change to be made just as easy. But you need at least one extra step to generate the actual SQL string to send to the database. And of course, you need some extra steps again to extract data from the resultset for forging more queries.&lt;br /&gt;&lt;br /&gt;In MQL, the effort to go from query to result and back are about as minimal as it can get. This results in less application code, which tends to be easier to understand.&lt;h4&gt;SQL is declarative, but MQL is more so&lt;/h4&gt;When discussing its merits as a programming language, it is often mentioned that SQL is declarative rather than procedural. Often this is presented as an advantage: with a declarative language like SQL, we get to focus on the results we want, whereas a procedural language would force us to code all kinds of details about how these results should be obtained. Or so the story goes.&lt;br /&gt;&lt;br /&gt;I won't deny SQL is declarative. For example, I don't need to spell out any particular data access algorithm required to find the Oxygen element and it's isotopes, I just write:&lt;pre&gt;SELECT      e.name&lt;br /&gt;,           e.symbol&lt;br /&gt;,           e.atomic_number&lt;br /&gt;,           e.ionization_energy&lt;br /&gt;,           e.melting_point&lt;br /&gt;,           i.name&lt;br /&gt;,           i.natural_abundance&lt;br /&gt;,           i.stable&lt;br /&gt;FROM        chemistry.chemical_element         e&lt;br /&gt;INNER JOIN  chemistry.chemical_element_isotope i&lt;br /&gt;ON          e.atomic_number = i.atomic_number&lt;br /&gt;WHERE       e.name = 'Oxygen'&lt;/pre&gt;But still: in order to successfully relate the &lt;code&gt;chemical_element&lt;/code&gt; and &lt;code&gt;chemical_element_isotope&lt;/code&gt; tables, I need to spell out that the values in &lt;code&gt;chemical_element_isotope&lt;/code&gt;'s &lt;code&gt;atomic_number&lt;/code&gt; column have to be equal to the value in the &lt;code&gt;atomic_number&lt;/code&gt; column of &lt;code&gt;chemical_element&lt;/code&gt;. Come to think of it, how can I know the relationship is built on &lt;code&gt;atomic_number&lt;/code&gt;, and not on &lt;code&gt;symbol&lt;/code&gt; or &lt;code&gt;name&lt;/code&gt;? And heaven forbid we accidentally compare the wrong columns, or forget one of the join conditions...&lt;br /&gt;&lt;br /&gt;Now compare it to the equivalent MQL query:&lt;pre&gt;{&lt;br /&gt;  "type": "/chemistry/chemical_element",&lt;br /&gt;  "name": "Oxygen",&lt;br /&gt;  "symbol": null,&lt;br /&gt;  "atomic_number": null,&lt;br /&gt;  "ionization_energy": null,&lt;br /&gt;  "melting_point": null,&lt;br /&gt;  "isotopes": [{&lt;br /&gt;    "name": null,&lt;br /&gt;    "natural_abundance": null,&lt;br /&gt;    "stable": null&lt;br /&gt;  }]&lt;br /&gt;}&lt;/pre&gt;&lt;br /&gt;The SQL query may be declarative, but compared to the MQL query, it requires a lot more knowledge of the underlying data model. All we had to do in MQL, is specify an &lt;code&gt;isotopes&lt;/code&gt; property, and list whatever we want to retrieve from the corresponding isotope instances. The only way we can mess up the MQL query is when we specify the wrong property names, in which case our query would simply fail to execute. In the SQL query, we could've been mistaken about which columns to compare, and get no result at all, or worse, a rubbish result. And with just a bit of ill luck, we can accidentally cause a cartesian product. Just for the hell of it, spot the error in the following SQL statement:&lt;pre&gt;SELECT      e.name&lt;br /&gt;,           e.symbol&lt;br /&gt;,           e.atomic_number&lt;br /&gt;,           e.ionization_energy&lt;br /&gt;,           e.melting_point&lt;br /&gt;,           i.name&lt;br /&gt;,           i.natural_abundance&lt;br /&gt;,           i.stable&lt;br /&gt;FROM        chemistry.chemical_element         e&lt;br /&gt;INNER JOIN  chemistry.chemical_element_isotope i&lt;br /&gt;ON          e.atomic_number = e.atomic_number&lt;br /&gt;WHERE       e.name = 'Oxygen'&lt;/pre&gt;&lt;br /&gt;&lt;h4&gt;Computational Completeness&lt;/h4&gt;&lt;br /&gt;Everybody with some experience in SQL programming knows that SQL is much more than a database query language. Even standard SQL is chock-full of operators and functions that allow you to build complex expressions and calculations. In fact, most SQL dialects support so many functions and operators that the manual needs at least a separate chapter to cover them. Algebra, Encryption, String formatting, String matching, Trigonometry: these are just a few categories of functions you can find in almost any SQL dialect I heard of.&lt;br /&gt;&lt;br /&gt;By contrast, MQL is all about the data. MQL defines a set of relational operators, but their function and scope is limited to finding objects, not doing calculations on them. There is exactly one construct in MQL that resembles a function, and it is used for counting the number of items in a result set.&lt;br /&gt;&lt;br /&gt;Personally, I think MQL would be better if it had a few more statistical or aggregate constructs like count. But overall, my current thinking is that the fact that MQL lacks the function-jungle present in most RDBMS-es is actually A Good Thing(tm). At the very least, it ensures queries stay focused on the data, and nothing but the data.&lt;br /&gt;&lt;h3&gt;Next Time&lt;/h3&gt;&lt;br /&gt;In this article I discussed the basics of the MQL query language, and I compared SQL and MQL on a number of accounts. In the next installment, I will discuss how this relates to developing data access services for web applications.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-7042325305385497085?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/7042325305385497085/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=7042325305385497085' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/7042325305385497085'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/7042325305385497085'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2011/01/mql-to-sql-json-based-query-language_07.html' title='MQL-to-SQL: A JSON-based query language for your favorite RDBMS - Part II'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm6.static.flickr.com/5081/5331599292_e1fc995307_t.jpg' height='72' width='72'/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-5367740322670057595</id><published>2011-01-06T22:10:00.011+01:00</published><updated>2011-01-07T08:36:00.941+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Javascript'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql user conference'/><category scheme='http://www.blogger.com/atom/ns#' term='mql-to-sql'/><category scheme='http://www.blogger.com/atom/ns#' term='Ajax'/><category scheme='http://www.blogger.com/atom/ns#' term='json'/><category scheme='http://www.blogger.com/atom/ns#' term='MQL'/><category scheme='http://www.blogger.com/atom/ns#' term='mash-up'/><title type='text'>MQL-to-SQL: A JSON-based query language for your favorite RDBMS - Part I</title><content type='html'>Yesterday, &lt;a href="http://rpbouman.blogspot.com/2011/01/speaking-at-mysql-conference-2011.html"&gt;I wrote&lt;/a&gt; about how I think &lt;a href="http://en.oreilly.com/mysql2011/" target="mysqluc"&gt;this year's MySQL conference&lt;/a&gt; will differ from prior editions. I also wrote that I will attend and that &lt;a href="http://en.oreilly.com/mysql2011/public/schedule/detail/17134" target="mysqlconf"&gt;I will be speaking&lt;/a&gt; on &lt;a href="http://code.google.com/p/mql-to-sql"&gt;MQL-to-SQL&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;I promised I would explain a little bit more background about my talk, so here's the first installment. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Abstract:&lt;/b&gt; &lt;a href="http://www.freebase.com/docs/mql/ch03.html" target="fb"&gt;MQL&lt;/a&gt; is a &lt;a href="http://www.json.org/" target="json"&gt;JSON&lt;/a&gt;-based database query language that has some very interesting features as compared to SQL, especially for modern (&lt;a href="http://en.wikipedia.org/wiki/Ajax_(programming)" target="ajax"&gt;AJAX&lt;/a&gt;) web-applications. MQL is not a standard database query language, and currently only natively supported by &lt;a href="http://www.freebase.com/" target="fb"&gt;Freebase&lt;/a&gt;. However, with &lt;a href="http://code.google.com/p/mql-to-sql/"&gt;MQL-to-SQL&lt;/a&gt;, a project that provides a SQL adapter for MQL, you can run MQL queries against any RDBMS with a SQL interface.&lt;br /&gt;&lt;br /&gt;This article covers mostly background information on modern web applications, JavaScript and JSON. This background information should help you understand why a JSON-based database query language is a good fit for modern web applications. &lt;a href="http://rpbouman.blogspot.com/2011/01/mql-to-sql-json-based-query-language_07.html"&gt;A following installment&lt;/a&gt; will discuss the MQL database query language and how it relates to SQL. A third article will cover the mql-to-sql project itself.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;MQL, a JSON-based database query language&lt;/h2&gt;&lt;br /&gt;&lt;a href="http://www.freebase.com/docs/mql/ch03.html" target="fb"&gt;MQL&lt;/a&gt; (pronounced as Mickle) is an abbreviation of Metaweb Query Language, which is the &lt;a href="http://www.json.org/" target="json"&gt;JSON&lt;/a&gt;-based database query language natively supported by &lt;a href="http://www.freebase.com/" target="fb"&gt;Freebase&lt;/a&gt;. If you're unfamiliar with the terms Metaweb and Freebase, see this quote from &lt;a href="http://en.wikipedia.org/wiki/Metaweb" target="fb"&gt;Wikipedia about Metaweb&lt;/a&gt;:&lt;blockquote&gt;Metaweb Technologies, Inc. is a United States company based in San Francisco that is developing Freebase, described as an "open, shared database of the world's knowledge".&lt;/blockquote&gt;...and this quote &lt;a href="http://en.wikipedia.org/wiki/Freebase_(database)" target="fb"&gt;about Freebase&lt;/a&gt;:&lt;blockquote&gt;On March 3, 2007 Metaweb publicly announced Freebase, described by the company as "an open shared database of the world's knowledge," and "a massive, collaboratively-edited database of cross-linked data." Often understood as a wikipedia-turned-database, Freebase provides an interface that allows non-programmers to fill in structured, or 'meta-data', of general information, and to categorize or connect data items in meaningful, or 'semantic' ways.&lt;/blockquote&gt;So, Freebase is a database of just about everything (think of it as a machine readable version of wikipedia), and MQL is its query language.&lt;br /&gt;&lt;br /&gt;As a quick primer on the MQL query language, here's a simple MQL query (taken from the &lt;a href="http://mql.freebaseapps.com/ch01" target="fb"&gt;Freebase manual&lt;/a&gt;)&lt;pre&gt;{&lt;br /&gt;  "query": {&lt;br /&gt;    "type":"/music/artist",&lt;br /&gt;    "name":"The Police",&lt;br /&gt;    "album":[]&lt;br /&gt;  }&lt;br /&gt;}&lt;/pre&gt;...and here's the result: &lt;pre&gt;{&lt;br /&gt;  "status": "200 OK", &lt;br /&gt;  "code": "/api/status/ok", &lt;br /&gt;  "transaction_id":"cache;cache01.p01.sjc1:8101;2008-09-18T17:56:28Z;0029",&lt;br /&gt;  "result": {&lt;br /&gt;    "type": "/music/artist", &lt;br /&gt;    "name": "The Police",&lt;br /&gt;    "album": [&lt;br /&gt;      "Outlandos d'Amour", &lt;br /&gt;      "Reggatta de Blanc", &lt;br /&gt;      "Zenyatta Mondatta", &lt;br /&gt;      "Ghost in the Machine", &lt;br /&gt;      "Synchronicity"&lt;br /&gt;    ]&lt;br /&gt;  }&lt;br /&gt;}&lt;/pre&gt;I won't discuss these samples just yet - the MQL query language is the topic of the next installment. This is just a quick sample to give you an idea what MQL queries and their results look like. If you're didn't recognize the syntax already - read on, the JSON syntax is covered in the next section of this article.&lt;br /&gt;&lt;br /&gt;The collection of data accumulated in Freebase is impressive - you should check it out sometime. However, this post is not so much about Freebase - it's the MQL query language I want to discuss here.&lt;br /&gt;&lt;h3&gt;JSON&lt;/h3&gt;MQL is based on JSON (pronounced as Jason), which is an abbreviation of &lt;a href="http://en.wikipedia.org/wiki/JavaScript" target="json"&gt;JavaScript&lt;/a&gt; Object Notation. If you're not familiar with JSON, hang on and read all of this section - it provides just enough background to understand why it is highly relevant for modern web-applications, and it provides a good-enough description of the JSON syntax for you to read and write MQL queries. You might also want to check out the &lt;a href="http://en.wikipedia.org/wiki/JSON" target="json"&gt;Wikipedia entry on JSON&lt;/a&gt; for an objective overview.&lt;br /&gt;&lt;br /&gt;If you are already familiar with JSON, then you can safely skip through to the next section. But please do take a minute to review the JSON object code sample - it is used as basis for developing simple MQL queries &lt;a href="http://rpbouman.blogspot.com/2011/01/mql-to-sql-json-based-query-language_07.html"&gt;in the next article in this series&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Now, back to JSON.&lt;h4&gt;What is JSON?&lt;/h4&gt;JSON is a data exchange format. Syntactically, JSON is a proper subset of JavaScript, the de-facto web browser scripting language. The JSON JavaScript subset is defined in such a way that it can express JavaScript values and objects of arbitrary complexity. However, JSON cannot be used to create an executable program - all dynamic, executable elements (functions) have been stripped away.&lt;br /&gt;&lt;br /&gt;Because JSON literally is JavaScript (just not the full set), JavaScript programs can easily read and write data expressed as JSON. For this reason, JSON is often characterized as a JavaScript object serialization format.&lt;br /&gt;&lt;br /&gt;An excellent description of the JSON syntax can be found at the &lt;a href="http://www.json.org/" target="json"&gt;JSON homepage&lt;/a&gt;. But if you don't feel like reading up on the details of JSON, no worries - you just need to remember a few JSON features to read and write MQL queries:&lt;ul&gt;&lt;li&gt;3 scalar* data types:&lt;ul&gt;&lt;li&gt;strings - for example: &lt;code&gt;"Oxygen"&lt;/code&gt;, or &lt;code&gt;'O'&lt;/code&gt;&lt;/li&gt;&lt;li&gt;numbers - like &lt;code&gt;8&lt;/code&gt; (integer), &lt;code&gt;13.6181&lt;/code&gt; (float) and &lt;code&gt;-2.1835e+2&lt;/code&gt; (float)&lt;/li&gt;&lt;li&gt;booleans - &lt;code&gt;true&lt;/code&gt; and &lt;code&gt;false&lt;/code&gt; are the only possible boolean values&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;2 composite data types:&lt;ul&gt;&lt;li&gt;arrays - an ordered comma-separated list of values, enclosed in square braces: &lt;code&gt;["Oxygen-16", "Oxygen-17", "Oxygen-18"]&lt;/code&gt;. The individual values in an array are usually referred to as elements&lt;/li&gt;&lt;li&gt;objects - an unordered comma-separated list of key/value pairs (which are uniquely named items), enclosed in curly braces:&lt;pre&gt;{&lt;br /&gt;  "name": "Oxygen",&lt;br /&gt;  "symbol": 'O',&lt;br /&gt;  "atomic_number": 8,&lt;br /&gt;  "ionization_energy": 13.6181,&lt;br /&gt;  "melting_point": -2.1835e+2,&lt;br /&gt;  "isotopes": ["Oxygen-16", "Oxygen-17", "Oxygen-18"]&lt;br /&gt;}​&lt;/pre&gt;The key/value pairs of an object are usually called &lt;em&gt;properties&lt;/em&gt; or &lt;em&gt;members&lt;/em&gt;. Note that name and value are separated by a colon (&lt;code&gt;:&lt;/code&gt;). Keys are strings, and have to be quoted. Values can be of any of the data types described above, including arrays (like the &lt;code&gt;"isotopes"&lt;/code&gt; property in the example above) and objects.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;A special &lt;code&gt;null&lt;/code&gt;-value (which actually constitutes a data type of its own)&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;(* I should point out that JavaScript is an &lt;a href="http://en.wikipedia.org/wiki/Object-oriented_programming" target="json"&gt;object-oriented language&lt;/a&gt;. At runtime, all data types described above, including the array and "scalar" types, are in fact objects. Objects are by definition composite, and not scalar. However, the list above is about JSONs syntactical constructs, not about their runtime JavaScript representation)&lt;br /&gt;&lt;br /&gt;As you can see, JSON provides a few very simple rules. Yet it allows you to build data structures of arbitrary complexity: the object example above illustrates how you can bundle a collection of named values together to represent structures of a higher order. In this particular case, we denoted an object that represents the chemical element Oxygen. And, we needn't have stopped here: we could've added more key/value pairs to represent other complex properties of the element Oxygen, representing things like the person who first discovered it, a list of well known substances that contain Oxygen and so on and so forth.&lt;br /&gt;&lt;h4&gt;AJAX and JSON&lt;/h4&gt;&lt;br /&gt;I already mentioned something about JSON in relation to JavaScript and modern web applications. By modern web applications, I mean web applications that offer a rich and highly interactive user interface that is based on &lt;a href="http://en.wikipedia.org/wiki/Ajax_(programming)" target="json"&gt;AJAX&lt;/a&gt; (An Acronym for Asynchronous JavaScript And XML) technology.&lt;br /&gt;&lt;br /&gt;A key feature of AJAX is the usage of client-side JavaScript code for maintaining non-blocking background communication with the web server. This is typically done using a specialized object called the &lt;code&gt;&lt;a href="http://en.wikipedia.org/wiki/XMLHttpRequest" target="json"&gt;XMLHttRequest&lt;/a&gt;&lt;/code&gt;. AJAX applications also tend to use JavaScript to dynamically change the contents or appearance of the page (a technique called dynamic HTML or DHTML). This allows web developers to create applications that can avoid a relatively slow page reload most of the time, making the application appear more responsive.&lt;br /&gt; &lt;br /&gt;Going by the meaning of the AJAX acronym, you may be under the impression that AJAX applications are all about using asynchronous XML data exchange using the XMLHttpRequest, and thus the relevance or need for JSON might be lost on you. I would say that, yes, you're not wrong: AJAX is often implemented by using the &lt;code&gt;XMLHttpRequest&lt;/code&gt; object to communicate with the server using XML messages. But for several reasons, JSON is gaining popularity as data exchange format instead of XML, and techniques like &lt;a href="http://en.wikipedia.org/wiki/JSONP#JSONP" target="json"&gt;JSONP&lt;/a&gt; are used in addition to the &lt;code&gt;XMLHttpRequest&lt;/code&gt; object. This technique actually has an advantage over using the &lt;code&gt;XMLHttpRequest&lt;/code&gt; because it can be used to do requests to services that reside on another domain than the current web page, whereas this is not allowed with the &lt;code&gt;XMLHttpRequest&lt;/code&gt; (at least, not by default and not without explicitly asking the user for confirmation). This makes JSONP a great tool for creating mash-up applications. &lt;br /&gt;&lt;br /&gt;I'm sure you're not surprised to hear from me that the internet offers many places where you can get your XML vs JSON brawl on - I'm not particularly interested in that discussion, and I don't feel anybody has to choose sides. The point I want to bring across is that for AJAX applications, JSON is a respected data exchange format. It is excellently supported by both web browsers as well as popular AJAX frameworks, and companies like Amazon, YAHOO! and Google deliver more and more webservices that use JSON as data exchange format.&lt;br /&gt;&lt;h3&gt;Next time&lt;/h3&gt; &lt;br /&gt;Now that you've seen how JSON works, and how it relates to AJAX web applications, you're ready to take a look at MQL queries. This is the topic of the next blog post.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-5367740322670057595?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/5367740322670057595/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=5367740322670057595' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5367740322670057595'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5367740322670057595'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2011/01/mql-to-sql-json-based-query-language.html' title='MQL-to-SQL: A JSON-based query language for your favorite RDBMS - Part I'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-6905816910315063839</id><published>2011-01-06T01:43:00.006+01:00</published><updated>2011-01-06T02:54:09.091+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MongoDB'/><category scheme='http://www.blogger.com/atom/ns#' term='mysqlconf'/><category scheme='http://www.blogger.com/atom/ns#' term='mql-to-sql'/><category scheme='http://www.blogger.com/atom/ns#' term='NoSQL'/><category scheme='http://www.blogger.com/atom/ns#' term='CouchDB'/><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Javascript'/><category scheme='http://www.blogger.com/atom/ns#' term='Freebase'/><category scheme='http://www.blogger.com/atom/ns#' term='drizzle'/><category scheme='http://www.blogger.com/atom/ns#' term='Ajax'/><category scheme='http://www.blogger.com/atom/ns#' term='MariaDB'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql user conference'/><category scheme='http://www.blogger.com/atom/ns#' term='php'/><category scheme='http://www.blogger.com/atom/ns#' term='SQL'/><category scheme='http://www.blogger.com/atom/ns#' term='MQL'/><title type='text'>Speaking at the MySQL conference 2011</title><content type='html'>I just received a confirmation that my &lt;a href="http://en.oreilly.com/mysql2011/public/schedule/full#session17134"&gt;presentation proposal&lt;/a&gt; for the &lt;a href="http://en.oreilly.com/mysql2011"&gt;MySQL user conference 2011&lt;/a&gt; was accepted! The title for my proposal is &lt;a href="http://en.oreilly.com/mysql2011/public/schedule/detail/17134"&gt;MQL-to-SQL: a JSON-based Query Language for RDBMS Access from AJAX Applications&lt;/a&gt;, and it covers pretty much everything implied by the title.&lt;br /&gt;&lt;br /&gt;As always, the &lt;a href="http://www.santaclara.hyatt.com/hyatt/hotels/index.jsp"&gt;Hyatt Regency Hotel&lt;/a&gt; in Santa Clara, California serves as the venue. The conference will be held from April 11-14. Except for the venue and period, I think this year's conference will bear few similarities to previous editions. Let me try and explain.&lt;br /&gt;&lt;br /&gt;This year's theme is "MySQL, the ecosystem and Beyond". This means that the conference is using MySQL as an &lt;em&gt;anchor&lt;/em&gt; for a myriad of topics which are of interest to a large majority of MySQL users. This explicitly leaves room for subjects that may not be directly related to the MySQL product proper. &lt;br /&gt;&lt;br /&gt;So, not only products with a direct link to MySQL, such as &lt;a href="http://drizzle.org/Home.html"&gt;drizzle&lt;/a&gt; and &lt;a href="http://mariadb.org/"&gt;MariaDB&lt;/a&gt; are covered; NoSQL databases like &lt;a href="http://couchdb.apache.org/"&gt;CouchDB&lt;/a&gt;, &lt;a href="http://www.mongodb.org/"&gt;MongoDB&lt;/a&gt; and &lt;a href="http://cassandra.apache.org/"&gt;Cassandra&lt;/a&gt; are quite well represented and the conference committee actively reached out to the &lt;a href="http://www.postgresql.org/"&gt;PostgreSQL&lt;/a&gt; community to submit proposals. Traditional topics like scalability, performance and tuning remain strongly present, just like high availability, failover, and replication. As always, some of the world experts in this field will be speaking. In addition, infrastructural topics like virtualization and cloud computing are well represented (but of course, especially with regard to database management); One of the things I'm thrilled about is the presence of developer and applicattion centric topics like GIS, rapid application development, and object relational mapping.  &lt;br /&gt;&lt;br /&gt;Just &lt;a href="http://en.oreilly.com/mysql2011/public/schedule"&gt;take a look at the full schedule&lt;/a&gt; to get a taste of what this event will be offering. Personally, I think it's a great setup, and I'm happy and honored to attend! &lt;br /&gt;&lt;br /&gt;In a series of upcoming blog posts, I plan to explain some of the subject matter regarding my own talk. But for now, I just want to tell you that I think this is going to be a great conference! I'm looking forward to attending a lot of high quality sessions, and meeting world leading experts in the MySQL and open source database ecosystem. I hope to see you there!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-6905816910315063839?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/6905816910315063839/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=6905816910315063839' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/6905816910315063839'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/6905816910315063839'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2011/01/speaking-at-mysql-conference-2011.html' title='Speaking at the MySQL conference 2011'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-4441048954144292400</id><published>2010-12-18T02:43:00.007+01:00</published><updated>2010-12-22T15:07:19.199+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Javascript'/><category scheme='http://www.blogger.com/atom/ns#' term='kettle-cookbook'/><category scheme='http://www.blogger.com/atom/ns#' term='Pentaho'/><category scheme='http://www.blogger.com/atom/ns#' term='Kettle'/><category scheme='http://www.blogger.com/atom/ns#' term='Pentaho Kettle Solutions'/><category scheme='http://www.blogger.com/atom/ns#' term='pentaho data integration'/><title type='text'>Substituting variables in Kettle Parameter values</title><content type='html'>&lt;a href="http://kettle.pentaho.org"&gt;Kettle&lt;/a&gt; (a.k.a. Pentaho Data Integration) jobs and transformations offers support for &lt;a href="http://wiki.pentaho.com/display/EAI/Named+Parameters" target="kettle"&gt;named parameters&lt;/a&gt; (as of version 3.2.0). Named parameters form a special class of ordinary &lt;a href="http://wiki.pentaho.com/display/COM/Using+Variables+in+Kettle" target="kettle"&gt;kettle variables&lt;/a&gt; and are intended to clearly and explicitly define for which variables the caller should supply a value. &lt;br /&gt;&lt;br /&gt;One of my pet projects, the pentaho auto-documentation solution &lt;a href="http://code.google.com/p/kettle-cookbook/" target="kettle-cookbook"&gt;kettle-cookbook&lt;/a&gt;, uses two named parameters called &lt;code&gt;INPUT_DIR&lt;/code&gt; and &lt;code&gt;OUTPUT_DIR&lt;/code&gt;. These allow you to specify the directory that contains the BI content that is to be documented (such as kettle transformation and job files, action sequence files and mondrian schema files), and the directory to store the generated documentation.&lt;br /&gt;&lt;br /&gt;Several kettle-cookbook users ran into problems attempting to use variable references in the values they supplied for the &lt;code&gt;INPUT_DIR&lt;/code&gt; and &lt;code&gt;OUTPUT_DIR&lt;/code&gt; variables. In this case, the variables referenced in the supplied parameter values would be set by adding entries in &lt;code&gt;kettle.properties&lt;/code&gt; file. I just committed revision 64 of kettle-cookbook which should fix this problem. In this article I briefly discuss the solution, as I think it may be useful to other kettle users. &lt;br /&gt;&lt;h3&gt;Substituting Kettle Variable References&lt;/h3&gt;&lt;br /&gt;Kettle doesn't automatically substitute variable references in parameter values (nor in ordinary variable values). So, if you need to support variable references inside parameter values, you have to substitute the variables yourself.&lt;br /&gt;&lt;h4&gt;Variable substitution in Kettle 4.01 and up&lt;/h4&gt;&lt;br /&gt;As of Kettle version 4.01, the &lt;a href="http://wiki.pentaho.com/display/EAI/Calculator" target="kettle"&gt;Calculator step&lt;/a&gt; supports a calculation type called "variable substitution in string A" that is intended exactly for that purpose. I have tested this but unfortunately in 4.01 it doesn't seem to work, at least not for the built-in variable &lt;code&gt;${Internal.Transformation.Filename.Directory}&lt;/code&gt; which I used in my test. In the latest stable version, Kettle 4.10 it does work as advertised, I would recommend using this method if you're a user of Kettle 4.10 (or later).&lt;br /&gt;&lt;h4&gt;Variable substitution in earlier Kettle versions&lt;/h4&gt;&lt;br /&gt;I have committed myself to making kettle-cookbook work on kettle 3.2.0, as my sources tell me that this is still an often-used version in many production environments. I'm even prepared to make kettle-cookbook work on Kettle versions earlier than 3.20, should there be sufficient demand for that. Anyway, the bottom line is, these versions do not support the "variable substitution in string A" calculation in the Calculator step, so you have to resort to a little trick.&lt;br /&gt;&lt;h3&gt;A Kettle 3.2.0 transformation to substitute variables in parameters&lt;/h3&gt;&lt;br /&gt;For kettle-cookbook, I added a single transformation called &lt;a href="http://code.google.com/p/kettle-cookbook/source/browse/trunk/pdi/substitute-variables-in-parameters.ktr?spec=svn64&amp;r=64" target="kettle-cookbook"&gt;substitute-variables-in-parameters.ktr&lt;/a&gt; as the first transformation of the main job. &lt;br /&gt;&lt;a href="http://code.google.com/p/kettle-cookbook/source/browse/trunk/pdi/substitute-variables-in-parameters.ktr?spec=svn64&amp;r=64" target="kettle-cookbook"&gt;&lt;br /&gt;&lt;img src="http://farm6.static.flickr.com/5246/5269672677_88f6a99d3b_b.jpg"/&gt;&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;The &lt;code&gt;substitute-variables-in-parameters.ktr&lt;/code&gt; transformation uses a "Get Variables" step to read the values of the &lt;code&gt;INPUT_DIR&lt;/code&gt; and &lt;code&gt;OUTPUT_DIR&lt;/code&gt; parameters. The values are then processed by a javascript function which substitutes all variable references with their values. Finally, a "Set Variables" step overwrites the original value of the variables with their replaced value.&lt;br /&gt;&lt;br /&gt;The code for the JavaScript step is shown below:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;function replace_variables(text){&lt;br /&gt;    var re = /\$\{([^\}]+)\}|%%([^%]+)%%/g,&lt;br /&gt;        match, from = 0,&lt;br /&gt;        variable_name, variable_value, &lt;br /&gt;        replaced_text = ""&lt;br /&gt;    ;&lt;br /&gt;&lt;br /&gt;    while ((match = re.exec(text)) !== null) {&lt;br /&gt;        variable_name  = match[1] ? match[1] : match[2];&lt;br /&gt;        variable_value = getVariable(variable_name, "");&lt;br /&gt;        replaced_text += text.substring(from, match.index);&lt;br /&gt;        replaced_text += variable_value;&lt;br /&gt;        from = match.index + match[0].length;&lt;br /&gt;    }&lt;br /&gt;    replaced_text += text.substring(from, text.length);&lt;br /&gt;    return replaced_text;&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;var replaced_input_dir = replace_variable(input_dir);&lt;br /&gt;var replaced_output_dir = replace_variable(output_dir);&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The script first defines &lt;code&gt;function replace_variables(text)&lt;/code&gt; which accepts the parameter value, and returns the substituted value. Then it calls the function, applying it to the &lt;code&gt;input_dir&lt;/code&gt; and &lt;code&gt;output_dir&lt;/code&gt; fields from the incoming stream. These fields originate in the preceding "Get variables" step which assigns them the value of the &lt;code&gt;INPUT_DIR&lt;/code&gt; and &lt;code&gt;OUTPUT_DIR&lt;/code&gt; variables. The output of the &lt;code&gt;replace_variables()&lt;/code&gt; function is assigned to the &lt;code&gt;replaced_input_dir&lt;/code&gt; and &lt;code&gt;replaced_output_dir&lt;/code&gt; javascript variables, which leave the JavaScript step as fields of the outgoing stream. In the final "Set variables" step, the &lt;code&gt;replaced_input_dir&lt;/code&gt; and &lt;code&gt;replaced_output_dir&lt;/code&gt; fields are used to overwrite the original value of the &lt;code&gt;INPUT_DIR&lt;/code&gt; and &lt;code&gt;OUTPUT_DIR&lt;/code&gt; values.&lt;br /&gt;&lt;h3&gt;The &lt;code&gt;replace_variables()&lt;/code&gt; function&lt;/h3&gt;&lt;br /&gt;Let's take a closer look at the &lt;code&gt;replace_variables()&lt;/code&gt; function.&lt;br /&gt;&lt;br /&gt;The heart of the function is formed by a &lt;code&gt;while&lt;/code&gt; loop that executes a &lt;a href="http://www.w3schools.com/jsref/jsref_obj_regexp.asp" target="javascript"&gt;javascript regular expression&lt;/a&gt; called &lt;code&gt;re&lt;/code&gt; that matches variable references.&lt;br /&gt;&lt;br /&gt;The regular expression itself is defined in the top of the function:&lt;pre&gt;&lt;br /&gt;    var re = /\$\{([^\}]+)\}|%%([^%]+)%%/g,&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;It's intention is to recognize variable references of the form &lt;code&gt;${NAME}&lt;/code&gt; and &lt;code&gt;%%NAME%%&lt;/code&gt;. The part of the pattern for the name is enclosed in parenthesis to form a capturing group. As we shall see later on, this allows us to extract the actual name of the referenced variable. The trailing &lt;code&gt;g&lt;/code&gt; indicates that the pattern should be matched anywhere in the string. This is necessary because we want to replace all variable references in the input text, not just the first one.&lt;br /&gt;&lt;br /&gt;The regular expression object is used to drive the &lt;code&gt;while&lt;/code&gt; loop by calling its &lt;code&gt;exec()&lt;/code&gt; method. In case of a match, the &lt;code&gt;exec()&lt;/code&gt; returns an array that describes the text matched by the regular expression. If there's no match, &lt;code&gt;exec()&lt;/code&gt; returns &lt;code&gt;null&lt;/code&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;    while ((match = re.exec(text)) !== null) {&lt;br /&gt;       ...&lt;br /&gt;    }&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;If there is a match, we first extract the variable name:&lt;pre&gt;&lt;br /&gt;variable_name  = match[1] ? match[1] : match[2];&lt;br /&gt;&lt;/pre&gt;The first entry of the &lt;code&gt;match&lt;/code&gt; array (at index 0) is the text that was matched by the pattern as a whole. The array contains subsequent elements for each capturing group in the regular expression. Because our regular expression &lt;code&gt;re&lt;/code&gt; has 2 capturing groups, the &lt;code&gt;match&lt;/code&gt; array contains two more elements. If the variable is of the form &lt;code&gt;${NAME}&lt;/code&gt;, the element at index=1 contains the variable name. If it's of the form &lt;code&gt;%%NAME%%&lt;/code&gt;, it will be contained in the element at index=2.&lt;br /&gt;&lt;br /&gt;Once we have the variable name, we can use the &lt;code&gt;getVariable()&lt;/code&gt; javascript function to obtain its value:&lt;pre&gt;&lt;br /&gt;        variable_value = getVariable(variable_name, "");&lt;/pre&gt;&lt;br /&gt;The &lt;code&gt;getVariable()&lt;/code&gt; is not a standard javascript function, but supplied by the kettle javascript step.&lt;br /&gt;&lt;br /&gt;To perform the actual substitution, we take the substring of the original text up to the location where the variable reference was matched. This location is conveniently supplied by the &lt;code&gt;match&lt;/code&gt; array:&lt;pre&gt;&lt;br /&gt;        replaced_text += text.substring(from, match.index);&lt;br /&gt;&lt;/pre&gt;Right after that location, we need to put the variable value instead of its name:&lt;pre&gt;&lt;br /&gt;        replaced_text += variable_value;&lt;br /&gt;&lt;/pre&gt;The last action in the loop is to remember to location right behind the last replaced variable reference, so we can pick up at the right location in the original value the next time we match a variable:&lt;pre&gt;&lt;br /&gt;        from = match.index + match[0].length;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Right after the loop, we need to copy the final piece of original text occurring right behind the last variable reference to yield the complete replaced text:&lt;pre&gt;&lt;br /&gt;    replaced_text += text.substring(from, text.length);&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;h3&gt;Odds and Ends&lt;/h3&gt;&lt;br /&gt;While the &lt;code&gt;substitute-variables-in-parameters.ktr&lt;/code&gt; transformation works great for its intended purpose, substituting variables in the known parameters &lt;code&gt;INPUT_DIR&lt;/code&gt; and &lt;code&gt;OUTPUT_DIR&lt;/code&gt;, it is not really applicable beyond kettle cookbook. What you'd really want to have is a job that replaces variables in all parameters, not just those that are known in advance. &lt;br /&gt;&lt;br /&gt;As it turns out this is actually almost trivial to achieve, however to solution is a bit too long-winded for this post. If anyone is interested in such a solution, please post a comment and let me know, and I'd be happy to provide it.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;UPDATE&lt;/b&gt;: A solution that substitutes all variable references occurring in the parameter values of the containing job is now available at the &lt;a href="http://wiki.pentaho.com/display/EAI/Substituting+variable+references+in+Job+Parameter+values" target="pentaho"&gt;kettle exchange area&lt;/a&gt; in the Pentaho wiki.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-4441048954144292400?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/4441048954144292400/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=4441048954144292400' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/4441048954144292400'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/4441048954144292400'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2010/12/substituting-variables-in-kettle.html' title='Substituting variables in Kettle Parameter values'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm6.static.flickr.com/5246/5269672677_88f6a99d3b_t.jpg' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-3732404779466472673</id><published>2010-12-09T09:25:00.006+01:00</published><updated>2010-12-09T11:50:25.853+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Pentaho'/><category scheme='http://www.blogger.com/atom/ns#' term='Kettle Solutions'/><category scheme='http://www.blogger.com/atom/ns#' term='Kettle'/><category scheme='http://www.blogger.com/atom/ns#' term='Pentaho Kettle Solutions'/><category scheme='http://www.blogger.com/atom/ns#' term='Pentaho Solutions'/><category scheme='http://www.blogger.com/atom/ns#' term='pentaho data integration'/><title type='text'>Parameterizing SQL statements in the Kettle Table Input step: Variables vs Parameters</title><content type='html'>I get this question on a regular basis, so I figured I might as well blog it, in the hope it will be useful for others. Here goes:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;Let's say that I want to delete all records that match an id from a set of tables.  The table names come in as rows into the Execute SQL Script step (check execute for every row). Next I write:&lt;br /&gt;&lt;pre&gt;DELETE FROM {table_name} WHERE id = {identifier}&lt;/pre&gt;&lt;br /&gt;as the SQL to execute.  In the parameters grid at the bottom right, I have two fields: &lt;code&gt;table_name&lt;/code&gt; and &lt;code&gt;identifier&lt;/code&gt;.  What is the syntax for substituting the &lt;code&gt;table_name&lt;/code&gt; and &lt;code&gt;identifier&lt;/code&gt; parameters in the sql script?&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;(Although this particular question focuses on the "Execute SQL Script" step, it also applies to the "Table Input" step, and probably a few more steps I can't recall right now.)&lt;br /&gt;&lt;br /&gt;The parameters grid can be used for prepared statement value placeholders. In the SQL statement these placeholders are denoted as questionmarks (&lt;code&gt;?&lt;/code&gt;). These are positional parameters: they get their value from those fields in the incoming stream that are entered in the parameters grid, in order. &lt;br /&gt;Here's an example of the correct usage of these placeholders:&lt;br /&gt;&lt;pre&gt;DELETE FROM myTable WHERE id = ?&lt;/pre&gt;&lt;br /&gt;Here, the &lt;code&gt;?&lt;/code&gt; in the &lt;code&gt;WHERE&lt;/code&gt; clause will be bound to the value of the first field from the incoming stream entered in the parameters grid. Because there is only one such placeholder, there can be only one field in the parameters grid.&lt;br /&gt;&lt;br /&gt;An important thing to realize is that these parameters can only be used to parameterize &lt;em&gt;value expressions&lt;/em&gt;. So, this kind of parameter does not work for identifiers, nor do they work for structural elements of the SQL statement, such as keywords. So this kind of parameter cannot be used to parameterize the table name which seems to be the intention in the original example posed in the question.&lt;br /&gt;&lt;br /&gt;There is a way to parameterize the structural elements of the SQL statement as well as the parameters. You can apply &lt;em&gt;variable substitution&lt;/em&gt; to the SQL statetment.&lt;br /&gt;&lt;br /&gt;Kettle Variables can be defined by a Set Variables step, or by specifying parameters at the transformation level. They get their value from "the environment": for example, parameters get their value initially when the transformation is started, and regular variables are typically set somewhere in the job that is calling your transformation. &lt;br /&gt;&lt;br /&gt;In text fields, including the SQL textarea of the Table input step or the Execute SQL Script step, you denote those variables with this syntax: &lt;code&gt;${VARIABLE_NAME}&lt;/code&gt;. So to parameterize the table name we could use something like this:&lt;br /&gt;&lt;pre&gt;DELETE FROM ${TABLE_NAME}&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;In order to force kettle to apply variable substitution to the SQL statement, you have to check the "variable substitution" checkbox. If this checkbox is checked, then all variables are simply substituted with their (string)value during transformation initialization. This is a lot like the way macro's are substituted by the pre-processor in C/C++ code. &lt;br /&gt;&lt;br /&gt;When comparing variables with parameters, two important things should be mentioned here:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Unlike value placeholders, variables can be used to manipulate any aspect of the SQL statement, not just value expressions. The variable value will simply become the text that makes up the SQL statement, it is your responsibility it results in a syntactically valid and correct SQL statement.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Variables are evaluated once during transformation initalization. So if you want to vary the variable value, you'll have to call the transformation again for the change to take effect. For the same reasons, you cannot set the value of a variable and read it within the same transformation: setting the variable value occurs at runtime, but evaluating it occurs at initialization time.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;Finally, here's a screenshot that summarizes these different ways to parameterize SQL statements in kettle: &lt;img src="http://farm6.static.flickr.com/5242/5245670439_5bd9ced803_b_d.jpg"/&gt;&lt;br /&gt;&lt;br /&gt;If you want to read more about this topic, it's covered in both our books &lt;a href="http://www.amazon.com/Pentaho-Solutions-Business-Intelligence-Warehousing/dp/0470484322/ref=bxgy_cc_b_img_a" target="_ps"&gt;Pentaho Solutions&lt;/a&gt; and &lt;a href="http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177/ref=pd_sim_b_1" target="_pks"&gt;Pentaho Kettle Solutions&lt;/a&gt;. Another title you might be interested in is Maria Roldan's &lt;a href="http://www.amazon.com/Pentaho-3-2-Data-Integration-Beginners/dp/1847199542/ref=pd_sim_b_3"&gt;Pentaho 3.2 Data Integration: Beginner's Guide&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-3732404779466472673?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/3732404779466472673/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=3732404779466472673' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/3732404779466472673'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/3732404779466472673'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2010/12/parameterizing-sql-statements-in-kettle.html' title='Parameterizing SQL statements in the Kettle Table Input step: Variables vs Parameters'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-7445810499566340078</id><published>2010-08-11T01:05:00.002+02:00</published><updated>2010-08-11T01:05:00.204+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Sakila'/><category scheme='http://www.blogger.com/atom/ns#' term='Indexed Database API'/><category scheme='http://www.blogger.com/atom/ns#' term='SQLite'/><category scheme='http://www.blogger.com/atom/ns#' term='kettle-cookbook'/><category scheme='http://www.blogger.com/atom/ns#' term='Kettle'/><category scheme='http://www.blogger.com/atom/ns#' term='Kettle Solutions'/><category scheme='http://www.blogger.com/atom/ns#' term='mql-to-sql'/><category scheme='http://www.blogger.com/atom/ns#' term='Web SQL Database API'/><category scheme='http://www.blogger.com/atom/ns#' term='Safari'/><category scheme='http://www.blogger.com/atom/ns#' term='Chrome'/><category scheme='http://www.blogger.com/atom/ns#' term='pentaho data integration'/><category scheme='http://www.blogger.com/atom/ns#' term='Quipu'/><category scheme='http://www.blogger.com/atom/ns#' term='Opera'/><category scheme='http://www.blogger.com/atom/ns#' term='Firefox'/><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Freebase'/><category scheme='http://www.blogger.com/atom/ns#' term='MQL'/><title type='text'>Back to blogging....</title><content type='html'>It has been a while since I posted on my blog - in fact, I believe this is the first time ever that more than one month passed between posts since I started blogging. There are a couple of reasons for the lag: &lt;ul&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html" target="wiley"&gt;&lt;img src="http://media.wiley.com/product_data/coverImage/77/04706351/0470635177.jpg"/&gt;&lt;/a&gt;&lt;a href="http://www.ibridge.be/"&gt;Matt Casters&lt;/a&gt;, &lt;a href="http://www.tholis.nl/"&gt;Jos van Dongen&lt;/a&gt; and me have spent a lot of time finalizing our forthcoming book, &lt;a href="http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html" target="wiley"&gt;Pentaho Kettle Solutions&lt;/a&gt; (Wiley, ISBN: 978-0-470-63517-9). The book is currently being produced, and should be available according to schedule in early September 2010. If you're interested, you might like to read &lt;a href="http://rpbouman.blogspot.com/2010/03/writing-another-book-pentaho-kettle.html"&gt;one of my earlier posts&lt;/a&gt; that explains the organization and outline of the book.&lt;br /&gt;&lt;br /&gt;(I should point out that we have reorganized the outline as the project progressed, so the final result will not have all the chapters mentioned in that post. We do however cover most of the topics mentioned.)&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://www.datawarehousemanagement.org/" target="quipu"&gt;&lt;img src="http://www.datawarehousemanagement.org/Portals/0/Quipu%20400%20lime%20smaller_resize.png"/&gt;&lt;/a&gt;I have been checking out &lt;a href="http://www.datawarehousemanagement.org/"&gt;Quipu&lt;/a&gt;, a promising Open Source data warehouse management solution. Quipu provides a repository-based extensible code-generator that allows you to generate and maintain a data warehouse based on the Data Vault model. One of the things I want to do in the short term is write templates that allows it to work for MySQL, and after that, I want to see if I can get Quipu to generate Kettle Jobs and transformations.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;I have been working on a couple of software projects. Two of them are currently available as open source on google code:&lt;dl&gt;&lt;br /&gt;&lt;dt&gt;&lt;a href="http://code.google.com/p/mql-to-sql/"&gt;mql-to-sql&lt;/a&gt;&lt;/dt&gt;&lt;dd&gt;This allows you to use the &lt;a href="http://wiki.freebase.com/wiki/MQL"&gt;Metaweb Query Language&lt;/a&gt; (MQL) to query a RDBMS. If you're wondering what this is all about: MQL is the query language used by &lt;a href="http://wiki.freebase.com/wiki/Main_Page"&gt;Freebase&lt;/a&gt; (which is a collaborative "database-of-everything" or "wikipedia-gone-database").&lt;br /&gt;&lt;br /&gt;While Freebase is interesting in its own right, I am particularly enthused about the MQL query language. I feel that MQL is an exceptionally good solution for flexible, expressive and secure data access for modern (AJAX) Web applications. Even though MQL was not developed with relational database systems in mind, I think it is a pretty good fit.&lt;br /&gt;&lt;br /&gt;Anyway, this is very much a work in progress, and I appreciate your feedback on it. If you're interested, you can read a bit more about my take on RDBMS data access for web applications and MQL on the &lt;a href="http://code.google.com/p/mql-to-sql/" target="mql-to-sql"&gt;mql-to-sql project home page&lt;/a&gt;. I have also put up an &lt;a href="http://mql.qbmetrix.com/mqlread/mql-to-sql-query-editor.php" target="mql-to-sql"&gt;online demo&lt;/a&gt; that allows you to query the &lt;a href="http://dev.mysql.com/doc/sakila/en/sakila.html" target="mysql"&gt;sakila MySQL sample database&lt;/a&gt; using MQL.&lt;/dd&gt;&lt;br /&gt;&lt;dt&gt;&lt;a href="http://code.google.com/p/kettle-cookbook/" target="kettle-cookbook"&gt;kettle-cookbook&lt;/a&gt;&lt;/dt&gt;&lt;dd&gt;This is a project that provides auto-documentation for &lt;a href="http://kettle.pentaho.com/" target="kettle-cookbook"&gt;Kettle&lt;/a&gt; (a.k.a. Pentaho Data Integration). &lt;br /&gt;&lt;br /&gt;The project consists of a bunch of Kettle Jobs and transformations (as well as some XSLT stylesheets) that extract data from Kettle Jobs and transformation and transform it into a collection of human-readable HTML documents along with a table of contents. The resulting documentation looks and feels a bit like JavaDoc documentation. &lt;br /&gt;&lt;br /&gt;If all goes well, I will be presenting kettle-cookbook in a Pentaho web seminar, which is currently scheduled for September 15, 2010&lt;/dd&gt;&lt;br /&gt;&lt;/dl&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;I've been enjoying 4 weeks of vacation (I started working this week). There's not much to tell about that, other than that it was great spending a lot of time with my family. I plan to do this more often, and now that Kettle Solutions is finished I should be able to find more time to do it.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;I've been looking at two emerging HTML5 APIs for client-side structured storage, the &lt;a href="http://dev.w3.org/html5/webdatabase/" target="html5"&gt;Web SQL Database API&lt;/a&gt; and the &lt;a href="http://www.w3.org/TR/IndexedDB/" target="html5"&gt;Indexed DB API&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;I have developed a few thoughts about the ongoing debate (see &lt;a href="http://hacks.mozilla.org/2010/06/beyond-html5-database-apis-and-the-road-to-indexeddb/" target="webdb"&gt;this&lt;/a&gt; and &lt;a href="http://hacks.mozilla.org/2010/06/comparing-indexeddb-and-webdatabase/" target="webdb"&gt;&lt;/a&gt; article for some background) about which one is better and I see a role for something like MQL here too. I will probably write something about this in the next few weeks&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Yesterday, I got wind of the &lt;a href="http://js1k.com/home" target="js1k"&gt;JS1k contest&lt;/a&gt;! Basically, the challenge is to write an interesting standalone JavaScript demo program that must be no larger than 1024 bytes. It is amazing and inspiring to &lt;a href="http://js1k.com/demos"&gt;see what people manage to do&lt;/a&gt; with a modern browser in 1k of JavaScript code. &lt;br /&gt;&lt;br /&gt;I decided to try it myself, and you can find &lt;a href="http://js1k.com/demo/244" target="js1k"&gt;my submission here&lt;/a&gt;: An interactive SQL query tool for the Web SQL DB API. &lt;br /&gt;&lt;br /&gt;Essentially, you get a textarea where you can enter arbitrary SQLite queries, and button to execute the SQL, and a result area that will print a feedback message and a result table (if applicable). As a bonus, there's a button to get a listing of the available database objects (using a query on sqlite_master) and an explain button to show the query plan of the current SQL statement.&lt;br /&gt;&lt;br /&gt;The demo works on recent versions of Google Chrome, Apple Safari and Opera. It can run offline, and does not require any plugins. I should say that I expect my submission will be rejected by the judges since the demo is not functional on Mozilla Firefox, which is a requirement. (That is, the script will detect that the Web SQL Database API is not supported and print a message to that effect). However, it was still fun to try my hand at it. &lt;br /&gt;&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;Ok - that's it for now. I will try and post more regularly and write about these and other things in the near future. Don't hesitate to leave a comment if you have any questions or suggestions.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-7445810499566340078?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/7445810499566340078/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=7445810499566340078' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/7445810499566340078'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/7445810499566340078'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2010/08/back-to-blogging.html' title='Back to blogging....'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-7122558264129037578</id><published>2010-05-26T21:45:00.009+02:00</published><updated>2010-05-28T01:30:01.723+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='PostgreSQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Oracle'/><category scheme='http://www.blogger.com/atom/ns#' term='SQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Ms SQL Server'/><category scheme='http://www.blogger.com/atom/ns#' term='ISO 9075'/><title type='text'>A small issue of SQL standards</title><content type='html'>From a functional perspective, the core SQL support in all major and minor RDBMS-es is reasonably similar. In this light, it's sometimes quite disturbing to find how some very basic things work so differently across different products. Consider this simple statement:&lt;pre&gt;SELECT  'a' /* this is a comment */ 'b'&lt;br /&gt;FROM    onerow&lt;/pre&gt;What should the result be? (You can assume that &lt;code&gt;onerow&lt;/code&gt; is an existing table that contains one row)&lt;br /&gt;&lt;br /&gt;It turns out popular RDBMS-es mostly disagree with one another.&lt;br /&gt;&lt;br /&gt;In Oracle XE, we get this:&lt;pre&gt;SELECT  'a' /* comment */ 'b'&lt;br /&gt;                         *&lt;br /&gt;ERROR at line 1:&lt;br /&gt;ORA-00923: FROM keyword not found where expected&lt;/pre&gt;&lt;br /&gt;PostgreSQL 8.4 also treats it as a syntax error, and thus seems compatible with Oracle's behavior: &lt;pre&gt;ERROR:  syntax error at or near "'b'"&lt;br /&gt;LINE 1: SELECT  'a' /* this is a comment */ 'b'&lt;/pre&gt;&lt;br /&gt;In Microsoft SQL Server 2008 we get: &lt;pre&gt;b&lt;br /&gt;-&lt;br /&gt;a&lt;br /&gt;&lt;br /&gt;(1 rows affected)&lt;/pre&gt;As you can see, MS SQL treats the query as &lt;code&gt;SELECT 'a' AS b FROM onerow&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;Finally, in MySQL 5.1, we get:&lt;pre&gt;+----+&lt;br /&gt;| a  |&lt;br /&gt;+----+&lt;br /&gt;| ab |&lt;br /&gt;+----+&lt;br /&gt;1 row in set (0.00 sec)&lt;/pre&gt;So in MySQL, its as if the comment isn't there at all, and as if the string literals &lt;code&gt;'a'&lt;/code&gt; and &lt;code&gt;'b'&lt;/code&gt; are actually just one string literal &lt;code&gt;'ab'&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;So what does the SQL standard say? In my copy of the 2003 edition, I find this (ISO/IEC 9075-2:2003 (E) 5.3 &amp;lt;literal&amp;gt;, page 145):&lt;blockquote&gt;Syntax Rules&lt;br /&gt;1) In a &lt;code&gt;&amp;lt;character string literal&amp;gt;&lt;/code&gt; or &lt;code&gt;&amp;lt;national character string literal&amp;gt;&lt;/code&gt;, the sequence:&lt;pre&gt;&amp;lt;quote&amp;gt; &amp;lt;character representation&amp;gt;... &amp;lt;quote&amp;gt; &amp;lt;separator&amp;gt; &amp;lt;quote&amp;gt; &amp;lt;character representation&amp;gt;... &amp;lt;quote&amp;gt;&lt;/pre&gt;is equivalent to the sequence&lt;pre&gt;&amp;lt;quote&amp;gt; &amp;lt;character representation&amp;gt;... &amp;lt;character representation&amp;gt;... &amp;lt;quote&amp;gt;&lt;/pre&gt;&lt;/blockquote&gt;If we lookup the definition of &lt;code&gt;&amp;lt;separator&amp;gt;&lt;/code&gt;, it reads: &lt;pre&gt;&amp;lt;separator&amp;gt; ::= { &amp;lt;comment&amp;gt; | &amp;lt;white space&amp;gt; }...&lt;/pre&gt;So in this case, MySQL does the "right" thing, and basically ignores the comment, treating &lt;code&gt;'a'&lt;/code&gt; and &lt;code&gt;'b'&lt;/code&gt; as a single string constant &lt;code&gt;'ab'&lt;/code&gt;.&lt;br /&gt;&lt;div style="background-color: red; color: white"&gt;UPDATE 1: As always, the devil is in the details. And trust me, the SQL standard has many of them (details that is - I'll leave it up to the reader to decide for the devils, although I have a suspicion in a particular direction). Read on, and make sure to read Nick's comment on this post - it turns out PostgreSQL seems to behave exactly according to the standard in this case.&lt;br /&gt;&lt;br /&gt;UPDATE 2: Serg also posted a comment citing yet another part of the standard that states that all comments implicitly count as a newline. This would mean that there doesn't have to be a literal newline character in or following the comment. In this case, my original remark that MySQL got it right would hold again.&lt;br /&gt;&lt;br /&gt;I should state that I think very highly of both Nick and Serg, and as far as I am concerned, they're both right. I can't help but seeing this as yet more support for my statement that the SQL standard is so complex it is almost or perhaps completely impossible to get it right. &lt;br /&gt;&lt;br /&gt;Do you find this too bold? If so, I'd really love to hear your thoughts on it. Please help us solve this paradox, I only want to understand what the standard really says.&lt;/div&gt;If you try the same thing with a single line comment, all products mentioned react the same as with the initial query, except for PostgreSQL, which now treats the query according to the standard.&lt;br /&gt;&lt;br /&gt;Now don't get me wrong. This post is not designed to bash or glorify any of the products mentioned. I think all of them are great in their own way. I am quite aware that although MySQL happens to adhere to the standard here, it violates it in other places. Finally, I should point out that I don't have a specific opinion on what the right behavior should be. I just want it to be the same on all platforms.&lt;br /&gt;&lt;br /&gt;At the same time, I realize that for SQL it's probably too late - up to an extent, incompatibility is considered normal, and database professionals tend to be specialized in particular products anyway. So I'm not holding my breath for the grand unification of SQL dialects.&lt;br /&gt;&lt;br /&gt;When I encountered this issue, I did have to think about that other rathole of incompatibilities I have to deal with professionally, which is web-browsers. An interesting development there is the HTML 5 specification, which actually &lt;a href="http://www.w3.org/TR/2010/WD-html5-20100304/syntax.html#parsing" target="html5"&gt;defines an algorithm for parsing HTML&lt;/a&gt; - even invalid HTML. This is quite different from the approach taken by most standards, which typically define only an abstract grammar, but leave the implementation entirely up to the vendors. In theory, providing parsing instructions as detailed as done in HTML 5 should make it easier to create correct parsers, and hopefully this will contribute to a more robust web.&lt;br /&gt;&lt;br /&gt;Anyway. That was all. Back to work...&lt;br /&gt;&lt;br /&gt;&lt;div style="background-color: red; color: white"&gt;UPDATE: I just heard that Sybase (unsurprisingly) behaves similar to MS SQL for this query (that is, query is valid, and returns &lt;code&gt;'a'&lt;/code&gt; in a column called &lt;code&gt;b&lt;/code&gt;). I checked SQLite myself, which is also in that camp.&lt;br /&gt;&lt;br /&gt;Nick also pointed out that LucidDB also provides a standard compliant implementation, in other words, it behaves exactly like PostgreSQL for this particular query. However, Julian, who was and is closely involved in LucidDB agrees with Serg that the comment should probably count as a newline, and filed a bug for LucidDB.&lt;br /&gt;&lt;br /&gt;I checked Firebird 2.1.3, and they are in the Oracle camp: in both cases, the query gives a syntax error.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-7122558264129037578?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/7122558264129037578/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=7122558264129037578' title='23 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/7122558264129037578'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/7122558264129037578'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2010/05/small-issue-of-sql-standards.html' title='A small issue of SQL standards'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>23</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-8185806028367157881</id><published>2010-05-25T11:00:00.001+02:00</published><updated>2010-05-25T12:06:56.266+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='PostgreSQL'/><category scheme='http://www.blogger.com/atom/ns#' term='NoSQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Oracle'/><title type='text'>MySQL, Oracle and NoSQL: In the grand scheme...</title><content type='html'>...NoSQL is just larger than a fly's dropping, and MySQL and Oracle are more alike than either of their respective fanboys would like to admit. &lt;br /&gt;&lt;br /&gt;Courtesy of Google trends:&lt;br /&gt;&lt;br /&gt;&lt;a target="google" href="http://www.google.com/trends?q=MySQL,+Oracle,+Microsoft+SQL+Server,+NoSQL,+PostgreSQL&amp;ctab=0&amp;geo=all&amp;date=all&amp;sort=0"&gt;&lt;iframe scrolling="no" frameborder="0" width="1024px" height="1024px"  src="http://www.google.com/trends?q=MySQL,+Oracle,+Microsoft+SQL+Server,+NoSQL,+PostgreSQL&amp;ctab=0&amp;geo=all&amp;date=all&amp;sort=0"&gt;&lt;/iframe&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I guess I won't be changing my career just yet.&lt;br /&gt;&lt;br /&gt;&lt;div style="background-color: red; margin: 1em; color: white"&gt;&lt;br /&gt;UPDATE: I tried a few terms for "Microsoft SQL Server" before posting (SQL Server, MS SQL) but found none that came up with what I felt like was a realistic volume (they are all much, much lower than I expected). &lt;a href="http://twitter.com/MarkGStacey" target="MarkGStacey"&gt;@MarkGStacey&lt;/a&gt; suggested trying "SQL 2008", "SQL 2005" and "SQL 2000", and those return much better results indeed (though still much lower than MySQL or Oracle). Anyway - I'd love to have some way of bunching up all those terms and have Google Trends show them as one trend, but I haven't figured out a way to do that. If you know how, please drop a line at let me know. &lt;br /&gt;&lt;br /&gt;I'll adjust the blog if I find a more satisfactory solution.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-8185806028367157881?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/8185806028367157881/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=8185806028367157881' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/8185806028367157881'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/8185806028367157881'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2010/05/mysql-oracle-and-nosql-in-grand-scheme.html' title='MySQL, Oracle and NoSQL: In the grand scheme...'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-5933188302518578702</id><published>2010-05-04T20:15:00.001+02:00</published><updated>2010-05-05T00:41:20.713+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='integer'/><title type='text'>MySQL: The maximum value of an integer</title><content type='html'>Did you ever have the need to find the maximum value of an integer in MySQL? Yeah, me neither. Anyway, &lt;a href="http://stackoverflow.com/questions/2679064/in-sql-how-do-i-get-the-maximum-value-for-an-integer/2679152" target="so"&gt;some people&lt;/a&gt; seem to need this, and this is what I came up with:&lt;pre&gt;&lt;br /&gt;SELECT ~0 as max_bigint_unsigned&lt;br /&gt;,      ~0 &amp;gt;&amp;gt; 32 AS max_int_unsigned&lt;br /&gt;,      ~0 &amp;gt;&amp;gt; 40 AS max_mediumint_unsigned&lt;br /&gt;,      ~0 &amp;gt;&amp;gt; 48 AS max_smallint_unsigned&lt;br /&gt;,      ~0 &amp;gt;&amp;gt; 56 AS max_tinyint_unsigned&lt;br /&gt;,      ~0 &amp;gt;&amp;gt; 1  AS max_bigint_signed&lt;br /&gt;,      ~0 &amp;gt;&amp;gt; 33 AS max_int_signed&lt;br /&gt;,      ~0 &amp;gt;&amp;gt; 41 AS max_mediumint_signed&lt;br /&gt;,      ~0 &amp;gt;&amp;gt; 49 AS max_smallint_signed&lt;br /&gt;,      ~0 &amp;gt;&amp;gt; 57 AS max_tinyint_signed&lt;br /&gt;\G&lt;br /&gt;&lt;br /&gt;*************************** 1. row ***************************&lt;br /&gt;   max_bigint_unsigned: 18446744073709551615&lt;br /&gt;      max_int_unsigned: 4294967295&lt;br /&gt;max_mediumint_unsigned: 16777215&lt;br /&gt; max_smallint_unsigned: 65535&lt;br /&gt;  max_tinyint_unsigned: 255&lt;br /&gt;     max_bigint_signed: 9223372036854775807&lt;br /&gt;        max_int_signed: 2147483647&lt;br /&gt;  max_mediumint_signed: 8388607&lt;br /&gt;   max_smallint_signed: 32767&lt;br /&gt;    max_tinyint_signed: 127&lt;br /&gt;1 row in set (0.00 sec)&lt;/pre&gt;&lt;br /&gt;In case you're wondering how it works, read up on what the tilde (&lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/bit-functions.html#operator_bitwise-invert" target="mysql"&gt;~&lt;/a&gt;&lt;/code&gt;) does: it peforms a &lt;em&gt;bitwise negation&lt;/em&gt;. In other words, if flips bits that are &lt;code&gt;1&lt;/code&gt; to &lt;code&gt;0&lt;/code&gt;, and vice versa. So, &lt;code&gt;~0&lt;/code&gt; means, set all the bits to &lt;code&gt;1&lt;/code&gt;, because in the integer one &lt;code&gt;0&lt;/code&gt;, all the bits are a binary &lt;code&gt;0&lt;/code&gt;. Now, in MySQL, at runtime, there is only one integer type, which is an 8-byte integer value or a &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/numeric-types.html" target="mysql"&gt;bigint&lt;/a&gt;&lt;/code&gt;. Therefore, &lt;code&gt;~0&lt;/code&gt; is by definition the largest possible integer value. &lt;br /&gt;&lt;br /&gt;MySQL defines a family of integer types for storage: &lt;code&gt;bigint&lt;/code&gt; (8 bytes), &lt;code&gt;int&lt;/code&gt; (4 bytes), &lt;code&gt;mediumint&lt;/code&gt; (3 bytes), &lt;code&gt;smallint&lt;/code&gt; (2 bytes) and &lt;code&gt;tinyint&lt;/code&gt; (1 byte). To find the maximum values of those types, we can use the right-bitshift operator &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/bit-functions.html#operator_right-shift" target="mysql"&gt;&amp;gt;&amp;gt;&lt;/a&gt;&lt;/code&gt; to push the most-significant bits at the left side of &lt;code&gt;~0&lt;/code&gt; down to the right, for the appropriate number of bytes to get the maximum values of the other integer flavors. So, &lt;br /&gt;&lt;pre&gt;&lt;br /&gt;int type: big                                int      medium    small    tiny&lt;br /&gt;bit #:    64       56       48       40       32       24       16        8      1  &lt;br /&gt;~0       = 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 = 18446744073709551615&lt;br /&gt;~0 &amp;gt;&amp;gt; &lt;b&gt;32&lt;/b&gt; = &lt;b&gt;00000000 00000000 00000000 00000000&lt;/b&gt; 11111111 11111111 11111111 11111111 = 4294967295&lt;br /&gt;~0 &amp;gt;&amp;gt; &lt;b&gt;40&lt;/b&gt; = &lt;b&gt;00000000 00000000 00000000 00000000 00000000&lt;/b&gt; 11111111 11111111 11111111 = 16777215&lt;br /&gt;~0 &amp;gt;&amp;gt; &lt;b&gt;48&lt;/b&gt; = &lt;b&gt;00000000 00000000 00000000 00000000 00000000 00000000&lt;/b&gt; 11111111 11111111 = 65535&lt;br /&gt;~0 &amp;gt;&amp;gt; &lt;b&gt;56&lt;/b&gt; = &lt;b&gt;00000000 00000000 00000000 00000000 00000000 00000000 00000000&lt;/b&gt; 11111111 = 255&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Now, for each of the integer flavors, MySQL lets you define them to be either signed or unsigned. This is implemented using a so-called &lt;em&gt;sign bit&lt;/em&gt;. The sign bit is the most significant bit (so, bit #64 in a &lt;code&gt;bigint&lt;/code&gt;, bit #32 in an &lt;code&gt;int&lt;/code&gt;, and so on and so forth). If the sign bit equals 0, the integer is positive and if it equals 1, the integer is negative. So, to get the maximum values for the signed integer flavors, we can use the same recipe, we just need to push the bits even one more bit to the right, like so:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;int type: big                                int      medium    small    tiny&lt;br /&gt;bit #:    64       56       48       40       32       24       16        8      1  &lt;br /&gt;~0 &amp;gt;&amp;gt;  1 = &lt;b&gt;0&lt;/b&gt;1111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 = 9223372036854775807&lt;br /&gt;~0 &amp;gt;&amp;gt; 33 = 00000000 00000000 00000000 00000000 &lt;b&gt;0&lt;/b&gt;1111111 11111111 11111111 11111111 = 2147483647&lt;br /&gt;~0 &amp;gt;&amp;gt; 41 = 00000000 00000000 00000000 00000000 00000000 &lt;b&gt;0&lt;/b&gt;1111111 11111111 11111111 = 8388607&lt;br /&gt;~0 &amp;gt;&amp;gt; 49 = 00000000 00000000 00000000 00000000 00000000 00000000 &lt;b&gt;0&lt;/b&gt;1111111 11111111 = 32767&lt;br /&gt;~0 &amp;gt;&amp;gt; 57 = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 &lt;b&gt;0&lt;/b&gt;1111111 = 127&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-5933188302518578702?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/5933188302518578702/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=5933188302518578702' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5933188302518578702'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5933188302518578702'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2010/05/mysql-maximum-value-of-integer.html' title='MySQL: The maximum value of an integer'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-5022183796037634328</id><published>2010-04-20T01:35:00.001+02:00</published><updated>2010-05-01T15:33:04.314+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql 5.5'/><category scheme='http://www.blogger.com/atom/ns#' term='xsltproc'/><category scheme='http://www.blogger.com/atom/ns#' term='XML'/><category scheme='http://www.blogger.com/atom/ns#' term='LOAD XML'/><category scheme='http://www.blogger.com/atom/ns#' term='XSLT'/><category scheme='http://www.blogger.com/atom/ns#' term='mysqldump'/><title type='text'>Restoring XML-formatted MySQL dumps</title><content type='html'>&lt;span style="display:none"&gt;EAVB_VFZUHIARHI&lt;/span&gt; To whom it may concern - &lt;br /&gt;&lt;br /&gt;The &lt;a target="mysql" href="http://dev.mysql.com/doc/refman/5.1/en/mysqldump.html"&gt;mysqldump&lt;/a&gt; program can be used to make logical database backups. Although the vast majority of people use it to create SQL dumps, it is possible to dump both schema structure and data in XML format. There are a few bugs (&lt;a target="mysql" href="http://bugs.mysql.com/bug.php?id=52792"&gt;#52792&lt;/a&gt;, &lt;a target="mysql" href="http://bugs.mysql.com/bug.php?id=52793"&gt;#52793&lt;/a&gt;) in this feature, but these are not the topic of this post.&lt;h3&gt;XML output from mysqldump&lt;/h3&gt;Dumping in XML format is done with &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysqldump.html#option_mysqldump_xml" target="mysql"&gt;the --xml or -X option&lt;/a&gt;. In addition, you should use the &lt;a target="mysql" href="http://dev.mysql.com/doc/refman/5.1/en/mysqldump.html#option_mysqldump_hex-blob"&gt;--hex-blob&lt;/a&gt; option otherwise the BLOB data will be dumped as raw binary data, which usually results in characters that are not valid, either according to the  XML spec or according to the UTF-8 encoding. (Arguably, this is also a bug. I haven't filed it though.)&lt;br /&gt;&lt;br /&gt;For example, a line like:&lt;pre&gt;&lt;br /&gt;mysqldump -uroot -pmysql -X --hex-blob --databases sakila&lt;br /&gt;&lt;/pre&gt; dumps the sakila database to the following XML format: &lt;pre&gt;&lt;br /&gt;&amp;lt;?xml version="1.0"?&amp;gt;&lt;br /&gt;&amp;lt;mysqldump xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"&amp;gt;&lt;br /&gt;  &amp;lt;database name="sakila"&amp;gt;&lt;br /&gt;    &amp;lt;table_structure name="actor"&amp;gt;&lt;br /&gt;      &amp;lt;field Field="actor_id" Type="smallint(5) unsigned" Null="NO" Key="PRI" Extra="auto_increment" /&amp;gt;&lt;br /&gt;      &amp;lt;field Field="first_name" Type="varchar(45)" Null="NO" Key="" Extra="" /&amp;gt;&lt;br /&gt;      &amp;lt;field Field="last_name" Type="varchar(45)" Null="NO" Key="MUL" Extra="" /&amp;gt;&lt;br /&gt;      &amp;lt;field Field="last_update" Type="timestamp" Null="NO" Key="" Default="CURRENT_TIMESTAMP" Extra="on update CURRENT_TIMESTAMP" /&amp;gt;&lt;br /&gt;      &amp;lt;key Table="actor" Non_unique="0" Key_name="PRIMARY" Seq_in_index="1" Column_name="actor_id" Collation="A" Cardinality="200" Null="" Index_type="BTREE" Comment="" /&amp;gt;&lt;br /&gt;      &amp;lt;key Table="actor" Non_unique="1" Key_name="idx_actor_last_name" Seq_in_index="1" Column_name="last_name" Collation="A" Cardinality="200" Null="" Index_type="BTREE" Comment="" /&amp;gt;&lt;br /&gt;      &amp;lt;options Name="actor" Engine="InnoDB" Version="10" Row_format="Compact" Rows="200" Avg_row_length="81" Data_length="16384" Max_data_length="0" Index_length="16384" Data_free="233832448" Auto_increment="201" Create_time="2009-10-10 10:04:56" Collation="utf8_general_ci" Create_options="" Comment="" /&amp;gt;&lt;br /&gt;    &amp;lt;/table_structure&amp;gt;&lt;br /&gt;    &amp;lt;table_data name="actor"&amp;gt;&lt;br /&gt;      &amp;lt;row&amp;gt;&lt;br /&gt;        &amp;lt;field name="actor_id"&amp;gt;1&amp;lt;/field&amp;gt;&lt;br /&gt;        &amp;lt;field name="first_name"&amp;gt;PENELOPE&amp;lt;/field&amp;gt;&lt;br /&gt;        &amp;lt;field name="last_name"&amp;gt;GUINESS&amp;lt;/field&amp;gt;&lt;br /&gt;        &amp;lt;field name="last_update"&amp;gt;2006-02-15 03:34:33&amp;lt;/field&amp;gt;&lt;br /&gt;      &amp;lt;/row&amp;gt;&lt;br /&gt;&lt;br /&gt;...many more rows and table structures...&lt;br /&gt;&lt;br /&gt;  &amp;lt;/database&amp;gt;&lt;br /&gt;&amp;lt;/mysqldump&amp;gt;&lt;br /&gt;&lt;/pre&gt;I don't want to spend too much time discussing why it would be useful to make backups in this way. There are definitely a few drawbacks - for example, for sakila, the plain SQL dump, even with &lt;code&gt;--hex-blob&lt;/code&gt; is 3.26 MB (3.429.358 bytes), whereas the XML output is 13.7 MB (14,415,665 bytes). Even after zip compression, the XML formatted dump is still one third larger than the plain SQL dump: 936 kB versus 695 kB.&lt;h3&gt;Restoring XML output from mysqldump&lt;/h3&gt;A more serious problem is that MySQL doesn't seem to offer any tool to restore XML formatted dumps. The &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.5/en/load-xml.html" target="mysql"&gt;LOAD XML&lt;/a&gt;&lt;/code&gt; feature, kindly &lt;a href="http://eriksdiary.blogspot.com/2007/08/load-xml-contribution-added-to-mysql.html" target="erik"&gt;contributed&lt;/a&gt; by &lt;a href="http://www.blogger.com/profile/07865795210715993795" target="erik"&gt;Erik Wetterberg&lt;/a&gt; could be used to some extent for this purpose. However, this feature is not yet available (it will be available in the upcoming version &lt;a href="http://dev.mysql.com/downloads/mysql/5.5.html" target="mysql"&gt;MySQL 5.5&lt;/a&gt;), and from what I can tell, it can only load data - not restore tables or databases. I also believe that this feature does not (yet) provide any way to properly restore hex-dumped BLOB data, but I really should test it to know for sure. &lt;br /&gt;&lt;br /&gt;Anyway.&lt;br /&gt;&lt;br /&gt;In between sessions of the past MySQL users conference I cobbled up an &lt;a href="http://en.wikipedia.org/wiki/Xslt" target="xslt"&gt;XSLT stylesheet&lt;/a&gt; that can convert &lt;code&gt;mysqldump&lt;/code&gt;'s XML output back to SQL script output. It is available under the LGPL license, and it is hosted on google code as the &lt;a href="http://code.google.com/p/mysqldump-x-restore/" target="mysqldump-x-restore"&gt;mysqldump-x-restore&lt;/a&gt; project. To get started, you need to download the &lt;a href="http://mysqldump-x-restore.googlecode.com/files/mysqldump-xml-to-sql.xslt" target="mysqldump-x-restore"&gt;mysqldump-xml-to-sql.xslt&lt;/a&gt; XSLT stylesheet. You also need a command line XSLT processor, like &lt;a href="http://xmlsoft.org/XSLT/xsltproc2.html" target="xslt"&gt;xsltproc&lt;/a&gt;. This utility is part of the Gnome libxslt project, and is included in packages for most linux distributions. There is a windows port available for which you can &lt;a href="http://www.zlatkovic.com/libxml.en.html" target="xslt"&gt;download the binaries&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Assuming that &lt;code&gt;xsltproc&lt;/code&gt; is in your path, and the XML dump and the &lt;code&gt;mysqldump-xml-to-sql.xslt&lt;/code&gt; are in the current working directory, you can use this command to convert the XML dump to SQL: &lt;pre&gt;&lt;br /&gt;xsltproc mysqldump-xml-to-sql.xslt sakila.xml &amp;gt; sakila.sql&lt;br /&gt;&lt;/pre&gt;On Unix-based systems you should be able to directly pipline the SQL into &lt;code&gt;mysql&lt;/code&gt; using&lt;pre&gt;&lt;br /&gt;mysql -uroot -pmysql &amp;lt; `xsltproc mysqldump-xml-to-sql.xslt sakila.xml`&lt;br /&gt;&lt;/pre&gt;The stylesheet comes with a number of options, which can be set through &lt;code&gt;xsltproc&lt;/code&gt;'s &lt;code&gt;--stringparam&lt;/code&gt; option. For example, setting the &lt;code&gt;schema&lt;/code&gt; parameter to &lt;code&gt;N&lt;/code&gt; will result in an SQL script that only contains DML statements:&lt;pre&gt;&lt;br /&gt;xsltproc --stringparam schema N mysqldump-xml-to-sql.xslt sakila.xml &gt; sakila.sql&lt;/pre&gt;Setting the &lt;code&gt;data&lt;/code&gt; option to &lt;code&gt;N&lt;/code&gt; will result in an SQL script that only contains DDL statements:&lt;pre&gt;&lt;br /&gt;xsltproc --stringparam data N mysqldump-xml-to-sql.xslt sakila.xml &amp;gt; sakila.sql&lt;/pre&gt;. There are additional options to control how often a &lt;code&gt;COMMIT&lt;/code&gt; should be issued, whether to add &lt;code&gt;DROP&lt;/code&gt; statements, whether to generate single row &lt;code&gt;INSERT&lt;/code&gt; statements, and to set the &lt;code&gt;max_allowed_packet&lt;/code&gt; size.&lt;h3&gt;What's next?&lt;/h3&gt;Nothing much really. I don't really recommend people to use &lt;code&gt;mysqldump&lt;/code&gt;'s XML output. I wrote mysqldump-x-restore for those people that inherited a bunch of XML formatted dumps, and don't know what to do with them. I haven't thouroughly tested it - please file a bug if you find one. If you actually think it's useful and you want more features, please let me know, and I'll look into it. I don't have much use for this myself, so if you have great ideas to move this forward, I'll let you have commit access.&lt;br /&gt;&lt;br /&gt;That is all.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-5022183796037634328?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/5022183796037634328/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=5022183796037634328' title='13 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5022183796037634328'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5022183796037634328'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2010/04/restoring-xml-formatted-mysql-dumps.html' title='Restoring XML-formatted MySQL dumps'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>13</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-5302892138341606385</id><published>2010-04-14T17:20:00.002+02:00</published><updated>2010-04-15T17:21:42.690+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql stored routine'/><category scheme='http://www.blogger.com/atom/ns#' term='Oracle / Sun deal'/><category scheme='http://www.blogger.com/atom/ns#' term='Database Stored Procedures'/><category scheme='http://www.blogger.com/atom/ns#' term='Oracle'/><category scheme='http://www.blogger.com/atom/ns#' term='MySQL UC'/><title type='text'>MySQL Conference 2010 Presentation: Optimizing Stored Routines</title><content type='html'>Yesterday I delivered my presentation for the &lt;a href="http://en.oreilly.com/mysql2010" target="mysqluc"&gt;MySQL User Conference and Expo 2010&lt;/a&gt;: &lt;a target="mysqluc" href="http://en.oreilly.com/mysql2010/public/schedule/detail/13035"&gt;Optimizing MySQL Stored Routines&lt;/a&gt;. If you are interested in the slides, you can find them on both the &lt;a href="http://assets.en.oreilly.com/1/event/36/Optimizing%20MySQL%20Stored%20Routines%20Presentation%201.pdf" target="mysqluc"&gt;MySQL conference site&lt;/a&gt; as well as on &lt;a href="http://www.slideshare.net/guest82ee09/optimizing-mysql-stored-routines-uc2010" target="slideshare"&gt;slideshare.net&lt;/a&gt;. Here's the abstract of my presentation so you can decide if this is interesting for you: &lt;blockquote&gt;MySQL stored routines (functions, procedures, triggers and events) can be useful. But many casually written stored routines are unnecessarily slow. The main reason is that MySQL does not apply even simple code optimizations to stored routine code. Many developers are not aware of this, and as a result, write stored routine code that can quite easily be tuned, increasing performance by 50%-100% by only applying very straightforward code optimizations.&lt;/blockquote&gt;It was very pleased to see so many people attend: I had the impression that MySQL stored routines are quite impopular, due to performance issues, and a syntax that is often regarded as "clunky", so I didn't expect more than about 20 people to show up. Much to my pleasure, the ballroom was filled for about two-thirds, and I estimate there were 70-something people in the room. &lt;br /&gt;&lt;br /&gt;A quick survey of the audience indicated that all of them were in fact using stored routines in production, so I assume they didn't show up out of morbid curiosity :) Interestingly, only few people reported performance issues. It would be interesting to do more research to find out what people are in fact doing with MySQL stored routines. Among yesterday's attendees, there were people using MySQL stored routines for managing user privileges, processing astronomical data, and checking complex dynamic business rules. To be sure - these were all different users - not just one isolated fanatic going wild with stored routines.&lt;br /&gt;&lt;br /&gt;Co-incidentally, &lt;a href="http://mituzas.lt/" target="domas"&gt;Domas Mituzas&lt;/a&gt; from facebook also mentioned stored routines in &lt;a href="http://en.oreilly.com/mysql2010/public/schedule/detail/13285" target="mysqluc"&gt;his presentation on high concurrency MySQL&lt;/a&gt; as a way to reduce the lock gap when performing multiple changes in a single transaction. I'm just saying - perhaps MySQL stored routines aren't that bad at all, they just need more love and dedication from the MySQL developers so they can mature and gain wider applicability. &lt;br /&gt;&lt;br /&gt;Recently, I &lt;a href="http://rpbouman.blogspot.com/2009/12/validating-mysql-data-entry-with_15.html" target="rpb"&gt;already wrote&lt;/a&gt; about a recent improvement in &lt;a href="http://dev.mysql.com/downloads/mysql/5.5.html" target="mysql"&gt;MySQL 5.5&lt;/a&gt;, the long anticipated &lt;a href="http://dev.mysql.com/doc/refman/5.5/en/signal-resignal.html" target="mysql"&gt;SIGNAL /RESIGNAL&lt;/a&gt; syntax. I hope more improvements will follow soon now the dust is settling after Oracle's acquisition of Sun. After hearing &lt;a href="http://en.oreilly.com/mysql2010/public/schedule/speaker/78864" target="mysqluc"&gt;Edward Screven&lt;/a&gt; unfold Oracle's strategy for MySQL in &lt;a href="http://en.oreilly.com/mysql2010/public/schedule/detail/12440" target="mysqluc"&gt;yesterday's keynote&lt;/a&gt;, I can tell you without reservation that I am quite optimistic :)&lt;br /&gt;&lt;br /&gt;Anyway - that is all for now. Two days of conference ahead :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-5302892138341606385?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/5302892138341606385/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=5302892138341606385' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5302892138341606385'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5302892138341606385'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2010/04/mysql-conference-2010-presentation.html' title='MySQL Conference 2010 Presentation: Optimizing Stored Routines'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-6531770173905277751</id><published>2010-03-30T11:00:00.005+02:00</published><updated>2010-03-30T23:29:49.116+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='partitioning'/><category scheme='http://www.blogger.com/atom/ns#' term='MySQLForge'/><category scheme='http://www.blogger.com/atom/ns#' term='mysqlconf'/><category scheme='http://www.blogger.com/atom/ns#' term='DBA'/><category scheme='http://www.blogger.com/atom/ns#' term='MySQL command line tool'/><category scheme='http://www.blogger.com/atom/ns#' term='information_schema'/><category scheme='http://www.blogger.com/atom/ns#' term='MySQL Conference'/><category scheme='http://www.blogger.com/atom/ns#' term='Giuseppe Maxia'/><category scheme='http://www.blogger.com/atom/ns#' term='mysqldump'/><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='André Simões'/><category scheme='http://www.blogger.com/atom/ns#' term='backup'/><category scheme='http://www.blogger.com/atom/ns#' term='MySQL command line client'/><category scheme='http://www.blogger.com/atom/ns#' term='MySQL UC'/><title type='text'>MySQL: Partition-wise backups with mysqldump</title><content type='html'>To whom it may concern,&lt;br /&gt;&lt;br /&gt;in response to &lt;a target="andre" href="http://twitter.com/ITXpander/status/11257597174"&gt;a query&lt;/a&gt; from André Simões (also known as &lt;a href="http://itxpander.wordpress.com/" target="andre"&gt;ITXpander&lt;/a&gt;), I slapped together a MySQL script that outputs &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysqldump.html" target="mysql"&gt;mysqldump&lt;/a&gt;&lt;/code&gt; commands for backing up individual &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/partitioning.html" target="mysql"&gt;partitions&lt;/a&gt; of the tables in the current schema. The script is maintained as &lt;a target="mysqlforge" href="http://forge.mysql.com/tools/tool.php?id=258"&gt;a snippet&lt;/a&gt; at &lt;a href="" target="http://forge.mysql.com/" target="mysqlforge"&gt;MySQL Forge&lt;/a&gt;. &lt;h3&gt;How it works&lt;/h3&gt;The script works by querying the &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/partitions-table.html" target="mysql"&gt;information_schema.PARTITIONS&lt;/a&gt;&lt;/code&gt; system view to generate an appropriate expression for mysqldump's &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysqldump.html#option_mysqldump_where" target="mysql"&gt;--where&lt;/a&gt;&lt;/code&gt; option. The generated command also redirects the output to a file with this name pattern:&lt;pre&gt;&amp;lt;schema&amp;gt;.&amp;lt;table&amp;gt;.&amp;lt;partition-name&amp;gt;.sql&lt;/pre&gt;For example, for this table (&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/partitioning-types.html" target="mysql"&gt;taken from&lt;/a&gt; the MySQL reference manual):&lt;pre&gt;CREATE TABLE members (&lt;br /&gt;    firstname VARCHAR(25) NOT NULL,&lt;br /&gt;    lastname VARCHAR(25) NOT NULL,&lt;br /&gt;    username VARCHAR(16) NOT NULL,&lt;br /&gt;    email VARCHAR(35),&lt;br /&gt;    joined DATE NOT NULL&lt;br /&gt;)&lt;br /&gt;PARTITION BY RANGE( YEAR(joined) ) (&lt;br /&gt;    PARTITION p0 VALUES LESS THAN (1960),&lt;br /&gt;    PARTITION p1 VALUES LESS THAN (1970),&lt;br /&gt;    PARTITION p2 VALUES LESS THAN (1980),&lt;br /&gt;    PARTITION p3 VALUES LESS THAN (1990),&lt;br /&gt;    PARTITION p4 VALUES LESS THAN MAXVALUE&lt;br /&gt;);&lt;/pre&gt; the script generates the following commands:&lt;pre&gt;mysqldump --user=username --password=password --no-create-info --where=" YEAR(joined) &amp;lt; 1960" test members &amp;gt; test.members.p0.sql&lt;br /&gt;mysqldump --user=username --password=password --no-create-info --where=" YEAR(joined) &amp;gt;= 1960 and  YEAR(joined) &amp;lt; 1970" test members &amp;gt; test.members.p1.sql&lt;br /&gt;mysqldump --user=username --password=password --no-create-info --where=" YEAR(joined) &amp;gt;= 1970 and  YEAR(joined) &amp;lt; 1980" test members &amp;gt; test.members.p2.sql&lt;br /&gt;mysqldump --user=username --password=password --no-create-info --where=" YEAR(joined) &amp;gt;= 1980 and  YEAR(joined) &amp;lt; 1990" test members &amp;gt; test.members.p3.sql&lt;br /&gt;mysqldump --user=username --password=password --no-create-info --where=" YEAR(joined) &amp;gt;= 1990 and  YEAR(joined) &amp;lt; 18446744073709551615" test members &amp;gt; test.members.p4.sql&lt;/pre&gt;Tip: in order to obtain directly executable output from the &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysql.html" target="mysql"&gt;mysql&lt;/a&gt;&lt;/code&gt; command line tool, run the script with the &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-command-options.html#option_mysql_skip-column-names" target="mysql"&gt;--skip-column-names&lt;/a&gt;&lt;/code&gt; (or &lt;code&gt;-N&lt;/code&gt;) option.&lt;h3&gt;Features&lt;/h3&gt;Currently, the script supports the following &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/partitioning-types.html" target="mysql"&gt;partitioning methods&lt;/a&gt;:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;&lt;code&gt;&lt;a target="mysql" href="http://dev.mysql.com/doc/refman/5.1/en/partitioning-hash.html"&gt;HASH&lt;/a&gt;&lt;/code&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;code&gt;&lt;a target="mysql" href="http://dev.mysql.com/doc/refman/5.1/en/partitioning-list.html"&gt;LIST&lt;/a&gt;&lt;/code&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;code&gt;&lt;a target="mysql" href="http://dev.mysql.com/doc/refman/5.1/en/partitioning-range.html"&gt;RANGE&lt;/a&gt;&lt;/code&gt;&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;h3&gt;Limitations&lt;/h3&gt;The &lt;code&gt;&lt;a target="mysql" href="http://dev.mysql.com/doc/refman/5.1/en/partitioning-linear-hash.html"&gt;LINEAR HASH&lt;/a&gt;&lt;/code&gt; method is currently not supported, but I may implement that in the future. &lt;br /&gt;&lt;br /&gt;Currently I do not have plans to implement the &lt;code&gt;&lt;a target="mysql" href="http://dev.mysql.com/doc/refman/5.1/en/partitioning-key.html"&gt;KEY&lt;/a&gt;&lt;/code&gt; and &lt;code&gt;&lt;a target="mysql" href="http://dev.mysql.com/doc/refman/5.1/en/partitioning-linear-key.html"&gt;LINEAR KEY&lt;/a&gt;&lt;/code&gt; partitioning methods, but I may reconsider if and when I have more information about the storage-engine specific partitioning functions used by these methods.&lt;br /&gt;&lt;br /&gt;Finally, I should point out that querying the &lt;code&gt;information_schema.PARTITIONS&lt;/code&gt; table is dog-slow. This may not be too big of an issue, however it is pretty annoying. If anybody has some tips to increase performance, please let me know.&lt;h3&gt;Acknowledgements&lt;/h3&gt;Thanks to André for posing the problem. I had a fun hour of procrastination to implement this, and it made me read part of the &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/partitioning.html" target="mysql"&gt;MySQL reference manual on partitioning&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;I also would like to thank Giuseppe Maxia (&lt;a href="http://datacharmer.blogspot.com/" target="giuseppe"&gt;the Datacharmer&lt;/a&gt;) for providing valuable feedback. If you're interested in either partitioning or the mysql command line, you should visit &lt;a href="http://en.oreilly.com/mysql2010/public/schedule/speaker/32" target="mysqlconf"&gt;his tutorials&lt;/a&gt; at the &lt;a href="http://en.oreilly.com/mysql2010/" target="mysqlconf"&gt;MySQL conference&lt;/a&gt;, april 12-15, 2010.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-6531770173905277751?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/6531770173905277751/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=6531770173905277751' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/6531770173905277751'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/6531770173905277751'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2010/03/mysql-partition-wise-backups-with.html' title='MySQL: Partition-wise backups with mysqldump'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-2938328248767492139</id><published>2010-03-19T02:45:00.003+01:00</published><updated>2010-03-21T19:59:45.939+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Percentile'/><category scheme='http://www.blogger.com/atom/ns#' term='Median'/><category scheme='http://www.blogger.com/atom/ns#' term='GROUP_CONCAT'/><category scheme='http://www.blogger.com/atom/ns#' term='Alex Bolenok'/><category scheme='http://www.blogger.com/atom/ns#' term='GROUP BY'/><category scheme='http://www.blogger.com/atom/ns#' term='Explain Extended'/><title type='text'>Greatest N per group: top 3 with GROUP_CONCAT()</title><content type='html'>In my opinion, one of the best things that happened to &lt;a href="http://planet.mysql.com/" target="alex"&gt;Planet MySQL&lt;/a&gt; lately, is &lt;a href="http://explainextended.com/" target="alex"&gt;Explain Extended&lt;/a&gt;, a blog by Alex Bolenok (also known as &lt;a href="http://stackoverflow.com/users/55159/quassnoi" target="stackoverflow"&gt;Quassnoi&lt;/a&gt; on &lt;a href="http://stackoverflow.com/" target="stackoverflow"&gt;Stackoverflow&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;I never had the pleasure of meeting Alex in person, but his articles are always interesting and of high quality, and the SQL wizardry he pulls off is downright inspiring. I really feel humbled by the creativity of some of his solutions and his apparent experience with multiple RDBMS products. &lt;br /&gt;&lt;br /&gt;Alex' &lt;a href="http://explainextended.com/2010/03/18/greatest-n-per-group-dealing-with-aggregates/" target="alex"&gt;most recent post&lt;/a&gt; is about aggregation, and finding a top 3 based on the aggregate:&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;In &lt;strong&gt;MySQL&lt;/strong&gt; I have a table called &lt;code&gt;meanings&lt;/code&gt; with three columns: &lt;code&gt;word&lt;/code&gt;, &lt;code&gt;meaning&lt;/code&gt;, &lt;code&gt;person&lt;/code&gt;. &lt;code&gt;word&lt;/code&gt; has &lt;strong&gt;16&lt;/strong&gt; possible values, &lt;code&gt;meaning&lt;/code&gt; has &lt;strong&gt;26&lt;/strong&gt;. A person assigns one or more meanings to each word. In the sample above, person &lt;strong&gt;1&lt;/strong&gt; assigned two meanings to word &lt;strong&gt;2&lt;/strong&gt;. There will be thousands of persons. I need to find the top three meanings for each of the &lt;strong&gt;16&lt;/strong&gt; words, with their frequencies. Is it possible to solve this with a single &lt;strong&gt;MySQL&lt;/strong&gt; query?&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;Alex presents a solution that uses &lt;code&gt;GROUP_CONCAT&lt;/code&gt; basically as a poor man's windowing function, a technique I have described on several occasions in the past for &lt;a href="http://rpbouman.blogspot.com/2009/09/mysql-another-ranking-trick.html" target="rpb"&gt;ranking&lt;/a&gt;, &lt;a href="http://rpbouman.blogspot.com/2007/12/calculating-financial-median-in-mysql.html" target="rpb"&gt;median&lt;/a&gt; and &lt;a href="http://rpbouman.blogspot.com/2008/07/calculating-percentiles-with-mysql.html" target="rpb"&gt;percentile&lt;/a&gt; solutions in MySQL.&lt;br /&gt;&lt;br /&gt;Now, Alex' solution is very clever and there are some elements that I think are very creative. That said, I think his solution can be improved still. Normally I wouldn't write a blog about it, and simply leave a comment on his blog, but his blog supports comments only for general articles, which is why I present it here:&lt;pre&gt;&lt;br /&gt;SELECT  word&lt;br /&gt;,       CONCAT(&lt;br /&gt;            SUBSTRING_INDEX(&lt;br /&gt;                GROUP_CONCAT(meaning ORDER BY num DESC), ',', 1&lt;br /&gt;            )&lt;br /&gt;        ,   ' ('&lt;br /&gt;        ,   SUBSTRING_INDEX(&lt;br /&gt;                GROUP_CONCAT(num ORDER BY num DESC), ',', 1&lt;br /&gt;            ) / SUM(num) * 100&lt;br /&gt;        ,   '%)'&lt;br /&gt;        ) rank1&lt;br /&gt;,       CONCAT(&lt;br /&gt;            SUBSTRING_INDEX(&lt;br /&gt;                SUBSTRING_INDEX(&lt;br /&gt;                    GROUP_CONCAT(meaning ORDER BY num DESC), ',', 2&lt;br /&gt;                ), ',', -1&lt;br /&gt;            )&lt;br /&gt;        ,   ' ('&lt;br /&gt;        ,   SUBSTRING_INDEX(&lt;br /&gt;                SUBSTRING_INDEX(&lt;br /&gt;                    GROUP_CONCAT(num ORDER BY num DESC), ',', 2&lt;br /&gt;                ), ',', -1&lt;br /&gt;            ) / SUM(num) * 100&lt;br /&gt;        ,   '%)'&lt;br /&gt;        ) rank2&lt;br /&gt;,       CONCAT(&lt;br /&gt;            SUBSTRING_INDEX(&lt;br /&gt;                SUBSTRING_INDEX(&lt;br /&gt;                    GROUP_CONCAT(meaning ORDER BY num DESC), ',', 3&lt;br /&gt;            ), ',', -1)&lt;br /&gt;        ,   ' ('&lt;br /&gt;        ,   SUBSTRING_INDEX(&lt;br /&gt;                SUBSTRING_INDEX(&lt;br /&gt;                    GROUP_CONCAT(num ORDER BY num DESC), ',', 3&lt;br /&gt;                ), ',', -1&lt;br /&gt;            ) / SUM(num) * 100&lt;br /&gt;        ,   '%)'&lt;br /&gt;        ) rank3&lt;br /&gt;FROM    (&lt;br /&gt;        SELECT      word, meaning, COUNT(*) num&lt;br /&gt;        FROM        t_meaning m&lt;br /&gt;        GROUP BY    word,meaning&lt;br /&gt;        ) a&lt;br /&gt;GROUP BY word&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This gives me output like this:&lt;pre&gt;&lt;br /&gt;+------+--------------+--------------+--------------+&lt;br /&gt;| word | rank1        | rank2        | rank3        |&lt;br /&gt;+------+--------------+--------------+--------------+&lt;br /&gt;|    1 | 16 (3.9728%) | 17 (3.9648%) | 12 (3.9632%) |&lt;br /&gt;|    2 | 9 (3.9792%)  | 10 (3.9632%) | 20 (3.9328%) |&lt;br /&gt;|    3 | 20 (3.9744%) | 13 (3.968%)  | 1 (3.9648%)  |&lt;br /&gt;|    4 | 26 (3.952%)  | 7 (3.9456%)  | 17 (3.9424%) |&lt;br /&gt;|    5 | 9 (4.008%)   | 21 (3.9824%) | 20 (3.936%)  |&lt;br /&gt;|    6 | 19 (3.9504%) | 10 (3.9488%) | 13 (3.9408%) |&lt;br /&gt;|    7 | 23 (4.0464%) | 12 (3.976%)  | 19 (3.9648%) |&lt;br /&gt;|    8 | 23 (4.0112%) | 3 (4.0096%)  | 8 (3.9328%)  |&lt;br /&gt;|    9 | 10 (4.016%)  | 19 (3.984%)  | 15 (3.9616%) |&lt;br /&gt;|   10 | 10 (4.0304%) | 14 (3.9344%) | 11 (3.9312%) |&lt;br /&gt;|   11 | 16 (3.9584%) | 6 (3.9296%)  | 19 (3.9232%) |&lt;br /&gt;|   12 | 7 (3.9968%)  | 1 (3.9392%)  | 26 (3.9264%) |&lt;br /&gt;|   13 | 8 (4.048%)   | 25 (3.9712%) | 23 (3.9616%) |&lt;br /&gt;|   14 | 16 (3.9936%) | 26 (3.9632%) | 4 (3.9536%)  |&lt;br /&gt;|   15 | 22 (4.0608%) | 12 (4.0048%) | 1 (3.9632%)  |&lt;br /&gt;|   16 | 14 (4.0032%) | 18 (3.9712%) | 4 (3.9488%)  |&lt;br /&gt;+------+--------------+--------------+--------------+&lt;br /&gt;16 rows in set (0.63 sec)&lt;/pre&gt;&lt;br /&gt;On my laptop, my solution is about 30% faster than the one presented by Alex. Personally I think mine is easier to understand too, but that is a matter of taste.&lt;br /&gt;&lt;br /&gt;Anyway, I'm just posting this to share my solution - I do not intend to downplay the one presented by Alex. Instead, I invite everyone interested in SQL, MySQL and PostgreSQL to keep an eye on Alex' blog as well as his excellent answers on Stackoverflow. He's an SQL jedi master in my book :)&lt;br /&gt;&lt;br /&gt;Of course, if you have a better solution to crack this problem in MySQL, please leave a comment. I'd love to hear what other people are doing to cope with these kinds of queries.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-2938328248767492139?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/2938328248767492139/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=2938328248767492139' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/2938328248767492139'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/2938328248767492139'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2010/03/greatest-n-per-group-top-3-with.html' title='Greatest N per group: top 3 with GROUP_CONCAT()'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-2730626224943105240</id><published>2010-03-14T21:35:00.007+01:00</published><updated>2010-03-15T15:53:55.872+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Matt Casters'/><category scheme='http://www.blogger.com/atom/ns#' term='Pentaho'/><category scheme='http://www.blogger.com/atom/ns#' term='Kettle'/><category scheme='http://www.blogger.com/atom/ns#' term='&quot;Pentaho Solutions&quot;'/><category scheme='http://www.blogger.com/atom/ns#' term='pentaho data integration'/><category scheme='http://www.blogger.com/atom/ns#' term='BI'/><category scheme='http://www.blogger.com/atom/ns#' term='Jos van Dongen'/><title type='text'>Writing another book: Pentaho Kettle Solutions</title><content type='html'>&lt;a href="http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470484322.html" target="wiley"&gt;&lt;img style="float:left; margin:1em" src="http://media.wiley.com/product_data/coverImage/22/04704843/0470484322.jpg"/&gt;&lt;/a&gt;Last year, at about this time of the year, I was well involved in the process of writing the book &lt;a href="http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470484322.html" target="wiley"&gt;Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL"&lt;/a&gt; for &lt;a href="http://www.wiley.com/" target="wiley"&gt;Wiley&lt;/a&gt;. To date, "Pentaho Solutions" is still the only all-round book on the open source &lt;a href="http://www.pentaho.com/"&gt;Pentaho Business Intelligence suite&lt;/a&gt;. &lt;a href="http://www.pentaho.com" target="pentaho"&gt;&lt;img style="float:right;margin:1em" src="http://www.pentaho.com/images/de_logo.png"/&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;It was an extremely interesting project to participate in, full of new experiences. Although the act of writing was time consuming and at times very trying for me as well as my family, it was completely worth it. I have none but happy memories of the collaboration with my full co-author &lt;a href="http://www.tholis.com/" target="jos"&gt;Jos van Dongen&lt;/a&gt;, our technical editors Jens Bleuel, Jeroen Kuiper, &lt;a href="http://pentahomusings.blogspot.com/" target="tom"&gt;Tom Barber&lt;/a&gt; and &lt;a href="http://www.sherito.org/" target="tom"&gt;Tomas Morgner&lt;/a&gt;, several of the Pentaho Developers, and last but not least, the team at Wiley, in particular Robert Elliot and Sara Shlaer.&lt;br /&gt;&lt;br /&gt;When the book was finally published, late August 2009, I was very proud - as a matter of fact, I still am :) Both Jos and I have been rewarded with a lot of positive feedback, and so far, book sales are meeting the expectations of the publisher. We've had mostly positive &lt;a href="http://www.amazon.com/Pentaho-Solutions-Business-Intelligence-Warehousing/product-reviews/0470484322/ref=dp_top_cm_cr_acr_txt?ie=UTF8&amp;showViewpoints=1"&gt;reviews on places like Amazon&lt;/a&gt;, and &lt;a href="http://www.xaprb.com/blog/2009/12/13/review-pentaho-solutions-bouman-dongen/" target="baron"&gt;elsewhere&lt;/a&gt; &lt;a href="http://www.bluefiredatasolutions.com/blog/2009/09/pentaho-solutions-a-needed-resource-with-potential-for-more/" target="alex"&gt;on&lt;/a&gt; &lt;a href="http://open-bi.blogspot.com/2009/12/pentaho-solutions-book-by-roland-bouman.html" target="vincent"&gt;the&lt;/a&gt; &lt;a href="http://www.prashantraju.com/2010/01/pentaho-solutions-review/" target="prashant"&gt;web&lt;/a&gt;. I'd like to use this opportunity to thank everybody that took the time to review the book: Thank you all - it is very rewarding to get this kind of feedback, and I appreciate it enourmously that you all took the time to spread the word. Beer is on me next time we meet :)&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Announcing "Pentaho Kettle Solutions"&lt;/h3&gt;&lt;br /&gt;&lt;a href="http://kettle.pentaho.org/"&gt;&lt;img style="float:right;margin:1em" src="http://www.pentaho.com/images/logo_kettle_lrg.png"/&gt;&lt;/a&gt; In the autumn of 2009, just a month after "Pentaho Solutions" was published, Wiley contacted Jos and me to find out if we were interested in writing a more specialized book on ETL and data integration using Pentaho. I felt honoured, and took the fact that Wiley, an experienced and well-reknowned publisher in the field of data warehousing and business intelligence, voiced interested in another Pentaho book by Jos an me as a token of confidence and encouragement that I value greatly. (For Pentaho Solutions, we heard that Wiley was interested, but we contacted them.) At the same time, I admit I had my share of doubts, having the memories of what it took to write Pentaho Solutions still fresh in my mind.&lt;br /&gt;&lt;br /&gt;As it happens, Jos and I both attended the 2009 Pentaho Community Meeting, and there we seized the opportunity to talk to &lt;a href="http://www.ibridge.be/" target="_matt"&gt;Matt Casters&lt;/a&gt;, chief Pentaho Data Integration and founding developer of &lt;a href="http://kettle.pentaho.org/" target="_pentaho"&gt;Kettle&lt;/a&gt; (a.k.a. Pentaho Data Integration). Both Jos and I didn't expect Matt to be able to free up any time in his ever busy schedule to help us to write the new book. Needless to say, he made us both very happy when he rather liked the idea, and expressed immediate interest in becoming a full co-author! &lt;br /&gt;&lt;br /&gt;Together, the three of us made a detailed outline and wrote a formal proposal for Wiley. Our proposal was accepted in December 2009, and we have been writing since, focusing on the forthcoming Kettle version, Kettle 4.0 . The tentative title of the book is &lt;a href="http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html" target="wiley"&gt;Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration&lt;/a&gt;. It is planned to be published in September 2010, and it will have approximately 750 pages.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html" target="wiley"&gt;&lt;img src="http://media.wiley.com/product_data/coverImage300/77/04706351/0470635177.jpg" style="margin:1em"/&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Our working copy of the outline is quite detailed but may still change in the future, which is why I won't publish it here until we finished our first draft of the book. I am 99% confident that the top level of the outline is stable, and I have no reservation in releasing that already: &lt;ul&gt;&lt;br /&gt;&lt;li&gt;Part I: Getting Started&lt;ul&gt;&lt;br /&gt;&lt;li&gt;ETL Primer&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Kettle Concepts&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Installation and Configuration&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Sample ETL Solution&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Part II: ETL Subsystems&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Overview of the 34 Subsystems of ETL&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Data Extraction&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Cleansing and Conforming&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Handling Dimension Tables&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Fact Tables&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Loading OLAP Cubes&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Part III: Management and Deployment&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Testing and Debugging&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Scheduling and Monitoring&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Versioning and Migration&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Lineage and Auditing&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Securing your Environment&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Documenting&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Part IV: Performance and Scalability&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Performance Tuning&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Parallization and Partitioning&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Dynamic Clustering in the Cloud&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Realtime and Streaming data&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Part V: Integrating and Extending Kettle&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Pentaho BI Integration&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Third-party Kettle Integration&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Extending Kettle&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Part VI: Advanced Topics&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Webservices and Web APIs&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Complex File Handling&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Data Vault Management&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Working with ERP Systems&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;Feel free to ask me any questions about this new book. If you're interested, stay tuned - I will probably be posting 2 or 3 updates as we go.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-2730626224943105240?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/2730626224943105240/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=2730626224943105240' title='37 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/2730626224943105240'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/2730626224943105240'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2010/03/writing-another-book-pentaho-kettle.html' title='Writing another book: Pentaho Kettle Solutions'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>37</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-5419815226495194527</id><published>2010-02-17T23:00:00.002+01:00</published><updated>2010-02-18T09:50:57.262+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mysql stored routine'/><category scheme='http://www.blogger.com/atom/ns#' term='geert'/><category scheme='http://www.blogger.com/atom/ns#' term='Refactoring'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql user conference'/><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql connector/python'/><category scheme='http://www.blogger.com/atom/ns#' term='information_schema'/><title type='text'>MySQL - the best stored routine is the one you don't write</title><content type='html'>At &lt;a href="http://fosdem.org/2010/" target="fosdem"&gt;Fosdem 2010&lt;/a&gt;, already two weeks ago, I had the pleasure of hearing &lt;a href="http://geert.vanderkelen.org/" target="geert"&gt;Geert van der Kelen&lt;/a&gt; explain the work he has been doing on connecting &lt;a href="http://www.mysql.com/" target="mysql"&gt;MySQL&lt;/a&gt; and &lt;a href="http://python.org/" target="python"&gt;Python&lt;/a&gt;. I don't know anything about Python, but anybody that has the courage, perseverance and coding skills to create an implementation of the the MySQL wire protocol from scratch is a class-A programmer in my book. So, I encourage everyone that needs MySQL connectivity for Python programs to check out Geert's brainchild, &lt;a href="https://launchpad.net/myconnpy" target="myconnpy"&gt;MySQL Connector/Python&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;In relation to MySQL Connector/Python, I just read &lt;a href="http://geert.vanderkelen.org/2010/02/stuffing-gaps-in-collations-table-using.html" target="geert"&gt;a post from Geert&lt;/a&gt; about how he uses &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/information-schema.html" target="mysql"&gt;the MySQL &lt;code&gt;information_schema&lt;/code&gt;&lt;/a&gt; to generate some Python code. In this particular case, he needs the data from the &lt;code&gt;COLLATIONS&lt;/code&gt; table to maintain a data structure that describes all collations supported by MySQL. &lt;br /&gt;&lt;br /&gt;For some reasons that I cannot fathom, Geert needed to generate a structure for each possible collation, not just the ones for which the &lt;code&gt;COLLATIONS&lt;/code&gt; table contains a row. To do this, he wrote a &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/create-procedure.html" target="_mysql"&gt;stored procedure&lt;/a&gt; that uses a &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/cursors.html" target="mysql"&gt;cursor&lt;/a&gt; to loop through the &lt;code&gt;COLLATIONS&lt;/code&gt; table. In the loop, he detects it whenever there's a gap in the sequence of values from the &lt;code&gt;ID&lt;/code&gt; column, and then starts a new loop to "fill the gaps". For each iteration of the outer cursor loop, a piece of text is emitted that conforms to the syntax of a &lt;a href="http://docs.python.org/library/stdtypes.html#typesseq" target="python"&gt;Python&lt;/a&gt; tuple describing the collation, and each iteration of the inner loop generates the text &lt;code&gt;None&lt;/code&gt;, a &lt;a href="http://www.python.org/doc/2.6/library/constants.html" target="python"&gt;Python built-in constant&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The final result of the procedure is a snippet of Python code shown below (abbreviated): &lt;pre&gt;&lt;br /&gt;..&lt;br /&gt;("cp1251","cp1251_bulgarian_ci"), # 14&lt;br /&gt;("latin1","latin1_danish_ci"), # 15&lt;br /&gt;("hebrew","hebrew_general_ci"), # 16&lt;br /&gt;None,&lt;br /&gt;("tis620","tis620_thai_ci"), # 18&lt;br /&gt;("euckr","euckr_korean_ci"), # 19&lt;br /&gt;..&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;In the final code, these lines are themselves used to form yet another tuple:&lt;pre&gt;&lt;br /&gt;desc = (&lt;br /&gt;    None,&lt;br /&gt;    ("big5","big5_chinese_ci"), # 1&lt;br /&gt;    ("latin2","latin2_czech_cs"), # 2&lt;br /&gt;    ("dec8","dec8_swedish_ci"), # 3&lt;br /&gt;    ("cp850","cp850_general_ci"), # 4&lt;br /&gt;..&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;This is excellent use of the information schema! However, I am not too thrilled about using a stored routine for this. Enter my fosdem talk about &lt;a href="http://fosdem.org/2010/schedule/events/mysql_refactoring" target="fosdem"&gt;refactoring stored routines&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;In this case, performance is not really an issue, so I won't play that card. But many people that do need well-performing stored procedures might start out like Geert and write a cursor loop, and perhaps do some looping inside that loop. One of the big take-aways in my presentation is to become aware of the ways that you can avoid a stored procedure. Geerts procedure is an excellent candidate to illustrate the point. As a bonus, I'm adding the code that is necessary to generate the entire snippet, not just the collection of tuples inside the outer pair of parenthesis. &lt;br /&gt;&lt;br /&gt;So, here goes:&lt;pre&gt;&lt;br /&gt;set group_concat_max_len := @@max_allowed_packet;&lt;br /&gt;&lt;br /&gt;select      concat('desc = (',&lt;br /&gt;                group_concat('\n   '&lt;br /&gt;                ,   if( collations.id is null, 'None',   &lt;br /&gt;                        concat('(', '"', character_set_name, '"',&lt;br /&gt;                            ',', '"', collation_name, '"', ')')&lt;br /&gt;                    )&lt;br /&gt;                ,   if(ids.id=255, '', ','), ' #', ids.id&lt;br /&gt;                order by ids.id&lt;br /&gt;                separator ''    &lt;br /&gt;                ), '\n)'&lt;br /&gt;            )&lt;br /&gt;from       (select (t0.id&amp;lt;&amp;lt;0) + (t1.id&amp;lt;&amp;lt;1) + (t2.id&amp;lt;&amp;lt;2)&lt;br /&gt;            +      (t3.id&amp;lt;&amp;lt;3) + (t4.id&amp;lt;&amp;lt;4) + (t5.id&amp;lt;&amp;lt;5)&lt;br /&gt;            +      (t6.id&amp;lt;&amp;lt;6) + (t7.id&amp;lt;&amp;lt;7)            id&lt;br /&gt;            from   (select 0 id union all select 1) t0&lt;br /&gt;            ,      (select 0 id union all select 1) t1&lt;br /&gt;            ,      (select 0 id union all select 1) t2&lt;br /&gt;            ,      (select 0 id union all select 1) t3&lt;br /&gt;            ,      (select 0 id union all select 1) t4&lt;br /&gt;            ,      (select 0 id union all select 1) t5&lt;br /&gt;            ,      (select 0 id union all select 1) t6&lt;br /&gt;            ,      (select 0 id union all select 1) t7) ids&lt;br /&gt;left join   information_schema.collations on ids.id = collations.id;&lt;/pre&gt;&lt;br /&gt;This query works first by generating 256 rows having id's ranging from 0 to 255. (I think I recall Alexander Barkov mentioning that this is currently the maximum number of collations that MySQL supports - perhaps I am wronge there). This is done by cross-joining a simple derived table that generates two rows:&lt;pre&gt;&lt;br /&gt;(select 0 id union all select 1)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;So, one row that yields &lt;code&gt;0&lt;/code&gt;, and one that yields &lt;code&gt;1&lt;/code&gt;. By cross-joining 8 of these derived tables, we get 2 to the 8th power rows, which equals 256. In the &lt;code&gt;SELECT&lt;/code&gt;-list, I use the &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/bit-functions.html#operator_left-shift" target="mysql"&gt;left bitshift operator&lt;/a&gt; &lt;code&gt;&amp;lt;&amp;lt;&lt;/code&gt; to shift the original &lt;code&gt;0&lt;/code&gt; and &lt;code&gt;1&lt;/code&gt; 0, 1, 2 and so on up to 7 positions. By then adding those values together, we fill up exactly one byte, and gain all possible values from 0 through 255:&lt;pre&gt;&lt;br /&gt;(select (t0.id&amp;lt;&amp;lt;0) + (t1.id&amp;lt;&amp;lt;1) + (t2.id&amp;lt;&amp;lt;2)&lt;br /&gt; +      (t3.id&amp;lt;&amp;lt;3) + (t4.id&amp;lt;&amp;lt;4) + (t5.id&amp;lt;&amp;lt;5)&lt;br /&gt; +      (t6.id&amp;lt;&amp;lt;6) + (t7.id&amp;lt;&amp;lt;7)            id&lt;br /&gt; from   (select 0 id union all select 1) t0&lt;br /&gt; ,      ...                              t1&lt;br /&gt; ,      ...&lt;br /&gt; ,      (select 0 id union all select 1) t7) ids&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Once we have this, the rest is straightforward - all we have to do now is use a &lt;code&gt;LEFT JOIN&lt;/code&gt; to find any collations from the &lt;code&gt;information_schema.COLLATIONS&lt;/code&gt; table in case the value of its &lt;code&gt;ID&lt;/code&gt; column matches the value we computed with the bit-shifting jiggery-pokery. For the matching rows, we use &lt;code&gt;CONCAT&lt;/code&gt; to generate a Python tuple describing the collation, and for the non-matching rows, we generate &lt;code&gt;None&lt;/code&gt;:&lt;pre&gt;&lt;br /&gt;if( collations.id is null, 'None',   &lt;br /&gt;    concat('(', '"', character_set_name, '"',&lt;br /&gt;           ',', '"', collation_name, '"', ')')&lt;br /&gt;)&lt;/pre&gt;&lt;br /&gt;The final touch is a &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/group-by-functions.html#function_group-concat" target="mysql"&gt;GROUP_CONCAT&lt;/a&gt;&lt;/code&gt; that we use to bunch these up into a comma separated list that is used as entries for the outer tuple. As always, you should set the value of the &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/server-system-variables.html#sysvar_group_concat_max_len" target="mysql"&gt;group_concat_max_len&lt;/a&gt;&lt;/code&gt; server variable to a sufficiently high value to hold the contents of the generated string, and if you want to be on the safe side and not run the risk of getting a truncated result, you should use &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/server-system-variables.html#sysvar_max_allowed_packet"&gt;max_allowed_packet&lt;/a&gt;&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;I have the honour of speaking at &lt;a href="http://en.oreilly.com/mysql2010" target="mysqlconf"&gt;the MySQL user conference&lt;/a&gt;, april 12-15 later this year. There, I will be doing a related talk called &lt;a href="http://en.oreilly.com/mysql2010/public/schedule/detail/13035" target="mysqlconf"&gt;Optimizing MySQL Stored Routines&lt;/a&gt;. In this talk, I will explain how stored routines impact performance, and provide some tips on how you can avoid them, but also on how to improve your stored procedure code in case you really do need them.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-5419815226495194527?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/5419815226495194527/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=5419815226495194527' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5419815226495194527'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5419815226495194527'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2010/02/mysql-best-stored-routine-is-one-you.html' title='MySQL - the best stored routine is the one you don&apos;t write'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-6429701815921158712</id><published>2010-01-27T18:10:00.000+01:00</published><updated>2010-01-27T18:11:28.626+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Data  warehousing'/><category scheme='http://www.blogger.com/atom/ns#' term='easter'/><category scheme='http://www.blogger.com/atom/ns#' term='Kettle'/><category scheme='http://www.blogger.com/atom/ns#' term='pentaho data integration'/><category scheme='http://www.blogger.com/atom/ns#' term='business intelligence'/><title type='text'>Easter Eggs for MySQL and Kettle</title><content type='html'>To whom it may concern, &lt;br /&gt;&lt;h3&gt;A MySQL stored function to calculate easter day&lt;/h3&gt;&lt;br /&gt;I uploaded a &lt;a href="http://forge.mysql.com/tools/tool.php?id=250" target="_forge"&gt;MySQL forge snippet&lt;/a&gt; for the &lt;code&gt;f_easter()&lt;/code&gt; function. You can use this function in MySQL statements to calculate easter sunday for any given year: &lt;pre&gt;&lt;br /&gt;mysql&amp;gt;  select f_easter(year(now()));&lt;br /&gt;+-----------------------+&lt;br /&gt;| f_easter(year(now())) |&lt;br /&gt;+-----------------------+&lt;br /&gt;| 2010-04-04            |&lt;br /&gt;+-----------------------+&lt;br /&gt;1 row in set (0.00 sec)&lt;br /&gt;&lt;/pre&gt; &lt;br /&gt;&lt;h3&gt;Anonymous Gregorian algorithm&lt;/h3&gt;&lt;br /&gt;To implement it, I simply transcribed the code of the &lt;a href="http://en.wikipedia.org/wiki/Computus#Anonymous_Gregorian_algorithm" target="_wiki"&gt;"Anonymous Gregorian algorithm"&lt;/a&gt; from wikipedia's &lt;a href="http://en.wikipedia.org/wiki/Computus" target="_wiki"&gt;Computus&lt;/a&gt; article.&lt;br /&gt;&lt;br /&gt;You might ask yourself: "how does it work?". Frankly, I don't know. Much like a tax form, I treat the calculation as a black box. But, it's wikipedia, so it must be right, right?&lt;br /&gt;&lt;h3&gt;A Javascript snippet to calculate easter day&lt;/h3&gt;&lt;br /&gt;I also transcribed the algorithm to javascript, so I could use it in &lt;a href="http://kettle.pentaho.org/" target="_kettle"&gt;Kettle&lt;/a&gt; (a.k.a. Pentaho Data Integration). Of course, nothing should stop you from using it for another environment, such as a webpage.&lt;br /&gt;&lt;br /&gt;I don't have a proper place to host that code, so I'm listing it here:&lt;pre&gt;&lt;br /&gt;//Script to calculate Easter day&lt;br /&gt;//according to the "Anonymous Gregorian algorithm" &lt;br /&gt;function easterDay(year) {&lt;br /&gt;     var a = year % 19,&lt;br /&gt;     b = Math.floor(year / 100),&lt;br /&gt;     c = year % 100,&lt;br /&gt;     d = Math.floor(b / 4),&lt;br /&gt;     e = b % 4,&lt;br /&gt;     f = Math.floor((b + 8) / 25),&lt;br /&gt;     g = Math.floor((b - f + 1) / 3),&lt;br /&gt;     h = (19 * a + b - d - g + 15) % 30,&lt;br /&gt;     i = Math.floor(c / 4),&lt;br /&gt;     k = c % 4,&lt;br /&gt;     L = (32 + 2 * e + 2 * i - h - k) % 7,&lt;br /&gt;     m = Math.floor((a + 11 * h + 22 * L) / 451),&lt;br /&gt;     n = h + L - 7 * m + 114;&lt;br /&gt;     return new Date(year,&lt;br /&gt;                  Math.floor(n / 31) - 1,&lt;br /&gt;                  (n % 31) + 1);&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;easter = easterDay(year);&lt;/pre&gt;&lt;br /&gt;To use this in your kettle transformations, create a stream with an field of the &lt;code&gt;Integer&lt;/code&gt; type called &lt;code&gt;year&lt;/code&gt;. The &lt;code&gt;year&lt;/code&gt; field should take on the value of some year. In the step, create one output field of the &lt;code&gt;Date&lt;/code&gt; type to take on the value of the &lt;code&gt;easter&lt;/code&gt; script variable. (For usage outside kettle, just use the &lt;code&gt;easterDay()&lt;/code&gt; function as you see fit)&lt;br /&gt;&lt;h3&gt;Nice, but so what?&lt;/h3&gt;&lt;br /&gt;The thought may have crossed your mind: "So what, who cares - why should I ever want to know when it's easter day?"&lt;br /&gt;&lt;br /&gt;Apparently, if you think like that, you don't like eggs very much. That's ok - I don't blame you. But I happen to like eggs, and people in the egg business like people that like eggs like me so they can sell them more eggs. In fact, they like selling eggs so much, that it makes a big difference to them whether their business intelligence reports say: "On March 22, 2008, we sold 10 times more eggs than on February 22 and May 22 of the same year" as compared to "In 2008, on the day before Easter, we only sold half the amount of eggs as compared to the day before Easter in 2009".&lt;br /&gt;&lt;br /&gt;In order to report these facts, special events and holidays like easter are stored in a date dimension. (I wrote about &lt;a href="http://rpbouman.blogspot.com/2007/04/kettle-tip-using-java-locales-for-date.html"&gt;creating a localized date dimension&lt;/a&gt;, a date dimension that speaks your language some time ago)&lt;br /&gt;&lt;br /&gt;So there you go: you could use these solutions in order to build a date dimension that understands easter. The nice thing about easter is that it can be used to derive a whole bunch of other Christian holidays, like good friday, ascension, and pentecost, and in many western countries, these will be special days with regard to the normal course of business. I leave all these as an exercise to the reader, but trust me - calculating easter is the key to a solving a lot of these problems.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-6429701815921158712?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/6429701815921158712/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=6429701815921158712' title='19 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/6429701815921158712'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/6429701815921158712'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2010/01/easter-eggs-for-mysql-and-kettle.html' title='Easter Eggs for MySQL and Kettle'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>19</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-7441979623852267366</id><published>2009-12-15T17:45:00.001+01:00</published><updated>2009-12-16T00:34:39.652+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Database Stored Procedures'/><category scheme='http://www.blogger.com/atom/ns#' term='SIGNAL'/><category scheme='http://www.blogger.com/atom/ns#' term='stored routine'/><category scheme='http://www.blogger.com/atom/ns#' term='trigger'/><title type='text'>Validating MySQL data entry with triggers: A quick look at the SIGNAL syntax</title><content type='html'>The &lt;a href="http://blogs.mysql.com/kaj/2009/12/15/mysql-550-m2-a-milestone-ready-to-download/" target="_mysql"&gt;latest MySQL 5.5 milestone release&lt;/a&gt; offers support for an ANSI/ISO standard feature called the &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.5/en/signal.html" target="_mysql"&gt;SIGNAL&lt;/a&gt;&lt;/code&gt; syntax. You can use this syntax inside stored routines (including triggers) to raise an error condition which can be used to invoke specific error handling, or otherwise abort the stored routine. In addition, you can use the &lt;code&gt;SIGNAL&lt;/code&gt; syntax to convey information about what went wrong, which may be used by the caller to handle the error.&lt;br /&gt;&lt;br /&gt;I have written about MySQL data entry validation procedures in the past. At the time, &lt;a href="http://rpbouman.blogspot.com/2006/02/dont-you-need-proper-error-handling.html" target="_mysql"&gt;MySQL did not support any proper means to raise an error condition&lt;/a&gt; inside a stored routine, and one had to work around that by deliberatly causing a runtime error, for example by referring to a non-existent table, setting a non-nullable column to null, or &lt;a href="http://rpbouman.blogspot.com/2005/11/using-udf-to-raise-errors-from-inside.html" target="_mysql"&gt;]Yting a specially crafted UDF&lt;/a&gt;. In this artcle, I'm taking a closer look at how to implement more robust data entry validation in MySQL using the &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.5/en/signal.html" target="_mysql"&gt;SIGNAL&lt;/a&gt;&lt;/code&gt; syntax.&lt;br /&gt;&lt;h3&gt;Triggers&lt;/h3&gt;&lt;br /&gt;For those of you that are unfamiliar with the subject, MySQL offers support for &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/triggers.html" target="_mysql"&gt;triggers&lt;/a&gt; as of version 5.0. Triggers are stored routines that are executed automatically right before or after data change events like a row being inserted, updated or deleted. Because triggers are executed as part of the SQL statement (and its containing transaction) causing the row change event, and because the trigger code has direct access to the changed row, you could in theory use them to correct or reject invalid data.&lt;br /&gt;&lt;h3&gt;Example data validation problem&lt;/h3&gt;&lt;br /&gt;Let's take a quick look at the following example. Suppose you have a table called &lt;code&gt;person&lt;/code&gt; to store data about persons:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;CREATE TABLE person (&lt;br /&gt;    id              INT         NOT NULL AUTO_INCREMENT PRIMARY KEY&lt;br /&gt;,   first_name      VARCHAR(64) NOT NULL    &lt;br /&gt;,   last_name       VARCHAR(64) NOT NULL&lt;br /&gt;,   initials        VARCHAR(8)&lt;br /&gt;)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Let's consider a simple validation method for the initials column:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;We require the first letter of the value for the &lt;code&gt;initials&lt;/code&gt; column to match the first letter of the value of the &lt;code&gt;first_name&lt;/code&gt; column.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;In addition, we require that the values for the &lt;code&gt;initials&lt;/code&gt; column consists of uppercase letters separated by periods.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;In order to implement this, we design the following algorithm: &lt;ul&gt;&lt;br /&gt;&lt;li&gt;If the value for &lt;code&gt;first_name&lt;/code&gt; is &lt;code&gt;NULL&lt;/code&gt;, we do nothing. The &lt;code&gt;NOT NULL&lt;/code&gt; table constraint will prevent the data from being entered anyway, so further attempts at validation or correction are pointless.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;If the value for &lt;code&gt;initials&lt;/code&gt; is &lt;code&gt;NULL&lt;/code&gt;, we correct it by automatically filling in the first character of the value for &lt;code&gt;first_name&lt;/code&gt;.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;If the values for &lt;code&gt;first_name&lt;/code&gt; as well as &lt;code&gt;initials&lt;/code&gt; are both not &lt;code&gt;NULL&lt;/code&gt;, we require that the first character of the value for &lt;code&gt;first_name&lt;/code&gt; equals the first character of the value for &lt;code&gt;initials&lt;/code&gt;.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Finally, we use a regular expression to check if the value for &lt;code&gt;initials&lt;/code&gt; matches the desired pattern of uppercase letters separated by periods.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;h3&gt;A data validation stored procedure&lt;/h3&gt;&lt;br /&gt;Let's start by creating a stored procedure to perform this algorithm. Here's the code for the &lt;code&gt;p_validate_initials&lt;/code&gt; procedure which validates and possibly corrects the &lt;code&gt;initials&lt;/code&gt; value based on the value for &lt;code&gt;first_name&lt;/code&gt;:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;DELIMITER go&lt;br /&gt;&lt;br /&gt;CREATE PROCEDURE p_validate_initials(&lt;br /&gt;    IN      p_first_name    VARCHAR(64)&lt;br /&gt;,   INOUT   p_initials      VARCHAR(64) &lt;br /&gt;)&lt;br /&gt;DETERMINISTIC                                                                   -- same arguments yield same result, always&lt;br /&gt;NO SQL                                                                          -- does not execute SQL statements, only procedural logic&lt;br /&gt;_main: BEGIN&lt;br /&gt;&lt;br /&gt;    DECLARE WARN_CORRECTED_INITIALS CONDITION FOR SQLSTATE '01000';&lt;br /&gt;    DECLARE ERR_INITIALS_DONT_MATCH_FIRSTNAME CONDITION FOR SQLSTATE '45000';&lt;br /&gt;    DECLARE ERR_INITIALS_ILLFORMATTED CONDITION FOR SQLSTATE '45000';&lt;br /&gt;&lt;br /&gt;    IF      p_first_name IS NULL THEN&lt;br /&gt;        LEAVE _main;                                                            -- nothing to validate&lt;br /&gt;    ELSEIF   p_initials IS NULL THEN                                            -- initials are NULL, correct:&lt;br /&gt;        SET p_initials := CONCAT(LEFT(p_first_name, 1), '.');                   -- take the first letter of first_name&lt;br /&gt;        SIGNAL WARN_CORRECTED_INITIALS                                          -- warn about the corrective measure&lt;br /&gt;        SET MESSAGE_TEXT = 'Corrected NULL value for initials to match value for first_name.';&lt;br /&gt;    ELSEIF   BINARY LEFT(p_first_name, 1) != LEFT(p_initials, 1) THEN           -- initials don't match first_name&lt;br /&gt;        SIGNAL ERR_INITIALS_DONT_MATCH_FIRSTNAME                                -- raise an error&lt;br /&gt;        SET MESSAGE_TEXT = 'The first letter of the value for initials does not match the first letter of the value for first_name';&lt;br /&gt;    END IF;&lt;br /&gt;    IF NOT p_initials REGEXP '^([A-Z][.])+$' THEN                               -- if initials don't match the correct pattern&lt;br /&gt;        SIGNAL ERR_INITIALS_ILLFORMATTED                                        -- raise an error&lt;br /&gt;        SET MESSAGE_TEXT = 'The value for initials must consist of upper case letters separated by periods.';&lt;br /&gt;    END IF;&lt;br /&gt;END;&lt;br /&gt;go&lt;br /&gt;&lt;br /&gt;DELIMITER ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Let's take a look at how this procedure works.&lt;br /&gt;&lt;h3&gt;How to issue warnings&lt;/h3&gt;&lt;br /&gt;First, let's pass &lt;code&gt;NULL&lt;/code&gt; for the initials to see if they are properly corrected:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;mysql&amp;gt; set @initials := null;&lt;br /&gt;Query OK, 0 rows affected (0.00 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; call p_validate_initials('Roland', @initials);&lt;br /&gt;Query OK, 0 rows affected, 1 warning (0.00 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; select @initials;&lt;br /&gt;+-----------+&lt;br /&gt;| @initials |&lt;br /&gt;+-----------+&lt;br /&gt;| R.        |&lt;br /&gt;+-----------+&lt;br /&gt;1 row in set (0.00 sec)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Note that executing the procedure placed the correct value into the &lt;code&gt;@initials&lt;/code&gt; user-defined variable. Also note that a warning was issued.&lt;br /&gt;&lt;br /&gt;In this case, the procedure ran through the following branch of the first &lt;code&gt;IF&lt;/code&gt; statement: &lt;pre&gt;&lt;br /&gt;    ...&lt;br /&gt;    ELSEIF   p_initials IS NULL THEN&lt;br /&gt;        SET p_initials := CONCAT(LEFT(p_first_name, 1), '.');&lt;br /&gt;        &lt;b&gt;SIGNAL WARN_CORRECTED_INITIALS&lt;br /&gt;        SET MESSAGE_TEXT = 'Corrected NULL value for initials to match value for first_name.';&lt;/b&gt;&lt;br /&gt;    ELSEIF ...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The warning is caused by the &lt;code&gt;SIGNAL&lt;/code&gt; statement (which appears in bold text in the snippet above). The general syntax for the &lt;code&gt;SIGNAL&lt;/code&gt; statement is:&lt;pre&gt;&lt;br /&gt;SIGNAL &lt;i&gt;condition_value&lt;/i&gt;&lt;br /&gt;    [SET &lt;i&gt;signal_information&lt;/i&gt; [, &lt;i&gt;signal_information&lt;/i&gt;] ...]&lt;br /&gt;    &lt;br /&gt;&lt;i&gt;condition_value&lt;/i&gt;:&lt;br /&gt;    SQLSTATE [VALUE] &lt;i&gt;sqlstate_value&lt;/i&gt;&lt;br /&gt;  | &lt;i&gt;condition_name&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;signal_information&lt;/i&gt;:&lt;br /&gt;    &lt;i&gt;condition_information_item&lt;/i&gt; = &lt;i&gt;simple_value_specification&lt;/i&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;So, in this case, we used the condition name &lt;code&gt;WARN_CORRECTED_INITIALS&lt;/code&gt; as &lt;code&gt;&lt;i&gt;condition_value&lt;/i&gt;&lt;/code&gt;. This &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/declare-condition.html" target="_mysql"&gt;condition is declared&lt;/a&gt; in the top of the procedure:&lt;pre&gt;&lt;br /&gt;    DECLARE &lt;b&gt;WARN_CORRECTED_INITIALS&lt;/b&gt; CONDITION FOR &lt;b&gt;SQLSTATE '01000'&lt;/b&gt;;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Basically, condition declarations like these serve simply to tie a human readable name to otherwise obscure &lt;code&gt;SQLSTATE&lt;/code&gt; values. As per the SQL standard, SQLSTATE values are strings of 5 digits. The prefix &lt;code&gt;01&lt;/code&gt; indicates a warning.&lt;br /&gt;&lt;br /&gt;(Condition declarations like these are not only useful to clarify the meaning of your &lt;code&gt;SIGNAL&lt;/code&gt; statements, you can also use them to declare error &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/declare-handler.html" target="_mysql"&gt;HANDLER&lt;/a&gt;&lt;/code&gt;s)&lt;br /&gt;&lt;br /&gt;An alternative syntax for &lt;code&gt;SIGNAL&lt;/code&gt; allows you to directly refer to the &lt;code&gt;SQLSTATE&lt;/code&gt; without explicitly declaring a &lt;code&gt;CONDITION&lt;/code&gt;. So, if you feel that declaring explicit conditions is too much trouble, you can also omit that and write: &lt;pre&gt;&lt;br /&gt;        SIGNAL SQLSTATE '01000' &lt;br /&gt;            ...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;(However, I like using explicit condition names better because it does a better job of explaining the intention of the code.)&lt;br /&gt;&lt;h3&gt;Conveying &lt;code&gt;SIGNAL&lt;/code&gt; context information&lt;/h3&gt;&lt;br /&gt;The &lt;code&gt;SIGNAL&lt;/code&gt; statement also features a &lt;code&gt;SET&lt;/code&gt;-clause which is used to convey signal information. In our example the set clause was:&lt;pre&gt;&lt;br /&gt;        &lt;b&gt;SET MESSAGE_TEXT = 'Corrected NULL value for initials to match value for first_name.';&lt;/b&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;As you can see, the &lt;code&gt;SET&lt;/code&gt;-clause in the example contains an assignment. In the context of the &lt;code&gt;SIGNAL&lt;/code&gt; statemntt, such an assignment is referred to as &lt;code&gt;&lt;i&gt;signal_information&lt;/i&gt;&lt;/code&gt;. The left hand side of the assignments must be one of the predefined &lt;code&gt;&lt;i&gt;condition_information_item&lt;/i&gt;&lt;/code&gt;s. The &lt;code&gt;SET&lt;/code&gt;-clause can have multiple of these &lt;code&gt;&lt;i&gt;signal_information&lt;/i&gt;&lt;/code&gt; items which can be used to capture and communicate program state to the client program. &lt;br /&gt;&lt;br /&gt;In the case of the example we can demonstrate how this works using the MySQL command-line client. By issuing a &lt;code&gt;SHOW WARNINGS&lt;/code&gt; statement, we can see the message text was conveyed with with the &lt;code&gt;&lt;i&gt;signal_information&lt;/i&gt;&lt;/code&gt; item:&lt;pre&gt;&lt;br /&gt;mysql&amp;gt; show warnings;&lt;br /&gt;+-------+------+---------------------------------------------------------------------------------+&lt;br /&gt;| Level | Code | Message                                                                         |&lt;br /&gt;+-------+------+---------------------------------------------------------------------------------+&lt;br /&gt;| Error | 1644 | The value for initials must consist of upper case letters separated by periods. |&lt;br /&gt;+-------+------+---------------------------------------------------------------------------------+&lt;br /&gt;1 row in set (0.01 sec)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Unfortunately, MySQL does not currently support a way for any calling stored routines to capture the &lt;code&gt;&lt;i&gt;signal_information&lt;/i&gt;&lt;/code&gt; items. They are currently only available in the APIs you use to communicate with MySQL, so you can capture them in your application code. &lt;br /&gt;&lt;br /&gt;(&lt;a href="http://bugs.mysql.com/bug.php?id=11660" target="_mysql"&gt;A bug&lt;/a&gt; has been filed to ask for the ability to refer to signal information items in stored routines. This should become available whenever MySQL implements a &lt;code&gt;DIAGNOSTICS&lt;/code&gt; feature)&lt;br /&gt;&lt;h4&gt;Predefined condition information items&lt;/h4&gt;&lt;br /&gt;I just mentioned that the left-hand side of the &lt;code&gt;&lt;i&gt;signal_information&lt;/i&gt;&lt;/code&gt; item assignment must be one of the predefined &lt;code&gt;&lt;i&gt;condition_information_item&lt;/i&gt;&lt;/code&gt;s. These are dictated by the standard, and although MySQL allows all of the standard &lt;code&gt;&lt;i&gt;condition_information_item&lt;/i&gt;&lt;/code&gt;s, only two of them are currently relevant: &lt;code&gt;MESSAGE_TEXT&lt;/code&gt; and &lt;code&gt;MYSQL_ERRNO&lt;/code&gt;. We already illustrated using &lt;code&gt;MESSAGE_TEXT&lt;/code&gt;. The &lt;code&gt;MYSQL_ERRNO&lt;/code&gt; is a non-standard condition information item that can be used to convey custom error codes. &lt;br /&gt;&lt;br /&gt;This leaves currently three variables to convey information about the context of the &lt;code&gt;SIGNAL&lt;/code&gt; statement:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;&lt;code&gt;SQLSTATE&lt;/code&gt;: available in the C API as &lt;code&gt;mysql_sqlstate()&lt;/code&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;code&gt;MYSQL_ERRNO&lt;/code&gt;: available in the C API as &lt;code&gt;mysql_errno()&lt;/code&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;code&gt;MESSAGE_TEXT&lt;/code&gt;: available in the C API as &lt;code&gt;mysql_error()&lt;/code&gt;&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;(You should be able to obtain the information also in PHP through the corresponding &lt;code&gt;mysqli_&lt;/code&gt;, &lt;code&gt;pdo_&lt;/code&gt; and &lt;code&gt;mysql_&lt;/code&gt; functions.)&lt;br /&gt;&lt;h3&gt;How to issue errors&lt;/h3&gt;&lt;br /&gt;We just discussed how to cause your stored routine to issue warnings. Issuing errors is exactly the same process, it just relies on a different class of &lt;code&gt;SQLSTATE&lt;/code&gt; values (as determined by the code prefix). Let's see the errors in action: &lt;pre&gt;&lt;br /&gt;mysql&amp;gt; set @initials := 'r';&lt;br /&gt;Query OK, 0 rows affected (0.00 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; call p_validate_initials('Roland', @initials);&lt;br /&gt;ERROR 1644 (45000): The first letter of the value for initials does not match the first letter of the value for first_name&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;In this case, the stored routine ran through the last branch of the first &lt;code&gt;IF&lt;/code&gt; statement: &lt;pre&gt;&lt;br /&gt;    ELSEIF   BINARY LEFT(p_first_name, 1) != LEFT(p_initials, 1) THEN&lt;br /&gt;        &lt;b&gt;SIGNAL ERR_INITIALS_DONT_MATCH_FIRSTNAME&lt;/b&gt;&lt;br /&gt;        SET MESSAGE_TEXT = 'The first letter of the value for initials does not match the first letter of the value for first_name';&lt;br /&gt;    END IF;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;As you can see, the syntax for the actual &lt;code&gt;SIGNAL&lt;/code&gt; statement is exactly similar to what we saw in the example illustrating warnings. The most important difference is that in this case, the condition that is being signalled is declared with a &lt;code&gt;SQLSTATE&lt;/code&gt; value of 45000:&lt;pre&gt;&lt;br /&gt;    DECLARE ERR_INITIALS_DONT_MATCH_FIRSTNAME CONDITION FOR SQLSTATE '45000';&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The &lt;code&gt;SQLSTATE&lt;/code&gt; code &lt;code&gt;45000&lt;/code&gt; is a special and indicates a general user-defined exception.&lt;br /&gt;&lt;h3&gt;Using the procedure in a trigger&lt;/h3&gt;&lt;br /&gt;All we need to do now is create our triggers on the &lt;code&gt;person&lt;/code&gt; table that call the procedure to perform the actual validation. We need to apply the validation when data is inserted into the table, but also when data is updated. If it turns out the data is invalid, we need to reject the change. For this reason, we want to create triggers that fire before the data change is applied to the table. &lt;br /&gt;&lt;br /&gt;So, to enforce validation, we need two triggers: one that fires &lt;code&gt;BEFORE INSERT&lt;/code&gt; events, and one that fires &lt;code&gt;BEFORE UPDATE&lt;/code&gt; events. Because the validation process itself is the same regardless of the type of change event, both triggers can call the &lt;code&gt;p_validate_initials&lt;/code&gt; procedure to perform the actual validation. This allows us to write (and maintain!) the validation logic only once, and reuse it whenever we need it. &lt;br /&gt;&lt;pre&gt;&lt;br /&gt;DELIMITER go&lt;br /&gt;&lt;br /&gt;CREATE TRIGGER bir_person &lt;br /&gt;BEFORE INSERT ON person&lt;br /&gt;FOR EACH ROW&lt;br /&gt;BEGIN&lt;br /&gt;    CALL p_validate_initials(&lt;br /&gt;        NEW.first_name&lt;br /&gt;    ,   NEW.initials &lt;br /&gt;    );&lt;br /&gt;END;&lt;br /&gt;go&lt;br /&gt;&lt;br /&gt;CREATE TRIGGER bur_person &lt;br /&gt;BEFORE UPDATE ON person&lt;br /&gt;FOR EACH ROW&lt;br /&gt;BEGIN&lt;br /&gt;    CALL p_validate_initials(&lt;br /&gt;        NEW.first_name&lt;br /&gt;    ,   NEW.initials &lt;br /&gt;    );&lt;br /&gt;END;&lt;br /&gt;go&lt;br /&gt;&lt;br /&gt;DELIMITER ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;A quick check indicates that data validation is now enforced as intended:&lt;pre&gt;&lt;br /&gt;mysql&amp;gt; INSERT INTO person (id, first_name, last_name, initials)&lt;br /&gt;    -&amp;gt; VALUES (2, 'Roland', 'Bouman', 'r');&lt;br /&gt;ERROR 1644 (45000): The first letter of the value for initials does not match the first letter of the value for first_name&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; INSERT INTO person (id, first_name, last_name, initials)&lt;br /&gt;    -&amp;gt; VALUES (2, 'Roland', 'Bouman', 'R');&lt;br /&gt;ERROR 1644 (45000): The value for initials must consist of upper case letters separated by periods.&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; INSERT INTO person (id, first_name, last_name, initials)&lt;br /&gt;    -&amp;gt; VALUES (2, 'Roland', 'Bouman', NULL);&lt;br /&gt;Query OK, 1 row affected, 1 warning (0.00 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; SHOW WARNINGS;&lt;br /&gt;+---------+------+------------------------------------------------------------------+&lt;br /&gt;| Level   | Code | Message                                                          |&lt;br /&gt;+---------+------+------------------------------------------------------------------+&lt;br /&gt;| Warning | 1642 | Corrected NULL value for initials to match value for first_name. |&lt;br /&gt;+---------+------+------------------------------------------------------------------+&lt;br /&gt;1 row in set (0.00 sec)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;h3&gt;Final words&lt;/h3&gt;&lt;br /&gt;For more information on MySQL 5.5, check out &lt;a href="http://datacharmer.blogspot.com" target="_gm"&gt;Giuseppe Maxia&lt;/a&gt;'s article on &lt;a href="http://datacharmer.blogspot.com/2009/12/getting-started-with-mysql-55.html" target="_mysql"&gt;Getting Started with MySQL 5.5&lt;/a&gt;. Detailed information on the &lt;code&gt;SIGNAL&lt;/code&gt; syntax is available in the reference manual here: &lt;a href="http://dev.mysql.com/doc/refman/5.5/en/signal-resignal.html" target="_mysql"&gt;http://dev.mysql.com/doc/refman/5.5/en/signal-resignal.html&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-7441979623852267366?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/7441979623852267366/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=7441979623852267366' title='21 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/7441979623852267366'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/7441979623852267366'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/12/validating-mysql-data-entry-with_15.html' title='Validating MySQL data entry with triggers: A quick look at the SIGNAL syntax'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>21</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-824115688560580731</id><published>2009-11-16T02:30:00.008+01:00</published><updated>2009-11-17T02:54:48.266+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='Javascript'/><category scheme='http://www.blogger.com/atom/ns#' term='Pentaho'/><category scheme='http://www.blogger.com/atom/ns#' term='Kettle'/><category scheme='http://www.blogger.com/atom/ns#' term='Open Source'/><category scheme='http://www.blogger.com/atom/ns#' term='pentaho data integration'/><category scheme='http://www.blogger.com/atom/ns#' term='business intelligence'/><title type='text'>Pentaho Data Integration: Javascript Step Performance</title><content type='html'>I just read &lt;a href="http://open-bi.blogspot.com/2009/11/cleaning-strings.html" target="_vince"&gt;a post&lt;/a&gt; from &lt;a href="http://open-bi.blogspot.com/" target="_vince"&gt;Vincent Teyssier&lt;/a&gt; on cleaning strings using the javascript capabilities of &lt;a href="http://kettle.pentaho.org/" target="_pdi"&gt;Pentaho Data Integration&lt;/a&gt; (also known as Kettle) and &lt;a href="http://www.talend.com/" target="_talend"&gt;Talend&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;In this post, I am looking at a few details of Vincent's approach to using Javascript in his transformation. I will present a few modifications that considerably improve performance of the Javascript execution. Some of these improvements are generic: because they apply to the use of the javascript language, they are likely to improve performance in both Talend as well as Kettle. Other improvements have to do with the way the incoming and outgoing record streams are bound to the javascript step in Kettle.&lt;h3&gt;Original Problem&lt;/h3&gt;&lt;br /&gt;The problem described by Vincent is simple enough: for each input string, return the string in lower case, except for the initial character, which should be in upper case. For example: &lt;code&gt;vIncEnt&lt;/code&gt; should become &lt;code&gt;Vincent&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;Vincent illustrates his solution using Pentaho Data Integration's &lt;a href="http://wiki.pentaho.com/display/EAI/Modified+Java+Script+Value" target="_pdi"&gt;"Modified Javascript Value"&lt;/a&gt; step. He uses it to execute the following piece of code:&lt;pre&gt;&lt;br /&gt;//First letter in uppercase, others in lowercase&lt;br /&gt;var c = Input.getString().substr(0,1);&lt;br /&gt;if (parseInt(Input.getString().length)==1)&lt;br /&gt;{&lt;br /&gt;    var cc = upper(c);&lt;br /&gt;}&lt;br /&gt;else&lt;br /&gt;{&lt;br /&gt;    var cc = upper(c) + lower(Input.getString().slice(1));&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;(The original post explains that one should be able to execute the code with minimal modification in Talend. While I don't have much experience with that tool, I think the proper step to use in that case is the tRhino step. Both tools use an embedded Rhino engine as javascript runtime, but I can imagine that there are slight differences with regard to binding the input and output fields and the support for built-in functions. Please feel free and leave a comment if you can provide more detailed information with regard to this matter.)&lt;br /&gt;&lt;br /&gt;In the script, &lt;code&gt;Input&lt;/code&gt; is the string field in the incoming stream that is to  be modified, and &lt;code&gt;cc&lt;/code&gt; is added to the output stream, containing the modified value. For some reason, the original example uses the javascript step in &lt;a href="http://wiki.pentaho.com/display/EAI/Modified+Java+Script+Value#ModifiedJavaScriptValue-Whatdoesthecompatibilityswitchdo%3F" target="_pdi"&gt;compatibility mode&lt;/a&gt;, necessitating expressions such as &lt;code&gt;Input.getString()&lt;/code&gt; to obtain the value from the field.&lt;br /&gt;&lt;br /&gt;I used the following transformation to test this script:&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/21931585@N07/4107067431/" title="v0 by roland.bouman, on Flickr"&gt;&lt;img src="http://farm3.static.flickr.com/2729/4107067431_c595cdc16c_o.png" width="801" height="621" alt="v0" /&gt;&lt;/a&gt;&lt;br /&gt;The transformation uses a &lt;a href="http://wiki.pentaho.com/display/EAI/Generate+Rows" target="_pdi"&gt;Generate Rows step&lt;/a&gt; to generate 1 million rows having a single String type field with the default value &lt;code&gt;vIncEnt&lt;/code&gt;. The rows are processed by the Modified Javascript Value step, using the original code and compatibility mode like described in Vincent's original post. Finally, I used a &lt;a href="http://wiki.pentaho.com/display/EAI/Dummy+(do+nothing)" target="_pdi"&gt;Dummy step&lt;/a&gt;. I am not entirely sure the dummy ste has any effect on the performance of the  javascript step, but I figured it would be a good idea to ensure the output of the script is actually copied to an outgoing stream.&lt;br /&gt;&lt;br /&gt;On my laptop, using Pentaho Data Integration 3.2, this transformation takes 21.6 seconds to complete, and the Javascript step processes the rows at a rate of 46210.7 rows/second.&lt;br /&gt;&lt;h3&gt;Caching calls to getString()&lt;/h3&gt;&lt;br /&gt;Like I mentioned, the original transformation uses the Javascript step in compatibility mode. Compatibility mode affects the way the fields of the stream are bound to the javascript step. With compatibility mode enabled, the step behaves like it did in Kettle 2.5 (and earlier versions): fields from the input stream are considered to be objects, and a special getter method is required to obtain their value. This is why we need an expression like &lt;code&gt;Input.getString()&lt;/code&gt; to obtain the actual value. &lt;br /&gt;&lt;br /&gt;The first improvement I'd like to present is based on simply caching the return value from the getter method. So instead of writing &lt;code&gt;Input.getString()&lt;/code&gt; all the time, we simply write a line like this:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;var input = Input.getString();&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Afterwards, we simply refer only to &lt;code&gt;input&lt;/code&gt; instead of &lt;code&gt;Input.toString()&lt;/code&gt;. With this modifcation, the script becomes:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;//First letter in uppercase, others in lowercase&lt;br /&gt;&lt;b&gt;var input = Input.getString();&lt;/b&gt;&lt;br /&gt;var c = &lt;b&gt;input&lt;/b&gt;.substr(0,1);&lt;br /&gt;if (parseInt(&lt;b&gt;input&lt;/b&gt;.length)==1)&lt;br /&gt;{&lt;br /&gt;var cc = upper(c);&lt;br /&gt;}&lt;br /&gt;else&lt;br /&gt;{&lt;br /&gt;var cc = upper(c) + lower(&lt;b&gt;input&lt;/b&gt;.slice(1));&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;(Note that &lt;code&gt;input&lt;/code&gt; and &lt;code&gt;Input&lt;/code&gt; are two different things here: &lt;code&gt;Input&lt;/code&gt; refers to the field object from the incoming record stream, and &lt;code&gt;input&lt;/code&gt; refers to a global javascript variable which we use to cache the return value from the &lt;code&gt;getString()&lt;/code&gt; method of the &lt;code&gt;Input&lt;/code&gt; field object.)&lt;br /&gt;&lt;br /&gt;If you compare this code to the original, you will notice that although this modified example adds an assignment to cache the value, it saves at least one call to the getString() method in the generic case. However, because the input value used in the example is longer than one character, it also saves another call done in the &lt;code&gt;else&lt;/code&gt; branch of the &lt;code&gt;if&lt;/code&gt; statement. So all in all, we can avoid two calls to &lt;code&gt;getString()&lt;/code&gt; in this example. &lt;br /&gt;&lt;br /&gt;This may not seem like that big a deal, but still, this improvement allows the javascript step to process rows at a rate of 51200.6 rows per second, which is an improvement of about 11%. Scripts that would have more than two calls to the getter method would benefit even more from this simple improvement.&lt;br /&gt;&lt;h3&gt;Disabling Compatibility mode&lt;/h3&gt;&lt;br /&gt;The compatibility mode is just that: a way to stay compatible with the old Kettle 2.5 behaviour. While this is useful to ensure your old transformations don't break, you really should consider not using it for new transformations.&lt;br /&gt;&lt;br /&gt;When disabling compatibility mode, you will need to change the script. In compatibility mode, the names of the fields from the input stream behave like variables that point to the field objects. With compatibility mode disabled, fieldnames still behave like variables, but now they point to the actual value of the field, and not the field object. So we need to change the script like this:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;var c = &lt;b&gt;Input&lt;/b&gt;.substr(0,1);&lt;br /&gt;var cc;&lt;br /&gt;if (parseInt(&lt;b&gt;Input&lt;/b&gt;.length)==1){&lt;br /&gt;    cc = upper(c);&lt;br /&gt;} &lt;br /&gt;else {&lt;br /&gt;    cc = upper(c) + lower(&lt;b&gt;Input&lt;/b&gt;.slice(1));&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;As you can see, we don't need to use the &lt;code&gt;getSting()&lt;/code&gt; method anywhere anymore, and this also makes our first improvement obsolete. Personally, I feel this is an improvement code-wise. In addition, the transformation now performs considerably better: now it takes 14,8 seconds, and the javascript step is processing 67159,1 rows per second, which 30% better than the previous solution, and 45% better than the original.&lt;br /&gt;&lt;h3&gt;Eliminating unncessary code&lt;/h3&gt;&lt;br /&gt;The fastest code is the code you don't execute. The original script contains a call to the &lt;a href="http://www.w3schools.com/jsref/jsref_parseInt.asp" target="_w3c"&gt;javascript built-in parseInt() function&lt;/a&gt; which is applied to the &lt;code&gt;length&lt;/code&gt; property of &lt;code&gt;Input&lt;/code&gt;:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;if (&lt;b&gt;parseInt(Input.length)&lt;/b&gt;==1){&lt;br /&gt;...snip...&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt; The intended usage of &lt;code&gt;parseInt()&lt;/code&gt; is to parse strings into integer values. Because the type of the &lt;code&gt;length&lt;/code&gt; property of a string is already an integer, the call to &lt;code&gt;parseInt()&lt;/code&gt; is simply redundant, and can be removed without any issue. This cuts down execution time to 12.8 seconds, and the Javascript step is now processing at a rate of 75204,9 rows per second: an improvement of 12% as compared to the previous improvement, and 63% as compared to the original.&lt;br /&gt;&lt;h3&gt;Optimizing the flow&lt;/h3&gt;&lt;br /&gt;Although it may look like we optimized the original javascript as much as we could, there is still room for improvement. We can rewrite the &lt;code&gt;if&lt;/code&gt; statements using the ternary operator, like so:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;var cc = Input.length==0 &lt;br /&gt;        ? ""&lt;br /&gt;        : Input.length==1 &lt;br /&gt;        ? Input.toUpperCase()&lt;br /&gt;        : Input.substr(0,1).toUpperCase() &lt;br /&gt;        + Input.substr(1).toLowerCase() &lt;br /&gt;;&lt;/pre&gt;&lt;br /&gt;(Note that I am now using the &lt;code&gt;toLowerCase()&lt;/code&gt; and &lt;code&gt;toUpperCase()&lt;/code&gt; methods of the &lt;a href="http://www.w3schools.com/jsref/jsref_obj_string.asp" target="_js"&gt;javascrpt String object&lt;/a&gt; in favor of the kettle built-in &lt;code&gt;lower()&lt;/code&gt; and &lt;code&gt;upper()&lt;/code&gt; functions.)&lt;br /&gt;Not everybody may appreciate this code-wise, as it may appear a lot less explcit than the original &lt;code&gt;if&lt;/code&gt; logic. In its defense, the approach of this solution has a more functional feel (as opposed to the procedural logic of the prior examples), which may feel more natural for the problem at hand. Regardless of any code-maintenance or aesthetic arguments, this code is actually slightly faster: It takes 12.3 seconds total, and the javascript step is processing 80301,9&lt;br /&gt;rows per second, which is a 7% improvement as compared to the previous solution, and a 74% improvement as compared to the original.&lt;br /&gt;&lt;h3&gt;Not using Javascript at all&lt;/h3&gt;&lt;br /&gt;The Javascript step can be very useful. But always keep in mind that it really is a general purpose scripting device. With the javascript step, you can do loops, open files, write to databases and whatnot. Do we really need all this power to solve the original problem? Especially if you are proficient in Javascript, it may be somewhat of a challenge to find better ways to solve the problem at hand, but really - it is often worth it.&lt;br /&gt;&lt;br /&gt;First, let us realize that the original problem does not presume a particularly difficult transformation. We just need "something" that takes one input value, and returns one output value. We don't need any side effects, like writing to a file. We also don't need to change the grain: every input row is matched by exactly one output row, which is similar in layout to the output row, save for the addition of a field to hold the transformed value. &lt;br /&gt;&lt;br /&gt;When discussing the previous solution, I already hinted that it was more "functional"  as compared to the more "procedural" examples before that. We will now look at a few solutions that are also functional in nature:&lt;br /&gt;&lt;h3&gt;The Formula step&lt;/h3&gt;&lt;br /&gt;So, basically, we need to write a function. The &lt;a href="http://wiki.pentaho.com/display/EAI/Formula" target="_pdi"&gt;Formula step&lt;/a&gt; lets you combine several built-in functions in about the same manner as you can in spreadsheet programs like open office and Microsoft Excel. Using the formula step we can enter the following formula:&lt;pre&gt;&lt;br /&gt;UPPER(LEFT([Input];1)) &amp;amp; LOWER(MID([Input];2;LEN([Input])))&lt;br /&gt;&lt;/pre&gt;If, like me, your eyes are bleading now, you might appreciate this formatted overview of this calculation:&lt;pre&gt;&lt;br /&gt;    UPPER(&lt;br /&gt;        LEFT(&lt;br /&gt;            [Input]&lt;br /&gt;        ;   1&lt;br /&gt;        )&lt;br /&gt;    ) &lt;br /&gt;&amp;amp;   LOWER(&lt;br /&gt;        MID(&lt;br /&gt;            [Input]&lt;br /&gt;        ;   2&lt;br /&gt;        ;   LEN([Input])&lt;br /&gt;        )&lt;br /&gt;    )&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This solution takes 8.5 seconds to complete, and the formula step is processing rows at a rate of 117868.9 per second, which is 47% better than the previous solution, and 155% better than the original (!!!)&lt;br /&gt;&lt;h3&gt;The Calculator step&lt;/h3&gt;&lt;br /&gt;While not as flexible as the Formula step, the &lt;a href="http://wiki.pentaho.com/display/EAI/Calculator" target="_pdi"&gt;Calculator step&lt;/a&gt; offers a reasonable range of often used functions, and has the advantage of often being faster than the formula step. In this case, we're lucky, and we can set up two calculations: one "LowerCase of a string A" to convert the input value entirely to lower case, and then a "First letter of each word in capital of a string A". By feeding the output of the former into the latter, we get the desired result:&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/21931585@N07/4107067435/" title="v4 by roland.bouman, on Flickr"&gt;&lt;img src="http://farm3.static.flickr.com/2566/4107067435_9a90947c9c_o.png" width="826" height="226" alt="v4" /&gt;&lt;/a&gt;&lt;br /&gt;(To be fair, because the calculation will actually add a capital to every word in the input, the result will actually be different as compared to any of the other transformations. However, in many cases, you might be able to guarantee that there is actually one word in the input, or otherwise, it may be considered desirable to capitalize all words.)&lt;br /&gt;&lt;br /&gt;This transformation complets in just 6.5 seconds, and the calculator processes rows at a rate of 155327,7 per second. This is 32% better than the previous solution and 236% better than the original.&lt;br /&gt;&lt;h3&gt;User-defined Java Expression&lt;/h3&gt;&lt;br /&gt;The final kicker is the &lt;a href="http://wiki.pentaho.com/display/EAI/User+Defined+Java+Expression" target="_pdi"&gt;user-defined java expression step&lt;/a&gt;. The user-defined java expression step allows you to write a java expression, which is compiled while the transformation is initialized. The expression I used is quite like the last javascript solution I discussed, except that we have to use methods of the &lt;a href=" http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#substring(int)" target="_java"&gt;Java String object&lt;/a&gt; (and not the JavaScript string object)&lt;pre&gt;&lt;br /&gt;Input.length()==0?"":Input.length()==1?Input.toUpperCase():Input.substring(0,1).toUpperCase() + Input.substring(1).toLowerCase()&lt;/pre&gt;&lt;br /&gt;The result is truly amazing: The transformation completes in just 3.1 seconds, with the user-defined Java expression step processing at a rate of 324886,2 rows per second. This is 109% faster than the previous solution, and 603% faster than the original.&lt;br /&gt;&lt;h3&gt;Conclusion&lt;/h3&gt;&lt;br /&gt;Javascript is a powerful device in data intergration transfomations, but it is quite slow. Consider replacing the javascript step with either the formula step, the calculator step or the user-defined Java expression step. Depending on your requirements, there may be other steps that deliver the fuunctionality you need.&lt;br /&gt;&lt;br /&gt;If you really do need javascript, and you are using Pentaho Data Integration, consider disabling the compatibility mode. On the other hand, if you do need the compatibility mode, be sure to avoid repeated calls the getter methods of the field objects to obtain the value. Instead, call the getter methods just once, and use global script variables to cache the return value. &lt;br /&gt;&lt;h3&gt;Summary&lt;/h3&gt;&lt;br /&gt;Here's a summary of the measurements:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;    Transformation   |  Rows per second&lt;br /&gt;    -----------------+-----------------&lt;br /&gt;    Original         |   46201,7&lt;br /&gt;    Cache getString()|   51200,6&lt;br /&gt;    No Compatmode    |   67159,1&lt;br /&gt;    no parseInt()    |   75204,9&lt;br /&gt;    Optimize flow    |   80301,9&lt;br /&gt;    Formula          |  117868,9&lt;br /&gt;    Calculator       |  155327,7&lt;br /&gt;    Java Expression  |  324886,2&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;...and here, a bar chart showing the results:&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/21931585@N07/4107067443/" title="v1000 by roland.bouman, on Flickr"&gt;&lt;img src="http://farm3.static.flickr.com/2612/4107067443_ca83b13bda_o.png" width="539" height="261" alt="v1000" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;h3&gt;Final thoughts&lt;/h3&gt;&lt;br /&gt;One of the things I haven't looked at in detail is adding more parallelism. By simply modifying the number of copies of the transforming step, we can use more cores/processors, but this is an excellent subject for a separate blog post.&lt;br /&gt;&lt;br /&gt;&lt;div style="border:solid; border-color: red"&gt;&lt;br /&gt;&lt;b&gt;UPDATE&lt;/b&gt;&lt;br /&gt;&lt;a href="http://daniele.livejournal.com/" target="_deinsp"&gt;Daniel Einspanjer&lt;/a&gt; from &lt;a href="https://blog.mozilla.com/data/" target="_moz"&gt;Mozilla&lt;/a&gt; Coorp. created &lt;a href="http://people.mozilla.com/~deinspanjer/KettleJSPerformance.mov"&gt;a 30 min. video&lt;/a&gt; demonstrating this hands-on! He adds a few very interesting approaches to squeeze out even more performance.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-824115688560580731?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/824115688560580731/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=824115688560580731' title='12 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/824115688560580731'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/824115688560580731'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/11/pentaho-data-integration-javascript.html' title='Pentaho Data Integration: Javascript Step Performance'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>12</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-3110806468379653967</id><published>2009-10-27T14:54:00.004+01:00</published><updated>2009-10-27T16:03:18.136+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='partitioning'/><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Data  warehousing'/><category scheme='http://www.blogger.com/atom/ns#' term='column oriented databases'/><category scheme='http://www.blogger.com/atom/ns#' term='Infobright'/><category scheme='http://www.blogger.com/atom/ns#' term='MonetDB'/><category scheme='http://www.blogger.com/atom/ns#' term='LucidDB'/><category scheme='http://www.blogger.com/atom/ns#' term='analytic databases'/><category scheme='http://www.blogger.com/atom/ns#' term='Calpont'/><category scheme='http://www.blogger.com/atom/ns#' term='business intelligence'/><title type='text'>Calpont opens up: InfiniDB Open Source Analytical Database (based on MySQL)</title><content type='html'>Open source business intelligence and data warehousing are on the rise!&lt;br /&gt;&lt;br /&gt;If you kept up with the &lt;a href="http://www.mysqlperformanceblog.com/" target="_mysqlperf"&gt;MySQL Performance Blog&lt;/a&gt;, you might have &lt;a href="http://www.mysqlperformanceblog.com/2009/10/26/air-traffic-queries-in-luciddb/" target="_mysqlperf"&gt;noticed&lt;/a&gt; a &lt;a href="http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/" target="_mysqlperf"&gt;number&lt;/a&gt; of &lt;a href="http://www.mysqlperformanceblog.com/2009/09/29/quick-comparison-of-myisam-infobright-and-monetdb/" target="_mysqlperf"&gt;posts&lt;/a&gt; comparing the open source analytical databases &lt;a href="http://www.infobright.org/"&gt;Infobright&lt;/a&gt;, &lt;a href="http://www.luciddb.org/" target="_lucid"&gt;LucidDB&lt;/a&gt;, and &lt;a href="http://monetdb.cwi.nl/"&gt;MonetDB&lt;/a&gt;. LucidDB &lt;a href="http://www.nicholasgoodman.com/bt/blog/2009/10/24/luciddb-dynamobi-is-running-with-it/" target="_lucid"&gt;got&lt;/a&gt; some &lt;a href="http://n2.nabble.com/Introducing-Dynamo-BI-td3883211.html" target="_lucid"&gt;more&lt;/a&gt; news &lt;a href="http://thinkwaitfast.blogspot.com/2009/10/introducing-dynamo-bi.html"&gt;last&lt;/a&gt; week when &lt;a href="http://www.nicholasgoodman.com/bt/blog/" target="_nick"&gt;Nick Goodman&lt;/a&gt; announced that the Dynamo Business Intelligence Corporation will be offering services around LucidDB, branding it as DynamoDB.&lt;br /&gt;&lt;br /&gt;Now, to top if off, &lt;a href="http://www.calpont.com/" target="_calpont"&gt;Calpont&lt;/a&gt; has just released &lt;a href="http://infinidb.org/resources/what-is-infinidb" target="_calpont"&gt;InfiniDB&lt;/a&gt;, a GPLv2 open source version of its analytical database offering, which is based on the MySQL server.&lt;br /&gt;&lt;br /&gt;So, let's take a quick look at InfiniDB. I haven't yet played around with it, but the features sure look interesting:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Column-oriented architecture (like all other analytical database products mentioned)&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Transparent compression&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Vertical and horizontal partitioning: on top of being column-oriented, data is also partitioned, potentially allowing for less IO to access data.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;MVCC and support for high concurrency. It would be interesting to see how much benefit this gives when loading data, because this is usually one of the bottle necks for column-oriented databases&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Support for ACID/Transactions&lt;/li&gt;&lt;br /&gt;&lt;li&gt;High performance bulkloader&lt;/li&gt;&lt;br /&gt;&lt;li&gt;No specialized hardware - InfiniDB is a pure software solution that can run on commidity hardware&lt;/li&gt;&lt;br /&gt;&lt;li&gt;MySQL compatible&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;The website sums up a few more features and benefits, but I think this covers the most important ones. &lt;br /&gt;&lt;br /&gt;Calpont also offers a closed source enterprise edition, which differs from the open source by offering support for multi-node scale-out support. By that, they do not mean regular MySQL replication scale-out. Instead, the enterprise edition features a true distributed database architecture which allows you to divide incoming requests across a layer of so-called "user modules" (MySQL front ends) and "performance modules" (the actual workhorses that partition, retrieve and cache data). In this scenario, the user modules break the queries they recieve from client applications into pieces, and send them to one or more performance modules in a parallel fashion. The performance modules then retrieve the actual data from either their cache, or from the disk, and sends those back to the user modules which re-assemble the partial and intermediate results to the final resultset which is sent back to the client. (see picture)&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/21931585@N07/4049476409/" title="shared-disk-arch-simple by roland.bouman, on Flickr"&gt;&lt;img src="http://farm3.static.flickr.com/2563/4049476409_a124c2b147_o.jpg" width="821" height="465" alt="shared-disk-arch-simple" /&gt;&lt;/a&gt;&lt;br /&gt;Given the MySQL compatibility and otherwise similar features, I think it is fair to compare the open source InfiniDB offering to the Infobright community edition. Interesting differences are that InfiniDB supports all usual DML statements (&lt;code&gt;INSERT&lt;/code&gt;, &lt;code&gt;DELETE&lt;/code&gt;, &lt;code&gt;UPDATE&lt;/code&gt;), and that InfiniDB offers the same bulkloader in both the community edition as well as the enterprise edition: Infobright community edition does not support DML, and offers a bulk loader that is less performant than the one included in its enterprise edition. I have not heard of an InfoBright multi-node option, so when comparing the enterprise edition featuresets, that seems like an advantage too in Calpont's offering.&lt;br /&gt;&lt;br /&gt;Please understand that I am not endorsing one of these products over the other: I'm just doing a checkbox feature list comparison here. What it mostly boils down to, is that users that need an affordable analytical database now have even more choice  than before. In addition, it adds a bit more competition for the vendors, and I expect them all to improve as a result of that. These are interesting times for the BI and data warehousing market :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-3110806468379653967?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/3110806468379653967/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=3110806468379653967' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/3110806468379653967'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/3110806468379653967'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/10/calpont-opens-up-infinidb-open-source.html' title='Calpont opens up: InfiniDB Open Source Analytical Database (based on MySQL)'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-8192519070945540628</id><published>2009-09-15T15:15:00.003+02:00</published><updated>2009-09-16T00:53:56.626+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Percentile'/><category scheme='http://www.blogger.com/atom/ns#' term='GROUP_CONCAT'/><category scheme='http://www.blogger.com/atom/ns#' term='FIND_IN_SET'/><category scheme='http://www.blogger.com/atom/ns#' term='User defined variable'/><category scheme='http://www.blogger.com/atom/ns#' term='self-join'/><category scheme='http://www.blogger.com/atom/ns#' term='RANK'/><category scheme='http://www.blogger.com/atom/ns#' term='join'/><title type='text'>MySQL: Another Ranking trick</title><content type='html'>I just read &lt;a href="http://code.openark.org/blog/mysql/sql-ranking-without-self-join"  target="_shlomi"&gt;SQL: Ranking without self join&lt;/a&gt;, in which &lt;a href="http://code.openark.org/blog/" target="_shlomi"&gt;Shlomi Noach&lt;/a&gt; shares a nice MySQL-specific trick based on &lt;a href="http://dev.mysql.com/doc/refman/5.0/en/user-variables.html" target="_mysql"&gt;user-defined variables&lt;/a&gt; to compute rankings. &lt;br /&gt;&lt;br /&gt;Shlomi's trick reminds me somewhat of the trick I came across little over a year ago to &lt;a href="http://rpbouman.blogspot.com/2008/07/calculating-percentiles-with-mysql.html" target="_rpb"&gt;caclulate percentiles&lt;/a&gt;. At that time, several people pointed out to me too that using user-defined variables in this way can be unreliable.&lt;h3&gt;The problem with user-defined variables&lt;/h3&gt;So what is the problem exaclty? Well, whenever a query assigns to a variable, and that same variable is read in another part of the query, you're on thin ice. That's because the result of the read is likely to differ depending on whether the assignment took place before or after the read. Not surprising when you think about it - the whole point of variable assignment is to change its value, which by definition causes a different result when subsequently reading the variable (unless you assigned the already assigned value of course, duh...). &lt;br /&gt;&lt;br /&gt;Now watch that previous statement clearly - the word &lt;em&gt;subsequently&lt;/em&gt; is all-important.&lt;br /&gt;&lt;br /&gt;See, that's the problem. The semantics of a SQL &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/select.html" target="_mysql"&gt;SELECT&lt;/a&gt;&lt;/code&gt; statement is to obtain a (tabular) resultset - not specifying an algorithm to construct that resultset. It is the job of the RDBMS to figure out an algorithm and thus, you can't be sure in what order individual expressions (including variable evaluation and assignment) are executed. &lt;br /&gt;&lt;br /&gt;The MySQL manual states it like this:&lt;blockquote&gt;&lt;br /&gt;The order of evaluation for user variables is undefined and may change based on the elements contained within a given query. In &lt;code&gt;SELECT @a, @a := @a+1 ...&lt;/code&gt;, you might think that MySQL will evaluate &lt;code&gt;@a&lt;/code&gt; first and then do an assignment second, but changing the query (for example, by adding a &lt;code&gt;GROUP BY&lt;/code&gt;, &lt;code&gt;HAVING&lt;/code&gt;, or &lt;code&gt;ORDER BY&lt;/code&gt; clause) may change the order of evaluation.&lt;br /&gt;&lt;br /&gt;The general rule is never to assign a value to a user variable in one part of a statement and use the same variable in some other part of the same statement. You might get the results you expect, but this is not guaranteed.&lt;/blockquote&gt;&lt;h3&gt;So what good are these variables anyway?&lt;/h3&gt;On the one hand, this looks really lame: can't MySQL just figure out the correct order of doing the calulations? Well, that is one way of looking at it. But there is an equally valid reason not to do that. If the calculations would influence execution order, it would drastically lessen the number of ways that are available to optimize the statement. &lt;br /&gt;&lt;br /&gt;This begs the question: Why is it possible at all to assign values to the user-defined variables? The answer is quite simple: you can use it to pass values between statetments. My hunch is the variables were created in the olden days to overcome some limitations resulting from the lack of support for subqueries. Having variables at least enables you to execute a query and assign the result temporarily for use in a subsequent statement. For example, to find the student with the highest score, you can do:&lt;pre&gt;&lt;br /&gt;mysql&amp;gt; select @score:=max(score) from score;&lt;br /&gt;+--------------------+&lt;br /&gt;| @score:=max(score) |&lt;br /&gt;+--------------------+&lt;br /&gt;|                 97 |&lt;br /&gt;+--------------------+&lt;br /&gt;1 row in set (0.00 sec)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; select * from score where score = @score;&lt;br /&gt;+----------+--------------+-------+&lt;br /&gt;| score_id | student_name | score |&lt;br /&gt;+----------+--------------+-------+&lt;br /&gt;|        2 | Gromit       |    97 |&lt;br /&gt;+----------+--------------+-------+&lt;br /&gt;1 row in set (0.03 sec)&lt;br /&gt;&lt;/pre&gt;There is nothing wrong with this approach - problems start arising only when reading and writing the same variable in one and the same statement.&lt;h3&gt;Another way - serializing the set with &lt;code&gt;GROUP_CONCAT&lt;/code&gt;&lt;/h3&gt;&lt;br /&gt;Anyway, the percentile post I just linked to contains another solution for that problem that relies on &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/group-by-functions.html#function_group-concat" target="_mysql"&gt;GROUP_CONCAT&lt;/a&gt;&lt;/code&gt;. It turns out we can use the same trick here.&lt;br /&gt;&lt;br /&gt;(Some people may like to point out that using &lt;code&gt;GROUP_CONCAT&lt;/code&gt; is not without issues either, because it may truncate the list in case the pre-assigned string buffer is not large enough. I wrote about dealing with that limitation in several places and I remain recommending to set the &lt;code&gt;group_concat_max_len&lt;/code&gt; server variable to the value set for the &lt;code&gt;max_packet_size&lt;/code&gt; server variable like so: &lt;pre&gt;SET @@group_concat_max_len := @@max_allowed_packet;&lt;/pre&gt;)&lt;br /&gt;&lt;br /&gt;The best way to understand how it works is to think of the problem in a few steps. First, we make an ordered list of all the values we want to rank. We can do this with &lt;code&gt;GROUP_CONCAT&lt;/code&gt; like this:&lt;pre&gt;&lt;br /&gt;mysql&amp;gt; SELECT  &lt;b&gt;GROUP_CONCAT(&lt;/b&gt;&lt;br /&gt;    -&amp;gt;             &lt;b&gt;DISTINCT score&lt;/b&gt;&lt;br /&gt;    -&amp;gt;             &lt;b&gt;ORDER BY score  DESC&lt;/b&gt;&lt;br /&gt;    -&amp;gt;         &lt;b&gt;)                   AS scores&lt;/b&gt;&lt;br /&gt;    -&amp;gt; FROM    score&lt;br /&gt;    -&amp;gt; ;&lt;br /&gt;+-------------+&lt;br /&gt;| scores      |&lt;br /&gt;+-------------+&lt;br /&gt;| 97,95,92,85 |&lt;br /&gt;+-------------+&lt;br /&gt;1 row in set (0.00 sec)&lt;/pre&gt;&lt;br /&gt;Now that we have this list, we can use the &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/string-functions.html#function_find-in-set" target="_mysql"&gt;FIND_IN_SET&lt;/a&gt;&lt;/code&gt; function to look up the position of any particlar value contained in the list. Because the list is ordered in descending order (due to the &lt;code&gt;ORDER BY ... DESC&lt;/code&gt;), and contains only unique values (due to the &lt;code&gt;DISTINCT&lt;/code&gt;), this position is in fact the rank number. For example, if we want to know the rank of all scores with the value 92, we can do:&lt;pre&gt;&lt;br /&gt;mysql&amp;gt; SELECT &lt;b&gt;FIND_IN_SET&lt;/b&gt;(92, '97,95,92,85')&lt;br /&gt;+--------------------------------+&lt;br /&gt;| FIND_IN_SET(92, '97,95,92,85') |&lt;br /&gt;+--------------------------------+&lt;br /&gt;|                              3 |&lt;br /&gt;+--------------------------------+&lt;br /&gt;1 row in set (0.00 sec)&lt;/pre&gt;So, the answer is &lt;code&gt;3&lt;/code&gt; because &lt;code&gt;92&lt;/code&gt; is the third entry in the list.&lt;br /&gt;&lt;br /&gt;(If you're wondering how it's possible that we can pass the integer &lt;code&gt;92&lt;/code&gt; as first argument for &lt;code&gt;FIND_IN_SET&lt;/code&gt;: the function expects string arguments, and automatically converts whichever non-string typed value we pass to a string. In the case of the integer &lt;code&gt;92&lt;/code&gt;, it is silently converted to the string &lt;code&gt;'92'&lt;/code&gt;)&lt;br /&gt;&lt;br /&gt;Of course, we are't really interested in looking up ranks for individual numbers one at a time; rather, we'd like to combine this with a query on the &lt;code&gt;scores&lt;/code&gt; table that does it for us. Likewise, we don't really want to manually supply the list of values as a string constant, we want to substitute that with the query we wrote to generate that list.&lt;br /&gt;So, we get:&lt;pre&gt;&lt;br /&gt;mysql&gt; SELECT score_id, student_name, score&lt;br /&gt;    -&gt; ,      FIND_IN_SET(&lt;br /&gt;    -&gt;            score&lt;br /&gt;    -&gt;        ,  (SELECT  GROUP_CONCAT(&lt;br /&gt;    -&gt;                        DISTINCT score&lt;br /&gt;    -&gt;                        ORDER BY score  DESC&lt;br /&gt;    -&gt;                    )&lt;br /&gt;    -&gt;            FROM    score)&lt;br /&gt;    -&gt;        ) as rank&lt;br /&gt;    -&gt; FROM   score;&lt;br /&gt;+----------+--------------+-------+------+&lt;br /&gt;| score_id | student_name | score | rank |&lt;br /&gt;+----------+--------------+-------+------+&lt;br /&gt;|        1 | Wallace      |    95 |    2 |&lt;br /&gt;|        2 | Gromit       |    97 |    1 |&lt;br /&gt;|        3 | Shaun        |    85 |    4 |&lt;br /&gt;|        4 | McGraw       |    92 |    3 |&lt;br /&gt;|        5 | Preston      |    92 |    3 |&lt;br /&gt;+----------+--------------+-------+------+&lt;br /&gt;5 rows in set (0.00 sec)&lt;/pre&gt;&lt;br /&gt;Alternatively, if you think that subqueries are for the devil, you can rewrite this to a &lt;code&gt;CROSS JOIN&lt;/code&gt; like so:&lt;pre&gt;&lt;br /&gt;SELECT      score_id, student_name, score&lt;br /&gt;,           FIND_IN_SET(&lt;br /&gt;                score&lt;br /&gt;            ,   scores&lt;br /&gt;            ) AS rank&lt;br /&gt;FROM        score&lt;br /&gt;CROSS JOIN (SELECT  GROUP_CONCAT(&lt;br /&gt;                       DISTINCT score&lt;br /&gt;                       ORDER BY score  DESC&lt;br /&gt;                    ) AS scores&lt;br /&gt;            FROM    score) scores&lt;/pre&gt;&lt;br /&gt;Now that we have a solutions, lets see how it compares to Shlomi's original method. To do this, I am using the &lt;code&gt;payment&lt;/code&gt; table from the &lt;a href="http://dev.mysql.com/doc/sakila/en/sakila.html" target="_mysql"&gt;sakila&lt;/a&gt; sample database.&lt;br /&gt;&lt;br /&gt;First, Shlomi's method:&lt;pre&gt;&lt;br /&gt;mysql&amp;gt; SELECT   payment_id&lt;br /&gt;    -&amp;gt; ,        amount&lt;br /&gt;    -&amp;gt; ,        @prev := @curr&lt;br /&gt;    -&amp;gt; ,        @curr := amount&lt;br /&gt;    -&amp;gt; ,        @rank := IF(@prev = @curr, @rank, @rank+1) AS rank&lt;br /&gt;    -&amp;gt; FROM     sakila.payment&lt;br /&gt;    -&amp;gt; ,       (SELECT @curr := null, @prev := null, @rank := 0) sel1&lt;br /&gt;    -&amp;gt; ORDER BY amount DESC;&lt;br /&gt;+------------+--------+----------------+-----------------+------+&lt;br /&gt;| payment_id | amount | @prev := @curr | @curr := amount | rank |&lt;br /&gt;+------------+--------+----------------+-----------------+------+&lt;br /&gt;|        342 |  11.99 |           NULL |           11.99 |    1 |&lt;br /&gt;.        ... .  ..... .          ..... .           ..... .    . .&lt;br /&gt;|      15456 |   0.00 |           0.00 |            0.00 |   19 |&lt;br /&gt;+------------+--------+----------------+-----------------+------+&lt;br /&gt;16049 rows in set (0.09 sec)&lt;/pre&gt;&lt;br /&gt;Wow! It sure is fast :) Now, the &lt;code&gt;GROUP_CONCAT&lt;/code&gt; solution, using a subquery:&lt;pre&gt;&lt;br /&gt;mysql&amp;gt; SELECT payment_id, amount&lt;br /&gt;    -&amp;gt; ,      FIND_IN_SET(&lt;br /&gt;    -&amp;gt;            amount&lt;br /&gt;    -&amp;gt;        ,  (SELECT  GROUP_CONCAT(&lt;br /&gt;    -&amp;gt;                        DISTINCT amount&lt;br /&gt;    -&amp;gt;                        ORDER BY amount  DESC&lt;br /&gt;    -&amp;gt;                    )&lt;br /&gt;    -&amp;gt;            FROM    sakila.payment)&lt;br /&gt;    -&amp;gt;        ) as rank&lt;br /&gt;    -&amp;gt; FROM   sakila.payment&lt;br /&gt;+------------+--------+------+&lt;br /&gt;| payment_id | amount | rank |&lt;br /&gt;+------------+--------+------+&lt;br /&gt;|          1 |   2.99 |   15 |&lt;br /&gt;.          . .   .... .   .. .&lt;br /&gt;|      16049 |   2.99 |   15 |&lt;br /&gt;+------------+--------+------+&lt;br /&gt;16049 rows in set (0.14 sec)&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(In case you're wondering why the results are different, this is because the result set for Shlomi's solution is necessarily ordered by ascending rank (or descending amount - same difference. To obtain the identical result, you need to add an &lt;code&gt;ORDER BY&lt;/code&gt; clause to my query. But since the point was to calculate the ranks, I didn't bother. Of course, adding an &lt;code&gt;ORDER BY&lt;/code&gt; could slow things down even more.)&lt;br /&gt;&lt;br /&gt;Quite a bit slower, bummer. But at leastt we can't run into nasties with the user variables anymore. For this data set, I get about the same performance with the &lt;code&gt;CROSS JOIN&lt;/code&gt;, but I should warn that I did not do a real benchmark.&lt;br /&gt;&lt;h3&gt;Conclusion&lt;/h3&gt;Don't fall into the trap of reading and writing the same user-defined variable in the same statement. Although it seems like a great device and can give you very good performance, you cannot really control the order of reads and writes. Even if you can, you must check it again whenever you have reason to believe the query will be solved differently by the server. This is of course the case whenever you upgrade the server. But also seemingly harmless changes like adding an index to a table may change the order of execution.&lt;br /&gt;&lt;br /&gt;Almost all cases where people want to read and write to the same user variables within the same query, they are dealing with a kind of serialization problem. They are trying to maintain state in a variable in order to use it across rows. In many cases, the right way to do that is to use a self-join. But this may not always be feasible, as pointed out in Shlomi's original post. For example, rewriting the payment rank query using a self join is not going to make you happy. &lt;br /&gt;&lt;br /&gt;Often, there is a way out. You can use &lt;code&gt;GROUP_CONCAT&lt;/code&gt; to serialize a set of rows. Granted, you need at least one pass for that, and another one to do something useful with the result, but this still a lot better than dealing with semi-cartesian self join issues.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-8192519070945540628?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/8192519070945540628/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=8192519070945540628' title='38 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/8192519070945540628'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/8192519070945540628'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/09/mysql-another-ranking-trick.html' title='MySQL: Another Ranking trick'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>38</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-8581340365503538920</id><published>2009-09-12T18:30:00.000+02:00</published><updated>2009-09-12T18:27:40.284+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Data  warehousing'/><category scheme='http://www.blogger.com/atom/ns#' term='Infobright'/><category scheme='http://www.blogger.com/atom/ns#' term='Kickfire'/><category scheme='http://www.blogger.com/atom/ns#' term='antitrust'/><category scheme='http://www.blogger.com/atom/ns#' term='Open Source'/><category scheme='http://www.blogger.com/atom/ns#' term='Oracle / Sun deal'/><category scheme='http://www.blogger.com/atom/ns#' term='EU'/><category scheme='http://www.blogger.com/atom/ns#' term='ScaleDB'/><category scheme='http://www.blogger.com/atom/ns#' term='Calpont'/><category scheme='http://www.blogger.com/atom/ns#' term='business intelligence'/><category scheme='http://www.blogger.com/atom/ns#' term='Exadata'/><title type='text'>EU Should Protect MySQL-based Special Purpose Database Vendors</title><content type='html'>In &lt;a href="http://rpbouman.blogspot.com/2009/09/mysql-factor-in-eus-decision.html" target="_rpb"&gt;my recent post&lt;/a&gt; on the EU antitrust regulators' probe into the Oracle Sun merger I did not mention an important class of stakeholders: the MySQL-based special purpose database startups. By these I mean:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://www.kickfire.com/" target="_kf"&gt;Kickfire&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://www.infobright.org/" target="_ib"&gt;Infobright&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://www.calpont.com/" target="_cp"&gt;Calpont&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://www.scaledb.com/" target="_sdb"&gt;ScaleDB&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;I think it's safe to say the first three are comparable in the sense that they are all analytical databases: they are designed for data warehousing and business intelligence applications. ScaleDB might be a good fit for those applications, but I think it's architecture is sufficiently different from the first three to not call it an analytical database.&lt;br /&gt;&lt;br /&gt;For Kickfire and Infobright, the selling point is that they are offering a relatively cheap solution to build large data warehouses and responsive business intelligence applications. (I can't really find enough information on Calpoint pricing, although they do mention low total cost of ownership.) An extra selling point is that they are MySQL compatible, which may make some difference for some customers. But that compatibility is in my opinion not as important as the availability of a serious data warehousing solution at a really sharp price.&lt;br /&gt;&lt;br /&gt;Now, in my previous post, I mentioned that the MySQL and Oracle RDBMS products are very different, and I do not perceive them as competing. Instead of trying to kill the plain MySQL database server product, Oracle should take advantage of a huge opportunity to help shape the web by being a good steward, leading ongoing MySQL development, and in addition, enable their current Oracle Enterprise customers to build cheap LAMP-based websites (with the possibility of adding value by offering Oracle to MySQL data integration).&lt;br /&gt;&lt;br /&gt;For these analytical database solutions, things may be different though. &lt;br /&gt;&lt;br /&gt;I think these MySQL based analytical databases really are competitive to Oracle's &lt;a href="http://www.oracle.com/database/exadata.html" target="_ora"&gt;Exadata&lt;/a&gt; analytical appliance. Oracle could form a serious threat to these MySQL-based analytical database vendors. After the merger, Oracle would certainly be in a position to hamper these vendors by resticting the non-GPL licensed usage of MySQL.&lt;br /&gt;&lt;a href="http://www.oracle.com/features/suncustomers.html" target="_ora"&gt;In a recent ad&lt;/a&gt;, Oracle vouched to increase investments in developing Sun's hardware and operating system technology. And this would eventually put them in an even better position to create appliances like Exadata, allowing them to ditch an external hardware partner like HP (which is their Exadata hardware partner).&lt;br /&gt;&lt;br /&gt;So, all in all, in my opinion the EU should definitely take a serious look at the dynamics of the analytical database market and decide how much impact the Oracle / Sun merger could have on this particular class of MySQL OEM customers. The rise of these relatvely cheap MySQL-based analytical databases is a very interesting development for the business intelligence and data warehousing space in general, and means a big win for customers that need affordable datawarhousing / business intelligence. It would be a shame if it would be curtailed by Oracle. After the merger, Oracle sure would have the means and the motive, so if someone needs protection, I think it would be these MySQL-based vendors of analytical databases.&lt;br /&gt;&lt;br /&gt;As always, these are just my musing and opinions - speculation is free. Feel free to correct me, add applause or point out my ignorance :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-8581340365503538920?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/8581340365503538920/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=8581340365503538920' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/8581340365503538920'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/8581340365503538920'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/09/eu-should-protect-mysql-based-special.html' title='EU Should Protect MySQL-based Special Purpose Database Vendors'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-896013464236860455</id><published>2009-09-03T14:10:00.002+02:00</published><updated>2009-09-03T14:15:45.025+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Customers'/><category scheme='http://www.blogger.com/atom/ns#' term='Open Source'/><category scheme='http://www.blogger.com/atom/ns#' term='Sun'/><category scheme='http://www.blogger.com/atom/ns#' term='Oracle / Sun deal'/><category scheme='http://www.blogger.com/atom/ns#' term='EU'/><category scheme='http://www.blogger.com/atom/ns#' term='Oracle'/><category scheme='http://www.blogger.com/atom/ns#' term='Enterprise'/><category scheme='http://www.blogger.com/atom/ns#' term='Neelie Kroes'/><title type='text'>MySQL a factor in EU's decision</title><content type='html'>I just read &lt;a href="http://blog.thinkphp.de/archives/429-EU-to-probe-Oracle-Sun-deal.html" target="_bjorn"&gt;Björn Schotte's post&lt;/a&gt; on the activities of the European Union antitrust regulators concerning the &lt;a href="http://www.sun.com/third-party/global/oracle/" target="_sun"&gt;intended takeover&lt;/a&gt; of &lt;a href="http://www.sun.com/" target="_sun"&gt;Sun Microsystems&lt;/a&gt; by &lt;a href="http://www.oracle.com/us/sun/index.htm" target="_oracle"&gt;Oracle&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Björn mentions a &lt;a href="http://news.yahoo.com/s/nm/20090903/bs_nm/us_sun_oracle_eu_1" target="_"&gt;news article&lt;/a&gt; that cites EU Competition Commissioner Neelie Kroes saying that the commission has the obligation to protect the customers from reduced choice, higher costs or both. But to me, this bit is not the most interesting. Later on the article reads:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;The Commission said it was concerned that the open source nature of Sun's MySQL database might not eliminate fully the potential for anti-competitive effects.&lt;br /&gt;&lt;br /&gt;With both Oracle's databases and MySQL competing directly in many sectors of the database market, MySQL is widely expected to represent a greater competitive constraint as it becomes increasingly functional, the EU executive said.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;In other words, the commission is working to protect the MySQL users :)&lt;br /&gt;&lt;br /&gt;Personally, I (and many other MySQL community members) don't fear for the future of MySQL as a product. But I do think it is justified to worry about customers that are now paying Sun for some licensed usage of MySQL, most notably OEM customers and a bunch of Enterprise users. &lt;br /&gt;&lt;br /&gt;Ever since the news was disclosed concerning the intention of Oracle to acquire Sun, it has been speculated that Oracle my try to "upsell" the Oracle RDBMS to current MySQL enterprise users. However I don't think that that would be the brightest of moves. I did a bit of speculation myself back in April in response to questions put forward in the &lt;a href="http://www.sswug.org/editorials/default.aspx?id=1657" target="_sswug"&gt;SSWUG newsletter&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I maintain the opinions I stated there: &lt;ul&gt;&lt;br /&gt;&lt;li&gt;MySQL / Oracle are completely different beasts and customers realize this, and most likely Oracle does so too. People running MySQL for web related applications won't move to Oracle. Period. Oracle may be able to grab some customers that use MySQL for data warehousing, but I think that even in these cases a choice for Infobright or Kickfire makes more sense.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Not all problems are database problems - if Oracle does a decent job of supporting and developing MySQL, they may become a respectable enough partner for current (larger) MySQL users to help them solve other problems such as systems integration.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Instead of looking at the benefits for MySQL customers of using Oracle, look at the benefits for Oracle customers using MySQL. Suddenly Oracle can offer support for the most popular webstack in the world - Now all these enterprise customers running expensive Oracle installations can finally build cheap websites based on MySQL and even get support from Oracle on connecting their backend Enterprise Oracle instances to the MySQL web front ends.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;It's not all about the products. Open Source adds a whole new dynamic to the development process. I'm not just talking about outside developers that offer new features and code patches, as this does not happen too often. There's more to it than code though&lt;br /&gt;&lt;br /&gt;In all successful open source projects I know there is a very lively culture of users engaging with developers and voicing their opinion on what is good and what is not so good. There is a very real chance for the user to influence the direction of the development process (although this does not mean everybody gets what they want in equal amounts). Conversely this provides a great opportunity for the development organization to learn about what the users really need and wish for.&lt;br /&gt;&lt;br /&gt;In short, Oracle may want to use Sun/MySQL to learn how to do better business with more empowered users.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;Of course, its all just my opinion - speculation is free. So you should feel free too to post your ideas on the matter. Go ahead and leave a comment ;)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-896013464236860455?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/896013464236860455/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=896013464236860455' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/896013464236860455'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/896013464236860455'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/09/mysql-factor-in-eus-decision.html' title='MySQL a factor in EU&apos;s decision'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-3222850260591564167</id><published>2009-09-03T04:25:00.000+02:00</published><updated>2009-09-03T04:19:02.945+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Javascript'/><category scheme='http://www.blogger.com/atom/ns#' term='webdata'/><title type='text'>Roland Bouman's blog goes i18n (Powered by Google Translate)</title><content type='html'>Now that Pentaho Solutions is in print, and the first few copies are finding its way towards the readers, I felt like doing something completely unrelated. So, I hacked up a little page translation widget, based on the &lt;a href="http://code.google.com/apis/ajaxlanguage/documentation/" target="_glapi"&gt;Google Language API&lt;/a&gt;. You can see the result in the top of the left sidebar of my blog right now:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/21931585@N07/3883162854/" title="translator by roland.bouman, on Flickr"&gt;&lt;img src="http://farm4.static.flickr.com/3447/3883162854_df21a3446d_o.png" width="583" height="361" alt="translator" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Using it is very simple: just pick the language of choice, and the page (text and some attributes like alt and title) will be translated. Pick the first entry in the list to see the original language again.&lt;br /&gt;&lt;br /&gt;This all happens inline by dynamic DOM manipulation, without having to reload the page. I tested it on Chrome 2, Firefox 3.5, Opera 10, Safari 4 and Internet Explorer 6 and 8. So far, it seems to work for all these browsers. &lt;br /&gt;&lt;br /&gt;Personally, I feel that the user experience you get with this widget is superior to what you would get with the &lt;a href="http://translate.google.com/translate_tools"&gt;google translation gadget&lt;/a&gt;. In addition, it is pretty easy to to configure the &lt;code&gt;Translator&lt;/code&gt; class . &lt;br /&gt;&lt;br /&gt;The code to add this to your page is in my opinion reasonably simple:&lt;pre&gt;&lt;br /&gt;    &amp;lt;!-- add a placeholder for the user interface --&amp;gt;&lt;br /&gt;    &amp;lt;div id="toolbar"&amp;gt;&amp;lt;div&amp;gt;&lt;br /&gt;&lt;br /&gt;    &amp;lt;!-- Include script that defines the Translator class --&amp;gt;&lt;br /&gt;    &amp;lt;script type="text/javascript" src="Translator-min.js"&amp;gt;&amp;lt;/script&amp;gt;&lt;br /&gt;    &amp;lt;!-- Instantiate a translator, have it create its gui and render to placeholder --&amp;gt;&lt;br /&gt;    &amp;lt;script type="text/javascript"&amp;gt;&lt;br /&gt;        var translator = new Translator();&lt;br /&gt;        var gui = translator.createGUI(null, "Language");&lt;br /&gt;        document.getElementById("toolbar").appendChild(gui);&lt;br /&gt;    &amp;lt;/script&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;This really is all the code you need - there are no dependencies on external Javascript frameworks. If you don't need or like the gui, you can of course skip the gui placeholder code as well as the second script and interract with the &lt;code&gt;Translator&lt;/code&gt; object programmatically.&lt;br /&gt;&lt;br /&gt;The minified javascript file is about 7k, which is not too bad in my opinion. I haven't worried too much about optimizations, and I think it should be possible to cut down on codesize.&lt;br /&gt;&lt;br /&gt;Another thing I haven't focused on just now is integration with frameworks - on the contrary I made sure you can use it standalone. But in order to do that, I had to write a few methods to facilitate DOM manipulation and JSON parsing, and its almost certain you will find functions like that are already in your framework. &lt;br /&gt;&lt;br /&gt;Anyway, readers, I'd like to hear from you...is this auseful feature on this blog? Would you like to use it on your own blog? If there's enough people that want it, I will make it available on google code or something like that.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-3222850260591564167?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/3222850260591564167/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=3222850260591564167' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/3222850260591564167'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/3222850260591564167'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/09/roland-boumans-blog-goes-i18n-powered.html' title='Roland Bouman&apos;s blog goes i18n (Powered by Google Translate)'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-5688946370954625167</id><published>2009-08-22T00:40:00.003+02:00</published><updated>2009-08-22T01:05:56.262+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Pentaho'/><category scheme='http://www.blogger.com/atom/ns#' term='&quot;Pentaho Solutions&quot;'/><title type='text'>"Pentaho Solutions": copies hit the mail</title><content type='html'>Hi! &lt;br /&gt;&lt;br /&gt;Just a few hours ago, I arrived home after a very quiet and peaceful two-week holiday with my family. It was great! I didn't bring a computer on purpose. I brought a mobile phone, but didn't answer that on purpose too :) Result: absolute relaxation, with lots of time to hike, cycle, and read, and occasional visits to musea and historic sites. Bliss :)&lt;br /&gt;&lt;br /&gt;Anyway, now that the bags are unpacked, and the kids are asleep, it's time to face the dragon better known as my inbox. What I found brought a big smile to my face:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/21931585@N07/3844027690/" title="Pentaho corp. posing with a copy of &amp;quot;Pentaho Solutions&amp;quot; by roland.bouman, on Flickr"&gt;&lt;img src="http://farm4.static.flickr.com/3467/3844027690_3bc1f8f42c_o.jpg" width="800" height="600" alt="Pentaho corp. posing with a copy of &amp;quot;Pentaho Solutions&amp;quot;" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Yes - it's true!! Copies of my and &lt;a href="http://www.tholis.com/" target="_jos"&gt;Jos'&lt;/a&gt; book &lt;a href="http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470484322.html"&gt;Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL&lt;/a&gt; have hit the mail, and at least one copy has reached the Pentaho office. &lt;br /&gt;&lt;br /&gt;Pentaho-ers, thanks for your kind email, and thanks for &lt;a href="http://www.pentaho.com/" target="_pentaho"&gt;a great product&lt;/a&gt;! &lt;br /&gt;&lt;br /&gt;I haven't received one myself (yet), and it will probably take some time still to ship the books to Europe. But it's certainly good to see a physical proof of our work. &lt;br /&gt;&lt;br /&gt;Anyway - if you are expecting a copy of the book because you pre-ordered one, or if I or Jos promised you a copy, fear not, it should be heading your way. I hope you like it - Enjoy :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-5688946370954625167?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/5688946370954625167/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=5688946370954625167' title='18 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5688946370954625167'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5688946370954625167'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/08/pentaho-solutions-copies-hit-mail.html' title='&quot;Pentaho Solutions&quot;: copies hit the mail'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>18</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-5741515368545511773</id><published>2009-07-24T01:16:00.003+02:00</published><updated>2009-07-24T01:34:14.656+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OSCON 2009'/><category scheme='http://www.blogger.com/atom/ns#' term='Kettle'/><category scheme='http://www.blogger.com/atom/ns#' term='pentaho data integration'/><category scheme='http://www.blogger.com/atom/ns#' term='presentation'/><title type='text'>OSCON 2009 Presentation "Taming your Data: Practical Data Integration Solutions with Kettle" now online</title><content type='html'>I just delivered &lt;a href="http://en.oreilly.com/oscon2009/public/schedule/detail/7905" target="_oscon"&gt;my OSCON 2009 presentation&lt;/a&gt; "Taming your data: Practical Data Integration Solutions with Kettle". &lt;br /&gt;&lt;br /&gt;&lt;img src="http://farm4.static.flickr.com/3280/2805545289_ee65bab385.jpg"/&gt;&lt;br /&gt;&lt;br /&gt;I think it went pretty well, and I got good responses from the audience. I did have much more material than time, and I probably should have proposed to do a tutorial instead. Maybe next year :)&lt;br /&gt;&lt;br /&gt;Anyway, you can find the presentation and the examples on the &lt;a href="http://en.oreilly.com/oscon2009/public/schedule/detail/7905" target="_oscon"&gt;OSCON 2009 website&lt;/a&gt;. The &lt;a href="http://assets.en.oreilly.com/1/event/27/Taming%20Your%20Data_%20Practical%20Data%20Integration%20Solutions%20with%20Kettle%20Presentation%201.pdf" target="_oscon"&gt;slides&lt;/a&gt; are available in pdf format. There's also &lt;a href="http://assets.en.oreilly.com/1/event/27/Taming%20Your%20Data_%20Practical%20Data%20Integration%20Solutions%20with%20Kettle%20Presentation.zip" target="_oscon"&gt;a zip file&lt;/a&gt; that contains the presentation as well as the kettle sample transformations and jobs.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-5741515368545511773?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/5741515368545511773/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=5741515368545511773' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5741515368545511773'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5741515368545511773'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/07/oscon-2009-presentation-taming-your.html' title='OSCON 2009 Presentation &quot;Taming your Data: Practical Data Integration Solutions with Kettle&quot; now online'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm4.static.flickr.com/3280/2805545289_ee65bab385_t.jpg' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-7117471596638400392</id><published>2009-07-18T01:53:00.005+02:00</published><updated>2009-07-18T02:24:50.255+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='OSCON 2009'/><category scheme='http://www.blogger.com/atom/ns#' term='Pentaho'/><category scheme='http://www.blogger.com/atom/ns#' term='Kettle'/><category scheme='http://www.blogger.com/atom/ns#' term='&quot;Pentaho Solutions&quot;'/><category scheme='http://www.blogger.com/atom/ns#' term='pentaho data integration'/><category scheme='http://www.blogger.com/atom/ns#' term='business intelligence'/><title type='text'>My OSCON 2009 Session: Taming your Data...</title><content type='html'>Yes! &lt;br /&gt;&lt;br /&gt;Finally, it's there: In a few hours, I will be flying off to San Franscisco to attend &lt;a href="http://en.oreilly.com/oscon2009" target="_oscon"&gt;OSCON 2009&lt;/a&gt; in San Jose, California. This is the first time I'm attending, and I'm tremendously excited to be there! The sessions look very promising, and I'm looking forward to seeing some excellent speakers. I expect to learn a lot.&lt;br /&gt;&lt;br /&gt;I'm also very proud and feel honoured to have the chance to deliver a session myself. It's called &lt;a href="http://en.oreilly.com/oscon2009/public/schedule/detail/7905" target="_href"&gt;Taming Your Data: Practical Data Integration Solutions with Kettle&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;Unsurprisingly, I will be talkig a lot about &lt;a href="http://kettle.pentaho.org/" target="_pentaho"&gt;Kettle, a.k.a. Pentaho Data Integration&lt;/a&gt;. Recently, I &lt;a href="http://www.mysqlconf.com/mysql2009/public/schedule/detail/7016"&gt;talked about Kettle too at the MySQL user's conference&lt;/a&gt;, and more recently, at &lt;a href="http://forge.mysql.com/wiki/Starring_Sakila_-_A_Data_Warehouse_Mini-Tutorial"&gt;a MySQL university session&lt;/a&gt;. Those sessions were focused mainly on how Kettle can help you load a data warehouse. &lt;br /&gt;&lt;br /&gt;But...there's much more to this tool than just data warehousing, and in this session, I will be exploring rougher grounds, like making sense of raw &lt;a href="http://www.imdb.com/interfaces#plain" target="_imdb"&gt;imdb text files&lt;/a&gt;, loading and generating XML, clustering and more. This session will also be much more hands on demonstration than the Sakila sessions. If you're interested and you are also attending, don't hesitate to drop by! I'm looking forward to meeting you :)&lt;br /&gt;&lt;br /&gt;And...because the topic of the session kind of relates &lt;a href="http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470484322.html"&gt;to my upcoming book, "Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL"&lt;/a&gt; (ISBN: 978-0-470-48432-6, 600+ pages, list price $50.00), my publisher Wiley decided to throw in a litte extra. Yup, that's right - I've got discount coupons for the book, so if you are interested in picking up a copy, or if you just want to give one away to a friend or colleague, come find me at my session (or somewhere else on OSCON) and I'll make sure you'll get one. Thanks Wiley!!&lt;br /&gt;&lt;br /&gt;Anyway - I'm hoping to meet you there: see you soon!!!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-7117471596638400392?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/7117471596638400392/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=7117471596638400392' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/7117471596638400392'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/7117471596638400392'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/07/my-oscon-2009-session-taming-your-data.html' title='My OSCON 2009 Session: Taming your Data...'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-5373570750250527867</id><published>2009-07-14T01:07:00.004+02:00</published><updated>2009-07-14T02:31:44.834+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Javascript'/><category scheme='http://www.blogger.com/atom/ns#' term='xsltproc'/><category scheme='http://www.blogger.com/atom/ns#' term='IE8'/><category scheme='http://www.blogger.com/atom/ns#' term='Safari'/><category scheme='http://www.blogger.com/atom/ns#' term='CSS'/><category scheme='http://www.blogger.com/atom/ns#' term='Chrome'/><category scheme='http://www.blogger.com/atom/ns#' term='MS Project'/><category scheme='http://www.blogger.com/atom/ns#' term='HTML'/><category scheme='http://www.blogger.com/atom/ns#' term='XSLT'/><category scheme='http://www.blogger.com/atom/ns#' term='Firefox'/><title type='text'>open-msp-viewer: Free XSLT utilities to render MS Project files as HTML web pages</title><content type='html'>For my day job, I've been working on a few things that allow you to render &lt;a href="http://msdn.microsoft.com/en-us/library/aa167886(office.11).aspx" target="_msp2003"&gt;Microsoft Project 2003&lt;/a&gt; projects on a web page. &lt;br /&gt;&lt;br /&gt;The code I wrote for my work is proprietary, and probably not directly useful for most people. But I figured that at least some of the work might be useful for others, so I wrote an open source version from scratch and I published that as the &lt;a href="http://code.google.com/p/open-msp-viewer/" target="_omspv"&gt;open-msp-viewer&lt;/a&gt; project on google code. If you like, &lt;a href="http://code.google.com/p/open-msp-viewer/source/checkout" target="omspv"&gt;check out the code&lt;/a&gt; and give it a spin.&lt;br /&gt;&lt;br /&gt;It works by first saving the project in the &lt;a href="http://msdn.microsoft.com/en-us/library/aa679870(office.11).aspx" target="_ms"&gt;MS Project XML format&lt;/a&gt; using standard MS Project functionality (Menu \ Save As..., then pick .XML) and then applying an &lt;a href="http://www.w3.org/TR/xslt" target="_w3c"&gt;XSLT transformation&lt;/a&gt; to generate HTML.&lt;br /&gt;&lt;br /&gt;Currently, the project includes an xslt stylesheet that renders MS Project XML files as a Gantt chart. To give you a quick idea, Take a look at these screenshots:&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/21931585@N07/3718062747/" title="msp1 by roland.bouman, on Flickr"&gt;&lt;img src="http://farm4.static.flickr.com/3461/3718062747_143aeb3354_b.jpg" width="1024" height="640" alt="msp1" /&gt;&lt;/a&gt;&lt;br /&gt;and&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/21931585@N07/3718062777/" title="msp2 by roland.bouman, on Flickr"&gt;&lt;img src="http://farm4.static.flickr.com/3431/3718062777_2d2696f7bd_b.jpg" width="1024" height="640" alt="msp2" /&gt;&lt;/a&gt;&lt;br /&gt;The web gantt chart is rendered in a HTML 4.01 variant, CSS 2.1 and uses javascript to allow the user to collapse and/or expand individual tasks in the work breakdown structure. Currently, the HTML does not validate due to a few custom attributes I introduced to support dynamic collapsing/expanding the chart with javascript. In addition, the xslt transform process introduces the msp namespace into the result document, which results in a validation error&lt;br /&gt;&lt;br /&gt;You can either &lt;a href="http://kent.w3.org/TR/xml-stylesheet/"&gt;associate the xslt stylesheet directly with the MS Project XML file&lt;/a&gt;, or you can use &lt;a href="http://xmlsoft.org/XSLT/xsltproc2.html" target="_xsltproc"&gt;an external tool like xsltproc&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;In the &lt;code&gt;trunk/xml&lt;/code&gt; subdirectory, you can find a couple of sample projects in xml format that already have the stylesheet association. I have tested these in IE8, Chrome 2, Safari 4 and Firefox 3.5, and it works well in all these browsers. In the &lt;code&gt;trunk/html&lt;/code&gt; directory, you'll find HTML output as created by xsltproc.&lt;br /&gt;&lt;br /&gt;In the future, more xslt stylesheets may be added to support alternative views. Things that I think I will add soon are a resources list and a calendar view.&lt;br /&gt;&lt;br /&gt;Enjoy, and let me know if you find a bug or would like to contribute.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-5373570750250527867?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/5373570750250527867/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=5373570750250527867' title='16 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5373570750250527867'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5373570750250527867'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/07/open-msp-viewer-free-xslt-utilities-to.html' title='open-msp-viewer: Free XSLT utilities to render MS Project files as HTML web pages'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm4.static.flickr.com/3461/3718062747_143aeb3354_t.jpg' height='72' width='72'/><thr:total>16</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-4351089717211342195</id><published>2009-07-12T12:55:00.002+02:00</published><updated>2009-07-12T13:01:20.836+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='IE8'/><category scheme='http://www.blogger.com/atom/ns#' term='Malware'/><category scheme='http://www.blogger.com/atom/ns#' term='Virus'/><category scheme='http://www.blogger.com/atom/ns#' term='hacked'/><category scheme='http://www.blogger.com/atom/ns#' term='Windows Update'/><category scheme='http://www.blogger.com/atom/ns#' term='Apple'/><title type='text'>WTF? Apple favicon on the Windows update site?</title><content type='html'>I just noticed the weirdest thing while visiting the windows update site:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/21931585@N07/3712786138/" title="windows-apple by roland.bouman, on Flickr"&gt;&lt;img src="http://farm3.static.flickr.com/2487/3712786138_53875f885a_o.png" width="859" height="720" alt="windows-apple" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;As you can see, the shortcut icon is actually an Apple icon...WTF?!! I looked in the code, and this is the code that sets the favicon: &lt;pre&gt;&amp;lt;link rel='shortcut icon' href='shared/images/banners/favicon.ico' type='image/x-icon'/&amp;gt;&lt;/pre&gt; When I navigate directly to &lt;code&gt;&lt;a href="http://update.microsoft.com/windowsupdate/v6/shared/images/banners/favicon.ico"&gt;http://update.microsoft.com/windowsupdate/v6/shared/images/banners/favicon.ico&lt;/a&gt;&lt;/code&gt; I get the normal icon, that is to say, this: &lt;img src="http://update.microsoft.com/windowsupdate/v6/shared/images/banners/favicon.ico"/&gt;.When I look with IE6, I actually see that icon as favicon...so the question is, what's up with the Apple icon in IE8?! Do I have some virus or malware that has modified IE8? I googled for the problem, but can't find any references...am I the only one? Anybody else experiencing this?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-4351089717211342195?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/4351089717211342195/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=4351089717211342195' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/4351089717211342195'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/4351089717211342195'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/07/wtf-apple-favicon-on-windows-update.html' title='WTF? Apple favicon on the Windows update site?'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-28548798604876459</id><published>2009-07-03T09:13:00.002+02:00</published><updated>2009-07-03T09:37:43.613+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Data  warehousing'/><category scheme='http://www.blogger.com/atom/ns#' term='MySQL Forge'/><category scheme='http://www.blogger.com/atom/ns#' term='Mondrian'/><category scheme='http://www.blogger.com/atom/ns#' term='Pentaho'/><category scheme='http://www.blogger.com/atom/ns#' term='Kettle'/><category scheme='http://www.blogger.com/atom/ns#' term='MySQL University'/><category scheme='http://www.blogger.com/atom/ns#' term='pentaho data integration'/><category scheme='http://www.blogger.com/atom/ns#' term='OLAP'/><title type='text'>Starring Sakila: MySQL university recording, slides and materials available onMySQLForge</title><content type='html'>Hi!&lt;br /&gt;&lt;br /&gt;Yesterday I had the honour of presenting my mini-bi/datawarehousing tutorial "Starring Sakila" for MySQL University. I did a modified version of the &lt;a href="http://www.mysqlconf.com/mysql2009/public/schedule/detail/7016" target="_mysqluc"&gt;presentation I did together with Matt Casters at the MySQL user's conference 2009&lt;/a&gt;. The structure of the presentation is still largely the same, although I condensed various bits, and I added practical examples of setting up the ETL process and creating a Pentaho Analysis View (OLAP pivot table) on top of a Mondrian Cube.&lt;br /&gt;&lt;br /&gt;The slides, session recording, and materials such as SQL script, pentaho data integration jobs and transformations, and Sakila Rentals Cube for Mondrian are all &lt;a href="http://forge.mysql.com/wiki/Starring_Sakila_-_A_Data_Warehouse_Mini-Tutorial" target="_mysqlforge"&gt;available here on MySQL Forge&lt;/a&gt;.&lt;br /&gt;&lt;h3&gt;Copyright Notice&lt;/h3&gt;&lt;br /&gt;Presentation slides, and materials such as SQL script, pentaho data integration jobs and transformations, and Sakila Rentals Cube for Mondrian are all Copyright Roland Bouman. Feel free to download and learn from it. But please do not distribute the materials yourself - instead, point people to the wiki page to get their own copy of the materials. Personal use of the files is allowed. Use these materials for creating training materials of using these materials as training materials is explicitly not allowed without written prior consent. (Just mail me at roland dot bouman at gmail dot com if you would like to use the materials for such purposes, and we can work something out.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-28548798604876459?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/28548798604876459/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=28548798604876459' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/28548798604876459'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/28548798604876459'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/07/starring-sakila-mysql-university.html' title='Starring Sakila: MySQL university recording, slides and materials available onMySQLForge'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-1636546707320109072</id><published>2009-06-18T01:00:00.000+02:00</published><updated>2009-06-18T01:00:01.000+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='stored routine'/><category scheme='http://www.blogger.com/atom/ns#' term='Performance'/><title type='text'>MySQL Stored Functions: Impact of DECLARE HANDLER on Performance</title><content type='html'>Hi again!&lt;br /&gt;&lt;br /&gt;Just a moment ago, I read &lt;a href="http://blogs.mysql.com/peterg/2009/06/17/get-the-error-return-value-in-a-variable/" target="_pg"&gt;this post&lt;/a&gt; by &lt;a href="http://blogs.mysql.com/peterg" target="_pg"&gt;Peter Gulutzan&lt;/a&gt;. In this post, Peter explains a little trick that allows you to capture the SQL state in a variable whenever an error occurs in your &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/stored-routines.html" target="_mysql"&gt;MySQL stored routine&lt;/a&gt; code.&lt;h3&gt;MySQL &lt;code&gt;CONDITION&lt;/code&gt;s and &lt;code&gt;HANDLER&lt;/code&gt;s&lt;/h3&gt;For the uninitiated: in MySQL stored routines, you can declare &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.0/en/declare-handler.html" target="_mysql"&gt;HANDLER&lt;/a&gt;&lt;/code&gt;s which are pieces of code that are executed only in case a particular &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.0/en/declare-condition.html" target="_mysql"&gt;CONDITION&lt;/a&gt;&lt;/code&gt; occurs. This device serves the same purpose as a &lt;code&gt;try ... catch&lt;/code&gt; block which is supported in many popular programming languages like Java, JavaScript, C++, PHP5 and C#. &lt;br /&gt;&lt;br /&gt;Now, one of the &lt;a href="http://bugs.mysql.com/bug.php?id=11661" target="_mysql"&gt;long-standing problems&lt;/a&gt; with MySQL &lt;code&gt;HANDLER&lt;/code&gt;s - the inability to explicitly raise a CONDITION  - has recently been solved by implementing the &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/6.0/en/signal.html" target="_mysql"&gt;SIGNAL&lt;/a&gt;&lt;/code&gt; syntax. The most important issue that remains is the &lt;a href="http://bugs.mysql.com/bug.php?id=11660" target="_mysql"&gt;inability to discover exactly which error occurred&lt;/a&gt;. You need this especially when you are writing a generic 'catch all' &lt;code&gt;HANDLER&lt;/code&gt; (for example, on that is triggered in response to &lt;code&gt;SQLEXCEPTION&lt;/code&gt;) and you want to write information regarding the error to a log.&lt;h3&gt;Peter's Trick to Capture SQLSTATE&lt;/h3&gt;To cut a long story short, Peter's solution is based on writing a &lt;code&gt;HANDLER&lt;/code&gt; for all known &lt;code&gt;CONDITION&lt;/code&gt;s in advance. Here's a fragment of his code to explain:&lt;pre&gt;BEGIN&lt;br /&gt;  ...&lt;br /&gt;  DECLARE EXIT HANDLER FOR SQLSTATE '01000' BEGIN SET @e='01000'; RESIGNAL; END;&lt;br /&gt;  ...&lt;br /&gt;  DECLARE EXIT HANDLER FOR SQLSTATE 'XAE09' BEGIN SET @e='XAE09'; RESIGNAL; END;&lt;br /&gt;&lt;br /&gt;  ...remainder of code...&lt;br /&gt;&lt;br /&gt;END;&lt;/pre&gt;As Peter points out, it's tedious, but it works. (That is to say, it works better than not having anything)&lt;h3&gt;Performance?&lt;/h3&gt;Now, one particular paragraph in Peter's post caught my eye:&lt;blockquote&gt;I added 38 &lt;code&gt;DECLARE EXIT HANDLER&lt;/code&gt; statements at the start of my procedure, just after the variable declarations. These lines are always the same for any procedure. They’re not executed unless an error happens so I don’t worry about speed.&lt;/blockquote&gt;I respect Peter a great deal - if he's got something to say you do well to listen and take his word for it. However, this time I was curious to find out if I could measure the effect of the &lt;code&gt;HANDLER&lt;/code&gt; code at all.&lt;h3&gt;Method&lt;/h3&gt;The code I used is a simplification of Peter's code. I tested it on MySQL 5.1 because I was just interested in the impact of a &lt;code&gt;DECLARE HANDLER&lt;/code&gt; statement. In my case, the &lt;code&gt;BEGIN...END&lt;/code&gt; block of the handler does not contain the &lt;code&gt;RESIGNAL&lt;/code&gt; statement, and my function does not drop a table but simply returns 1. This is important, as none of the &lt;code&gt;HANDLER&lt;/code&gt;s is ever triggered by my code.&lt;br /&gt;&lt;br /&gt;Last week I wrote how seemingly small changes in MySQL stored routine code can have a surprisingly large impact on performance. In that particular case, I already had a hunch about which things could be improved. In this case, I just didn't know so I created a series of functions with 2, 4, 8, 16, 32 and 38 &lt;code&gt;DECLARE HANDLER&lt;/code&gt; statements, and again I used the &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.0/en/information-functions.html#function_benchmark" target="_mysql"&gt;BENCHMARK()&lt;/a&gt;&lt;/code&gt; function to measure the time it takes to execute it 100,000 times. I did warm-up runs, and repeated the measurement 5 times for each function variant.&lt;h3&gt;Results&lt;/h3&gt;The graph below summarizes my observations:&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/21931585@N07/3636702900/" title="handler by roland.bouman, on Flickr"&gt;&lt;img src="http://farm4.static.flickr.com/3626/3636702900_b1bfba3d2f_o.png" width="405" height="211" alt="handler" /&gt;&lt;/a&gt;&lt;br /&gt;The squares in the graph are the measurements - each one represents a version of the function. Along the horizontal axis you see the number of &lt;code&gt;DECLARE HANDLER&lt;/code&gt; statements in that particular version of the function. The number of seconds it took to execute the function 100,000 times on my laptop using &lt;code&gt;BENCHMARK()&lt;/code&gt; is on the vertical axis.&lt;br /&gt;As you can see, there seems to be a linear relationship between the number of &lt;code&gt;DECLARE HANDLER&lt;/code&gt; statements and the time it takes to execute the function. This is in itself a surprise. I mean, I would expect a little overhead per &lt;code&gt;DECLARE&lt;/code&gt; statement when the function is compiled initially. After that, it is cached at the session level, and beyond that point I would not expect statements that do not execute to have any impact. &lt;br /&gt;&lt;br /&gt;So how badly do the &lt;code&gt;HANDLER&lt;/code&gt; declarations slow our function down? Well, I measured an average of 0.38 seconds for 2, and an average of 0.55 seconds for 38 &lt;code&gt;DECLARE HANDLER&lt;/code&gt; statements respectively. The difference, 0.18 seconds is a little less than 50% of the function variant with 2 &lt;code&gt;DECLARE HANDLER&lt;/code&gt; statements, and a little more than 30% of the function having 38 &lt;code&gt;DECLARE HANDLER&lt;/code&gt; statements. &lt;h3&gt;Conclusion&lt;/h3&gt;To be fair, the function I tested doesn't actually do anything, and if your function or stored procedure does some real processing, the overhead may be neglectable. However, you can clearly see that even just declaring a handler has a measurable negative impact on performance. The essence of Peter's trick is to actually always write a &lt;code&gt;DECLARE HANDLER&lt;/code&gt; for each possible condition and to do this for each stored routine. You will certainly suffer a peformance hit for small functions, esp. if they get called a lot.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-1636546707320109072?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/1636546707320109072/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=1636546707320109072' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/1636546707320109072'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/1636546707320109072'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/06/mysql-stored-functions-impact-of.html' title='MySQL Stored Functions: Impact of DECLARE HANDLER on Performance'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-5794785261579691520</id><published>2009-06-10T20:45:00.004+02:00</published><updated>2009-06-11T10:26:08.792+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Stored function'/><category scheme='http://www.blogger.com/atom/ns#' term='Refactoring'/><category scheme='http://www.blogger.com/atom/ns#' term='perfomance'/><title type='text'>MySQL: Refactoring a Stored Function</title><content type='html'>Hi All!&lt;br /&gt;&lt;br /&gt;I was just reading &lt;a href="http://planet.mysql.com/" target="_planet"&gt;PlanetMySQL&lt;/a&gt; and noticed &lt;a href="http://www.mikehillyer.com/" target="_mh"&gt;Mike Hillyer&lt;/a&gt;'s recent post on &lt;a href="http://www.mikehillyer.com/mysql/user-friendly-age-function-for-mysql/" target="_mh"&gt;a user-friendly age function&lt;/a&gt; for MySQL. Basically, this function accepts two &lt;code&gt;DATETIME&lt;/code&gt; values and returns an indication of the time between the two dates in the form of a human-readable string. For example:&lt;pre&gt;mysql&amp;gt; select  TimeDiffUnits('2001-05-01', '2002-01-01')&lt;br /&gt;+-------------------------------------------+&lt;br /&gt;| TimeDiffUnits('2001-05-01', '2002-01-01') |&lt;br /&gt;+-------------------------------------------+&lt;br /&gt;| 8 Months                                  |&lt;br /&gt;+-------------------------------------------+&lt;br /&gt;1 row in set (0.00 sec)&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Just for fun, I decided to try and refactor it, and I'm writing this to share the results. Now for a little disclaimer. The purpose of this post is not to bash Mike or belittle his code: I consider him a friend, and I respect him and his skills a lot. The point is to show how seemingly small differences in MySQL stored function code can have quite an impact on performance. The good news is that there is a rationale behind all this - I did not refactor based on trial and error. I hope I can shed some light on that in the remainder of this post when discussing the individual improvements.&lt;br /&gt;&lt;h3&gt;Summary&lt;/h3&gt;&lt;br /&gt;I changed three pieces of code in Mike's original function. Each of these changes help to increase performance just a little bit, and because none of the changes alter the overall structure of the original code, I'm inclined to consider them as improvements.I'm including a graph here:&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/21931585@N07/3614352632/" title="refactor by roland.bouman, on Flickr"&gt;&lt;img src="http://farm4.static.flickr.com/3634/3614352632_1e0020d9b7_o.png" width="311" height="271" alt="refactor" /&gt;&lt;/a&gt;&lt;br /&gt;The Y-axis of the graph represents the average (N=5) number of seconds it took to run &lt;pre&gt;SELECT BENCHMARK(100000, TimeDiffUnits('2001-01-01', '2002-01-01'))&lt;/pre&gt;The original function is at the left, and each bar at the right represents one step of refactoring. All in all, at the far right, you can see that the final result is a function with exactly the same behavior which runs about 70% faster than the original :-). &lt;br /&gt;&lt;h3&gt;About the Measurements&lt;/h3&gt;&lt;br /&gt;I should clarify a thing or two to explain the measurments I did. &lt;br /&gt;&lt;br /&gt;First of all, &lt;a href="http://dev.mysql.com/doc/refman/5.0/en/information-functions.html#function_benchmark" target="_mysql"&gt;&lt;code&gt;BENCHMARK()&lt;/code&gt;&lt;/a&gt; is usually frowned upon as a general benchmarking tool. However, in this case the measured function is completely computational in nature, and I think that it isn't too bad to get a rough idea of relative performance.&lt;br /&gt;&lt;br /&gt;Second, the actual code that is being measured, &lt;code&gt;TimeDiffUnits('2001-01-01', '2002-01-01')&lt;/code&gt; is in fact a worst-case&lt;sup&gt;*&lt;/sup&gt; scenario for this particular function, and it's quite likely that testing it with real data does not automatically result in 70% performance increase.&lt;br /&gt;&lt;br /&gt;(&lt;sup&gt;*&lt;/sup&gt;: It is quite literally a worst-case scenario because the input values are such that the &lt;code&gt;CASE&lt;/code&gt;-statement used within the function has to mull through all branches before finding a matching branch)&lt;br /&gt;&lt;br /&gt;Finally I should point out that the vast majority of database performance problems have to do with IO or resource contention, and are not computational in nature. So chances are that you can use none of the information in the remainder of this post to improve your performance problem. (You can always stop reading now of course ;-)&lt;br /&gt;&lt;br /&gt;That said, if you use MySQL stored functions, it can't hurt to be aware of their performance issues, and it is not too hard to make a habit of writing the fastest possible code. In many cases, you will find that the fastest solution is also the cleanest, shortest, and most maintainable one.&lt;br /&gt;&lt;br /&gt;Now, without further ado - how I refactored that function...&lt;br /&gt;&lt;h3&gt;Original Function&lt;/h3&gt;&lt;br /&gt;First, meet the &lt;a href="http://www.mikehillyer.com/mysql/user-friendly-age-function-for-mysql/" target="_mike"&gt;original function&lt;/a&gt; as it appeared on Mike's blog.&lt;br /&gt;&lt;pre&gt;CREATE FUNCTION TimeDiffUnits (old DATETIME, new DATETIME)RETURNS CHAR(50) &lt;br /&gt;DETERMINISTIC NO SQL&lt;br /&gt;BEGIN&lt;br /&gt;  DECLARE diff INTEGER;&lt;br /&gt;  SET diff = UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);&lt;br /&gt;  CASE&lt;br /&gt;    WHEN (diff &amp;lt; 3600)     THEN RETURN CONCAT(FLOOR(diff / 60) , ' Minutes');&lt;br /&gt;    WHEN (diff &amp;lt; 86400)    THEN RETURN CONCAT(FLOOR(diff / 3600), ' Hours');&lt;br /&gt;    WHEN (diff &amp;lt; 604800)   THEN RETURN CONCAT(FLOOR(diff / 86400), ' Days');&lt;br /&gt;    WHEN (diff &amp;lt; 2592000)  THEN RETURN CONCAT(FLOOR(diff / 604800), ' Weeks');&lt;br /&gt;    WHEN (diff &amp;lt; 31536000) THEN RETURN CONCAT(FLOOR(diff / 2592000), ' Months');&lt;br /&gt;    ELSE                        RETURN CONCAT(FLOOR(diff / 31536000), ' Years');&lt;br /&gt;  END CASE;&lt;br /&gt;END;&lt;/pre&gt;This is pretty straight-forward:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;First, the time between the input datetime values is computed as a number of seconds. This is done by converting the input values to a number of seconds using the &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/date-and-time-functions.html#function_unix-timestamp" target="_mysql"&gt;UNIX_TIMESTAMP&lt;/a&gt;&lt;/code&gt; function. Then, the value for the old datetime is subtracted from the new datetime. The result is assigned to the &lt;code&gt;diff&lt;/code&gt; variable where it is stored for later use.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Second, a &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/case-statement.html" target="_mysql"&gt;searched &lt;code&gt;CASE&lt;/code&gt;-statement&lt;/a&gt; is used to determine the order of magnitude: Minutes, Hours, and so on up to Years. For example, if the number of seconds is less than 3600 (which is an hour, 60 seconds times 60 minutes) the &lt;code&gt;WHEN&lt;/code&gt; branch is entered to calculate a number of minutes&lt;/li&gt;&lt;br /&gt;&lt;li&gt;The &lt;code&gt;WHEN&lt;/code&gt; branches of the &lt;code&gt;CASE&lt;/code&gt;-statement calculate how many of the selected units (minutes, hours, etc.) fit into the elapsed time calculated in step 1. Using &lt;code&gt;CONCAT()&lt;/code&gt;, this is used to create a nice human-readable string, which is immediately returned from the function. &lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;On my laptop, &lt;code&gt;BENCHMARK(1000000, TimeDiffUnits('2001-01-01', '2002-01-01'))&lt;/code&gt; takes between 2.91 and 3.05 seconds.&lt;br /&gt;&lt;h3&gt;Step1: Using &lt;code&gt;DEFAULT&lt;/code&gt; instead of &lt;code&gt;SET&lt;/code&gt;&lt;/h3&gt;The first thing I did was getting rid of the assignment to &lt;code&gt;diff&lt;/code&gt;:&lt;br /&gt;&lt;pre&gt;&lt;span style="color:red;text-decoration:line-through"&gt;SET diff = UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);&lt;/span&gt;&lt;/pre&gt;&lt;br /&gt;Instead, I used the &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/declare-local-variable.html" target="_mysql"&gt;&lt;code&gt;DECLARE&lt;/code&gt;&lt;/a&gt; statement of &lt;code&gt;diff&lt;/code&gt; to assign the elapsed time right away using the &lt;code&gt;DEFAULT&lt;/code&gt; clause:&lt;pre&gt;DECLARE diff INTEGER &lt;span style="color:blue; font-weight:bold"&gt;DEFAULT UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);&lt;/span&gt;&lt;/pre&gt;For completeness here's the modified function:&lt;pre&gt;CREATE FUNCTION TimeDiffUnits1 (old DATETIME, new DATETIME)RETURNS CHAR(50)&lt;br /&gt;DETERMINISTIC NO SQL&lt;br /&gt;BEGIN&lt;br /&gt;  DECLARE diff INTEGER &lt;span style="font-weight:bold"&gt;DEFAULT UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);&lt;/span&gt;&lt;br /&gt;  CASE&lt;br /&gt;    WHEN (diff &lt; 3600)     THEN RETURN CONCAT(FLOOR(diff / 60) , ' Minutes');&lt;br /&gt;    WHEN (diff &lt; 86400)    THEN RETURN CONCAT(FLOOR(diff / 3600), ' Hours');&lt;br /&gt;    WHEN (diff &lt; 604800)   THEN RETURN CONCAT(FLOOR(diff / 86400), ' Days');&lt;br /&gt;    WHEN (diff &lt; 2592000)  THEN RETURN CONCAT(FLOOR(diff / 604800), ' Weeks');&lt;br /&gt;    WHEN (diff &lt; 31536000) THEN RETURN CONCAT(FLOOR(diff / 2592000), ' Months');&lt;br /&gt;    ELSE                        RETURN CONCAT(FLOOR(diff / 31536000), ' Years');&lt;br /&gt;  END CASE;&lt;br /&gt;END;&lt;/pre&gt;This is not a big deal right? Well it isn't. I mean, this is certainly not a big change and I think the code is still just as clear. Did this gain us some performance? Well, just a bit. For me, &lt;code&gt;SELECT BENCHMARK(100000, TimeDiffUnits1('2001-01-01', '2002-01-01'));&lt;/code&gt; takes between 2.89 and 2.98 seconds to run. This is about 2% better than the original. Admittedly, next to nothing, but considering that we casually eliminated only one assignment, I think it is rather good! &lt;br /&gt;&lt;br /&gt;Take-away: Don't assign when you don't have to. Each local variable declaration is an implicit assignment - use it if you can.&lt;br /&gt;&lt;h3&gt;Step 2: Using &lt;code&gt;DIV&lt;/code&gt; instead of float division and &lt;code&gt;FLOOR()&lt;/code&gt;&lt;/h3&gt;The second change I introduced is a bit larger than the previous one. To compute the number of elapsed units, the original code uses &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/arithmetic-functions.html#operator_divide" target="_mysql"&gt;the division operator (&lt;code&gt;/&lt;/code&gt;)&lt;/a&gt;. This uses floating point arithmetic, and to get a nice integer result, the division is wrapped inside the &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mathematical-functions.html#function_floor" target="_mysql"&gt;&lt;code&gt;FLOOR()&lt;/code&gt;&lt;/a&gt; function. In fact, this is a pattern that I have observed earlier in other code (yes, I'm guilty too :(), and &lt;a href="http://rpbouman.blogspot.com/2008/07/mysql-divide-and-conquer.html" target="_rpb"&gt;I wrote about it&lt;/a&gt; in the past. &lt;br /&gt;&lt;br /&gt;As it turns out, we don't need the division operator to perform division. At least, not in this case. MySQL provides the &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/arithmetic-functions.html#operator_div" target="_mysql"&gt;&lt;code&gt;DIV&lt;/code&gt; operator&lt;/a&gt;, which is designed to perform integer division. This is great news for two reasons:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;No time is wasted converting the numbers to floating point values to perform the calculation&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Because the result of &lt;code&gt;DIV&lt;/code&gt; is also an integer, we don't need &lt;code&gt;FLOOR&lt;/code&gt; to convert back to integer again.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;So, for example, this:&lt;br /&gt;&lt;pre&gt;WHEN (diff &lt; 31536000) THEN RETURN CONCAT(&lt;span style="color:red;text-decoration:line-through;"&gt;FLOOR(diff / 2592000)&lt;/span&gt;, ' Months');&lt;/pre&gt;Can be rewritten to &lt;pre&gt;WHEN (diff &lt; 31536000) THEN RETURN CONCAT(&lt;span style="color:blue;font-weight:bold"&gt;diff DIV 2592000&lt;/span&gt;, ' Months');&lt;br /&gt;&lt;/pre&gt;. This should be faster, and its less code too. Here's the modified function for completeness:&lt;pre&gt;CREATE FUNCTION TimeDiffUnits2 (old DATETIME, new DATETIME)RETURNS CHAR(50)&lt;br /&gt;DETERMINISTIC NO SQL&lt;br /&gt;BEGIN&lt;br /&gt;  DECLARE diff INTEGER DEFAULT UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);&lt;br /&gt;  CASE&lt;br /&gt;    WHEN (diff &amp;lt; 3600)     THEN RETURN CONCAT(&lt;span style="font-weight:bold"/&gt;diff DIV 60&lt;/span&gt; , ' Minutes');&lt;br /&gt;    WHEN (diff &amp;lt; 86400)    THEN RETURN CONCAT(&lt;span style="font-weight:bold"/&gt;diff DIV 3600&lt;/span&gt;, ' Hours');&lt;br /&gt;    WHEN (diff &amp;lt; 604800)   THEN RETURN CONCAT(&lt;span style="font-weight:bold"/&gt;diff DIV 86400&lt;/span&gt;, ' Days');&lt;br /&gt;    WHEN (diff &amp;lt; 2592000)  THEN RETURN CONCAT(&lt;span style="font-weight:bold"/&gt;diff DIV 604800&lt;/span&gt;, ' Weeks');&lt;br /&gt;    WHEN (diff &amp;lt; 31536000) THEN RETURN CONCAT(&lt;span style="font-weight:bold"/&gt;diff DIV 2592000&lt;/span&gt;, ' Months');&lt;br /&gt;    ELSE                        RETURN CONCAT(&lt;span style="font-weight:bold"/&gt;diff DIV 31536000&lt;/span&gt;, ' Years');&lt;br /&gt;  END CASE;&lt;br /&gt;END;&lt;/pre&gt;After the modification, &lt;code&gt;BENCHMARK(100000, TimeDiffUnits2('2001-01-01', '2002-01-01'))&lt;/code&gt; takes between 2.61 and 2.72 seconds to run on my laptop. This is about 11% faster than the original, and about 9% faster than my first improvement. &lt;br /&gt;&lt;br /&gt;Take-away: If you are doing division, think a minute about what data types you are using. Do you really need float arithmetic? If you don't, then don't use the division operator (&lt;code&gt;/&lt;/code&gt;), simply use &lt;code&gt;DIV&lt;/code&gt; instead. It may be less known, but it is more explicit and can give you some extra performance. If you are using &lt;code&gt;FLOOR&lt;/code&gt;, ask yourself why you are throwing away fractional digits. There are a bunch of cases where you just need to format something that is intrinsically fractional, but if you can't care less about the fractional numbers, chances are you can chuck the &lt;code&gt;FLOOR&lt;/code&gt; away and simply avodi the fractional numbers by using straight integer division.&lt;br /&gt;&lt;h3&gt;Step 3: Using the CASE operator instead of the CASE statement&lt;/h3&gt;Finally, I changed the &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/case-statement.html" target="_mysql"&gt;&lt;code&gt;CASE&lt;/code&gt;-statement&lt;/a&gt; and replaced it with a &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/control-flow-functions.html#operator_case" target="_mysql"&gt;&lt;code&gt;CASE&lt;/code&gt;-operator&lt;/a&gt; instead.&lt;br /&gt;&lt;br /&gt;Now, I have seen on numerous occasions that people are confused about &lt;code&gt;CASE&lt;/code&gt;, so here's a quick overview, just to get it completely straight:&lt;h4&gt;The &lt;code&gt;CASE&lt;/code&gt;-statement&lt;/h4&gt;&lt;ul&gt;&lt;li&gt;The &lt;code&gt;CASE&lt;/code&gt;-statement is not the same as the &lt;code&gt;CASE&lt;/code&gt;-operator.&lt;/li&gt;&lt;li&gt;The &lt;code&gt;CASE&lt;/code&gt;-statement is a &lt;em&gt;program-flow control statement&lt;/em&gt;, and is allowed in stored routines (procedures, functions, triggers and events), but not in regular SQL statements.&lt;/li&gt;&lt;li&gt;The &lt;code&gt;CASE&lt;/code&gt;-statement is used to choose between a number of alternative code paths. Each &lt;code&gt;WHEN&lt;/code&gt; branch must contain at least one statement, and may contain multiple statements. Note that statements that appear in the &lt;code&gt;WHEN&lt;/code&gt; branch of a &lt;code&gt;CASE&lt;/code&gt;-statement are always followed by a semi-colon statement terminator.&lt;/li&gt;&lt;li&gt;The &lt;code&gt;CASE&lt;/code&gt;-statement is syntactically terminated using the keywords &lt;code&gt;END CASE&lt;/code&gt;. Note that &lt;code&gt;END CASE&lt;/code&gt; is typically followed by a semi-colon that acts as the statement terminator for the &lt;code&gt;CASE&lt;/code&gt;-statement. (The only exception is when the &lt;code&gt;CASE&lt;/code&gt; statement appears as top-level statement in a stored routine, in which case the semi-colon is allowed but not required.&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;The &lt;code&gt;CASE&lt;/code&gt;-operator&lt;/h4&gt;&lt;ul&gt;&lt;li&gt;The &lt;code&gt;CASE&lt;/code&gt;-operator is not the same as the &lt;code&gt;CASE&lt;/code&gt;-statement.&lt;/li&gt;&lt;li&gt;The &lt;code&gt;CASE&lt;/code&gt;-operator (aka &lt;code&gt;CASE&lt;/code&gt;-expression) is a &lt;em&gt;value-expression&lt;/em&gt;. It is allowed in almost all places where you can use a value. So, you can use the &lt;code&gt;CASE&lt;/code&gt;-operator in regular SQL statements as well as in stored routines.&lt;/li&gt;&lt;li&gt;The &lt;code&gt;WHEN&lt;/code&gt;-branches of the &lt;code&gt;CASE&lt;/code&gt;-operator must contain a single value-expression (which may itself be composite). The &lt;code&gt;WHEN&lt;/code&gt;-branches of the &lt;code&gt;CASE&lt;/code&gt;-operator can thus not contain statements, and cannot contain multiple expressions - it just wouldn't make sense because the &lt;code&gt;CASE&lt;/code&gt;-operator evaluates to a single value, just like any other expression. Because statements are not allowed in the &lt;code&gt;WHEN&lt;/code&gt; branches of &lt;code&gt;CASE&lt;/code&gt;-operators, there are never terminating semi-colons inside these &lt;code&gt;WHEN&lt;/code&gt; branches.&lt;/li&gt;&lt;li&gt;The &lt;code&gt;CASE&lt;/code&gt;-expression is terminated with the &lt;code&gt;END&lt;/code&gt; keyword. Note that this is different fromt the terminator of the &lt;code&gt;CASE&lt;/code&gt;-statement which is &lt;code&gt;END CASE&lt;/code&gt;&lt;/li&gt;&lt;li&gt;Also, note that a &lt;code&gt;CASE&lt;/code&gt;-expression does not itself have a semi-colon as statement terminator for the simple reason that it is not a statement. Of course, it is possible for a &lt;code&gt;CASE&lt;/code&gt;-expression to appear at the end of a statement. In this case, there will be a semi-colon statement terminator immediately after the &lt;code&gt;END&lt;/code&gt; of the &lt;code&gt;CASE&lt;/code&gt;-expression, but it is important to realize that that semi-colon terminates the statement that contains the &lt;code&gt;CASE&lt;/code&gt;-expression - it does not terminate the &lt;code&gt;CASE&lt;/code&gt;-expression itself.&lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Simple &lt;code&gt;CASE&lt;/code&gt; and searched &lt;code&gt;CASE&lt;/code&gt;&lt;/h4&gt;So, I just discussed two different occurrences of &lt;code&gt;CASE&lt;/code&gt;: as statement, and as operator/expression. Now to complicate things even more, each of these can appear in two different forms, namely the &lt;em&gt;simple&lt;/em&gt; and the &lt;em&gt;searched&lt;/em&gt; &lt;code&gt;CASE&lt;/code&gt;, making up a total of 4 different &lt;code&gt;CASE&lt;/code&gt; variants. Fortunately, I wrote about the simple and searched &lt;code&gt;CASE&lt;/code&gt; earlier in &lt;a href="http://rpbouman.blogspot.com/2008/03/mysql-stored-procedures-case-syntax.html" target="_rpb"&gt;another article&lt;/a&gt;.&lt;br /&gt;&lt;h4&gt;&lt;code&gt;CASE&lt;/code&gt;-statement versus &lt;code&gt;CASE&lt;/code&gt;-expression&lt;/h4&gt;&lt;br /&gt;As you can see in the original code, Mike uses a &lt;code&gt;CASE&lt;/code&gt;-statement. Each of  the &lt;code&gt;WHEN&lt;/code&gt;-branches contains a single &lt;code&gt;RETURN&lt;/code&gt; statement that passes the return value to the caller. With a small modification, we can rewrite this to a single &lt;code&gt;RETURN&lt;/code&gt; statement that uses a &lt;code&gt;CASE&lt;/code&gt;-expression to pick the right value. The result is shown below:&lt;pre&gt;&lt;br /&gt;CREATE FUNCTION TimeDiffUnits3(old DATETIME, new DATETIME) RETURNS char(50)&lt;br /&gt;NO SQL DETERMINISTIC&lt;br /&gt;BEGIN&lt;br /&gt;  DECLARE diff INTEGER DEFAULT UNIX_TIMESTAMP(new) - UNIX_TIMESTAMP(old);&lt;br /&gt;  &lt;span style="font-weight: bold;"&gt;RETURN CASE&lt;/span&gt;&lt;br /&gt;    WHEN (diff &lt; 3600)     THEN CONCAT(diff div 60 , ' Minutes')&lt;br /&gt;    WHEN (diff &lt; 86400)    THEN CONCAT(diff div 3600, ' Hours')&lt;br /&gt;    WHEN (diff &lt; 604800)   THEN CONCAT(diff div 86400, ' Days')&lt;br /&gt;    WHEN (diff &lt; 2592000)  THEN CONCAT(diff div 604800, ' Weeks')&lt;br /&gt;    WHEN (diff &lt; 31536000) THEN CONCAT(diff div 2592000, ' Months')&lt;br /&gt;    ELSE                        CONCAT(diff div 31536000, ' Years')&lt;br /&gt;  &lt;span style="font-weight: bold;"&gt;END&lt;/span&gt;;&lt;br /&gt;END&lt;/pre&gt;While testing, I found that &lt;code&gt;BENCHMARK(100000, TimeDiffUnits3('2001-01-01', '2002-01-01'))&lt;/code&gt; takes between 1.69 and 1.78 seconds to run. That is a 70% improvement over the original, and a 66% and 53% improvement with regard to the prior versions respectively. &lt;br /&gt;&lt;br /&gt;Personally, I am curious why this is such a big improvement, and my guess is that this can be explained by assuming that a &lt;code&gt;CASE&lt;/code&gt; statement is chopped up into many byte-code instructions which are each executed individually and sequentially, whereas the &lt;code&gt;CASE&lt;/code&gt;-operator is written as a single C-function. At any rate, I think this was rather worth it, and personally I feel the final solution is a bit cleaner than the original.&lt;br /&gt;&lt;br /&gt;Take-away: If you need some choice structure, think about what you are choosing. Are you deciding between different code paths, or are you picking between values? If you are picking values, simply write a &lt;code&gt;CASE&lt;/code&gt;-expression. It's more explicit, and it is a lot faster than a &lt;code&gt;CASE&lt;/code&gt;-statement. Another thing to consider:&lt;br /&gt;do you need multiple statements in the branches, or can you get by with a single statement? If you can get by with a single statement, and it is always a &lt;code&gt;RETURN&lt;/code&gt; or a &lt;code&gt;SET&lt;/code&gt;-statement, then you can rewrite it to a &lt;code&gt;CASE&lt;/code&gt; expresion.&lt;br /&gt;&lt;h3&gt;Conclusion&lt;/h3&gt;&lt;br /&gt;Because the changes I made are really quite small, I think this mostly shows that MySQL stored function compilation is poorly optimized (if at all). I am not a compiler expert but my gut feeling is that most of the optimizations could have been done automatically. &lt;br /&gt;&lt;h3&gt;Finally&lt;/h3&gt;&lt;br /&gt;If you are interested in refactoring MySQL stored functions and procedures, you might also like a few other articles I wrote on the subject:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://rpbouman.blogspot.com/2006/09/refactoring-mysql-cursors.html"&gt;Refactoring Cursors&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://rpbouman.blogspot.com/2006/10/refactoring-derived-table-unionwtf.html"&gt;Refactoring &lt;code&gt;UNION&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://rpbouman.blogspot.com/2009/03/faster-mysql-database-size-google-chart.html"&gt;Refactoring information schema queries&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-5794785261579691520?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/5794785261579691520/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=5794785261579691520' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5794785261579691520'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5794785261579691520'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/06/mysql-refactoring-stored-function.html' title='MySQL: Refactoring a Stored Function'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-1587135614866889230</id><published>2009-05-26T21:46:00.003+02:00</published><updated>2009-05-26T22:03:43.165+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Open Community Camp'/><category scheme='http://www.blogger.com/atom/ns#' term='business intelligence'/><title type='text'>OpenCommunityCamp 2009</title><content type='html'>&lt;a href="http://opencommunitycamp.org/2009/" target="_occ"&gt;OpenCommunityCamp 2009&lt;/a&gt; is near! OpenCommunityCamp is a true Camp: there's no registration, its completely free, and geeks and geekettes will be spending the evenings around the campfire, and sleep in tents beneath the starry skies of Oegstgeest, the Netherlands. &lt;br /&gt;&lt;br /&gt;If you can find the time, I'd be glad to &lt;a href="http://opencommunitycamp.org/2009/?q=node/36" target="_occ"&gt;meet you there&lt;/a&gt;, 26th July through 2nd August 2009. &lt;br /&gt;&lt;br /&gt;I'll be &lt;a href="http://opencommunitycamp.org/2009/?q=node/28" target="_occ"&gt;speaking&lt;/a&gt; on Open Source Business Intelligence, but there's &lt;a href="http://opencommunitycamp.org/2009/?q=node/15" target="_occ"&gt;a lot more going on&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Hope to see you there!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-1587135614866889230?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/1587135614866889230/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=1587135614866889230' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/1587135614866889230'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/1587135614866889230'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/05/opencommunitycamp-2009.html' title='OpenCommunityCamp 2009'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-3634017746737380668</id><published>2009-05-21T21:58:00.003+02:00</published><updated>2009-05-22T22:35:58.020+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Wiley'/><category scheme='http://www.blogger.com/atom/ns#' term='Pentaho'/><category scheme='http://www.blogger.com/atom/ns#' term='&quot;Building Pentaho Solutions&quot;'/><category scheme='http://www.blogger.com/atom/ns#' term='Jos van Dongen'/><category scheme='http://www.blogger.com/atom/ns#' term='book'/><title type='text'>Completed Draft for the "Pentaho Solutions" Book</title><content type='html'>Yes! &lt;br /&gt;&lt;br /&gt;Last night, I completed the draft of &lt;a href="http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470484322.html" target="_wiley"&gt;"Pentaho Solutions"&lt;/a&gt;, which is a book I'm writing together with my friend and colleague &lt;a href="http://tholis.webnode.com/blog/" target="_href"&gt;Jos van Dongen&lt;/a&gt; for Wiley. &lt;br /&gt;&lt;br /&gt;(Actually, the full title is: "Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL")&lt;br /&gt;&lt;br /&gt;Here's an overview of the contents, just to give you an idea what we have been doing:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Part I: Getting Started, Prerequisites, Installation and Configuration and Overview&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Chapter 1: Quick Start: Pentaho PCI Examples&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Chapter 2: Prerequisites&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Chapter 3: Server Installation and Configuration&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Chapter 4: The Pentaho BI Stack&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Part II: Business Case, Dimensional Modeling, Data Warehouse and Data Mart Design&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Chapter 5: Example Business Case: World Class Movies&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Chapter 6: Data warehouse primer&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Chapter 7: Modeling the Business: Logical Design using Star Schemas&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Chapter 8: The Data Mart Design Process&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Part III: ETL and Data Integration&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Chapter 9: Pentaho Data Integration&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Chapter 10: Designing Pentaho Data Integration Solutions&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Chapter 11: Deploying Pentaho Data Integration Solutions&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Part IV: Business Intelligence Applications&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Chapter 12: The Metadata Layer&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Chapter 13: Reporting using Jfree&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Chapter 14: Scheduling, Subscription and Bursting&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Chapter 15: OLAP Solutions using Pentaho Analysis Services&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Chapter 16: Data Mining with Weka&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Chapter 17: Building Dashboards&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;So - Jos and I will be processing comments from the copy and tech editors  for the remainder of this month, and then, if all goes well this is going to result in a book with about 550 pages which should be available in the first week of September (2009). You can actually pre-order a copy already on Amazon and benefit from the special 37%(!) pre-order discount:&lt;br /&gt;&lt;a href="http://www.amazon.com/Pentaho-Solutions-Business-Intelligence-Warehousing/dp/0470484322" target="amazon"&gt;&lt;img src="http://media.wiley.com/product_data/coverImage300/22/04704843/0470484322.jpg" border="0"/&gt;&lt;/a&gt;&lt;br /&gt;I admit that it has been way more work than I thought, and I'm very glad we have reached this milestone. That said, I really am quite happy with what we've written so far, and I am convinced we will be delivering a valuable book for everybody that is interested in getting started with Business Intelligence, or just Pentaho, or both. Although Jos and I did the writing, we got great support from Pentaho developers as well as various community members, and this really was a huge help to us. (I don't want to name names now because I would risk leaving someone out accidentally - We'll make sure to get a full list of acknowledgements)&lt;br /&gt;&lt;br /&gt;As a project I think things have developed pretty good too. We experienced some delay with respect to our original schedule, but to be frank I expected that would happen when we started this. All in all, we are two weeks behind, but on a +6 months period of time, I don't think that's too bad. I really should mention that so far, we received excellent coaching and advice from our contacts at Wiley, Bob Elliot and Sara Shlaer. Thanks! and on to the last few bits of work...&lt;br /&gt;&lt;br /&gt;Update: I forgot to mention - we are covering the new and upcoming version of the Pentaho platform, Pentaho 3. However, tools within the platform follow their own version scheme so a single version number doesn't say that much (for example, the data integration chapters focus on Pentaho Data Integration 3.2)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-3634017746737380668?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/3634017746737380668/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=3634017746737380668' title='26 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/3634017746737380668'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/3634017746737380668'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/05/completed-draft-for-pentaho-solutions.html' title='Completed Draft for the &quot;Pentaho Solutions&quot; Book'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>26</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-5167438599473576991</id><published>2009-05-18T11:38:00.005+02:00</published><updated>2009-05-18T12:12:06.507+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Javascript'/><category scheme='http://www.blogger.com/atom/ns#' term='MS Sharepoint'/><category scheme='http://www.blogger.com/atom/ns#' term='CSS'/><category scheme='http://www.blogger.com/atom/ns#' term='Microsoft Sharepoint'/><category scheme='http://www.blogger.com/atom/ns#' term='HTML'/><category scheme='http://www.blogger.com/atom/ns#' term='coding horror'/><title type='text'>Wanna discuss Microsoft Sharepoint?</title><content type='html'>Currently, I'm developing something that requires integration with Microsoft Sharepoint. &lt;br /&gt;&lt;br /&gt;If you're considering to do the same, and are used to standard HTML, CSS and javascript technologies like object-detection and namespacing, I seriously advise you to have some painkillers in reach, as well as some hard liquor to flush them down. &lt;br /&gt;&lt;br /&gt;If you'd like to discuss my point of view, please use this function from &lt;code&gt;core.js&lt;/code&gt;, kindly provided by the open minded crew that is responsible for the web development disaster better known as Microsoft Sharepoint:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;function Discuss(strUrl)&lt;br /&gt;{&lt;br /&gt; var L_IE5upRequired_Text="'Discuss' requires a Windows SharePoint Services-compatible application and Microsoft Internet Explorer 6.0 or greater.";&lt;br /&gt; if (browseris.ie5up &amp;&amp; browseris.win32)&lt;br /&gt;  window.parent.location.href=strUrl;&lt;br /&gt; else&lt;br /&gt;  alert(L_IE5upRequired_Text);&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Thank you.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-5167438599473576991?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/5167438599473576991/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=5167438599473576991' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5167438599473576991'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5167438599473576991'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/05/wanna-discuss-sharepoint.html' title='Wanna discuss Microsoft Sharepoint?'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-6007676505561538796</id><published>2009-04-21T15:31:00.001+02:00</published><updated>2009-04-21T15:33:39.441+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='Sun'/><category scheme='http://www.blogger.com/atom/ns#' term='NULL'/><category scheme='http://www.blogger.com/atom/ns#' term='Oracle'/><title type='text'>Not so random slashdot comment Gem</title><content type='html'>&lt;blockquote&gt;Java 8 will replace String with String2, which will treat empty string and null the same.&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;By by characterZer0 (138196), on Monday April 20, @08:44AM&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-6007676505561538796?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/6007676505561538796/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=6007676505561538796' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/6007676505561538796'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/6007676505561538796'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/04/not-so-random-slashdot-comment-gem.html' title='Not so random slashdot comment Gem'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-5646878138542592170</id><published>2009-03-26T12:11:00.002+01:00</published><updated>2009-03-26T12:54:43.647+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='database size'/><category scheme='http://www.blogger.com/atom/ns#' term='google chart api'/><category scheme='http://www.blogger.com/atom/ns#' term='information_schema'/><title type='text'>More fun visualizing MySQL Database Size</title><content type='html'>Hi again!&lt;br /&gt;&lt;br /&gt;As a sidekick for my &lt;a href="http://rpbouman.blogspot.com/2009/03/faster-mysql-database-size-google-chart.html" target="_rpb"&gt;previous post&lt;/a&gt;, I came up with a snippet of code that generates the &lt;a href="http://code.google.com/apis/chart/" target="_google"&gt;Google Chart&lt;/a&gt; URL to visualize table size for the current database. For example, for &lt;a href="http://dev.mysql.com/doc/sakila/en/sakila.html" target="_mysql"&gt;the sakila sample database&lt;/a&gt;, we get URL's like this:&lt;pre&gt;http://chart.apis.google.com/chart?cht=bhs&amp;chbh=19,2,2&amp;chs=653x459&amp;chtt=sakila%20Size%20(MB)&amp;chco=4D89F9,C6D9FD&amp;chd=t:0.0156,0.0156,0.0156,0.0156,0.0156,0.0469,0.0625,0.0625,0.0781,0.0781,0.1875,0.1875,0.1875,0.1163,0.1719,1.5156,1.5156|0.0000,0.0000,0.0000,0.0156,0.0313,0.0156,0.0156,0.0313,0.0156,0.0469,0.0781,0.0781,0.0781,0.2002,0.1875,0.6094,1.2031&amp;chds=0,2.7188&amp;chxt=y,x&amp;chxl=0:|rental%20(InnoDB)|payment%20(InnoDB)|inventory%20(InnoDB)|film_text%20(MyISAM)|films%20(InnoDB)|film_actor%20(InnoDB)|film%20(InnoDB)|customer%20(InnoDB)|staff%20(InnoDB)|address%20(InnoDB)|film_category%20(InnoDB)|city%20(InnoDB)|store%20(InnoDB)|actor%20(InnoDB)|language%20(InnoDB)|country%20(InnoDB)|category%20(InnoDB)|1:|0|2.72MB&amp;chm=N*f2*,000000,0,-1,11|N*f2*,000000,1,-1,11&lt;/pre&gt;The graph looks like this: &lt;img alt="Google Chart: Sakila database size" src="http://chart.apis.google.com/chart?cht=bhs&amp;chbh=19,2,2&amp;chs=653x459&amp;chtt=sakila%20Size%20(MB)&amp;chco=4D89F9,C6D9FD&amp;chd=t:0.0156,0.0156,0.0156,0.0156,0.0156,0.0469,0.0625,0.0625,0.0781,0.0781,0.1875,0.1875,0.1875,0.1163,0.1719,1.5156,1.5156|0.0000,0.0000,0.0000,0.0156,0.0313,0.0156,0.0156,0.0313,0.0156,0.0469,0.0781,0.0781,0.0781,0.2002,0.1875,0.6094,1.2031&amp;chds=0,2.7188&amp;chxt=y,x&amp;chxl=0:|rental%20(InnoDB)|payment%20(InnoDB)|inventory%20(InnoDB)|film_text%20(MyISAM)|films%20(InnoDB)|film_actor%20(InnoDB)|film%20(InnoDB)|customer%20(InnoDB)|staff%20(InnoDB)|address%20(InnoDB)|film_category%20(InnoDB)|city%20(InnoDB)|store%20(InnoDB)|actor%20(InnoDB)|language%20(InnoDB)|country%20(InnoDB)|category%20(InnoDB)|1:|0|2.72MB&amp;chm=N*f2*,000000,0,-1,11|N*f2*,000000,1,-1,11"/&gt;Here's the script I used:&lt;pre&gt;SET @maxpixels:=300000;&lt;br /&gt;SET @barwidth:=19;&lt;br /&gt;SET @spacebetweenbars:=2;&lt;br /&gt;SET @spacebetweengroups:=2;&lt;br /&gt;SET @totalbarwidth:=@barwidth+(2*@spacebetweenbars)+(2*@spacebetweengroups);&lt;br /&gt;SET @megabytes:=1024*1024;&lt;br /&gt;SET @decimals:=2;&lt;br /&gt;&lt;br /&gt;SELECT      CONCAT(&lt;br /&gt;                 'http://chart.apis.google.com/chart'&lt;br /&gt;            ,    '?cht=bhs'&lt;br /&gt;            ,    '&amp;chbh=',@barwidth,',',@spacebetweenbars,',',@spacebetweengroups&lt;br /&gt;            ,    '&amp;chs=', @maxpixels div (COUNT(*) * @totalbarwidth),'x', COUNT(*) * @totalbarwidth&lt;br /&gt;            ,    '&amp;chtt=', table_schema, ' Size (MB)' &lt;br /&gt;            ,    '&amp;chco=4D89F9,C6D9FD'&lt;br /&gt;            ,    '&amp;chd=t:',    GROUP_CONCAT(data_length / @megabytes ORDER BY (data_length+index_length))&lt;br /&gt;            ,    '|', GROUP_CONCAT(index_length / @megabytes ORDER BY (data_length+index_length))&lt;br /&gt;            ,    '&amp;chds=' ,0, ',', MAX(data_length+index_length)/@megabytes&lt;br /&gt;            ,    '&amp;chxt=y,x'&lt;br /&gt;            ,    '&amp;chxl=0:|', GROUP_CONCAT(table_name, ' (', engine,')' ORDER BY (data_length + index_length) DESC separator '|')&lt;br /&gt;            ,         '|1:|', 0, '|', ROUND(MAX(data_length+index_length) / @megabytes, @decimals), 'MB'&lt;br /&gt;            ,    '&amp;chm=N*f',@decimals,'*,000000,0,-1,11|N*f',@decimals,'*,000000,1,-1,11'&lt;br /&gt;            )&lt;br /&gt;FROM        information_schema.tables&lt;br /&gt;WHERE       table_schema = SCHEMA()&lt;br /&gt;AND         table_type   = 'BASE TABLE'&lt;br /&gt;GROUP BY    table_schema&lt;/pre&gt;I'm not really satisfied yet...I keep hitting limitations w/re to google charts. I built a little bit of logic that will ensure the resulting picture is within the upper limit of 300000 pixels. I just found out there is another limitation that says the chart height can't exceed 1000 pixels. I'm going to stop looking at this for a while, and maybe later on I will come up with some works-most-of-the-time kind of logic to control this.&lt;br /&gt;&lt;br /&gt;Feel free to drop me a line if you have some ideas regarding this.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-5646878138542592170?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/5646878138542592170/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=5646878138542592170' title='20 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5646878138542592170'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5646878138542592170'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/03/more-fun-visualizing-mysql-database.html' title='More fun visualizing MySQL Database Size'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>20</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-472283629512966495</id><published>2009-03-26T12:01:00.000+01:00</published><updated>2009-03-26T13:57:10.026+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Refactoring'/><category scheme='http://www.blogger.com/atom/ns#' term='GROUP_CONCAT'/><category scheme='http://www.blogger.com/atom/ns#' term='database size'/><category scheme='http://www.blogger.com/atom/ns#' term='google chart api'/><category scheme='http://www.blogger.com/atom/ns#' term='information_schema'/><category scheme='http://www.blogger.com/atom/ns#' term='cursor'/><category scheme='http://www.blogger.com/atom/ns#' term='business intelligence'/><title type='text'>A Faster MySQL Database Size Google Chart</title><content type='html'>Abstract - As &lt;a href="http://blog.olindata.com/2009/02/using-the-google-graph-api-with-mysql-stored-functions/" target="_olindata"&gt;described&lt;/a&gt; by Walter Heck, MySQL database size can be visualized using &lt;a href="http://code.google.com/apis/chart/" target="_google"&gt;Google Charts&lt;/a&gt;. With a minor code improvement the URL for the chart can be obtained twice as fast. With some more modification, the number of lines can be cut down resulting in a function that is half as long.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Hi!&lt;/h3&gt;It's been a while since I posted - I admit I'm struggling for a bit to balance time and attention to the day job, &lt;a hreg="http://rpbouman.blogspot.com/2009/01/writing-book-building-pentaho-solutions.html" target="_rpb"&gt;writing a book&lt;/a&gt;, preparing &lt;a href="http://en.oreilly.com/mysql2009/public/schedule/speaker/198" target="_uc"&gt;my talks&lt;/a&gt; for the &lt;a href="http://en.oreilly.com/mysql2009/public/content/home" target="_uc"&gt;MySQL user's conference&lt;/a&gt; and of course family life.&lt;br /&gt;&lt;br /&gt;A month ago or so I &lt;a href="http://www.pythian.com/news/1490/google-charts-for-dba-tablespaces-allocation" target="_pythian"&gt;read&lt;/a&gt; a &lt;a href="http://blog.olindata.com/2009/02/using-the-google-graph-api-with-mysql-stored-functions/" target="_olindata"&gt;couple&lt;/a&gt; of &lt;a href="http://www.ruturaj.net/mysql-google-pie-charts" target="_ruturaj."&gt;posts&lt;/a&gt; about using the &lt;a href="http://code.google.com/apis/chart/" target="_google"&gt;Google chart API&lt;/a&gt; to visualize database size. Although I personally would not consider using Google Charts (in its current form) for serious application monitoring applications, I am quite charmed by its ease of use and availability. Time to give it a try myself.&lt;br /&gt;&lt;br /&gt;I inspected the PL/SQL code provided by &lt;a href="http://www.pythian.com/news/1490/google-charts-for-dba-tablespaces-allocation" target="_pythian"&gt;Alex Gorbachev&lt;/a&gt;, and the MySQL code by &lt;a href="http://blog.olindata.com/2009/02/using-the-google-graph-api-with-mysql-stored-functions/" target="_olindata"&gt;Walter Heck&lt;/a&gt;, as well as the improved implementation by &lt;a href="http://www.ruturaj.net/mysql-google-pie-charts" target="_ruturaj"&gt;Ruturaj Vartak&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;Although I applaud both Walter and Ruturaj's efforts in porting this code, I think their code can be improved still. In this short article I'd like to illustrate how a minor modification can double performance.&lt;h3&gt;Code Analysis&lt;/h3&gt;Let's take a brief moment to analyze Walter's original code. I will cite a few fragments of his code. (I have made some simplifications and removed some distractions to make it as easy as possible to understand the logic of the code. If you find an error in my citation, please consider the possibility that it is my doing, not Walter's).&lt;br /&gt;&lt;br /&gt;His program takes the form of a &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/create-procedure.html" target="_mysql"&gt;MySQL stored function&lt;/a&gt;. It accepts a few parameters to configure the chart (chart type and size) and outputs a fully functional URL which can be used to retrieve a .png image that shows the size (in MB) of all databases managed by the MySQL server. This is the function signature:&lt;pre&gt;CREATE FUNCTION FNC_GOOGRAPH_DB_SIZE(&lt;br /&gt;  p_chart_type CHAR,&lt;br /&gt;  p_height INT,&lt;br /&gt;  p_width INT&lt;br /&gt;) RETURNS varchar(255)&lt;/pre&gt;The stored function is implemented by retrieving size metadata from the &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/tables-table.html" target="_mysql"&gt;information_schema.TABLES&lt;/a&gt;&lt;/code&gt; table. The size of each table's data and indexes are then added and summed per database. The results are traversed using a &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/cursors.html" target="_mysql"&gt;cursor&lt;/a&gt; loop. &lt;br /&gt;&lt;br /&gt;Here's the cursor declaration:&lt;pre&gt;DECLARE c_schema_sizes cursor FOR&lt;br /&gt;SELECT    t.table_schema&lt;br /&gt;,         &lt;b&gt;SUM(t.data_length + t.index_length)&lt;/b&gt; / 1024 / 1024&lt;br /&gt;FROM      information_schema.tables t&lt;br /&gt;&lt;b&gt;GROUP BY  t.table_schema&lt;/b&gt;&lt;/pre&gt;Note that the figure is immediately represented as megabytes by dividing twice by &lt;code&gt;1024&lt;/code&gt;. (In other words, divide by &lt;code&gt;1024&lt;/code&gt; to make kilobytes of the raw bytes, then divide by &lt;code&gt;1024&lt;/code&gt; again to make megabytes out of kilobytes.) &lt;br /&gt;&lt;br /&gt;During iteration, a list of database names and a list of database sizes are built through string concatenation. These are required to supply the data series and labels to the chart. &lt;br /&gt;&lt;br /&gt;Only the code for constructing the data series is shown here:&lt;pre&gt;/* Get the percentage of the total size as the graph's data */&lt;br /&gt;IF v_chart_data = '' THEN&lt;br /&gt;    SET v_chart_data = &lt;br /&gt;        ROUND(v_data_length_sum &lt;b&gt;/ v_data_length_total&lt;/b&gt;, 2) &lt;b&gt;* 100&lt;/b&gt;;&lt;br /&gt;ELSE&lt;br /&gt;    SET v_chart_data = CONCAT(v_chart_data,',',&lt;br /&gt;        ROUND(v_data_length_sum &lt;b&gt;/ v_data_length_total&lt;/b&gt;, 2) &lt;b&gt;* 100&lt;/b&gt;);&lt;br /&gt;END IF;&lt;/pre&gt;Note that instead of making a list of actual sizes, a percentage is taken. Keep that in mind, we'll discuss this in more detail in the next section.&lt;br /&gt;&lt;br /&gt;Finally, the actual URL is built, using the list of database sizes as the data series and the list of database names as data labels. Here's the URL construction code: &lt;pre&gt;SET v_url = 'http://chart.apis.google.com/chart?';&lt;br /&gt;SET v_url = CONCAT(v_url, 'cht=', p_chart_type);&lt;br /&gt;SET v_url = CONCAT(v_url, '&amp;chs=', p_width , 'x', p_height);&lt;br /&gt;SET v_url = CONCAT(v_url, '&amp;chtt=Database Sizes (MB)');&lt;br /&gt;&lt;b&gt;SET v_url = CONCAT(v_url, '&amp;chl=', v_chart_labels);&lt;/b&gt;&lt;br /&gt;&lt;b&gt;SET v_url = CONCAT(v_url, '&amp;chd=t:', v_chart_data);&lt;/b&gt;&lt;br /&gt;&lt;b&gt;SET v_url = CONCAT(v_url, '&amp;chdl=', v_legend_labels);&lt;/b&gt;&lt;/pre&gt;&lt;h3&gt;About Google Chart Data Formats&lt;/h3&gt;We just mentioned that the data points are not simply concatenated - rather, data size per database is expressed as a &lt;em&gt;percentage&lt;/em&gt; of the total size of all databases. This is not just some arbitrary choice - it is in fact required by the Google Chart API. Here's a quote from the &lt;a href="http://code.google.com/apis/chart/formats.html#overview" target="_google"&gt;Overview of data formats&lt;/a&gt;:&lt;blockquote&gt;Before you can create a chart you must encode your data into a form that is understood by the Chart API. Use one of the following formats:&lt;ul&gt;&lt;li&gt;Text encoding uses a string of positive floating point numbers from zero to one hundred.&lt;/li&gt;&lt;li&gt;...&lt;/li&gt;&lt;/ul&gt;&lt;/blockquote&gt;For this reason, the code divides everything by &lt;code&gt;v_data_length_total&lt;/code&gt; and multiplies by &lt;code&gt;100&lt;/code&gt;. In our code analysis, we did not discuss &lt;code&gt;v_data_length_total&lt;/code&gt; because it was not really essential for the logic of the program. This is how it is computed:&lt;pre&gt;SELECT  ROUND(SUM(t.data_length + t.index_length) / 1024 / 1024) &lt;br /&gt;INTO    v_data_length_total&lt;br /&gt;FROM    information_schema.tables t&lt;/pre&gt;Note that this query is very similar to the query used in the cursor declaration. In both cases, the &lt;code&gt;information_schema.TABLES&lt;/code&gt; table is queried to calculate the combined size of table data and indexes. The main difference is grouping: in the cursor declaration a &lt;code&gt;GROUP BY table_schema&lt;/code&gt; was used to calculate the size &lt;em&gt;per database&lt;/em&gt;. In this case, grouping is absent and the size is calculated across databases, i.e. the combined size of all tables and indexes in the entire server.&lt;h4&gt;Text encoding with Data Scaling&lt;/h4&gt;&lt;br /&gt;Now, as it turns out, the text-endoding used by this code comes in two flavours:&lt;blockquote&gt;Text encoding with data scaling uses a string of positive and negative floating point numbers in combination with a scaling parameter.&lt;/blockquote&gt;So, as an alternative to 'pre-scaling' the values in the data series to a 100-point scale, we can also opt to let Google do the work, provided we pass this scale factor.&lt;br /&gt;&lt;br /&gt;Because the code to calculate the scale factor is already present, we can simply remove the scaling computation inside the loop, and add the scale factor to the URL like so:&lt;pre&gt;&lt;br /&gt;    -- inside loop: remove percentage calulation:&lt;br /&gt;      IF v_chart_data = '' THEN&lt;br /&gt;        SET v_chart_data = ROUND(v_data_length_sum,2);&lt;br /&gt;      ELSE&lt;br /&gt;        SET v_chart_data = concat(v_chart_data, ',', ROUND(v_data_length_sum, 2));&lt;br /&gt;      END IF;&lt;br /&gt;&lt;br /&gt;    -- outside loop: add scaling parameter to URL:&lt;br /&gt;SET v_url = CONCAT(v_url, '&amp;chds=0,', v_data_length_total);&lt;br /&gt;&lt;/pre&gt;However, we can do better than that.&lt;h3&gt;An alternative method to calculate the scaling factor&lt;/h3&gt;The problem with the current solution is that it has to do a separate query to obtain the total size. We already noticed that both used queries are quite similar, save for the grouping. Because we are already looping through all results, we can try to calculate the total ourselves. This would allow us to avoid doing the &lt;code&gt;SELECT...INTO&lt;/code&gt; statement altogether. &lt;br /&gt;&lt;br /&gt;We would need to add proper initialization for the &lt;code&gt;v_data_length_total&lt;/code&gt; variable and add the following line inside the loop:&lt;pre&gt;SET v_data_length_total := v_data_length_total + v_data_length_sum;&lt;/pre&gt;When comparing performance before and after this change, we can observe a dramatic change. On my laptop, the original function takes anywhere between 45 and 50 seconds. After incorporating these changes, the time is slashed by two and the function takes 'only' 22 to 24 seconds. &lt;br /&gt;&lt;br /&gt;The fact that the second solution is twice as fast (~23 seconds instead of ~46) is not a coincidence: it's because we're doing half the number of queries (1 instead of 2). Don't get me wrong - this is still very slow. But this comes all down to poor &lt;code&gt;information_schema&lt;/code&gt; performance. In this particular case, there is not much we can do to improve the code further to gain performance.&lt;h3&gt;More performance improvements?&lt;/h3&gt;I am very much convinced the bottleneck in database size chart function is the query on the &lt;code&gt;information_schema.TABLES&lt;/code&gt; table. To be more exact, it's the fact that we're accessing the &lt;code&gt;DATA_LENGTH&lt;/code&gt; and &lt;code&gt;INDEX_LENGTH&lt;/code&gt; columns. Just watch this:&lt;pre&gt;mysql&amp;gt; select count(*) from information_schema.tables;&lt;br /&gt;+----------+&lt;br /&gt;| count(*) |&lt;br /&gt;+----------+&lt;br /&gt;|     1576 |&lt;br /&gt;+----------+&lt;br /&gt;1 row in set (&lt;b&gt;0.08 sec&lt;/b&gt;)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; select count(data_length) from information_schema.tables;&lt;br /&gt;+--------------------+&lt;br /&gt;| count(data_length) |&lt;br /&gt;+--------------------+&lt;br /&gt;|               1563 |&lt;br /&gt;+--------------------+&lt;br /&gt;1 row in set (&lt;b&gt;25.98 sec&lt;/b&gt;)&lt;br /&gt;&lt;br /&gt;mysql&amp;gt; select count(index_length) from information_schema.tables;&lt;br /&gt;+---------------------+&lt;br /&gt;| count(index_length) |&lt;br /&gt;+---------------------+&lt;br /&gt;|                1563 |&lt;br /&gt;+---------------------+&lt;br /&gt;1 row in set (&lt;b&gt;25.41 sec&lt;/b&gt;)&lt;/pre&gt;As you can see, simply accessing &lt;code&gt;DATA_LENGTH&lt;/code&gt; and/or &lt;code&gt;INDEX_LENGTH&lt;/code&gt; causes the query to be slow. &lt;br /&gt;&lt;br /&gt;Notice also that the time spent to execute this stand-alone query is about the same as it takes for the improved function to complete. Basically, this tells us the function spends all its time performing the query - timewise, the contribution of the remainder of the function, such as cursor traversal, creating the value lists and building the URL is simply negligible.&lt;br /&gt;&lt;br /&gt;This means that we simply can't improve the performance of the database size chart function unless we can improve the performance of this query. Because we have no other general way of obtaining index and data size, this is the end of the line.&lt;h3&gt;Other Improvements&lt;/h3&gt;Even though we probably can't improve the performance of Walter's function anymore, I still think it's possible to improve the code. &lt;br /&gt;&lt;br /&gt;I guess this is kind of a pet peeve of mine, but I dislike using MySQL cursors. There's a lot of &lt;a href="http://rpbouman.blogspot.com/2005/09/why-repeat-and-while-are-usually-not.html" target="_rpb"&gt;syntax involved to set them up&lt;/a&gt;. Usually, they &lt;a href="http://rpbouman.blogspot.com/2006/09/refactoring-mysql-cursors.html" target="_rpb"&gt;can be avoided&lt;/a&gt; anyway.&lt;br /&gt;&lt;br /&gt;MySQL Cursors are also pretty slow. This is something you really start to notice when traversing tens of thousands of rows. Not that that really matters for building Google Chart URLs: Most likely, you will hit a limitation in either &lt;a href="http://code.google.com/apis/chart/basics.html#chart_size" target="_google"&gt;chart size&lt;/a&gt; or &lt;a href="http://code.google.com/apis/chart/faq.html#url_length" target="_google"&gt;URL length&lt;/a&gt;, so you can't really gain a lot of performance by eliminating cursors for this particular purpose.&lt;br /&gt;&lt;br /&gt;Still, I think that eliminating the cursor will help to make the code less complex, so lets try anyway.&lt;h4&gt;Building Lists: &lt;code&gt;GROUP_CONCAT()&lt;/code&gt;&lt;/h4&gt;We just explained how the cursor was used to build various lists of values. As it happens, MySQL supports the &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/group-by-functions.html" target="_mysql"&gt;GROUP_CONCAT()&lt;/a&gt;&lt;/code&gt; function, which was designed especially for that purpose. &lt;code&gt;GROUP_CONCAT()&lt;/code&gt; is an aggregate function, just like &lt;code&gt;COUNT()&lt;/code&gt;, &lt;code&gt;MIN()&lt;/code&gt; and &lt;code&gt;MAX()&lt;/code&gt;. &lt;br /&gt;&lt;br /&gt;Like other aggregate functions, &lt;code&gt;GROUP_CONCAT()&lt;/code&gt; can produce a single summary result on a group of rows. It does so by first concatenating its arguments for each row in the group, and then concatenating the per-row result for the entire group of rows, optionally separating row results using some separator.&lt;br /&gt;&lt;br /&gt;Now, there is an important limitation with this function that causes many people to avoid &lt;code&gt;GROUP_CONCAT&lt;/code&gt; altogether. Here's the relevant text from the manual:&lt;blockquote&gt;The result is truncated to the maximum length that is given by the &lt;code&gt;&lt;a href="http://dev.mysql.com/doc/refman/5.1/en/server-system-variables.html#sysvar_group_concat_max_len" target="_mysql"&gt;group_concat_max_len&lt;/a&gt;&lt;/code&gt; system variable, which has a default value of 1024.&lt;/blockquote&gt;Basically, this says that if you don't take proper precautions, the lists you generate with &lt;code&gt;GROUP_CONCAT&lt;/code&gt; may be truncated. This is obviously bad news! However, there is a very simple workaround, which is also hinted at in the documentation:&lt;blockquote&gt;The value can be set higher, although the effective maximum length of the return value is constrained by the value of &lt;code&gt;&lt;a target="_mysql" href="http://dev.mysql.com/doc/refman/5.1/en/server-system-variables.html#sysvar_max_allowed_packet"&gt;max_allowed_packet&lt;/a&gt;&lt;/code&gt;.&lt;/blockquote&gt;So, the workaround is simple: By assigning the value of &lt;code&gt;max_allowed_packet&lt;/code&gt; to &lt;code&gt;group_concat_max_len&lt;/code&gt; wil allow the longest possible list of values. You may argue that this might still not long enough. However, that is a mooit point. Just read up on &lt;code&gt;&lt;a target="_mysql" href="http://dev.mysql.com/doc/refman/5.1/en/server-system-variables.html#sysvar_max_allowed_packet"&gt;max_allowed_packet&lt;/a&gt;&lt;/code&gt;:&lt;blockquote&gt;The maximum size of one packet or any generated/intermediate string.&lt;/blockquote&gt;In other words, no MySQL string will ever exceed its length beyond &lt;code&gt;max_allowed_packet&lt;/code&gt; - the same limit holds for any other method of concatenating strings within MySQL, including cursor loops.&lt;h3&gt;A cursor-less database size chart function&lt;/h3&gt;Without further ado, this is how I would write the database size chart function:&lt;pre&gt;CREATE FUNCTION  f_dbsize_google_chart(&lt;br /&gt;    p_chart_type ENUM('bhs','p')&lt;br /&gt;,   p_width      MEDIUMINT UNSIGNED&lt;br /&gt;,   p_height     MEDIUMINT UNSIGNED&lt;br /&gt;)&lt;br /&gt;RETURNS LONGTEXT&lt;br /&gt;READS SQL DATA&lt;br /&gt;BEGIN&lt;br /&gt;    DECLARE v_sum_size, v_max_size DOUBLE;&lt;br /&gt;    DECLARE v_size_series, v_size_labels LONGTEXT;&lt;br /&gt;&lt;br /&gt;    -- store current group_concat_max_len so we can reset&lt;br /&gt;    DECLARE v_group_concat_max_len BIGINT UNSIGNED DEFAULT @@group_concat_max_len;&lt;br /&gt;&lt;br /&gt;    -- ensure group_concat capacity&lt;br /&gt;    SET @@group_concat_max_len := @@max_allowed_packet;&lt;br /&gt;&lt;br /&gt;    -- get the database size&lt;br /&gt;    SELECT  ROUND(SUM(size),2), MAX(size)&lt;br /&gt;    ,       GROUP_CONCAT(&lt;br /&gt;                ROUND(size,2)&lt;br /&gt;                ORDER BY size, table_schema&lt;br /&gt;            )&lt;br /&gt;    ,       GROUP_CONCAT(&lt;br /&gt;                table_schema &lt;br /&gt;                ORDER BY size, table_schema&lt;br /&gt;                SEPARATOR '|'&lt;br /&gt;            )&lt;br /&gt;    INTO    v_sum_size, v_max_size&lt;br /&gt;    ,       v_size_series, v_size_labels&lt;br /&gt;    FROM    (&lt;br /&gt;            SELECT      table_schema&lt;br /&gt;            ,           SUM(data_length + index_length) / 1024 / 1024 size&lt;br /&gt;            FROM        information_schema.tables &lt;br /&gt;            WHERE       table_type = 'BASE TABLE'&lt;br /&gt;            GROUP BY    table_schema&lt;br /&gt;            ) a;&lt;br /&gt;&lt;br /&gt;    -- restore original group_concat_max_len&lt;br /&gt;    SET @@group_concat_max_len := v_group_concat_max_len;&lt;br /&gt;&lt;br /&gt;    -- build URL&lt;br /&gt;    RETURN CONCAT(&lt;br /&gt;        'http://chart.apis.google.com/chart'&lt;br /&gt;    ,   '?cht='   , p_chart_type&lt;br /&gt;    ,   '&amp;chs='   , p_width, 'x', p_height&lt;br /&gt;    ,   '&amp;chds=0,', v_max_size&lt;br /&gt;    ,   '&amp;chd=t:' , v_size_series&lt;br /&gt;    ,   '&amp;chdl='  , v_size_labels&lt;br /&gt;    ,   '&amp;chl='   , replace(v_size_series, ',', '|')&lt;br /&gt;    ,   '&amp;chtt=MySQL Database Size (', v_sum_size, 'MB)'&lt;br /&gt;    );&lt;br /&gt;END;&lt;/pre&gt;Here's a quick summary that points out some differences with regard to the original code:&lt;ul&gt;&lt;li&gt;Instead of accepting &lt;code&gt;CHAR(1)&lt;/code&gt; for the chart type, an &lt;code&gt;ENUM('bhs','p')&lt;/code&gt; is used to restrict the value to a listed type.&lt;/li&gt;&lt;li&gt;Instead returning a &lt;code&gt;varchar(3000)&lt;/code&gt;, this function returns &lt;code&gt;LONGTEXT&lt;/code&gt;, effectively leaving it to the Google chart API to report a URL length limitation&lt;/li&gt;&lt;li&gt;Instead of a cursor, we use a single &lt;code&gt;SELECT..INTO&lt;/code&gt; statement.&lt;/li&gt;&lt;li&gt;The query in the &lt;code&gt;SELECT...INTO&lt;/code&gt; statement uses a subquery in the &lt;code&gt;FROM&lt;/code&gt; clause which is functionally equivalent to the actual cursor in the original code. A small but important improvement is the addition of a condition to restrict the result only to base tables. This automatically excludes the &lt;code&gt;information_schema&lt;/code&gt; and databases that contain only views&lt;/li&gt;&lt;li&gt;The outer query is functionally equivalent to the cursor loop and uses one &lt;code&gt;GROUP_CONCAT()&lt;/code&gt; expression to obtain a list of values, and one &lt;code&gt;GROUP_CONCAT()&lt;/code&gt; expression to obtain a list of labels. Here, we also calculate the scaling factor by taking the &lt;code&gt;MAX&lt;/code&gt; of the database size.&lt;/li&gt;&lt;li&gt;To work around the truncation problem with &lt;code&gt;GROUP_CONCAT&lt;/code&gt;, we set &lt;code&gt;group_concat_max_len&lt;/code&gt; the the maximum practical value. We don't just set &lt;code&gt;max_group_concat_len&lt;/code&gt; and leave it at that. We restore its original value at the end of the stored function.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-472283629512966495?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/472283629512966495/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=472283629512966495' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/472283629512966495'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/472283629512966495'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/03/faster-mysql-database-size-google-chart.html' title='A Faster MySQL Database Size Google Chart'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-5667216309480331322</id><published>2009-03-04T20:43:00.004+01:00</published><updated>2009-03-04T21:02:41.450+01:00</updated><title type='text'>Woot! MySQL bug 11661 fixed, finally...</title><content type='html'>Yeah...title says it all.&lt;br /&gt;&lt;br /&gt;Bug &lt;a href="http://bugs.mysql.com/bug.php?id=11661" target="_mysql"&gt;bug #11661&lt;/a&gt;, &lt;em&gt;Raising Exceptions from within stored procedures: Support for SIGNAL statement&lt;/em&gt;.&lt;br /&gt;&lt;br /&gt;I can't recall the exact moment of filing this bug, because I filed a lot of them. However, this one is special to me because I filed it when I was in the process of getting involved with MySQL and its community. I remember it as a very intense period of learning, meeting other community members online (mostly through &lt;a href="forums.mysql.com"&gt;forums.mysql.com&lt;/a&gt; and blogging on a regular basis. &lt;br /&gt;&lt;br /&gt;A lot has happened since then, and if you ask me, most of it is good stuff. The only thing that casts somewhat of a shadow is that it took so long to fix this. But I don't want to whine about it. Rather, I'd like to use this opportunity and extend my big, BIG THANK YOU!! to Marc Alff who has been working very hard on implementing this feature. I should also mention &lt;a href="http://blogs.mysql.com/peterg/2009/03/03/signal-and-resignal-are-in-60-main-tree/" target="_mysql"&gt;Peter Gulutzan&lt;/a&gt;, who's enduring attention to quality and efforts for making MySQL comply more to the SQL Standard have made MySQL a better product.&lt;br /&gt;&lt;br /&gt;Thanks Guys! I'm looking forward to using this and other features in &lt;a href="http://dev.mysql.com/downloads/mysql/6.0.html" target="_mysql"&gt;MySQL 6.0&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-5667216309480331322?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/5667216309480331322/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=5667216309480331322' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5667216309480331322'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/5667216309480331322'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/03/woot-mysql-bug-11661-fixed-finally.html' title='Woot! MySQL bug 11661 fixed, finally...'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-397778535733358435</id><published>2009-02-21T02:50:00.001+01:00</published><updated>2009-02-21T02:55:39.103+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='wikipedia'/><category scheme='http://www.blogger.com/atom/ns#' term='loss of service'/><category scheme='http://www.blogger.com/atom/ns#' term='outage'/><title type='text'>Rare find...</title><content type='html'>First time in my lifetime....&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/21931585@N07/3296575934/" title="wikiout by roland.bouman, on Flickr"&gt;&lt;img src="http://farm4.static.flickr.com/3342/3296575934_d6b6d64687_o.png" width="767" height="515" alt="wikiout" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;And indeed, about a minute later all is back to normal. &lt;br /&gt;&lt;br /&gt;My respect goes out to the architects and admins of wikipedia that somehow manage to keep this large scale operation running so smoothly. It's truly amazing.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-397778535733358435?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/397778535733358435/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=397778535733358435' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/397778535733358435'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/397778535733358435'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/02/rare-find.html' title='Rare find...'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-3308692306835894546</id><published>2009-02-20T16:10:00.001+01:00</published><updated>2009-02-20T16:13:12.517+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='.kjb file'/><category scheme='http://www.blogger.com/atom/ns#' term='Matt Casters'/><category scheme='http://www.blogger.com/atom/ns#' term='Pentaho'/><category scheme='http://www.blogger.com/atom/ns#' term='Kettle'/><category scheme='http://www.blogger.com/atom/ns#' term='&quot;Building Pentaho Solutions&quot;'/><category scheme='http://www.blogger.com/atom/ns#' term='Kettle repository'/><category scheme='http://www.blogger.com/atom/ns#' term='pentaho data integration'/><category scheme='http://www.blogger.com/atom/ns#' term='.ktr file'/><title type='text'>Exporting a Kettle Repository to Files</title><content type='html'>Hi All! &lt;br /&gt;&lt;br /&gt;Today I'd like to announce &lt;a href="http://code.google.com/p/krex/wiki/Introduction_to_KREX" href="_krex"&gt;KREX&lt;/a&gt;, a small solution I put together to export a &lt;a href="http://kettle.pentaho.org/" target="_kettle"&gt;Kettle&lt;/a&gt; (a.k.a. &lt;a href="http://www.pentaho.org/" target="_pentaho"&gt;Pentaho&lt;/a&gt; Data Integration) Repository to individual transformation (&lt;code&gt;.ktr&lt;/code&gt;) and and job (&lt;code&gt;.kjb&lt;/code&gt;) files. &lt;br /&gt;&lt;br /&gt;The idea to create this was inspired by this &lt;a href="http://forums.pentaho.org/showthread.php?t=68017" target="_pentaho"&gt;thread&lt;/a&gt; on the &lt;a href="http://forums.pentaho.org/" target="_pentaho"&gt;pentaho forums&lt;/a&gt;, started by &lt;a href="http://forums.pentaho.org/member.php?u=18848" target="_pentaho"&gt;kandrews&lt;/a&gt;. He (she?) wrote:&lt;blockquote&gt;Has anyone ever been able to export a PDI repository and convert it somehow into regular non-repository .kjb &amp; .ktr files? If you have done this already or this functionality already exists please let me know.&lt;br /&gt;&lt;br /&gt;My initial thoughts are possibly an XLS translation against the XML from the repository export. Thoughts?&lt;/blockquote&gt;Well, I hope this helps! Enjoy en let me know if its useful. Be advised that in the  &lt;a href="http://forums.pentaho.org/showthread.php?t=68017" target="_pentaho"&gt;same thread&lt;/a&gt;, &lt;a href="http://www.ibridge.be/" target="_matt"&gt;Matt Casters&lt;/a&gt; already revealed that the functionality to do this will soon be built into PDI, but until then this may be of use.&lt;br /&gt;&lt;br /&gt;To start using KREX,&lt;ul&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://code.google.com/p/krex/source/checkout" target="_krex"&gt;checkout&lt;/a&gt; the repository or &lt;a href="http://code.google.com/p/krex/source/browse/#svn/trunk" target="_krex"&gt;download&lt;/a&gt; the Job and Transformation files to your file system.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Open the main Job file &lt;code&gt;export_repository_to_files.kjb&lt;/code&gt; using &lt;A href="http://sourceforge.net/forum/forum.php?forum_id=918065" target="_sf"&gt;Pentaho Data Integration 3.2&lt;/a&gt;'s &lt;a href="http://sourceforge.net/forum/forum.php?forum_id=918065" target="_pentaho"&gt;spoon&lt;/a&gt; (Currently &lt;a href="http://www.ibridge.be/?p=156" target="_matt"&gt;a Milestone 1 release&lt;/a&gt;)&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Configure the &lt;code&gt;Set Source Repository Step&lt;/code&gt; in the &lt;code&gt;set_source_repo_and_target_directory&lt;/code&gt; transformation to match the repository you want to export&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Run the main job file (&lt;code&gt;export_repository_to_files.kjb&lt;/code&gt;)&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;If all goes well, you should now have a directory called &lt;code&gt;pdi_repo_export&lt;/code&gt; in your home directory which contains a subdirectory named after your exported repository containing the directory tree with the &lt;code&gt;.ktr&lt;/code&gt; and &lt;code&gt;.kjb&lt;/code&gt; files.&lt;br /&gt;&lt;br /&gt;Here's a quick screenshot of the main job, just to give you an idea:&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/21931585@N07/3295374476/" title="krex by roland.bouman, on Flickr"&gt;&lt;img src="http://farm4.static.flickr.com/3378/3295374476_0760ea2c7c_o.png" width="976" height="248" alt="krex" /&gt;&lt;/a&gt;The heart of the job is formed by the very last transformation, which does the actual legwork of extracting and saving the individual transformations:&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/21931585@N07/3295374472/" title="krex2 by roland.bouman, on Flickr"&gt;&lt;img src="http://farm4.static.flickr.com/3464/3295374472_88803ebc85_o.png" width="858" height="524" alt="krex2" /&gt;&lt;/a&gt;&lt;br /&gt;The steps before that are mainly configuration and ensuring that the directory tree that is to contain the files is created before we attempt to write any files.&lt;br /&gt;&lt;br /&gt;If you have any suggestions or comments, I welcome you to post them here. If you are trying to use KREX but run into an issue, please use the &lt;a href="http://code.google.com/p/krex/issues/list" target="_krex"&gt;KREX issuelist&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;If you are looking for more tips and trick with kettle and Pentaho in general, stay tuned. The &lt;a href="http://rpbouman.blogspot.com/2009/01/writing-book-building-pentaho-solutions.html" target="_bps"&gt;"Building Pentaho Solutions" book&lt;/a&gt; I'm writing for &lt;a href="http://www.wiley.com/WileyCDA/" target="_wiley"&gt;Wiley&lt;/a&gt; together with &lt;a href="http://www.linkedin.com/pub/0/169/771" target="_jos"&gt;Jos van Dongen&lt;/a&gt; will contain tons and tons of practical tips and solutions, and explain many of its technologies and concepts in thorough detail.&lt;br /&gt;&lt;br /&gt;Cheers and until next time,&lt;br /&gt;&lt;br /&gt;Roland&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/15319370-3308692306835894546?l=rpbouman.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://rpbouman.blogspot.com/feeds/3308692306835894546/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=15319370&amp;postID=3308692306835894546' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/3308692306835894546'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/15319370/posts/default/3308692306835894546'/><link rel='alternate' type='text/html' href='http://rpbouman.blogspot.com/2009/02/exporting-kettle-repository-to-files.html' title='Exporting a Kettle Repository to Files'/><author><name>Roland Bouman</name><uri>http://www.blogger.com/profile/13365137747952711328</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='23' height='32' src='http://www.xcdsql.org/people/rbouman/roland.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-15319370.post-5810299338052382197</id><published>2009-01-31T01:19:00.002+01:00</published><updated>2009-01-31T01:36:12.932+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Data  warehousing'/><category scheme='http://www.blogger.com/atom/ns#' term='SCD Type 1'/><category scheme='http://www.blogger.com/atom/ns#' term='&quot;Building Pentaho Solutions&quot;'/><category scheme='http://www.blogger.com/atom/ns#' term='SCD Type 2'/><category scheme='http://www.blogger.com/atom/ns#' term='Jeff Prenevost'/><category scheme='http://www.blogger.com/atom/ns#' term='BI'/><category scheme='http://www.blogger.com/atom/ns#' term='slowly changing dimension'/><category scheme='http://www.blogger.com/atom/ns#' term='Jos van Dongen'/><category scheme='http://www.blogger.com/atom/ns#' term='GROUP BY'/><category scheme='http://www.blogger.com/atom/ns#' term='business intelligence'/><title type='text'>Loading a dimension table with SCD1 and SCD2 attributes</title><content type='html'>&lt;a href="http://www.tholis.com/blog/" target="_jos"&gt;Jos&lt;/a&gt;, my co-author for &lt;a href="http://rpbouman.blogspot.com/2009/01/writing-book-building-pentaho-solutions.html" target="_rpb"&gt;the "Building Pentaho Solutions" book&lt;/a&gt; just pointed me to a recent article by Jeff Prenevost entitled &lt;a href="http://www.information-management.com/infodirect/2009_106/10014831-1.html" target="_jeff"&gt;"The Problem with History"&lt;/a&gt;.&lt;h3&gt;Abstract&lt;/h3&gt;Jeff's topic, loading a hybrid Type 1 / Type 2 &lt;a href="http://en.wikipedia.org/wiki/Slowly_Changing_Dimension" target="_wiki"&gt;slowly changing dimension&lt;/a&gt; table is related to &lt;a href="http://en.wikipedia.org/wiki/Data_warehouse" target="_wiki"&gt;data warehousing&lt;/a&gt; but maybe of interest outside of that context as well. &lt;br /&gt;&lt;br /&gt;As it turns out, the particular problem described by Jeff is non-trivial, but can be solved quite elegantly in a single SQL statment. This may be a compelling alternative to the multi-step, multi-pass solution proposed in his article.&lt;h3&gt;Type 1 and Type 2 Slowly Changing Dimensions&lt;/h3&gt;In his article, Jeff describes a method to load a slowly changing dimension (SCD) table from an &lt;a href="http://en.wikipedia.org/wiki/Audit_trail" target="_wiki"&gt;audit trail&lt;/a&gt;. This would be quite straight forward in case we are dealing with a &lt;a href="http://en.wikipedia.org/wiki/Slowly_Changing_Dimension#Type_2" target="_wiki"&gt;Type 2 slowly changing dimension&lt;/a&gt;. In that case, each row in the audit trail would also yield one row in the dimension table. If the dimension would be a &lt;a href="http://en.wikipedia.org/wiki/Slowly_Changing_Dimension#Type_1" target="_wiki"&gt;Type 1 slowly changing dimension&lt;/a&gt;, the matter would only be slightly more complicated - in this case only the most recent version of each object would be loaded into the dimension table.&lt;h4&gt;A Hybrid SCD Type 1/2&lt;/h4&gt;The interesting thing about the problem described in the article is that the dimension table is a hybrid Type 1 / Type 2 dimension. That is, for some attributes, history needs to be tracked in the dimension table (Type 2 attributes), whereas only the most recent data is required for other attributes (Type 1 attributes).&lt;h3&gt;Sample Data&lt;/h3&gt;To make things tangible, here's a sample Employee audit trail:&lt;pre&gt;         |    SCD Type 1        |    SCD Type 2  |&lt;br /&gt;+--------+--------+-------------+--------+-------+------------+------------+&lt;br /&gt;| empkey | name   | ssn         | gender | state | valid_from | valid_to   |&lt;br /&gt;+--------+--------+-------------+--------+-------+------------+------------+&lt;br /&gt;|     14 | Jo     | 323-10-1116 |&lt;span style="background-color:rgb(255,230,230)"&gt; F      | MI    &lt;/span&gt;| 1998-12-03 | 1998-12-27 |&lt;br /&gt;|     14 | Jo     | &lt;b&gt;323-10-1119&lt;/b&gt; |&lt;span style="background-color:rgb(255,230,230)"&gt; F      | MI    &lt;/span&gt;| 1998-12-28 | 2005-04-22 |&lt;br /&gt;|     14 | &lt;b&gt;Joe&lt;/b&gt;    | 323-10-1119 |&lt;span style="background-color:rgb(255,230,230)"&gt; F      | MI    &lt;/span&gt;| 2005-04-23 | 2005-08-07 |&lt;br /&gt;|     14 | &lt;b&gt;Joseph&lt;/b&gt; | 323-10-1119 |&lt;span style="background-color:rgb(230,255,230)"&gt; &lt;b&gt;M&lt;/b&gt;      | MI    &lt;/span&gt;| 2005-08-08 | 2006-02-12 |&lt;br /&gt;|     14 | &lt;b&gt;Joe&lt;/b&gt;    | 323-10-1119 |&lt;span style="background-color:rgb(230,255,230)"&gt; M      | MI    &lt;/span&gt;| 2006-02-13 | 2006-07-04 |&lt;br /&gt;|     14 | &lt;b&gt;Joseph&lt;/b&gt; | 323-10-1119 |&lt;span style="background-color:rgb(230,230,255)"&gt; M      | &lt;b&gt;NY&lt;/b&gt;    &lt;/span&gt;| 2006-07-05 | 2006-12-24 |&lt;br /&gt;&lt;span style="font-style:italic;background-color:rgb(200,200,200)"&gt;|     14 | Joseph | 323-10-1119 |&lt;span style="background-color:orange"&gt; M      | &lt;b&gt;MI&lt;/b&gt;    &lt;/span&gt;| 2006-12-25 | NULL       |&lt;/span&gt;&lt;br /&gt;|     15 | Jim    | 224-57-2726 |&lt;span style="background-color:rgb(255,200,200)"&gt; M      | IL    &lt;/span&gt;| 2002-01-16 | 2004-03-15 |&lt;br /&gt;|     15 | &lt;b&gt;James&lt;/b&gt;  | 224-57-2726 |&lt;span style="background-color:rgb(255,200,200)"&gt; M      | IL    &lt;/span&gt;| 2004-03-16 | 2007-06-22 |&lt;br /&gt;&lt;span style="font-style:italic;background-color:rgb(200,200,200)"&gt;|     15 | James  | 224-57-2726 |&lt;span style="background-color:rgb(230,255,230)"&gt; M      | &lt;b&gt;IN&lt;/b&gt;    &lt;/span&gt;| 2007-06-23 | 2007-08-31 |&lt;/span&gt;&lt;br /&gt;+--------+--------+-------------+--------+-------+------------+------------+&lt;/pre&gt;The data shows the history for the employees with &lt;code&gt;empkey&lt;/code&gt; 14 and 15. All rows with the same &lt;code&gt;empkey&lt;/code&gt; value form a time line of data change events. Each row represents a change event, updating some of the employee's data. The &lt;code&gt;valid_from&lt;/code&gt; and &lt;code&gt;valid_to&lt;/code&gt; columns are used to record when the data change event occurred, so the values in these columns change for each row. For the other columns, I used &lt;b&gt;bold&lt;/b&gt; markup to make it easier to spot the change.&lt;h4&gt;SCD Type 1 and 2 Attributes&lt;/h4&gt;In Jeff's article, the columns &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;ssn&lt;/code&gt; end up as SCD Type 1 attributes, and the columns &lt;code&gt;gender&lt;/code&gt; and &lt;code&gt;state&lt;/code&gt; as SCD Type 2 attributes. I used color highlighting to mark up groups of consecutive rows where the values for the &lt;code&gt;gender&lt;/code&gt; and &lt;code&gt;state&lt;/code&gt; columns did not change. After loading the dimension table we must end up with one row for each such group, capturing all change events for these columns. The &lt;code&gt;ssn&lt;/code&gt; and &lt;code&gt;name&lt;/code&gt; must always contain the most recent data. I highlighted the most recent row using grey markup.&lt;h4&gt;Resulting Dimension Table&lt;/h4&gt;The data for the resulting dimension table would look like this:&lt;pre&gt;+--------+--------+--------+-------------+--------+-------+------------+------------+------------+&lt;br /&gt;| dw_key | empkey | name   | ssn         | gender | state | valid_from | valid_to   | is_current |&lt;br /&gt;+--------+--------+--------+-------------+--------+-------+------------+------------+------------+&lt;br /&gt;|      1 |     14 | Joseph | 323-10-1119 | F      | MI    | 1998-12-03 | 2005-08-07 | N          |&lt;br /&gt;|      2 |     14 | Joseph | 323-10-1119 | M      | MI    | 2005-08-08 | 2006-07-04 | N          |&lt;br /&gt;|      3 |     14 | Joseph | 323-10-1119 | M      | NY    | 2006-07-05 | 2006-12-24 | N          |&lt;br /&gt;|      4 |     14 | Joseph | 323-10-1119 | M      | MI    | 2006-12-25 | 9999-12-31 | Y          |&lt;br /&gt;|      5 |     15 | James  | 224-57-2726 | M      | IL    | 2002-01-16 | 2007-06-22 | N          |&lt;br /&gt;|      6 |     15 | James  | 224-57-2726 | M      | IN    | 2007-06-23 | 9999-12-31 | Y          |&lt;br /&gt;+--------+--------+--------+-------------+--------+-------+------------+------------+------------&lt;/pre&gt;As you can see, only the changes in the &lt;code&gt;gender&lt;/code&gt; and &lt;code&gt;state&lt;/code&gt; columns are recorded, ignoring any changes in the &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;ssn&lt;/code&gt; columns. Instead, these columns get the values of the most recent change.&lt;h3&gt;Jeff's Solution&lt;/h3&gt;Jeff's article describes a step-wise solution to this problem. To give you an idea and some context, here's a quote from the article describing the mindset in developing this approach:&lt;blockquote&gt;&lt;br /&gt;While an ETL tool like Informatica or DataStage would be handy, any of this could be done fairly easily in straight SQL. We’ll also keep all the steps simple, easy to understand and discrete. It’s possible to create enormous heaps of nested SQL to do everything in one statement, but it's best to keep everything understandable. We create and label "tables," but whether you choose to actually create real database tables, or the tables are just record sets inside a flow, the process remains essentially unchanged.&lt;/blockquote&gt;The actual steps that make up the described solution can be summarized as follows:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Rank all rows according to the timeline&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Join consecutive rows by rank in case the there is a change in the SCD attributes&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Break the joined rows back into two rows (Jeff calls this 'semi-pivot')&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Add the last and first version for each object to the set&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Add another rank number to the result set&lt;/li&gt;&lt;br /&gt;&lt;li&gt;At this point, the rows where the new rank number is odd contain the &lt;code&gt;valid_from&lt;/code&gt; date, and rows with an even number contain the &lt;code&gt;valid_to&lt;/code&gt; date. Join each odd row to its consecutive even rows.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Join to the last version of each object to fill in for the SCD Type 1 attributes, and set a flag for the latest version of the object.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;I admit I have a hard time understanding Jeff's method, and I am completely puzzled as to how he invented it. My apologies if my summary isn't exactly crystal clear - I encourage you to read &lt;a href="http://www.information-management.com/infodirect/2009_106/10014831-1.html" target="_jeff"&gt;the original article, "The Problem with History"&lt;/a&gt; yourself if you are interested in the details.&lt;h3&gt;My Alternative&lt;/h3&gt;First of all, let me say that I am impressed with Jeff's creativity. I would never think of this solution and I am not really sure I even understand it. I also am not categorically opposed to Jeff's initial mindset: I too think that it is sometimes better to avoid complex SQL statements in favor of a series of relatively simple ones.&lt;br /&gt;&lt;br /&gt;That said, when I first read Jeff's article, my first impression was: "Wow...that are a lot of moving parts". My second impression was: "Darn, that's a lot of multiple passes over the same data set". All the time, I had this hunch that it wouldn't be to hard to do it in two statements (one &lt;code&gt;SELECT...INTO&lt;/code&gt;, one &lt;code&gt;UPDATE&lt;/code&gt;), or perhaps even in one (&lt;code&gt;SELECT...INTO&lt;/code&gt;).&lt;h4&gt;Preparation&lt;/h4&gt;I started by setting up the test data in MySQL 5.1. I'm including the script here for the readers' convenience:&lt;pre&gt;&lt;br /&gt;CREATE TABLE  hrsource (&lt;br /&gt;    empkey     int         NOT NULL&lt;br /&gt;,   name       varchar(10) NOT NULL&lt;br /&gt;,   ssn        char(11)    NOT NULL&lt;br /&gt;,   gender     char(1)     NOT NULL&lt;br /&gt;,   state      char(2)     NOT NULL&lt;br /&gt;,   valid_from date        NOT NULL&lt;br /&gt;,   valid_to   date       &lt;br /&gt;,   PRIMARY KEY(empkey, valid_from)&lt;br /&gt;);&lt;br /&gt;&lt;br /&gt;INSERT INTO hrsource &lt;br /&gt;(empkey,name,ssn,gender,state,valid_from,valid_to) VALUES&lt;br /&gt; (14, 'Jo'    , '323-10-1116', 'F', 'MI', '1998-12-03', '1998-12-27')&lt;br /&gt;,(14, 'Jo'    , '323-10-1119', 'F', 'MI', '1998-12-28', '2005-04-22')&lt;br /&gt;,(14, 'Joe'   , '323-10-1119', 'F', 'MI', '2005-04-23', '2005-08-07')&lt;br /&gt;,(14, 'Joseph', '323-10-1119', 'M', 'MI', '2005-08-08', '2006-02-12')&lt;br /&gt;,(14, 'Joe'   , '323-10-1119', 'M', 'MI', '2006-02-13', '2006-07-04')&lt;br /&gt;,(14, 'Joseph', '323-10-1119', 'M', 'NY', '2006-07-05', '2006-12-24')&lt;br /&gt;,(14, 'Joseph', '323-10-1119', 'M', 'MI', '2006-12-25', NULL)&lt;br /&gt;,(15, 'Jim'   , '224-57-2726', 'M', 'IL', '2002-01-16', '2004-03-15')&lt;br /&gt;,(15, 'James' , '224-57-2726', 'M', 'IL', '2004-03-16', '2007-06-22')&lt;br /&gt;,(15, 'James' , '224-57-2726', 'M', 'IN', '2007-06-23', '2007-08-31');&lt;br /&gt;&lt;/pre&gt;Of course, this is not a lot of data, but it is the data from the original article, dutifully hacked into my computer's keyboard by your's truly.&lt;h4&gt;Solution&lt;/h4&gt;Without further ado, this is my solution for selecting the dimension's table dataset:&lt;pre&gt;SELECT     prv.empkey&lt;br /&gt;,          curr.name&lt;br /&gt;,          curr.ssn&lt;br /&gt;,          prv.gender&lt;br /&gt;,          prv.state&lt;br /&gt;,          MIN(prv.valid_from)                        valid_from&lt;br /&gt;,          COALESCE(&lt;br /&gt;               nxt.valid_from - INTERVAL 1 DAY&lt;br /&gt;           ,   '9999-12-31'&lt;br /&gt;           )                                          valid_to&lt;br /&gt;,          CASE &lt;br /&gt;               WHEN nxt.valid_from IS NULL THEN 'Y'&lt;br /&gt;               ELSE 'N'&lt;br /&gt;           END                                        is_current&lt;br /&gt;FROM       hrsource                curr&lt;br /&gt;INNER JOIN (&lt;br /&gt;            SELECT   empkey&lt;br /&gt;            ,        MAX(valid_from) valid_from&lt;br /&gt;            FROM     hrsource&lt;br /&gt;            GROUP BY empkey&lt;br /&gt;           )                       curr1&lt;br /&gt;ON         curr.empkey           = curr1.empkey&lt;br /&gt;AND        curr.valid_from       = curr1.valid_from&lt;br /&gt;INNER JOIN hrsource                prv&lt;br /&gt;ON         curr.empkey           = prv.empkey&lt;br /&gt;LEFT JOIN  hrsource                nxt&lt;br /&gt;ON         prv.empkey            = nxt.empkey&lt;br /&gt;AND        prv.valid_to          &amp;lt; nxt.valid_from&lt;br /&gt;AND       (prv.gender,prv.state)!= (nxt.gender,nxt.state)&lt;br /&gt;LEFT JOIN  hrsource                inb&lt;br /&gt;ON         prv.empkey            = inb.empkey&lt;br /&gt;AND        prv.valid_to          &amp;lt; inb.valid_from&lt;br /&gt;AND        nxt.valid_from        &amp;gt; inb.valid_to&lt;br /&gt;AND       (prv.gender,prv.state)!= (inb.gender,inb.state)&lt;br /&gt;WHERE      inb.empkey IS NULL&lt;br /&gt;GROUP BY   prv.empkey&lt;br /&gt;,          nxt.valid_from&lt;/pre&gt;Now I would certainly not qualify this as a simple statement. At the same time, I feel this does not resemble the "enormous heaps of nested SQL" so dreaded by Jeff. Let me explain how it works, and you can judge for yourself.&lt;h3&gt;Explanation&lt;/h3&gt;I will know do a step-by-step explanation of my statement. I hope it clears up any doubt you might have as to how my solution solves the problem.&lt;h4&gt;SCD Type 1 attributes: Isolating the Most Recent Change&lt;/h4&gt;In the introduction, I mentioned that the loading the dimension table would be easy in case it was either a Type 1 or a Type 2 slowly changing dimension. Well, we know that, come what may, we will always need the most recent row for each &lt;code&gt;empkey&lt;/code&gt; to supply values for the Type 1 attributes &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;ssn&lt;/code&gt;. So, we start by attacking that part of the problem. &lt;br /&gt;&lt;br /&gt;The following fragment of my solution does exactly that:&lt;pre&gt;FROM       hrsource                curr&lt;br /&gt;INNER JOIN (&lt;br /&gt;            SELECT   empkey&lt;br /&gt;            ,        MAX(valid_from) valid_from&lt;br /&gt;            FROM     hrsource&lt;br /&gt;            GROUP BY empkey&lt;br /&gt;           )                       curr1&lt;br /&gt;ON         curr.empkey           = curr1.empkey&lt;br /&gt;AND        curr.valid_from       = curr1.valid_from&lt;/pre&gt;The subquery &lt;code&gt;curr1&lt;/code&gt; finds us the highest value for the &lt;code&gt;valid_from&lt;/code&gt; column for each distinct &lt;code&gt;empkey&lt;/code&gt;. The &lt;code&gt;GROUP BY empkey&lt;/code&gt; clause ensures we get exactly one row for each distinct value of &lt;code&gt;empkey&lt;/code&gt;. From all rows with the same &lt;code&gt;empkey&lt;/code&gt; value, the &lt;code&gt;MAX(valid_from)&lt;/code&gt; bit finds the largest &lt;code&gt;valid_from&lt;/code&gt; value.&lt;br /&gt;&lt;br /&gt;The combination &lt;code&gt;empkey, valid_from&lt;/code&gt; is the primary key of the employee audit trail. This means we can now use the &lt;code&gt;empkey, MAX(valid_from)&lt;/code&gt; pair from the &lt;code&gt;curr1&lt;/code&gt; subquery to point out the most recent row for each distinct &lt;code&gt;empkey&lt;/code&gt;. The &lt;code&gt;INNER JOIN&lt;/code&gt; with the &lt;code&gt;hrsource&lt;/code&gt; table (dubbed &lt;code&gt;curr&lt;/code&gt;) does exactly that.&lt;h3&gt;Grouping consecutive rows with identical Type 2 attributes&lt;/h3&gt;We must now deal with the second part of the problem, the SCD Type 2 attributes. &lt;br /&gt;&lt;br /&gt;If you look back at the color highlighting in the sample data set, you may realize this is basically a grouping problem: for each distinct &lt;code&gt;empkey&lt;/code&gt; we need to group rows with identical values in the Type 2 columns &lt;code&gt;gender&lt;/code&gt; and &lt;code&gt;state&lt;/code&gt;. From the point of view of the dimension table, no change occurred within this group, so we should store it as one row in the dimension table, and reconstruct the change dates by taking the minimum &lt;code&gt;valid_from&lt;/code&gt; and maximum &lt;code&gt;valid_to&lt;/code&gt; values from the group. &lt;br /&gt;&lt;br /&gt;We must be careful though: we can't simply make groups of all distinct combinations of the SCD Type 2 columns. Look for example at the rows having &lt;code&gt;14&lt;/code&gt; for the &lt;code&gt;empkey&lt;/code&gt; column. The rows with the &lt;code&gt;valid_from&lt;/code&gt; values &lt;code&gt;2005-08-08&lt;/code&gt;, &lt;code&gt;2006-02-13&lt;/code&gt; and &lt;code&gt;2006-12-25&lt;/code&gt; all have the identical combination (&lt;code&gt;M&lt;/code&gt;, &lt;code&gt;MI&lt;/code&gt;) in the Type 2 columns &lt;code&gt;gender&lt;/code&gt; and &lt;code&gt;state&lt;/code&gt; respectively. However, only the first two of these belong together in a group. The row with &lt;code&gt;2006-12-25&lt;/code&gt; in the &lt;code&gt;valid_from&lt;/code&gt; column does not belong to the group because another change event occurred on &lt;code&gt;2006-07-05&lt;/code&gt;, which lies between &lt;code&gt;2006-02-13&lt;/code&gt; and &lt;code&gt;2006-12-25&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;So, we must sharpen up the definition of the problem. It is not enough to make groups of identical combinations of Type 2 attributes: we must also demand that the rows in the group are &lt;em&gt;consecutive&lt;/em&gt;.&lt;br /&gt;&lt;br /&gt;The first step in attacking this problem is generate combinations of the audit trail rows so that we have the first row of each group along the first row of the next group. The following fragment solves part of that problem by combining each row (alias: &lt;code&gt;prv&lt;/code&gt;) with all more recent rows (alias: &lt;code&gt;nxt&lt;/code&gt;) that have the same &lt;code&gt;empkey&lt;/code&gt; value but different values in the Type 2 columns &lt;code&gt;gender&lt;/code&gt; and &lt;code&gt;state&lt;/code&gt;:&lt;pre&gt;&lt;br /&gt;&lt;span style="text-decoration: line-through"&gt;INNER JOIN&lt;/span&gt; hrsource                prv&lt;br /&gt;&lt;span style="text-decoration: line-through"&gt;ON         curr.empkey           = prv.empkey&lt;/span&gt;&lt;br /&gt;LEFT JOIN  hrsource                nxt&lt;br /&gt;ON         prv.empkey            = nxt.empkey             -- timeline of same employee&lt;br /&gt;AND        prv.valid_to          &amp;lt; nxt.valid_from         -- nxt must be more recent than prv&lt;br /&gt;AND       (prv.gender,prv.state)!= (nxt.gender,nxt.state) -- type 2 attributes differ&lt;/pre&gt;(Please ignore the part I struck out - it's not relevant at this point)&lt;br /&gt;&lt;br /&gt;Note the usage of the &lt;code&gt;LEFT JOIN&lt;/code&gt;. It ensures that if &lt;code&gt;prv&lt;/code&gt; is the most recent row, and there is by definition no row in &lt;code&gt;nxt&lt;/code&gt; that is more recent, the &lt;code&gt;prv&lt;/code&gt; row is still retained. Had we used an &lt;code&gt;INNER JOIN&lt;/code&gt;, we would have lost that row, messing up the result.&lt;br /&gt;&lt;br /&gt;If we run this fragment in isolation, we can certainly see rows that pair the first row of a group with the first row of the next group. For example, if you run this query:&lt;pre&gt;SELECT    prv.empkey&lt;br /&gt;,         prv.valid_from prv_from, prv.valid_to prv_to&lt;br /&gt;,         nxt.valid_from nxt_from, nxt.valid_to nxt_to&lt;br /&gt;,         prv.gender prv_gender&lt;br /&gt;,         nxt.gender nxt_gender&lt;br /&gt;FROM      hrsource prv&lt;br /&gt;LEFT JOIN hrsource nxt&lt;br /&gt;ON        prv.empkey   = nxt.empkey&lt;br /&gt;AND       prv.valid_to &amp;lt; nxt.valid_from&lt;br /&gt;AND      (prv.gender, prv.state) != (nxt.gender, nxt.state)&lt;/pre&gt;We get results like this:&lt;pre&gt;&lt;br /&gt;+--------+------------+------------+------------+------------+------------+------------+&lt;br /&gt;| empkey | prv_from   | prv_to     | nxt_from   | nxt_to     | prv_gender | nxt_gender |&lt;br /&gt;+--------+------------+------------+------------+------------+------------+------------+&lt;br /&gt;&lt;span style="background-color:rgb(200,255,200)"&gt;|     14 | 1998-12-03 | 1998-12-27 | 2005-08-08 | 2006-02-12 | F          | M          |&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color:rgb(255,200,200)"&gt;|     14 | 1998-12-03 | 1998-12-27 | 2006-02-13 | 2006-07-04 | F          | M          |&lt;br /&gt;|     14 | 1998-12-03 | 1998-12-27 | 2006-07-05 | 2006-12-24 | F          | M          |&lt;/span&gt;&lt;br /&gt;.        .            .            .            .            .            .            .&lt;br /&gt;.        .            ...more rows...           .            .            .            .&lt;br /&gt;.        .            .            .            .            .            .            .&lt;br /&gt;+--------+------------+------------+------------+------------+------------+------------+&lt;br /&gt;&lt;/pre&gt;The first row highlighted in green is a desired combination because the &lt;code&gt;valid_from&lt;/code&gt; date of &lt;code&gt;nxt&lt;/code&gt; is closest to that of &lt;code&gt;prv&lt;/code&gt;. It is this row from &lt;code&gt;nxt&lt;/code&gt; that marks the end of the group of rows starting with &lt;code&gt;prv&lt;/code&gt;. For the rows highlighted in red, the &lt;code&gt;nxt&lt;/code&gt; part indicates a change that occurred beyond that point and are thus not desired.&lt;br /&gt;&lt;br /&gt;To get rid of these undesired rows, we simply have to demand that no row indicating a change in the SCD Type 2 attributes occurs between &lt;code&gt;prv&lt;/code&gt; and &lt;code&gt;nxt&lt;/code&gt;. In my query, the following fragment is responsible for that part:&lt;pre&gt;LEFT JOIN  hrsource                inb&lt;br /&gt;ON         prv.empkey            = inb.empkey             -- time line of same employee&lt;br /&gt;AND        prv.valid_to          &amp;lt; inb.valid_from         -- more recent than prv&lt;br /&gt;AND       (prv.gender,prv.state)!= (inb.gender,inb.state) -- Type 2 columns changed&lt;br /&gt;AND        nxt.valid_from        &amp;gt; inb.valid_to           -- less recent than nxt&lt;br /&gt;WHERE      inb.empkey IS NULL                             -- does not exist&lt;/pre&gt;This fragment essentially states that there must not be any row in between &lt;code&gt;prv&lt;/code&gt; and &lt;code&gt;nxt&lt;/code&gt; that indicate a change in the SCD type 2 attributes as compared to &lt;code&gt;prv&lt;/code&gt;. Note that the first part of the join condition is identical to the one we used to join &lt;code&gt;nxt&lt;/code&gt; to &lt;code&gt;prv&lt;/code&gt;. The extra's are that we also ask that &lt;code&gt;inb&lt;/code&gt; lies before &lt;code&gt;nxt&lt;/code&gt;. Because we used a &lt;code&gt;LEFT JOIN&lt;/code&gt;, the result row is retained in case no such &lt;code&gt;inb&lt;/code&gt; row exists. The &lt;code&gt;WHERE inb.empkey IS NULL&lt;/code&gt; condition explicitly filters for these cases. By definition, this means that the &lt;code&gt;nxt&lt;/code&gt; part in the result row marks the end of whatever group the &lt;code&gt;prv&lt;/code&gt; belongs to.&lt;h4&gt;Rolling Up the Rows in their Groups&lt;/h4&gt;Now, if we put these parts together we get something like this:&lt;pre&gt;&lt;br /&gt;SELECT    prv.empkey&lt;br /&gt;,         prv.valid_from prv_from, prv.valid_to prv_to&lt;br /&gt;,         nxt.valid_from nxt_from, nxt.valid_to nxt_to&lt;br /&gt;,         prv.gender prv_gender&lt;br /&gt;,         nxt.gender nxt_gender&lt;br /&gt;FROM      hrsource                  prv&lt;br /&gt;LEFT JOIN hrsource                  nxt&lt;br /&gt;ON        prv.empkey              = nxt.empkey&lt;br /&gt;AND       prv.valid_to            &amp;lt; nxt.valid_from&lt;br /&gt;AND      (prv.gender, prv.state) != (nxt.gender, nxt.state)&lt;br /&gt;LEFT JOIN  hrsource                 inb&lt;br /&gt;ON         prv.empkey             = inb.empkey             &lt;br /&gt;AND        prv.valid_to           &amp;lt; inb.valid_from         &lt;br /&gt;AND       (prv.gender,prv.state) != (inb.gender,inb.state) &lt;br /&gt;AND        nxt.valid_from         &amp;gt; inb.valid_to           &lt;br /&gt;WHERE      inb.empkey IS NULL&lt;/pre&gt;Some of the results are here:&lt;pre&gt;+--------+------------+------------+------------+------------+------------+------------+&lt;br /&gt;| empkey | prv_from   | prv_to     | nxt_from   | nxt_to     | prv_gender | nxt_gender |&lt;br /&gt;+--------+------------+------------+------------+------------+------------+------------+&lt;br /&gt;|     14 | 1998-12-03 | 1998-12-27 | &lt;b&gt;2005-08-08&lt;/b&gt; | 2006-02-12 | F          | M          |&lt;br /&gt;|     14 | 1998-12-28 | 2005-04-22 | &lt;b&gt;2005-08-08&lt;/b&gt; | 2006-02-12 | F          | M          |&lt;br /&gt;|     14 | 2005-04-23 | 2005-08-07 | &lt;b&gt;2005-08-08&lt;/b&gt; | 2006-02-12 | F          | M          |&lt;br /&gt;.        .            .            .            .            .            .            .&lt;br /&gt;.        .            ...more rows...           .            .            .            .&lt;br /&gt;.        .            .            .            .            .            .            .&lt;br /&gt;+--------+------------+------------+------------+------------+------------+------------+&lt;/pre&gt;Now, we still see the individual rows. However, because we have paired them with the row that definitely marks the end of the group it belongs to, we can now lump them together using &lt;code&gt;GROUP BY&lt;/code&gt;. This explains the final bits of the query:&lt;pre&gt;GROUP BY   prv.empkey&lt;br /&gt;,          nxt.valid_from&lt;br /&gt;&lt;/pre&gt;For each value in &lt;code&gt;empkey&lt;/code&gt;, this &lt;code&gt;GROUP BY&lt;/code&gt; clause essentially rolls up the group of rows that showed no change in the SCD Type 2 attributes until the occurrence of the &lt;code&gt;nxt&lt;/code&gt; row. In the &lt;code&gt;SELECT&lt;/code&gt; list, we can now use the &lt;code&gt;MIN&lt;/code&gt; function to find the date of the first row in the group.&lt;br /&gt;&lt;br /&gt;Note that the &lt;code&gt;GROUP BY&lt;/code&gt; list is the thinnest one possible. The &lt;code&gt;SELECT&lt;/code&gt; list contains many more columns that do not appear in the &lt;code&gt;GROUP BY&lt;/code&gt; list. However, it is perfectly safe to do so in this case. For more details on this matter, please &lt;a href="http://rpbouman.blogspot.com/2007/05/debunking-group-by-myths.html" target="_rpb"&gt;
