Saturday, December 18, 2010

Substituting variables in Kettle Parameter values

Kettle (a.k.a. Pentaho Data Integration) jobs and transformations offers support for named parameters (as of version 3.2.0). Named parameters form a special class of ordinary kettle variables and are intended to clearly and explicitly define for which variables the caller should supply a value.

One of my pet projects, the pentaho auto-documentation solution kettle-cookbook, uses two named parameters called INPUT_DIR and OUTPUT_DIR. These allow you to specify the directory that contains the BI content that is to be documented (such as kettle transformation and job files, action sequence files and mondrian schema files), and the directory to store the generated documentation.

Several kettle-cookbook users ran into problems attempting to use variable references in the values they supplied for the INPUT_DIR and OUTPUT_DIR variables. In this case, the variables referenced in the supplied parameter values would be set by adding entries in kettle.properties file. I just committed revision 64 of kettle-cookbook which should fix this problem. In this article I briefly discuss the solution, as I think it may be useful to other kettle users.

Substituting Kettle Variable References


Kettle doesn't automatically substitute variable references in parameter values (nor in ordinary variable values). So, if you need to support variable references inside parameter values, you have to substitute the variables yourself.

Variable substitution in Kettle 4.01 and up


As of Kettle version 4.01, the Calculator step supports a calculation type called "variable substitution in string A" that is intended exactly for that purpose. I have tested this but unfortunately in 4.01 it doesn't seem to work, at least not for the built-in variable ${Internal.Transformation.Filename.Directory} which I used in my test. In the latest stable version, Kettle 4.10 it does work as advertised, I would recommend using this method if you're a user of Kettle 4.10 (or later).

Variable substitution in earlier Kettle versions


I have committed myself to making kettle-cookbook work on kettle 3.2.0, as my sources tell me that this is still an often-used version in many production environments. I'm even prepared to make kettle-cookbook work on Kettle versions earlier than 3.20, should there be sufficient demand for that. Anyway, the bottom line is, these versions do not support the "variable substitution in string A" calculation in the Calculator step, so you have to resort to a little trick.

A Kettle 3.2.0 transformation to substitute variables in parameters


For kettle-cookbook, I added a single transformation called substitute-variables-in-parameters.ktr as the first transformation of the main job.



The substitute-variables-in-parameters.ktr transformation uses a "Get Variables" step to read the values of the INPUT_DIR and OUTPUT_DIR parameters. The values are then processed by a javascript function which substitutes all variable references with their values. Finally, a "Set Variables" step overwrites the original value of the variables with their replaced value.

The code for the JavaScript step is shown below:

function replace_variables(text){
var re = /\$\{([^\}]+)\}|%%([^%]+)%%/g,
match, from = 0,
variable_name, variable_value,
replaced_text = ""
;

while ((match = re.exec(text)) !== null) {
variable_name = match[1] ? match[1] : match[2];
variable_value = getVariable(variable_name, "");
replaced_text += text.substring(from, match.index);
replaced_text += variable_value;
from = match.index + match[0].length;
}
replaced_text += text.substring(from, text.length);
return replaced_text;
}

var replaced_input_dir = replace_variable(input_dir);
var replaced_output_dir = replace_variable(output_dir);

The script first defines function replace_variables(text) which accepts the parameter value, and returns the substituted value. Then it calls the function, applying it to the input_dir and output_dir fields from the incoming stream. These fields originate in the preceding "Get variables" step which assigns them the value of the INPUT_DIR and OUTPUT_DIR variables. The output of the replace_variables() function is assigned to the replaced_input_dir and replaced_output_dir javascript variables, which leave the JavaScript step as fields of the outgoing stream. In the final "Set variables" step, the replaced_input_dir and replaced_output_dir fields are used to overwrite the original value of the INPUT_DIR and OUTPUT_DIR values.

The replace_variables() function


Let's take a closer look at the replace_variables() function.

The heart of the function is formed by a while loop that executes a javascript regular expression called re that matches variable references.

The regular expression itself is defined in the top of the function:

var re = /\$\{([^\}]+)\}|%%([^%]+)%%/g,

It's intention is to recognize variable references of the form ${NAME} and %%NAME%%. The part of the pattern for the name is enclosed in parenthesis to form a capturing group. As we shall see later on, this allows us to extract the actual name of the referenced variable. The trailing g indicates that the pattern should be matched anywhere in the string. This is necessary because we want to replace all variable references in the input text, not just the first one.

The regular expression object is used to drive the while loop by calling its exec() method. In case of a match, the exec() returns an array that describes the text matched by the regular expression. If there's no match, exec() returns null

while ((match = re.exec(text)) !== null) {
...
}

If there is a match, we first extract the variable name:

variable_name = match[1] ? match[1] : match[2];
The first entry of the match array (at index 0) is the text that was matched by the pattern as a whole. The array contains subsequent elements for each capturing group in the regular expression. Because our regular expression re has 2 capturing groups, the match array contains two more elements. If the variable is of the form ${NAME}, the element at index=1 contains the variable name. If it's of the form %%NAME%%, it will be contained in the element at index=2.

Once we have the variable name, we can use the getVariable() javascript function to obtain its value:

variable_value = getVariable(variable_name, "");

The getVariable() is not a standard javascript function, but supplied by the kettle javascript step.

To perform the actual substitution, we take the substring of the original text up to the location where the variable reference was matched. This location is conveniently supplied by the match array:

replaced_text += text.substring(from, match.index);
Right after that location, we need to put the variable value instead of its name:

replaced_text += variable_value;
The last action in the loop is to remember to location right behind the last replaced variable reference, so we can pick up at the right location in the original value the next time we match a variable:

from = match.index + match[0].length;

Right after the loop, we need to copy the final piece of original text occurring right behind the last variable reference to yield the complete replaced text:

replaced_text += text.substring(from, text.length);

Odds and Ends


While the substitute-variables-in-parameters.ktr transformation works great for its intended purpose, substituting variables in the known parameters INPUT_DIR and OUTPUT_DIR, it is not really applicable beyond kettle cookbook. What you'd really want to have is a job that replaces variables in all parameters, not just those that are known in advance.

As it turns out this is actually almost trivial to achieve, however to solution is a bit too long-winded for this post. If anyone is interested in such a solution, please post a comment and let me know, and I'd be happy to provide it.

UPDATE: A solution that substitutes all variable references occurring in the parameter values of the containing job is now available at the kettle exchange area in the Pentaho wiki.

Thursday, December 09, 2010

Parameterizing SQL statements in the Kettle Table Input step: Variables vs Parameters

I get this question on a regular basis, so I figured I might as well blog it, in the hope it will be useful for others. Here goes:

Let's say that I want to delete all records that match an id from a set of tables. The table names come in as rows into the Execute SQL Script step (check execute for every row). Next I write:
DELETE FROM {table_name} WHERE id = {identifier}

as the SQL to execute. In the parameters grid at the bottom right, I have two fields: table_name and identifier. What is the syntax for substituting the table_name and identifier parameters in the sql script?


(Although this particular question focuses on the "Execute SQL Script" step, it also applies to the "Table Input" step, and probably a few more steps I can't recall right now.)

The parameters grid can be used for prepared statement value placeholders. In the SQL statement these placeholders are denoted as questionmarks (?). These are positional parameters: they get their value from those fields in the incoming stream that are entered in the parameters grid, in order.
Here's an example of the correct usage of these placeholders:
DELETE FROM myTable WHERE id = ?

Here, the ? in the WHERE clause will be bound to the value of the first field from the incoming stream entered in the parameters grid. Because there is only one such placeholder, there can be only one field in the parameters grid.

An important thing to realize is that these parameters can only be used to parameterize value expressions. So, this kind of parameter does not work for identifiers, nor do they work for structural elements of the SQL statement, such as keywords. So this kind of parameter cannot be used to parameterize the table name which seems to be the intention in the original example posed in the question.

There is a way to parameterize the structural elements of the SQL statement as well as the parameters. You can apply variable substitution to the SQL statetment.

Kettle Variables can be defined by a Set Variables step, or by specifying parameters at the transformation level. They get their value from "the environment": for example, parameters get their value initially when the transformation is started, and regular variables are typically set somewhere in the job that is calling your transformation.

In text fields, including the SQL textarea of the Table input step or the Execute SQL Script step, you denote those variables with this syntax: ${VARIABLE_NAME}. So to parameterize the table name we could use something like this:
DELETE FROM ${TABLE_NAME}


In order to force kettle to apply variable substitution to the SQL statement, you have to check the "variable substitution" checkbox. If this checkbox is checked, then all variables are simply substituted with their (string)value during transformation initialization. This is a lot like the way macro's are substituted by the pre-processor in C/C++ code.

When comparing variables with parameters, two important things should be mentioned here:

  • Unlike value placeholders, variables can be used to manipulate any aspect of the SQL statement, not just value expressions. The variable value will simply become the text that makes up the SQL statement, it is your responsibility it results in a syntactically valid and correct SQL statement.

  • Variables are evaluated once during transformation initalization. So if you want to vary the variable value, you'll have to call the transformation again for the change to take effect. For the same reasons, you cannot set the value of a variable and read it within the same transformation: setting the variable value occurs at runtime, but evaluating it occurs at initialization time.



Finally, here's a screenshot that summarizes these different ways to parameterize SQL statements in kettle:

If you want to read more about this topic, it's covered in both our books Pentaho Solutions and Pentaho Kettle Solutions. Another title you might be interested in is Maria Roldan's Pentaho 3.2 Data Integration: Beginner's Guide.

Wednesday, August 11, 2010

Back to blogging....

It has been a while since I posted on my blog - in fact, I believe this is the first time ever that more than one month passed between posts since I started blogging. There are a couple of reasons for the lag:

  • Matt Casters, Jos van Dongen and me have spent a lot of time finalizing our forthcoming book, Pentaho Kettle Solutions (Wiley, ISBN: 978-0-470-63517-9). The book is currently being produced, and should be available according to schedule in early September 2010. If you're interested, you might like to read one of my earlier posts that explains the organization and outline of the book.

    (I should point out that we have reorganized the outline as the project progressed, so the final result will not have all the chapters mentioned in that post. We do however cover most of the topics mentioned.)

  • I have been checking out Quipu, a promising Open Source data warehouse management solution. Quipu provides a repository-based extensible code-generator that allows you to generate and maintain a data warehouse based on the Data Vault model. One of the things I want to do in the short term is write templates that allows it to work for MySQL, and after that, I want to see if I can get Quipu to generate Kettle Jobs and transformations.

  • I have been working on a couple of software projects. Two of them are currently available as open source on google code:

    mql-to-sql
    This allows you to use the Metaweb Query Language (MQL) to query a RDBMS. If you're wondering what this is all about: MQL is the query language used by Freebase (which is a collaborative "database-of-everything" or "wikipedia-gone-database").

    While Freebase is interesting in its own right, I am particularly enthused about the MQL query language. I feel that MQL is an exceptionally good solution for flexible, expressive and secure data access for modern (AJAX) Web applications. Even though MQL was not developed with relational database systems in mind, I think it is a pretty good fit.

    Anyway, this is very much a work in progress, and I appreciate your feedback on it. If you're interested, you can read a bit more about my take on RDBMS data access for web applications and MQL on the mql-to-sql project home page. I have also put up an online demo that allows you to query the sakila MySQL sample database using MQL.

    kettle-cookbook
    This is a project that provides auto-documentation for Kettle (a.k.a. Pentaho Data Integration).

    The project consists of a bunch of Kettle Jobs and transformations (as well as some XSLT stylesheets) that extract data from Kettle Jobs and transformation and transform it into a collection of human-readable HTML documents along with a table of contents. The resulting documentation looks and feels a bit like JavaDoc documentation.

    If all goes well, I will be presenting kettle-cookbook in a Pentaho web seminar, which is currently scheduled for September 15, 2010


  • I've been enjoying 4 weeks of vacation (I started working this week). There's not much to tell about that, other than that it was great spending a lot of time with my family. I plan to do this more often, and now that Kettle Solutions is finished I should be able to find more time to do it.

  • I've been looking at two emerging HTML5 APIs for client-side structured storage, the Web SQL Database API and the Indexed DB API.

    I have developed a few thoughts about the ongoing debate (see this and article for some background) about which one is better and I see a role for something like MQL here too. I will probably write something about this in the next few weeks

  • Yesterday, I got wind of the JS1k contest! Basically, the challenge is to write an interesting standalone JavaScript demo program that must be no larger than 1024 bytes. It is amazing and inspiring to see what people manage to do with a modern browser in 1k of JavaScript code.

    I decided to try it myself, and you can find my submission here: An interactive SQL query tool for the Web SQL DB API.

    Essentially, you get a textarea where you can enter arbitrary SQLite queries, and button to execute the SQL, and a result area that will print a feedback message and a result table (if applicable). As a bonus, there's a button to get a listing of the available database objects (using a query on sqlite_master) and an explain button to show the query plan of the current SQL statement.

    The demo works on recent versions of Google Chrome, Apple Safari and Opera. It can run offline, and does not require any plugins. I should say that I expect my submission will be rejected by the judges since the demo is not functional on Mozilla Firefox, which is a requirement. (That is, the script will detect that the Web SQL Database API is not supported and print a message to that effect). However, it was still fun to try my hand at it.


Ok - that's it for now. I will try and post more regularly and write about these and other things in the near future. Don't hesitate to leave a comment if you have any questions or suggestions.

Wednesday, May 26, 2010

A small issue of SQL standards

From a functional perspective, the core SQL support in all major and minor RDBMS-es is reasonably similar. In this light, it's sometimes quite disturbing to find how some very basic things work so differently across different products. Consider this simple statement:
SELECT  'a' /* this is a comment */ 'b'
FROM onerow
What should the result be? (You can assume that onerow is an existing table that contains one row)

It turns out popular RDBMS-es mostly disagree with one another.

In Oracle XE, we get this:
SELECT  'a' /* comment */ 'b'
*
ERROR at line 1:
ORA-00923: FROM keyword not found where expected

PostgreSQL 8.4 also treats it as a syntax error, and thus seems compatible with Oracle's behavior:
ERROR:  syntax error at or near "'b'"
LINE 1: SELECT 'a' /* this is a comment */ 'b'

In Microsoft SQL Server 2008 we get:
b
-
a

(1 rows affected)
As you can see, MS SQL treats the query as SELECT 'a' AS b FROM onerow.

Finally, in MySQL 5.1, we get:
+----+
| a |
+----+
| ab |
+----+
1 row in set (0.00 sec)
So in MySQL, its as if the comment isn't there at all, and as if the string literals 'a' and 'b' are actually just one string literal 'ab'.

So what does the SQL standard say? In my copy of the 2003 edition, I find this (ISO/IEC 9075-2:2003 (E) 5.3 <literal>, page 145):
Syntax Rules
1) In a <character string literal> or <national character string literal>, the sequence:
<quote> <character representation>... <quote> <separator> <quote> <character representation>... <quote>
is equivalent to the sequence
<quote> <character representation>... <character representation>... <quote>
If we lookup the definition of <separator>, it reads:
<separator> ::= { <comment> | <white space> }...
So in this case, MySQL does the "right" thing, and basically ignores the comment, treating 'a' and 'b' as a single string constant 'ab'.
UPDATE 1: As always, the devil is in the details. And trust me, the SQL standard has many of them (details that is - I'll leave it up to the reader to decide for the devils, although I have a suspicion in a particular direction). Read on, and make sure to read Nick's comment on this post - it turns out PostgreSQL seems to behave exactly according to the standard in this case.

UPDATE 2: Serg also posted a comment citing yet another part of the standard that states that all comments implicitly count as a newline. This would mean that there doesn't have to be a literal newline character in or following the comment. In this case, my original remark that MySQL got it right would hold again.

I should state that I think very highly of both Nick and Serg, and as far as I am concerned, they're both right. I can't help but seeing this as yet more support for my statement that the SQL standard is so complex it is almost or perhaps completely impossible to get it right.

Do you find this too bold? If so, I'd really love to hear your thoughts on it. Please help us solve this paradox, I only want to understand what the standard really says.
If you try the same thing with a single line comment, all products mentioned react the same as with the initial query, except for PostgreSQL, which now treats the query according to the standard.

Now don't get me wrong. This post is not designed to bash or glorify any of the products mentioned. I think all of them are great in their own way. I am quite aware that although MySQL happens to adhere to the standard here, it violates it in other places. Finally, I should point out that I don't have a specific opinion on what the right behavior should be. I just want it to be the same on all platforms.

At the same time, I realize that for SQL it's probably too late - up to an extent, incompatibility is considered normal, and database professionals tend to be specialized in particular products anyway. So I'm not holding my breath for the grand unification of SQL dialects.

When I encountered this issue, I did have to think about that other rathole of incompatibilities I have to deal with professionally, which is web-browsers. An interesting development there is the HTML 5 specification, which actually defines an algorithm for parsing HTML - even invalid HTML. This is quite different from the approach taken by most standards, which typically define only an abstract grammar, but leave the implementation entirely up to the vendors. In theory, providing parsing instructions as detailed as done in HTML 5 should make it easier to create correct parsers, and hopefully this will contribute to a more robust web.

Anyway. That was all. Back to work...

UPDATE: I just heard that Sybase (unsurprisingly) behaves similar to MS SQL for this query (that is, query is valid, and returns 'a' in a column called b). I checked SQLite myself, which is also in that camp.

Nick also pointed out that LucidDB also provides a standard compliant implementation, in other words, it behaves exactly like PostgreSQL for this particular query. However, Julian, who was and is closely involved in LucidDB agrees with Serg that the comment should probably count as a newline, and filed a bug for LucidDB.

I checked Firebird 2.1.3, and they are in the Oracle camp: in both cases, the query gives a syntax error.

Tuesday, May 25, 2010

MySQL, Oracle and NoSQL: In the grand scheme...

...NoSQL is just larger than a fly's dropping, and MySQL and Oracle are more alike than either of their respective fanboys would like to admit.

Courtesy of Google trends:



I guess I won't be changing my career just yet.


UPDATE: I tried a few terms for "Microsoft SQL Server" before posting (SQL Server, MS SQL) but found none that came up with what I felt like was a realistic volume (they are all much, much lower than I expected). @MarkGStacey suggested trying "SQL 2008", "SQL 2005" and "SQL 2000", and those return much better results indeed (though still much lower than MySQL or Oracle). Anyway - I'd love to have some way of bunching up all those terms and have Google Trends show them as one trend, but I haven't figured out a way to do that. If you know how, please drop a line at let me know.

I'll adjust the blog if I find a more satisfactory solution.

Tuesday, May 04, 2010

MySQL: The maximum value of an integer

Did you ever have the need to find the maximum value of an integer in MySQL? Yeah, me neither. Anyway, some people seem to need this, and this is what I came up with:

SELECT ~0 as max_bigint_unsigned
, ~0 >> 32 AS max_int_unsigned
, ~0 >> 40 AS max_mediumint_unsigned
, ~0 >> 48 AS max_smallint_unsigned
, ~0 >> 56 AS max_tinyint_unsigned
, ~0 >> 1 AS max_bigint_signed
, ~0 >> 33 AS max_int_signed
, ~0 >> 41 AS max_mediumint_signed
, ~0 >> 49 AS max_smallint_signed
, ~0 >> 57 AS max_tinyint_signed
\G

*************************** 1. row ***************************
max_bigint_unsigned: 18446744073709551615
max_int_unsigned: 4294967295
max_mediumint_unsigned: 16777215
max_smallint_unsigned: 65535
max_tinyint_unsigned: 255
max_bigint_signed: 9223372036854775807
max_int_signed: 2147483647
max_mediumint_signed: 8388607
max_smallint_signed: 32767
max_tinyint_signed: 127
1 row in set (0.00 sec)

In case you're wondering how it works, read up on what the tilde (~) does: it peforms a bitwise negation. In other words, if flips bits that are 1 to 0, and vice versa. So, ~0 means, set all the bits to 1, because in the integer one 0, all the bits are a binary 0. Now, in MySQL, at runtime, there is only one integer type, which is an 8-byte integer value or a bigint. Therefore, ~0 is by definition the largest possible integer value.

MySQL defines a family of integer types for storage: bigint (8 bytes), int (4 bytes), mediumint (3 bytes), smallint (2 bytes) and tinyint (1 byte). To find the maximum values of those types, we can use the right-bitshift operator >> to push the most-significant bits at the left side of ~0 down to the right, for the appropriate number of bytes to get the maximum values of the other integer flavors. So,

int type: big int medium small tiny
bit #: 64 56 48 40 32 24 16 8 1
~0 = 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 = 18446744073709551615
~0 >> 32 = 00000000 00000000 00000000 00000000 11111111 11111111 11111111 11111111 = 4294967295
~0 >> 40 = 00000000 00000000 00000000 00000000 00000000 11111111 11111111 11111111 = 16777215
~0 >> 48 = 00000000 00000000 00000000 00000000 00000000 00000000 11111111 11111111 = 65535
~0 >> 56 = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 11111111 = 255

Now, for each of the integer flavors, MySQL lets you define them to be either signed or unsigned. This is implemented using a so-called sign bit. The sign bit is the most significant bit (so, bit #64 in a bigint, bit #32 in an int, and so on and so forth). If the sign bit equals 0, the integer is positive and if it equals 1, the integer is negative. So, to get the maximum values for the signed integer flavors, we can use the same recipe, we just need to push the bits even one more bit to the right, like so:

int type: big int medium small tiny
bit #: 64 56 48 40 32 24 16 8 1
~0 >> 1 = 01111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 = 9223372036854775807
~0 >> 33 = 00000000 00000000 00000000 00000000 01111111 11111111 11111111 11111111 = 2147483647
~0 >> 41 = 00000000 00000000 00000000 00000000 00000000 01111111 11111111 11111111 = 8388607
~0 >> 49 = 00000000 00000000 00000000 00000000 00000000 00000000 01111111 11111111 = 32767
~0 >> 57 = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 01111111 = 127

Tuesday, April 20, 2010

Restoring XML-formatted MySQL dumps

EAVB_VFZUHIARHI To whom it may concern -

The mysqldump program can be used to make logical database backups. Although the vast majority of people use it to create SQL dumps, it is possible to dump both schema structure and data in XML format. There are a few bugs (#52792, #52793) in this feature, but these are not the topic of this post.

XML output from mysqldump

Dumping in XML format is done with the --xml or -X option. In addition, you should use the --hex-blob option otherwise the BLOB data will be dumped as raw binary data, which usually results in characters that are not valid, either according to the XML spec or according to the UTF-8 encoding. (Arguably, this is also a bug. I haven't filed it though.)

For example, a line like:

mysqldump -uroot -pmysql -X --hex-blob --databases sakila
dumps the sakila database to the following XML format:

<?xml version="1.0"?>
<mysqldump xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<database name="sakila">
<table_structure name="actor">
<field Field="actor_id" Type="smallint(5) unsigned" Null="NO" Key="PRI" Extra="auto_increment" />
<field Field="first_name" Type="varchar(45)" Null="NO" Key="" Extra="" />
<field Field="last_name" Type="varchar(45)" Null="NO" Key="MUL" Extra="" />
<field Field="last_update" Type="timestamp" Null="NO" Key="" Default="CURRENT_TIMESTAMP" Extra="on update CURRENT_TIMESTAMP" />
<key Table="actor" Non_unique="0" Key_name="PRIMARY" Seq_in_index="1" Column_name="actor_id" Collation="A" Cardinality="200" Null="" Index_type="BTREE" Comment="" />
<key Table="actor" Non_unique="1" Key_name="idx_actor_last_name" Seq_in_index="1" Column_name="last_name" Collation="A" Cardinality="200" Null="" Index_type="BTREE" Comment="" />
<options Name="actor" Engine="InnoDB" Version="10" Row_format="Compact" Rows="200" Avg_row_length="81" Data_length="16384" Max_data_length="0" Index_length="16384" Data_free="233832448" Auto_increment="201" Create_time="2009-10-10 10:04:56" Collation="utf8_general_ci" Create_options="" Comment="" />
</table_structure>
<table_data name="actor">
<row>
<field name="actor_id">1</field>
<field name="first_name">PENELOPE</field>
<field name="last_name">GUINESS</field>
<field name="last_update">2006-02-15 03:34:33</field>
</row>

...many more rows and table structures...

</database>
</mysqldump>
I don't want to spend too much time discussing why it would be useful to make backups in this way. There are definitely a few drawbacks - for example, for sakila, the plain SQL dump, even with --hex-blob is 3.26 MB (3.429.358 bytes), whereas the XML output is 13.7 MB (14,415,665 bytes). Even after zip compression, the XML formatted dump is still one third larger than the plain SQL dump: 936 kB versus 695 kB.

Restoring XML output from mysqldump

A more serious problem is that MySQL doesn't seem to offer any tool to restore XML formatted dumps. The LOAD XML feature, kindly contributed by Erik Wetterberg could be used to some extent for this purpose. However, this feature is not yet available (it will be available in the upcoming version MySQL 5.5), and from what I can tell, it can only load data - not restore tables or databases. I also believe that this feature does not (yet) provide any way to properly restore hex-dumped BLOB data, but I really should test it to know for sure.

Anyway.

In between sessions of the past MySQL users conference I cobbled up an XSLT stylesheet that can convert mysqldump's XML output back to SQL script output. It is available under the LGPL license, and it is hosted on google code as the mysqldump-x-restore project. To get started, you need to download the mysqldump-xml-to-sql.xslt XSLT stylesheet. You also need a command line XSLT processor, like xsltproc. This utility is part of the Gnome libxslt project, and is included in packages for most linux distributions. There is a windows port available for which you can download the binaries.

Assuming that xsltproc is in your path, and the XML dump and the mysqldump-xml-to-sql.xslt are in the current working directory, you can use this command to convert the XML dump to SQL:

xsltproc mysqldump-xml-to-sql.xslt sakila.xml > sakila.sql
On Unix-based systems you should be able to directly pipline the SQL into mysql using

mysql -uroot -pmysql < `xsltproc mysqldump-xml-to-sql.xslt sakila.xml`
The stylesheet comes with a number of options, which can be set through xsltproc's --stringparam option. For example, setting the schema parameter to N will result in an SQL script that only contains DML statements:

xsltproc --stringparam schema N mysqldump-xml-to-sql.xslt sakila.xml > sakila.sql
Setting the data option to N will result in an SQL script that only contains DDL statements:

xsltproc --stringparam data N mysqldump-xml-to-sql.xslt sakila.xml > sakila.sql
. There are additional options to control how often a COMMIT should be issued, whether to add DROP statements, whether to generate single row INSERT statements, and to set the max_allowed_packet size.

What's next?

Nothing much really. I don't really recommend people to use mysqldump's XML output. I wrote mysqldump-x-restore for those people that inherited a bunch of XML formatted dumps, and don't know what to do with them. I haven't thouroughly tested it - please file a bug if you find one. If you actually think it's useful and you want more features, please let me know, and I'll look into it. I don't have much use for this myself, so if you have great ideas to move this forward, I'll let you have commit access.

That is all.

Wednesday, April 14, 2010

MySQL Conference 2010 Presentation: Optimizing Stored Routines

Yesterday I delivered my presentation for the MySQL User Conference and Expo 2010: Optimizing MySQL Stored Routines. If you are interested in the slides, you can find them on both the MySQL conference site as well as on slideshare.net. Here's the abstract of my presentation so you can decide if this is interesting for you:
MySQL stored routines (functions, procedures, triggers and events) can be useful. But many casually written stored routines are unnecessarily slow. The main reason is that MySQL does not apply even simple code optimizations to stored routine code. Many developers are not aware of this, and as a result, write stored routine code that can quite easily be tuned, increasing performance by 50%-100% by only applying very straightforward code optimizations.
It was very pleased to see so many people attend: I had the impression that MySQL stored routines are quite impopular, due to performance issues, and a syntax that is often regarded as "clunky", so I didn't expect more than about 20 people to show up. Much to my pleasure, the ballroom was filled for about two-thirds, and I estimate there were 70-something people in the room.

A quick survey of the audience indicated that all of them were in fact using stored routines in production, so I assume they didn't show up out of morbid curiosity :) Interestingly, only few people reported performance issues. It would be interesting to do more research to find out what people are in fact doing with MySQL stored routines. Among yesterday's attendees, there were people using MySQL stored routines for managing user privileges, processing astronomical data, and checking complex dynamic business rules. To be sure - these were all different users - not just one isolated fanatic going wild with stored routines.

Co-incidentally, Domas Mituzas from facebook also mentioned stored routines in his presentation on high concurrency MySQL as a way to reduce the lock gap when performing multiple changes in a single transaction. I'm just saying - perhaps MySQL stored routines aren't that bad at all, they just need more love and dedication from the MySQL developers so they can mature and gain wider applicability.

Recently, I already wrote about a recent improvement in MySQL 5.5, the long anticipated SIGNAL /RESIGNAL syntax. I hope more improvements will follow soon now the dust is settling after Oracle's acquisition of Sun. After hearing Edward Screven unfold Oracle's strategy for MySQL in yesterday's keynote, I can tell you without reservation that I am quite optimistic :)

Anyway - that is all for now. Two days of conference ahead :)

Tuesday, March 30, 2010

MySQL: Partition-wise backups with mysqldump

To whom it may concern,

in response to a query from André Simões (also known as ITXpander), I slapped together a MySQL script that outputs mysqldump commands for backing up individual partitions of the tables in the current schema. The script is maintained as a snippet at MySQL Forge a gist on github.

How it works

The script works by querying the information_schema.PARTITIONS system view to generate an appropriate expression for mysqldump's --where option. The generated command also redirects the output to a file with this name pattern:
<schema>.<table>.<partition-name>.sql
For example, for this table (taken from the MySQL reference manual):
CREATE TABLE members (
firstname VARCHAR(25) NOT NULL,
lastname VARCHAR(25) NOT NULL,
username VARCHAR(16) NOT NULL,
email VARCHAR(35),
joined DATE NOT NULL
)
PARTITION BY RANGE( YEAR(joined) ) (
PARTITION p0 VALUES LESS THAN (1960),
PARTITION p1 VALUES LESS THAN (1970),
PARTITION p2 VALUES LESS THAN (1980),
PARTITION p3 VALUES LESS THAN (1990),
PARTITION p4 VALUES LESS THAN MAXVALUE
);
the script generates the following commands:
mysqldump --user=username --password=password --no-create-info --where=" YEAR(joined) < 1960" test members > test.members.p0.sql
mysqldump --user=username --password=password --no-create-info --where=" YEAR(joined) >= 1960 and YEAR(joined) < 1970" test members > test.members.p1.sql
mysqldump --user=username --password=password --no-create-info --where=" YEAR(joined) >= 1970 and YEAR(joined) < 1980" test members > test.members.p2.sql
mysqldump --user=username --password=password --no-create-info --where=" YEAR(joined) >= 1980 and YEAR(joined) < 1990" test members > test.members.p3.sql
mysqldump --user=username --password=password --no-create-info --where=" YEAR(joined) >= 1990 and YEAR(joined) < 18446744073709551615" test members > test.members.p4.sql
Tip: in order to obtain directly executable output from the mysql command line tool, run the script with the --skip-column-names (or -N) option.

Features

Currently, the script supports the following partitioning methods:

Limitations

The LINEAR HASH method is currently not supported, but I may implement that in the future.

Currently I do not have plans to implement the KEY and LINEAR KEY partitioning methods, but I may reconsider if and when I have more information about the storage-engine specific partitioning functions used by these methods.

Finally, I should point out that querying the information_schema.PARTITIONS table is dog-slow. This may not be too big of an issue, however it is pretty annoying. If anybody has some tips to increase performance, please let me know.

Acknowledgements

Thanks to André for posing the problem. I had a fun hour of procrastination to implement this, and it made me read part of the MySQL reference manual on partitioning.

I also would like to thank Giuseppe Maxia (the Datacharmer) for providing valuable feedback. If you're interested in either partitioning or the mysql command line, you should visit his tutorials at the MySQL conference, april 12-15, 2010.

Friday, March 19, 2010

Greatest N per group: top 3 with GROUP_CONCAT()

In my opinion, one of the best things that happened to Planet MySQL lately, is Explain Extended, a blog by Alex Bolenok (also known as Quassnoi on Stackoverflow).

I never had the pleasure of meeting Alex in person, but his articles are always interesting and of high quality, and the SQL wizardry he pulls off is downright inspiring. I really feel humbled by the creativity of some of his solutions and his apparent experience with multiple RDBMS products.

Alex' most recent post is about aggregation, and finding a top 3 based on the aggregate:

In MySQL I have a table called meanings with three columns: word, meaning, person. word has 16 possible values, meaning has 26. A person assigns one or more meanings to each word. In the sample above, person 1 assigned two meanings to word 2. There will be thousands of persons. I need to find the top three meanings for each of the 16 words, with their frequencies. Is it possible to solve this with a single MySQL query?


Alex presents a solution that uses GROUP_CONCAT basically as a poor man's windowing function, a technique I have described on several occasions in the past for ranking, median and percentile solutions in MySQL.

Now, Alex' solution is very clever and there are some elements that I think are very creative. That said, I think his solution can be improved still. Normally I wouldn't write a blog about it, and simply leave a comment on his blog, but his blog supports comments only for general articles, which is why I present it here:

SELECT word
, CONCAT(
SUBSTRING_INDEX(
GROUP_CONCAT(meaning ORDER BY num DESC), ',', 1
)
, ' ('
, SUBSTRING_INDEX(
GROUP_CONCAT(num ORDER BY num DESC), ',', 1
) / SUM(num) * 100
, '%)'
) rank1
, CONCAT(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(meaning ORDER BY num DESC), ',', 2
), ',', -1
)
, ' ('
, SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(num ORDER BY num DESC), ',', 2
), ',', -1
) / SUM(num) * 100
, '%)'
) rank2
, CONCAT(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(meaning ORDER BY num DESC), ',', 3
), ',', -1)
, ' ('
, SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(num ORDER BY num DESC), ',', 3
), ',', -1
) / SUM(num) * 100
, '%)'
) rank3
FROM (
SELECT word, meaning, COUNT(*) num
FROM t_meaning m
GROUP BY word,meaning
) a
GROUP BY word

This gives me output like this:

+------+--------------+--------------+--------------+
| word | rank1 | rank2 | rank3 |
+------+--------------+--------------+--------------+
| 1 | 16 (3.9728%) | 17 (3.9648%) | 12 (3.9632%) |
| 2 | 9 (3.9792%) | 10 (3.9632%) | 20 (3.9328%) |
| 3 | 20 (3.9744%) | 13 (3.968%) | 1 (3.9648%) |
| 4 | 26 (3.952%) | 7 (3.9456%) | 17 (3.9424%) |
| 5 | 9 (4.008%) | 21 (3.9824%) | 20 (3.936%) |
| 6 | 19 (3.9504%) | 10 (3.9488%) | 13 (3.9408%) |
| 7 | 23 (4.0464%) | 12 (3.976%) | 19 (3.9648%) |
| 8 | 23 (4.0112%) | 3 (4.0096%) | 8 (3.9328%) |
| 9 | 10 (4.016%) | 19 (3.984%) | 15 (3.9616%) |
| 10 | 10 (4.0304%) | 14 (3.9344%) | 11 (3.9312%) |
| 11 | 16 (3.9584%) | 6 (3.9296%) | 19 (3.9232%) |
| 12 | 7 (3.9968%) | 1 (3.9392%) | 26 (3.9264%) |
| 13 | 8 (4.048%) | 25 (3.9712%) | 23 (3.9616%) |
| 14 | 16 (3.9936%) | 26 (3.9632%) | 4 (3.9536%) |
| 15 | 22 (4.0608%) | 12 (4.0048%) | 1 (3.9632%) |
| 16 | 14 (4.0032%) | 18 (3.9712%) | 4 (3.9488%) |
+------+--------------+--------------+--------------+
16 rows in set (0.63 sec)

On my laptop, my solution is about 30% faster than the one presented by Alex. Personally I think mine is easier to understand too, but that is a matter of taste.

Anyway, I'm just posting this to share my solution - I do not intend to downplay the one presented by Alex. Instead, I invite everyone interested in SQL, MySQL and PostgreSQL to keep an eye on Alex' blog as well as his excellent answers on Stackoverflow. He's an SQL jedi master in my book :)

Of course, if you have a better solution to crack this problem in MySQL, please leave a comment. I'd love to hear what other people are doing to cope with these kinds of queries.

Sunday, March 14, 2010

Writing another book: Pentaho Kettle Solutions

Last year, at about this time of the year, I was well involved in the process of writing the book Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL" for Wiley. To date, "Pentaho Solutions" is still the only all-round book on the open source Pentaho Business Intelligence suite.

It was an extremely interesting project to participate in, full of new experiences. Although the act of writing was time consuming and at times very trying for me as well as my family, it was completely worth it. I have none but happy memories of the collaboration with my full co-author Jos van Dongen, our technical editors Jens Bleuel, Jeroen Kuiper, Tom Barber and Tomas Morgner, several of the Pentaho Developers, and last but not least, the team at Wiley, in particular Robert Elliot and Sara Shlaer.

When the book was finally published, late August 2009, I was very proud - as a matter of fact, I still am :) Both Jos and I have been rewarded with a lot of positive feedback, and so far, book sales are meeting the expectations of the publisher. We've had mostly positive reviews on places like Amazon, and elsewhere on the web. I'd like to use this opportunity to thank everybody that took the time to review the book: Thank you all - it is very rewarding to get this kind of feedback, and I appreciate it enourmously that you all took the time to spread the word. Beer is on me next time we meet :)

Announcing "Pentaho Kettle Solutions"


In the autumn of 2009, just a month after "Pentaho Solutions" was published, Wiley contacted Jos and me to find out if we were interested in writing a more specialized book on ETL and data integration using Pentaho. I felt honoured, and took the fact that Wiley, an experienced and well-reknowned publisher in the field of data warehousing and business intelligence, voiced interested in another Pentaho book by Jos an me as a token of confidence and encouragement that I value greatly. (For Pentaho Solutions, we heard that Wiley was interested, but we contacted them.) At the same time, I admit I had my share of doubts, having the memories of what it took to write Pentaho Solutions still fresh in my mind.

As it happens, Jos and I both attended the 2009 Pentaho Community Meeting, and there we seized the opportunity to talk to Matt Casters, chief Pentaho Data Integration and founding developer of Kettle (a.k.a. Pentaho Data Integration). Both Jos and I didn't expect Matt to be able to free up any time in his ever busy schedule to help us to write the new book. Needless to say, he made us both very happy when he rather liked the idea, and expressed immediate interest in becoming a full co-author!

Together, the three of us made a detailed outline and wrote a formal proposal for Wiley. Our proposal was accepted in December 2009, and we have been writing since, focusing on the forthcoming Kettle version, Kettle 4.0 . The tentative title of the book is Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration. It is planned to be published in September 2010, and it will have approximately 750 pages.



Our working copy of the outline is quite detailed but may still change in the future, which is why I won't publish it here until we finished our first draft of the book. I am 99% confident that the top level of the outline is stable, and I have no reservation in releasing that already:

  • Part I: Getting Started

    • ETL Primer

    • Kettle Concepts

    • Installation and Configuration

    • Sample ETL Solution


  • Part II: ETL Subsystems

    • Overview of the 34 Subsystems of ETL

    • Data Extraction

    • Cleansing and Conforming

    • Handling Dimension Tables

    • Fact Tables

    • Loading OLAP Cubes


  • Part III: Management and Deployment

    • Testing and Debugging

    • Scheduling and Monitoring

    • Versioning and Migration

    • Lineage and Auditing

    • Securing your Environment

    • Documenting


  • Part IV: Performance and Scalability

    • Performance Tuning

    • Parallization and Partitioning

    • Dynamic Clustering in the Cloud

    • Realtime and Streaming data


  • Part V: Integrating and Extending Kettle

    • Pentaho BI Integration

    • Third-party Kettle Integration

    • Extending Kettle


  • Part VI: Advanced Topics

    • Webservices and Web APIs

    • Complex File Handling

    • Data Vault Management

    • Working with ERP Systems



Feel free to ask me any questions about this new book. If you're interested, stay tuned - I will probably be posting 2 or 3 updates as we go.

Wednesday, February 17, 2010

MySQL - the best stored routine is the one you don't write

At Fosdem 2010, already two weeks ago, I had the pleasure of hearing Geert van der Kelen explain the work he has been doing on connecting MySQL and Python. I don't know anything about Python, but anybody that has the courage, perseverance and coding skills to create an implementation of the the MySQL wire protocol from scratch is a class-A programmer in my book. So, I encourage everyone that needs MySQL connectivity for Python programs to check out Geert's brainchild, MySQL Connector/Python.

In relation to MySQL Connector/Python, I just read a post from Geert about how he uses the MySQL information_schema to generate some Python code. In this particular case, he needs the data from the COLLATIONS table to maintain a data structure that describes all collations supported by MySQL.

For some reasons that I cannot fathom, Geert needed to generate a structure for each possible collation, not just the ones for which the COLLATIONS table contains a row. To do this, he wrote a stored procedure that uses a cursor to loop through the COLLATIONS table. In the loop, he detects it whenever there's a gap in the sequence of values from the ID column, and then starts a new loop to "fill the gaps". For each iteration of the outer cursor loop, a piece of text is emitted that conforms to the syntax of a Python tuple describing the collation, and each iteration of the inner loop generates the text None, a Python built-in constant.

The final result of the procedure is a snippet of Python code shown below (abbreviated):

..
("cp1251","cp1251_bulgarian_ci"), # 14
("latin1","latin1_danish_ci"), # 15
("hebrew","hebrew_general_ci"), # 16
None,
("tis620","tis620_thai_ci"), # 18
("euckr","euckr_korean_ci"), # 19
..


In the final code, these lines are themselves used to form yet another tuple:

desc = (
None,
("big5","big5_chinese_ci"), # 1
("latin2","latin2_czech_cs"), # 2
("dec8","dec8_swedish_ci"), # 3
("cp850","cp850_general_ci"), # 4
..


This is excellent use of the information schema! However, I am not too thrilled about using a stored routine for this. Enter my fosdem talk about refactoring stored routines.

In this case, performance is not really an issue, so I won't play that card. But many people that do need well-performing stored procedures might start out like Geert and write a cursor loop, and perhaps do some looping inside that loop. One of the big take-aways in my presentation is to become aware of the ways that you can avoid a stored procedure. Geerts procedure is an excellent candidate to illustrate the point. As a bonus, I'm adding the code that is necessary to generate the entire snippet, not just the collection of tuples inside the outer pair of parenthesis.

So, here goes:

set group_concat_max_len := @@max_allowed_packet;

select concat('desc = (',
group_concat('\n '
, if( collations.id is null, 'None',
concat('(', '"', character_set_name, '"',
',', '"', collation_name, '"', ')')
)
, if(ids.id=255, '', ','), ' #', ids.id
order by ids.id
separator ''
), '\n)'
)
from (select (t0.id<<0) + (t1.id<<1) + (t2.id<<2)
+ (t3.id<<3) + (t4.id<<4) + (t5.id<<5)
+ (t6.id<<6) + (t7.id<<7) id
from (select 0 id union all select 1) t0
, (select 0 id union all select 1) t1
, (select 0 id union all select 1) t2
, (select 0 id union all select 1) t3
, (select 0 id union all select 1) t4
, (select 0 id union all select 1) t5
, (select 0 id union all select 1) t6
, (select 0 id union all select 1) t7) ids
left join information_schema.collations on ids.id = collations.id;

This query works first by generating 256 rows having id's ranging from 0 to 255. (I think I recall Alexander Barkov mentioning that this is currently the maximum number of collations that MySQL supports - perhaps I am wronge there). This is done by cross-joining a simple derived table that generates two rows:

(select 0 id union all select 1)

So, one row that yields 0, and one that yields 1. By cross-joining 8 of these derived tables, we get 2 to the 8th power rows, which equals 256. In the SELECT-list, I use the left bitshift operator << to shift the original 0 and 1 0, 1, 2 and so on up to 7 positions. By then adding those values together, we fill up exactly one byte, and gain all possible values from 0 through 255:

(select (t0.id<<0) + (t1.id<<1) + (t2.id<<2)
+ (t3.id<<3) + (t4.id<<4) + (t5.id<<5)
+ (t6.id<<6) + (t7.id<<7) id
from (select 0 id union all select 1) t0
, ... t1
, ...
, (select 0 id union all select 1) t7) ids

Once we have this, the rest is straightforward - all we have to do now is use a LEFT JOIN to find any collations from the information_schema.COLLATIONS table in case the value of its ID column matches the value we computed with the bit-shifting jiggery-pokery. For the matching rows, we use CONCAT to generate a Python tuple describing the collation, and for the non-matching rows, we generate None:

if( collations.id is null, 'None',
concat('(', '"', character_set_name, '"',
',', '"', collation_name, '"', ')')
)

The final touch is a GROUP_CONCAT that we use to bunch these up into a comma separated list that is used as entries for the outer tuple. As always, you should set the value of the group_concat_max_len server variable to a sufficiently high value to hold the contents of the generated string, and if you want to be on the safe side and not run the risk of getting a truncated result, you should use max_allowed_packet.

I have the honour of speaking at the MySQL user conference, april 12-15 later this year. There, I will be doing a related talk called Optimizing MySQL Stored Routines. In this talk, I will explain how stored routines impact performance, and provide some tips on how you can avoid them, but also on how to improve your stored procedure code in case you really do need them.

Wednesday, January 27, 2010

Easter Eggs for MySQL and Kettle

To whom it may concern,

A MySQL stored function to calculate easter day


I uploaded a MySQL forge snippet for the f_easter() function. You can use this function in MySQL statements to calculate easter sunday for any given year:

mysql> select f_easter(year(now()));
+-----------------------+
| f_easter(year(now())) |
+-----------------------+
| 2010-04-04 |
+-----------------------+
1 row in set (0.00 sec)

Anonymous Gregorian algorithm


To implement it, I simply transcribed the code of the "Anonymous Gregorian algorithm" from wikipedia's Computus article.

You might ask yourself: "how does it work?". Frankly, I don't know. Much like a tax form, I treat the calculation as a black box. But, it's wikipedia, so it must be right, right?

A Javascript snippet to calculate easter day


I also transcribed the algorithm to javascript, so I could use it in Kettle (a.k.a. Pentaho Data Integration). Of course, nothing should stop you from using it for another environment, such as a webpage.

I don't have a proper place to host that code, so I'm listing it here:

//Script to calculate Easter day
//according to the "Anonymous Gregorian algorithm"
function easterDay(year) {
var a = year % 19,
b = Math.floor(year / 100),
c = year % 100,
d = Math.floor(b / 4),
e = b % 4,
f = Math.floor((b + 8) / 25),
g = Math.floor((b - f + 1) / 3),
h = (19 * a + b - d - g + 15) % 30,
i = Math.floor(c / 4),
k = c % 4,
L = (32 + 2 * e + 2 * i - h - k) % 7,
m = Math.floor((a + 11 * h + 22 * L) / 451),
n = h + L - 7 * m + 114;
return new Date(year,
Math.floor(n / 31) - 1,
(n % 31) + 1);
}

easter = easterDay(year);

To use this in your kettle transformations, create a stream with an field of the Integer type called year. The year field should take on the value of some year. In the step, create one output field of the Date type to take on the value of the easter script variable. (For usage outside kettle, just use the easterDay() function as you see fit)

Nice, but so what?


The thought may have crossed your mind: "So what, who cares - why should I ever want to know when it's easter day?"

Apparently, if you think like that, you don't like eggs very much. That's ok - I don't blame you. But I happen to like eggs, and people in the egg business like people that like eggs like me so they can sell them more eggs. In fact, they like selling eggs so much, that it makes a big difference to them whether their business intelligence reports say: "On March 22, 2008, we sold 10 times more eggs than on February 22 and May 22 of the same year" as compared to "In 2008, on the day before Easter, we only sold half the amount of eggs as compared to the day before Easter in 2009".

In order to report these facts, special events and holidays like easter are stored in a date dimension. (I wrote about creating a localized date dimension, a date dimension that speaks your language some time ago)

So there you go: you could use these solutions in order to build a date dimension that understands easter. The nice thing about easter is that it can be used to derive a whole bunch of other Christian holidays, like good friday, ascension, and pentecost, and in many western countries, these will be special days with regard to the normal course of business. I leave all these as an exercise to the reader, but trust me - calculating easter is the key to a solving a lot of these problems.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...