Roland Bouman's blog: January 2009

Saturday, January 31, 2009

Loading a dimension table with SCD1 and SCD2 attributes

Jos, my co-author for the "Building Pentaho Solutions" book just pointed me to a recent article by Jeff Prenevost entitled "The Problem with History".

Abstract

Jeff's topic, loading a hybrid Type 1 / Type 2 slowly changing dimension table is related to data warehousing but maybe of interest outside of that context as well.

As it turns out, the particular problem described by Jeff is non-trivial, but can be solved quite elegantly in a single SQL statment. This may be a compelling alternative to the multi-step, multi-pass solution proposed in his article.

Type 1 and Type 2 Slowly Changing Dimensions

In his article, Jeff describes a method to load a slowly changing dimension (SCD) table from an audit trail. This would be quite straight forward in case we are dealing with a Type 2 slowly changing dimension. In that case, each row in the audit trail would also yield one row in the dimension table. If the dimension would be a Type 1 slowly changing dimension, the matter would only be slightly more complicated - in this case only the most recent version of each object would be loaded into the dimension table.

A Hybrid SCD Type 1/2

The interesting thing about the problem described in the article is that the dimension table is a hybrid Type 1 / Type 2 dimension. That is, for some attributes, history needs to be tracked in the dimension table (Type 2 attributes), whereas only the most recent data is required for other attributes (Type 1 attributes).

Sample Data

To make things tangible, here's a sample Employee audit trail:

         |    SCD Type 1        |    SCD Type 2  |
+--------+--------+-------------+--------+-------+------------+------------+
| empkey | name   | ssn         | gender | state | valid_from | valid_to   |
+--------+--------+-------------+--------+-------+------------+------------+
|     14 | Jo     | 323-10-1116 | F      | MI    | 1998-12-03 | 1998-12-27 |
|     14 | Jo     | 323-10-1119 | F      | MI    | 1998-12-28 | 2005-04-22 |
|     14 | Joe    | 323-10-1119 | F      | MI    | 2005-04-23 | 2005-08-07 |
|     14 | Joseph | 323-10-1119 | M      | MI    | 2005-08-08 | 2006-02-12 |
|     14 | Joe    | 323-10-1119 | M      | MI    | 2006-02-13 | 2006-07-04 |
|     14 | Joseph | 323-10-1119 | M      | NY    | 2006-07-05 | 2006-12-24 |
|     14 | Joseph | 323-10-1119 | M      | MI    | 2006-12-25 | NULL       |
|     15 | Jim    | 224-57-2726 | M      | IL    | 2002-01-16 | 2004-03-15 |
|     15 | James  | 224-57-2726 | M      | IL    | 2004-03-16 | 2007-06-22 |
|     15 | James  | 224-57-2726 | M      | IN    | 2007-06-23 | 2007-08-31 |
+--------+--------+-------------+--------+-------+------------+------------+

The data shows the history for the employees with empkey 14 and 15. All rows with the same empkey value form a time line of data change events. Each row represents a change event, updating some of the employee's data. The valid_from and valid_to columns are used to record when the data change event occurred, so the values in these columns change for each row. For the other columns, I used bold markup to make it easier to spot the change.

SCD Type 1 and 2 Attributes

In Jeff's article, the columns name and ssn end up as SCD Type 1 attributes, and the columns gender and state as SCD Type 2 attributes. I used color highlighting to mark up groups of consecutive rows where the values for the gender and state columns did not change. After loading the dimension table we must end up with one row for each such group, capturing all change events for these columns. The ssn and name must always contain the most recent data. I highlighted the most recent row using grey markup.

Resulting Dimension Table

The data for the resulting dimension table would look like this:

+--------+--------+--------+-------------+--------+-------+------------+------------+------------+
| dw_key | empkey | name   | ssn         | gender | state | valid_from | valid_to   | is_current |
+--------+--------+--------+-------------+--------+-------+------------+------------+------------+
|      1 |     14 | Joseph | 323-10-1119 | F      | MI    | 1998-12-03 | 2005-08-07 | N          |
|      2 |     14 | Joseph | 323-10-1119 | M      | MI    | 2005-08-08 | 2006-07-04 | N          |
|      3 |     14 | Joseph | 323-10-1119 | M      | NY    | 2006-07-05 | 2006-12-24 | N          |
|      4 |     14 | Joseph | 323-10-1119 | M      | MI    | 2006-12-25 | 9999-12-31 | Y          |
|      5 |     15 | James  | 224-57-2726 | M      | IL    | 2002-01-16 | 2007-06-22 | N          |
|      6 |     15 | James  | 224-57-2726 | M      | IN    | 2007-06-23 | 9999-12-31 | Y          |
+--------+--------+--------+-------------+--------+-------+------------+------------+------------

As you can see, only the changes in the gender and state columns are recorded, ignoring any changes in the name and ssn columns. Instead, these columns get the values of the most recent change.

Jeff's Solution

Jeff's article describes a step-wise solution to this problem. To give you an idea and some context, here's a quote from the article describing the mindset in developing this approach:

While an ETL tool like Informatica or DataStage would be handy, any of this could be done fairly easily in straight SQL. We’ll also keep all the steps simple, easy to understand and discrete. It’s possible to create enormous heaps of nested SQL to do everything in one statement, but it's best to keep everything understandable. We create and label "tables," but whether you choose to actually create real database tables, or the tables are just record sets inside a flow, the process remains essentially unchanged.

The actual steps that make up the described solution can be summarized as follows:

Rank all rows according to the timeline

Join consecutive rows by rank in case the there is a change in the SCD attributes

Break the joined rows back into two rows (Jeff calls this 'semi-pivot')

Add the last and first version for each object to the set

Add another rank number to the result set

At this point, the rows where the new rank number is odd contain the valid_from date, and rows with an even number contain the valid_to date. Join each odd row to its consecutive even rows.

Join to the last version of each object to fill in for the SCD Type 1 attributes, and set a flag for the latest version of the object.

I admit I have a hard time understanding Jeff's method, and I am completely puzzled as to how he invented it. My apologies if my summary isn't exactly crystal clear - I encourage you to read the original article, "The Problem with History" yourself if you are interested in the details.

My Alternative

First of all, let me say that I am impressed with Jeff's creativity. I would never think of this solution and I am not really sure I even understand it. I also am not categorically opposed to Jeff's initial mindset: I too think that it is sometimes better to avoid complex SQL statements in favor of a series of relatively simple ones.

That said, when I first read Jeff's article, my first impression was: "Wow...that are a lot of moving parts". My second impression was: "Darn, that's a lot of multiple passes over the same data set". All the time, I had this hunch that it wouldn't be to hard to do it in two statements (one SELECT...INTO, one UPDATE), or perhaps even in one (SELECT...INTO).

Preparation

I started by setting up the test data in MySQL 5.1. I'm including the script here for the readers' convenience:


CREATE TABLE  hrsource (
    empkey     int         NOT NULL
,   name       varchar(10) NOT NULL
,   ssn        char(11)    NOT NULL
,   gender     char(1)     NOT NULL
,   state      char(2)     NOT NULL
,   valid_from date        NOT NULL
,   valid_to   date       
,   PRIMARY KEY(empkey, valid_from)
);

INSERT INTO hrsource 
(empkey,name,ssn,gender,state,valid_from,valid_to) VALUES
 (14, 'Jo'    , '323-10-1116', 'F', 'MI', '1998-12-03', '1998-12-27')
,(14, 'Jo'    , '323-10-1119', 'F', 'MI', '1998-12-28', '2005-04-22')
,(14, 'Joe'   , '323-10-1119', 'F', 'MI', '2005-04-23', '2005-08-07')
,(14, 'Joseph', '323-10-1119', 'M', 'MI', '2005-08-08', '2006-02-12')
,(14, 'Joe'   , '323-10-1119', 'M', 'MI', '2006-02-13', '2006-07-04')
,(14, 'Joseph', '323-10-1119', 'M', 'NY', '2006-07-05', '2006-12-24')
,(14, 'Joseph', '323-10-1119', 'M', 'MI', '2006-12-25', NULL)
,(15, 'Jim'   , '224-57-2726', 'M', 'IL', '2002-01-16', '2004-03-15')
,(15, 'James' , '224-57-2726', 'M', 'IL', '2004-03-16', '2007-06-22')
,(15, 'James' , '224-57-2726', 'M', 'IN', '2007-06-23', '2007-08-31');

Of course, this is not a lot of data, but it is the data from the original article, dutifully hacked into my computer's keyboard by your's truly.

Solution

Without further ado, this is my solution for selecting the dimension's table dataset:

SELECT     prv.empkey
,          curr.name
,          curr.ssn
,          prv.gender
,          prv.state
,          MIN(prv.valid_from)                        valid_from
,          COALESCE(
               nxt.valid_from - INTERVAL 1 DAY
           ,   '9999-12-31'
           )                                          valid_to
,          CASE 
               WHEN nxt.valid_from IS NULL THEN 'Y'
               ELSE 'N'
           END                                        is_current
FROM       hrsource                curr
INNER JOIN (
            SELECT   empkey
            ,        MAX(valid_from) valid_from
            FROM     hrsource
            GROUP BY empkey
           )                       curr1
ON         curr.empkey           = curr1.empkey
AND        curr.valid_from       = curr1.valid_from
INNER JOIN hrsource                prv
ON         curr.empkey           = prv.empkey
LEFT JOIN  hrsource                nxt
ON         prv.empkey            = nxt.empkey
AND        prv.valid_to          < nxt.valid_from
AND       (prv.gender,prv.state)!= (nxt.gender,nxt.state)
LEFT JOIN  hrsource                inb
ON         prv.empkey            = inb.empkey
AND        prv.valid_to          < inb.valid_from
AND        nxt.valid_from        > inb.valid_to
AND       (prv.gender,prv.state)!= (inb.gender,inb.state)
WHERE      inb.empkey IS NULL
GROUP BY   prv.empkey
,          nxt.valid_from

Now I would certainly not qualify this as a simple statement. At the same time, I feel this does not resemble the "enormous heaps of nested SQL" so dreaded by Jeff. Let me explain how it works, and you can judge for yourself.

Explanation

I will know do a step-by-step explanation of my statement. I hope it clears up any doubt you might have as to how my solution solves the problem.

SCD Type 1 attributes: Isolating the Most Recent Change

In the introduction, I mentioned that the loading the dimension table would be easy in case it was either a Type 1 or a Type 2 slowly changing dimension. Well, we know that, come what may, we will always need the most recent row for each empkey to supply values for the Type 1 attributes name and ssn. So, we start by attacking that part of the problem.

The following fragment of my solution does exactly that:

FROM       hrsource                curr
INNER JOIN (
            SELECT   empkey
            ,        MAX(valid_from) valid_from
            FROM     hrsource
            GROUP BY empkey
           )                       curr1
ON         curr.empkey           = curr1.empkey
AND        curr.valid_from       = curr1.valid_from

The subquery curr1 finds us the highest value for the valid_from column for each distinct empkey. The GROUP BY empkey clause ensures we get exactly one row for each distinct value of empkey. From all rows with the same empkey value, the MAX(valid_from) bit finds the largest valid_from value.

The combination empkey, valid_from is the primary key of the employee audit trail. This means we can now use the empkey, MAX(valid_from) pair from the curr1 subquery to point out the most recent row for each distinct empkey. The INNER JOIN with the hrsource table (dubbed curr) does exactly that.

Grouping consecutive rows with identical Type 2 attributes

We must now deal with the second part of the problem, the SCD Type 2 attributes.

If you look back at the color highlighting in the sample data set, you may realize this is basically a grouping problem: for each distinct empkey we need to group rows with identical values in the Type 2 columns gender and state. From the point of view of the dimension table, no change occurred within this group, so we should store it as one row in the dimension table, and reconstruct the change dates by taking the minimum valid_from and maximum valid_to values from the group.

We must be careful though: we can't simply make groups of all distinct combinations of the SCD Type 2 columns. Look for example at the rows having 14 for the empkey column. The rows with the valid_from values 2005-08-08, 2006-02-13 and 2006-12-25 all have the identical combination (M, MI) in the Type 2 columns gender and state respectively. However, only the first two of these belong together in a group. The row with 2006-12-25 in the valid_from column does not belong to the group because another change event occurred on 2006-07-05, which lies between 2006-02-13 and 2006-12-25.

So, we must sharpen up the definition of the problem. It is not enough to make groups of identical combinations of Type 2 attributes: we must also demand that the rows in the group are consecutive.

The first step in attacking this problem is generate combinations of the audit trail rows so that we have the first row of each group along the first row of the next group. The following fragment solves part of that problem by combining each row (alias: prv) with all more recent rows (alias: nxt) that have the same empkey value but different values in the Type 2 columns gender and state:


INNER JOIN hrsource                prv
ON         curr.empkey           = prv.empkey
LEFT JOIN  hrsource                nxt
ON         prv.empkey            = nxt.empkey             -- timeline of same employee
AND        prv.valid_to          < nxt.valid_from         -- nxt must be more recent than prv
AND       (prv.gender,prv.state)!= (nxt.gender,nxt.state) -- type 2 attributes differ

(Please ignore the part I struck out - it's not relevant at this point)

Note the usage of the LEFT JOIN. It ensures that if prv is the most recent row, and there is by definition no row in nxt that is more recent, the prv row is still retained. Had we used an INNER JOIN, we would have lost that row, messing up the result.

If we run this fragment in isolation, we can certainly see rows that pair the first row of a group with the first row of the next group. For example, if you run this query:

SELECT    prv.empkey
,         prv.valid_from prv_from, prv.valid_to prv_to
,         nxt.valid_from nxt_from, nxt.valid_to nxt_to
,         prv.gender prv_gender
,         nxt.gender nxt_gender
FROM      hrsource prv
LEFT JOIN hrsource nxt
ON        prv.empkey   = nxt.empkey
AND       prv.valid_to < nxt.valid_from
AND      (prv.gender, prv.state) != (nxt.gender, nxt.state)

We get results like this:


+--------+------------+------------+------------+------------+------------+------------+
| empkey | prv_from   | prv_to     | nxt_from   | nxt_to     | prv_gender | nxt_gender |
+--------+------------+------------+------------+------------+------------+------------+
|     14 | 1998-12-03 | 1998-12-27 | 2005-08-08 | 2006-02-12 | F          | M          |
|     14 | 1998-12-03 | 1998-12-27 | 2006-02-13 | 2006-07-04 | F          | M          |
|     14 | 1998-12-03 | 1998-12-27 | 2006-07-05 | 2006-12-24 | F          | M          |
.        .            .            .            .            .            .            .
.        .            ...more rows...           .            .            .            .
.        .            .            .            .            .            .            .
+--------+------------+------------+------------+------------+------------+------------+

The first row highlighted in green is a desired combination because the valid_from date of nxt is closest to that of prv. It is this row from nxt that marks the end of the group of rows starting with prv. For the rows highlighted in red, the nxt part indicates a change that occurred beyond that point and are thus not desired.

To get rid of these undesired rows, we simply have to demand that no row indicating a change in the SCD Type 2 attributes occurs between prv and nxt. In my query, the following fragment is responsible for that part:

LEFT JOIN  hrsource                inb
ON         prv.empkey            = inb.empkey             -- time line of same employee
AND        prv.valid_to          < inb.valid_from         -- more recent than prv
AND       (prv.gender,prv.state)!= (inb.gender,inb.state) -- Type 2 columns changed
AND        nxt.valid_from        > inb.valid_to           -- less recent than nxt
WHERE      inb.empkey IS NULL                             -- does not exist

This fragment essentially states that there must not be any row in between prv and nxt that indicate a change in the SCD type 2 attributes as compared to prv. Note that the first part of the join condition is identical to the one we used to join nxt to prv. The extra's are that we also ask that inb lies before nxt. Because we used a LEFT JOIN, the result row is retained in case no such inb row exists. The WHERE inb.empkey IS NULL condition explicitly filters for these cases. By definition, this means that the nxt part in the result row marks the end of whatever group the prv belongs to.

Rolling Up the Rows in their Groups

Now, if we put these parts together we get something like this:


SELECT    prv.empkey
,         prv.valid_from prv_from, prv.valid_to prv_to
,         nxt.valid_from nxt_from, nxt.valid_to nxt_to
,         prv.gender prv_gender
,         nxt.gender nxt_gender
FROM      hrsource                  prv
LEFT JOIN hrsource                  nxt
ON        prv.empkey              = nxt.empkey
AND       prv.valid_to            < nxt.valid_from
AND      (prv.gender, prv.state) != (nxt.gender, nxt.state)
LEFT JOIN  hrsource                 inb
ON         prv.empkey             = inb.empkey             
AND        prv.valid_to           < inb.valid_from         
AND       (prv.gender,prv.state) != (inb.gender,inb.state) 
AND        nxt.valid_from         > inb.valid_to           
WHERE      inb.empkey IS NULL

Some of the results are here:

+--------+------------+------------+------------+------------+------------+------------+
| empkey | prv_from   | prv_to     | nxt_from   | nxt_to     | prv_gender | nxt_gender |
+--------+------------+------------+------------+------------+------------+------------+
|     14 | 1998-12-03 | 1998-12-27 | 2005-08-08 | 2006-02-12 | F          | M          |
|     14 | 1998-12-28 | 2005-04-22 | 2005-08-08 | 2006-02-12 | F          | M          |
|     14 | 2005-04-23 | 2005-08-07 | 2005-08-08 | 2006-02-12 | F          | M          |
.        .            .            .            .            .            .            .
.        .            ...more rows...           .            .            .            .
.        .            .            .            .            .            .            .
+--------+------------+------------+------------+------------+------------+------------+

Now, we still see the individual rows. However, because we have paired them with the row that definitely marks the end of the group it belongs to, we can now lump them together using GROUP BY. This explains the final bits of the query:

GROUP BY   prv.empkey
,          nxt.valid_from

For each value in empkey, this GROUP BY clause essentially rolls up the group of rows that showed no change in the SCD Type 2 attributes until the occurrence of the nxt row. In the SELECT list, we can now use the MIN function to find the date of the first row in the group.

Note that the GROUP BY list is the thinnest one possible. The SELECT list contains many more columns that do not appear in the GROUP BY list. However, it is perfectly safe to do so in this case. For more details on this matter, please read my Debunking Group By Myths article.

Glueing the SCD Type 1 and Type 2 Solutions together

We developed the solution as two separate queries - one to take care of the SCD Type 1 attributes, one for the SCD Type 2 attributes. These two are simply joined together over the empkey. The code to do this was displayed struck out in the first query fragment that deals with SCD type 2 attributes.

Calculating the Result columns

The only part of the query that was not discussed yet is the SELECT list:

SELECT     prv.empkey
,          curr.name
,          curr.ssn
,          prv.gender
,          prv.state
,          MIN(prv.valid_from)                        valid_from
,          COALESCE(
               nxt.valid_from - INTERVAL 1 DAY
           ,   '9999-12-31'
           )                                          valid_to
,          CASE 
               WHEN nxt.valid_from IS NULL THEN 'Y'
               ELSE 'N'
           END                                        is_current

Note that we select the name and ssn columns from the curr table. This ensures these columns will always reflect the data of the most recent change which is typical for SCD Type 1 attributes. The gender and state columns on the other hand are drawn from prv. They reflect the change history for each individual employee.

The valid_from column is aggregated using the MIN function. This essentially yields the date of the first row of each group of consecutive rows having the same empkey and values in the SCD Type 2 columns.

The valid_to date is calculated from the valid_from date of the first row of the next group of rows that indicates a change of values in the SCD Type 2 columns. This is the relevant code:

nxt.valid_from - INTERVAL 1 DAY

Now, to be completely fair - this is in fact cheating. It implies I trust the data to be such that the value for the valid_to date is always exactly one day less than the value for the valid_from date for the next row in the history. This was certainly the case with the sample data, but we cannot really rely on that. That said, it is not that hard to get the 'real' valid_to date, but it would make this example a lot harder to understand.

Another thing about the calculation of valid_to is the treatment of NULL values. For each history associated with one distinct empkey value, the last row will have NULL here, and this is substituted by a large date instead. Dimension tables tend to avoid nullable columns and in this case the maximum value '9999-12-31' expresses that this row is the current row. Using that same logic, and exrta flag column is added to indicate whether the row is the current row.

Wednesday, January 28, 2009

Fosdem, 9 days from now

FOSDEM, the free and open source developer's meeting is just 9 days away.

I will be doing one talk on MySQL 5.1 Plugins on Sunday, February 8 in the MySQL Room. Initially I was scheduled for two talks, but one was canceled to make way for another talk. I'm actually quite happy it went that way, because Jurriaan Persyn's talk about Database Sharding that is now scheduled instead looks very interesting indeed. I think it really complements the already nice mix of topics in the MySQL track.

Apart from being an interesting tech/geek conference, Fosdem is also a great social event, and I am looking forward to meeting new people and hanging out with old friends and colleagues. Who knows, maybe I'll meet you there?

Monday, January 26, 2009

Repository for MySQL UDFs: new sys_eval() function

Thanks to Bernardo Damele, the lib_mysqludf_sys package hosted by the Repository for MySQL UDFs can now boast a new sys_eval() function. You can read the details here on his blog. Bernardo also created a similar function for PostgreSQL.

Thank you, Bernardo!!!

sys_eval() is quite like sys_exec(): both functions can execute an arbitrary command string. The difference is that sys_eval() returns the output of the command, whereas sys_exec() returns the exit code. Go and check it out for yourself!

Sunday, January 11, 2009

MySQL's sql_mode: My Suggestions

Just the previous week, Anders mentioned the sql_mode with regard to Joomla. Past week, Shlomi's started a nice discussion pertaining to MySQL's sql_mode. He offers the suggestion to modify the installation procedure so MySQL is by default installed using a so-called "strict" sql_mode. In addition, he suggest it would be better if it would not be allowed to change the sql_mode after starting the database server.

Although I understand Shlomi's motivation in wanting this, I respectfully disagree. I have a few ideas too about the sql_mode and what should be done to make it more valuable and less dangerous to use it. In short, these are my suggestions:

Make setting the sql_mode after the server has started a privileged action.

Allow session settings to be defined as part of the user account so they are automatically applied as part of creating a session for the user

Read on if you want to know more about the sql_mode and my motivations for proposing these two changes.

UPDATE: Everybody that wants to automatically set a specific sql mode depending on the user that is connecting, look up this entry in the MySQL bug database:

http://bugs.mysql.com/bug.php?id=42035

Shane Bester put up a comment there that explains how to do it using mysqld's --init-connect option.

The `sql_mode`

What is the sql_mode, and what does "strict" really mean? Well, the sql_mode is a MySQL server variable. It's value can affect a number of aspects regarding how MySQL parses SQL, and what semantics are attached to it. Functionally, the sql_mode is best understood as a compatibility feature that allows MySQL to behave more like other RDBMS products.

The value for the sql_mode is a set, which is essentially a comma-separated list of predefined values. You can find all possible values here in the reference manual. For example, on windows, the default sql mode looks like this:


mysql> SELECT @@sql_mode;
+----------------------------------------------------------------+
| @@sql_mode                                                     |
+----------------------------------------------------------------+
| STRICT_TRANS_TABLES,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION |
+----------------------------------------------------------------+
1 row in set (0.00 sec)

You can set the value for the sql_mode using the SET syntax. You can also set its value on the mysqld command line, or in an option file.

There is quite a good bunch of atomic pre-defined values that alter or attenuate just one specific aspect of MySQL's behaviour. Some examples in this category are IGNORE_SPACE and ONLY_FULL_GROUP_BY, and all modes shown in the previous listing also fall into this category. There is another bunch of mode values that have the effect of adding a predefined collection of values to the sql_mode. Examples in this category are ANSI (which is equivalent to REAL_AS_FLOAT,PIPES_AS_CONCAT,ANSI_QUOTES,IGNORE_SPACE) and POSTGRESQL.

The "strict" mode

Now what is this "strict" mode? Well, first of all it is good to realize that there is not one "strict mode". Just check the manual:

When this manual refers to “strict mode,” it means a mode where at least one of STRICT_TRANS_TABLES or STRICT_ALL_TABLES is enabled.

What do these modes do? Basically, these modes affect the rules for modifying table data. Old-style MySQL behavior was to implicitly provide defaults for missing values, and to convert out-of range values to the nearest in-range value. You would get a warning to inform you it happened, but that statement would continue. STRICT_ALL_TABLES causes the statement to be aborted in case of invalid data. For transactional tables, this will also cause the statement to be rolled back. At any rate, an error (instead of a warning) is issued in this case. The STRICT_TRANS_TABLES mode works similar to STRICT_ALL_TABLES with regard to transactional tables, but still displays the old behaviour for non-transactional tables.

Now, there is some additional complication, because even in "strict" mode, not all out of range values are automatically taken care of. MySQL supports date values that are invalid in the sense that they do not correspond to a real existing date. Most notably, there is the "zero date", 0000-00-00 00:00:00. By default, it is also allowed to have a zero field within a date, like so: 2001-01-00 (Note the zero day field. Personally I don't mind these date semantics, and I happen to find them convenient. For example, if you know the year, and you know the month, you can still enter a value that calculates like a date, even though the exact day is presently not known). These can be influenced using the NO_ZERO_DATE and NO_ZERO_IN_DATE settings for the sql_mode.

If you are looking for a way to use both a strict mode, and avoid these invalid dates as well, check out the TRADITIONAL mode. This combines the strict modes and disallows the invalid dates, and includes even a few more options to make MySQL behave more like a traditional RDBMS.

The snag with `sql_mode`

Personally, I think that having the sql_mode is a pretty good thing. Setting an appropriate strict sql_mode allows you to write more robust applications without losing the ability to run older applications that lack the concept and rely on the forgiving old-style MySQL behaviour. However, there are still a few snags to look out for.

You can run into real trouble when specifying a non-default sql_mode in the option file. In a prior article, I wrote that existing applications may not be prepared for this, and they might break. For this reason, the default sql_mode can be overridden on the session level, allowing each application to set exactly the sql_mode they require.

At the same time, the ability to change the sql_mode on the session level is also a source of trouble. Not only does it enable users to change the rules concerning what data can be entered into a table, it can actually lead to corruption and data loss when used in conjunction with partitioned tables. Obviously, this is a serious problem. It just blows my mind why nothing was done to prevent this.

Proposed Solution - two suggestions

Like Shlomi, I agree there is an inherent problem in the way it is now possible to change the sql_mode. I also feel it should change, but I respectfully disagree with Shlomi's suggestions. Perhaps the sql_mode as set by the windows installer (only STRICT_TRANS_TABLES, NO_AUTOCREATE_USER and NO_ENGINE_SUBSTITUTION) is doable, but something like the TRADITIONAL mode would simply break too many existing applications. Being able to set the sql_mode is tied to this matter - if the global sql_mode does not meet the requirements (either more or less strict) of a particular application, you simple need to override it at the session level. At least I don't see how it can be done otherwise.

I think the solution to all these problems is to ensure that the sql_mode is not altered willy-nilly. First of all, it is clear that some users should be able to alter the mode for application compatibility. It is also clear that there are large classes of users that should not be able to do this. Secondly, most cases that require changing the sql_mode involve running specific applications. In these cases, the change of sql_mode is very well defined, and completely confined to the application's schema.

Suggestion 1: altering the `sql_mode` should be a privileged action

So what I mean is that it should by default not be possible for an arbitrary user to change the sql_mode. I guess it would make sense to allow at least root and perhaps the SUPER users to do it, but otherwise, you would need to be explicitly granted the privilege, something like:


GRANT SET SESSION SQL_MODE TO user@host;

If you like this suggestion, or if you have some other thoughts concerning this matter, go to bug 42034, and add a comment.

Suggestion 2: define default session settings for the account

Many applications use a single database account. It would be extremely helpful if we could define the default session settings at the account level. These could then be applied as soon as the user establishes a connection. For example:


CREATE USER user@host IDENTIFIED BY 'password'
SESSION DEFAULTS SET 
    SQL_MODE := 'TRADITIONAL,NO_ENGINE_SUBSTITUTION,IGNORE_SPACE,PIPES_AS_CONCAT,ANSI_QUOTES'
,   AUTOCOMMIT := OFF
,   STORAGE_ENGINE := INNODB
;

In many cases, this would be enough - if the session settings are applied automatically, there will in most cases not be any need at all to change the sql_mode (or the autocommit mode, or the storage_engine). If you like this suggestion, go to bug 42035 and leave your thoughts there.

Thursday, January 08, 2009

Talking at FOSDEM (7-8 februari, MySQL Room)

Wohoo!

I just got word that the proposals I submitted for FOSDEM are appoved! I was going to go to FOSDEM no matter what, and when I saw Lenz Grimmer's call for participation, I decided to give it a go.

I'll be doing two talks:

MySQL Plugins - Learn about this exciting MySQL 5.1 feature

Starring Sakila - A datawarhousing and Business Intelligence mini-tutorial using the Sakila sample database

Having the opportunity to do these talks adds to the fun, and I feel honored for being given the chance. Thank you!

This year, Fosdem looks very promising for the MySQL crowd. Lenz himself is doing a talk on MySQL HA solutions, and this year, MySQL even has a Developer's room too (on Sunday, Februari 8). Check out the tentative schedule for the MySQL room here. I will probably hang out all day in the MySQL dev room to see all the talks. Although all of them seem interesting to me, there are three talks I'd like to highlight:

PBXT best practices, by Vladimir Kolesnikov

Percona MySQL patches and the XtraDB storage engine, by Ewen Fortune

Boost performance with MySQL 5.1 partitions, by Giuseppe Maxia( yup - he knows what he's talking about)

The talks by Kris Buytaert (MySQL monitoring) and Geert Vanderkelen (MySQL Cluster) should be good too. I just pitched the ones that seem likely to contain stuff that's mostly new to me.

So, see you there at FOSDEM!! I''m looking forward to meet you there, and have a trappiste or two...after that I usually lose count anyway ;)

Wednesday, January 07, 2009

Thank you, Wikipedia

Wikipedia users: I bet you noticed the banners...

The past weeks there was a banner on top of all wikipedia articles asking for donations. Today I looked and saw the "Thank you banner" to convey that it all worked out - enough donations were made to keep wikipedia going.

Maybe you think I'm being pompous, but I feel that wikipedia is nothing short of being a world wonder. It certainly is one of the greatest achievements of the past decade, and I suspect it will remain relevant for many years to come.

I'd like to say "thank you" to everybody that donated, and a "Big Thank You" to the all the people behind wikipedia - contributors, moderators, editors, developers, system administrators and management (and whoever I left out). Wikipedia is one of my top resources for technical and scientific knowledge, and it's really hard to think how I would do my work without it.

Sunday, January 04, 2009

MySQL Stored Procedure Result Sets in Pentaho Data Integration

Quick tip - suppose you need the result set of a MySQL stored procedure in your Pentaho Data Integration (a.k.a. Kettle) Transformation, what do you do?

A Call DB Procedure sounds promising, but as it turns out, you can't use it to retrieve any result sets. Rather, this type of step is meant have an input stream from another source:

drive stored procedure execution presumably for some useful side-effect

invoke a database stored function and obtain the scalar result

So, what can we do? The answer is simpler than might be expected.

Just use an ordinary Table input step. Everybody that has used PDI before knows that the Table input step retrieves data based on an SQL statement. However, the SQL statement is not required to be a SELECT statement. Any statement that returns a result set will do, including SHOW statements, and CALL.

Stay tuned for more tips like these - more will come as Jos and I are completing our "Building Pentaho Solutions" Book.

Friday, January 02, 2009

Writing a Book: Building Pentaho Solutions

Ok - this has been stewing for some time now, and I think now is the right time to announce that I am working together with Jos van Dongen from Tholis Consulting to create a book for Wiley with the tentative title "Building Pentaho Solutions".

My personal aim is to make this book the primary point of reference for DBAs and Application Developers that are familiar with Open Source products like MySQL and PostgreSQL but have no prior BI skills, as well as BI professionals that are familiar with closed source BI products like Microsoft BI and Business Objects that want to learn how to get things done with Pentaho.

The book will cover all distinct components and sub-products that make up the Pentaho BI Suite. For each component (and where applicable), installation, usage and maintenance are discussed and illustrated.

Background theory is given as needed to provide context for those readers with no prior BI knowledge or experience. This means we will cover topics such as dimensional modeling, data warehousing, data integration and much more. At the same time, the book will have a strong "hands-on" focus, and for that purpose, Jos and I have put together a fairly realistic online DVD Rental Company, www.worldclassmovies.com.

We went as far as creating different database schemas for the operational applications (for customer orders, inventory management, and purchase orders), and generating non-trivial volumes of example data for it (tables with 100k to 10m rows), covering over 7 years of DVD rental business - and this is just the operational system. And although generated, it's not just random data - things like customer age and location distribution, per year and per week ordering behaviour etc. - it's all in our sample data.

In our schemas and sample solutions, we are making an effort to cater to MS SQL, MySQL, Oracle and PostgreSQL users. We will also explore a few highly interesting open source database products like LucidDB and MonetDB.

Anyway - I will soon post some more details about the book and it's contents as we progress. I have taken Baron Schwartz's advice on writing a book to heart, and we are working from a pretty detailed outline trying to meet a darn-tight schedule.

If all goes well, we should end up with a 450-500 page book in a store near you by August 2009.

UPDATE:
Jos and I are in the final stages of writing the book. We will then have to process reviewer comments and set up the book's website (including all code samples).

We've had a few changes to our schedule. Instead of a 400-500 page book, it will be a 500-600 page book. It was due to be available late August, but this will now be Early September.

The book can be pre-ordered on Amazon. More information on the book can be found on the Wiley website

UPDATE2: The book is in print and available since late August 2009! So far, we've had a couple of very nice reviews.

The book is available directly from the Wiley site and also from Amazon. Wiley also provides an e-book. If the link to the e-book is not working or claiming the book is not available as e-book then you most likely ended up on the European version of the website. To fix that and order the e-book anyway, click the "choose location" link on the Wiley homepage, and set it to United States and try again.