Roland Bouman's blog

Tuesday, April 21, 2009

Not so random slashdot comment Gem

Java 8 will replace String with String2, which will treat empty string and null the same.

By by characterZer0 (138196), on Monday April 20, @08:44AM

Thursday, March 26, 2009

More fun visualizing MySQL Database Size

Hi again!

As a sidekick for my previous post, I came up with a snippet of code that generates the Google Chart URL to visualize table size for the current database. For example, for the sakila sample database, we get URL's like this:

http://chart.apis.google.com/chart?cht=bhs&chbh=19,2,2&chs=653x459&chtt=sakila%20Size%20(MB)&chco=4D89F9,C6D9FD&chd=t:0.0156,0.0156,0.0156,0.0156,0.0156,0.0469,0.0625,0.0625,0.0781,0.0781,0.1875,0.1875,0.1875,0.1163,0.1719,1.5156,1.5156|0.0000,0.0000,0.0000,0.0156,0.0313,0.0156,0.0156,0.0313,0.0156,0.0469,0.0781,0.0781,0.0781,0.2002,0.1875,0.6094,1.2031&chds=0,2.7188&chxt=y,x&chxl=0:|rental%20(InnoDB)|payment%20(InnoDB)|inventory%20(InnoDB)|film_text%20(MyISAM)|films%20(InnoDB)|film_actor%20(InnoDB)|film%20(InnoDB)|customer%20(InnoDB)|staff%20(InnoDB)|address%20(InnoDB)|film_category%20(InnoDB)|city%20(InnoDB)|store%20(InnoDB)|actor%20(InnoDB)|language%20(InnoDB)|country%20(InnoDB)|category%20(InnoDB)|1:|0|2.72MB&chm=N*f2*,000000,0,-1,11|N*f2*,000000,1,-1,11

The graph looks like this: Google Chart: Sakila database size

Here's the script I used:

SET @maxpixels:=300000;
SET @barwidth:=19;
SET @spacebetweenbars:=2;
SET @spacebetweengroups:=2;
SET @totalbarwidth:=@barwidth+(2*@spacebetweenbars)+(2*@spacebetweengroups);
SET @megabytes:=1024*1024;
SET @decimals:=2;

SELECT      CONCAT(
                 'http://chart.apis.google.com/chart'
            ,    '?cht=bhs'
            ,    '&chbh=',@barwidth,',',@spacebetweenbars,',',@spacebetweengroups
            ,    '&chs=', @maxpixels div (COUNT(*) * @totalbarwidth),'x', COUNT(*) * @totalbarwidth
            ,    '&chtt=', table_schema, ' Size (MB)' 
            ,    '&chco=4D89F9,C6D9FD'
            ,    '&chd=t:',    GROUP_CONCAT(data_length / @megabytes ORDER BY (data_length+index_length))
            ,    '|', GROUP_CONCAT(index_length / @megabytes ORDER BY (data_length+index_length))
            ,    '&chds=' ,0, ',', MAX(data_length+index_length)/@megabytes
            ,    '&chxt=y,x'
            ,    '&chxl=0:|', GROUP_CONCAT(table_name, ' (', engine,')' ORDER BY (data_length + index_length) DESC separator '|')
            ,         '|1:|', 0, '|', ROUND(MAX(data_length+index_length) / @megabytes, @decimals), 'MB'
            ,    '&chm=N*f',@decimals,'*,000000,0,-1,11|N*f',@decimals,'*,000000,1,-1,11'
            )
FROM        information_schema.tables
WHERE       table_schema = SCHEMA()
AND         table_type   = 'BASE TABLE'
GROUP BY    table_schema

I'm not really satisfied yet...I keep hitting limitations w/re to google charts. I built a little bit of logic that will ensure the resulting picture is within the upper limit of 300000 pixels. I just found out there is another limitation that says the chart height can't exceed 1000 pixels. I'm going to stop looking at this for a while, and maybe later on I will come up with some works-most-of-the-time kind of logic to control this.

Feel free to drop me a line if you have some ideas regarding this.

A Faster MySQL Database Size Google Chart

Abstract - As described by Walter Heck, MySQL database size can be visualized using Google Charts. With a minor code improvement the URL for the chart can be obtained twice as fast. With some more modification, the number of lines can be cut down resulting in a function that is half as long.

Hi!

It's been a while since I posted - I admit I'm struggling for a bit to balance time and attention to the day job, writing a book, preparing my talks for the MySQL user's conference and of course family life.

A month ago or so I read a couple of posts about using the Google chart API to visualize database size. Although I personally would not consider using Google Charts (in its current form) for serious application monitoring applications, I am quite charmed by its ease of use and availability. Time to give it a try myself.

I inspected the PL/SQL code provided by Alex Gorbachev, and the MySQL code by Walter Heck, as well as the improved implementation by Ruturaj Vartak.

Although I applaud both Walter and Ruturaj's efforts in porting this code, I think their code can be improved still. In this short article I'd like to illustrate how a minor modification can double performance.

Code Analysis

Let's take a brief moment to analyze Walter's original code. I will cite a few fragments of his code. (I have made some simplifications and removed some distractions to make it as easy as possible to understand the logic of the code. If you find an error in my citation, please consider the possibility that it is my doing, not Walter's).

His program takes the form of a MySQL stored function. It accepts a few parameters to configure the chart (chart type and size) and outputs a fully functional URL which can be used to retrieve a .png image that shows the size (in MB) of all databases managed by the MySQL server. This is the function signature:

CREATE FUNCTION FNC_GOOGRAPH_DB_SIZE(
  p_chart_type CHAR,
  p_height INT,
  p_width INT
) RETURNS varchar(255)

The stored function is implemented by retrieving size metadata from the information_schema.TABLES table. The size of each table's data and indexes are then added and summed per database. The results are traversed using a cursor loop.

Here's the cursor declaration:

DECLARE c_schema_sizes cursor FOR
SELECT    t.table_schema
,         SUM(t.data_length + t.index_length) / 1024 / 1024
FROM      information_schema.tables t
GROUP BY  t.table_schema

Note that the figure is immediately represented as megabytes by dividing twice by 1024. (In other words, divide by 1024 to make kilobytes of the raw bytes, then divide by 1024 again to make megabytes out of kilobytes.)

During iteration, a list of database names and a list of database sizes are built through string concatenation. These are required to supply the data series and labels to the chart.

Only the code for constructing the data series is shown here:

/* Get the percentage of the total size as the graph's data */
IF v_chart_data = '' THEN
    SET v_chart_data = 
        ROUND(v_data_length_sum / v_data_length_total, 2) * 100;
ELSE
    SET v_chart_data = CONCAT(v_chart_data,',',
        ROUND(v_data_length_sum / v_data_length_total, 2) * 100);
END IF;

Note that instead of making a list of actual sizes, a percentage is taken. Keep that in mind, we'll discuss this in more detail in the next section.

Finally, the actual URL is built, using the list of database sizes as the data series and the list of database names as data labels. Here's the URL construction code:

SET v_url = 'http://chart.apis.google.com/chart?';
SET v_url = CONCAT(v_url, 'cht=', p_chart_type);
SET v_url = CONCAT(v_url, '&chs=', p_width , 'x', p_height);
SET v_url = CONCAT(v_url, '&chtt=Database Sizes (MB)');
SET v_url = CONCAT(v_url, '&chl=', v_chart_labels);
SET v_url = CONCAT(v_url, '&chd=t:', v_chart_data);
SET v_url = CONCAT(v_url, '&chdl=', v_legend_labels);

About Google Chart Data Formats

We just mentioned that the data points are not simply concatenated - rather, data size per database is expressed as a percentage of the total size of all databases. This is not just some arbitrary choice - it is in fact required by the Google Chart API. Here's a quote from the Overview of data formats:

Before you can create a chart you must encode your data into a form that is understood by the Chart API. Use one of the following formats:
Text encoding uses a string of positive floating point numbers from zero to one hundred.
...

For this reason, the code divides everything by v_data_length_total and multiplies by 100. In our code analysis, we did not discuss v_data_length_total because it was not really essential for the logic of the program. This is how it is computed:

SELECT  ROUND(SUM(t.data_length + t.index_length) / 1024 / 1024) 
INTO    v_data_length_total
FROM    information_schema.tables t

Note that this query is very similar to the query used in the cursor declaration. In both cases, the information_schema.TABLES table is queried to calculate the combined size of table data and indexes. The main difference is grouping: in the cursor declaration a GROUP BY table_schema was used to calculate the size per database. In this case, grouping is absent and the size is calculated across databases, i.e. the combined size of all tables and indexes in the entire server.

Text encoding with Data Scaling

Now, as it turns out, the text-endoding used by this code comes in two flavours:

Text encoding with data scaling uses a string of positive and negative floating point numbers in combination with a scaling parameter.

So, as an alternative to 'pre-scaling' the values in the data series to a 100-point scale, we can also opt to let Google do the work, provided we pass this scale factor.

Because the code to calculate the scale factor is already present, we can simply remove the scaling computation inside the loop, and add the scale factor to the URL like so:


    -- inside loop: remove percentage calulation:
      IF v_chart_data = '' THEN
        SET v_chart_data = ROUND(v_data_length_sum,2);
      ELSE
        SET v_chart_data = concat(v_chart_data, ',', ROUND(v_data_length_sum, 2));
      END IF;

    -- outside loop: add scaling parameter to URL:
SET v_url = CONCAT(v_url, '&chds=0,', v_data_length_total);

However, we can do better than that.

An alternative method to calculate the scaling factor

The problem with the current solution is that it has to do a separate query to obtain the total size. We already noticed that both used queries are quite similar, save for the grouping. Because we are already looping through all results, we can try to calculate the total ourselves. This would allow us to avoid doing the SELECT...INTO statement altogether.

We would need to add proper initialization for the v_data_length_total variable and add the following line inside the loop:

SET v_data_length_total := v_data_length_total + v_data_length_sum;

When comparing performance before and after this change, we can observe a dramatic change. On my laptop, the original function takes anywhere between 45 and 50 seconds. After incorporating these changes, the time is slashed by two and the function takes 'only' 22 to 24 seconds.

The fact that the second solution is twice as fast (~23 seconds instead of ~46) is not a coincidence: it's because we're doing half the number of queries (1 instead of 2). Don't get me wrong - this is still very slow. But this comes all down to poor information_schema performance. In this particular case, there is not much we can do to improve the code further to gain performance.

More performance improvements?

I am very much convinced the bottleneck in database size chart function is the query on the information_schema.TABLES table. To be more exact, it's the fact that we're accessing the DATA_LENGTH and INDEX_LENGTH columns. Just watch this:

mysql> select count(*) from information_schema.tables;
+----------+
| count(*) |
+----------+
|     1576 |
+----------+
1 row in set (0.08 sec)

mysql> select count(data_length) from information_schema.tables;
+--------------------+
| count(data_length) |
+--------------------+
|               1563 |
+--------------------+
1 row in set (25.98 sec)

mysql> select count(index_length) from information_schema.tables;
+---------------------+
| count(index_length) |
+---------------------+
|                1563 |
+---------------------+
1 row in set (25.41 sec)

As you can see, simply accessing DATA_LENGTH and/or INDEX_LENGTH causes the query to be slow.

Notice also that the time spent to execute this stand-alone query is about the same as it takes for the improved function to complete. Basically, this tells us the function spends all its time performing the query - timewise, the contribution of the remainder of the function, such as cursor traversal, creating the value lists and building the URL is simply negligible.

This means that we simply can't improve the performance of the database size chart function unless we can improve the performance of this query. Because we have no other general way of obtaining index and data size, this is the end of the line.

Other Improvements

Even though we probably can't improve the performance of Walter's function anymore, I still think it's possible to improve the code.

I guess this is kind of a pet peeve of mine, but I dislike using MySQL cursors. There's a lot of syntax involved to set them up. Usually, they can be avoided anyway.

MySQL Cursors are also pretty slow. This is something you really start to notice when traversing tens of thousands of rows. Not that that really matters for building Google Chart URLs: Most likely, you will hit a limitation in either chart size or URL length, so you can't really gain a lot of performance by eliminating cursors for this particular purpose.

Still, I think that eliminating the cursor will help to make the code less complex, so lets try anyway.

Building Lists: `GROUP_CONCAT()`

We just explained how the cursor was used to build various lists of values. As it happens, MySQL supports the GROUP_CONCAT() function, which was designed especially for that purpose. GROUP_CONCAT() is an aggregate function, just like COUNT(), MIN() and MAX().

Like other aggregate functions, GROUP_CONCAT() can produce a single summary result on a group of rows. It does so by first concatenating its arguments for each row in the group, and then concatenating the per-row result for the entire group of rows, optionally separating row results using some separator.

Now, there is an important limitation with this function that causes many people to avoid GROUP_CONCAT altogether. Here's the relevant text from the manual:

The result is truncated to the maximum length that is given by the group_concat_max_len system variable, which has a default value of 1024.

Basically, this says that if you don't take proper precautions, the lists you generate with GROUP_CONCAT may be truncated. This is obviously bad news! However, there is a very simple workaround, which is also hinted at in the documentation:

The value can be set higher, although the effective maximum length of the return value is constrained by the value of max_allowed_packet.

So, the workaround is simple: By assigning the value of max_allowed_packet to group_concat_max_len wil allow the longest possible list of values. You may argue that this might still not long enough. However, that is a mooit point. Just read up on max_allowed_packet:

The maximum size of one packet or any generated/intermediate string.

In other words, no MySQL string will ever exceed its length beyond max_allowed_packet - the same limit holds for any other method of concatenating strings within MySQL, including cursor loops.

A cursor-less database size chart function

Without further ado, this is how I would write the database size chart function:

CREATE FUNCTION  f_dbsize_google_chart(
    p_chart_type ENUM('bhs','p')
,   p_width      MEDIUMINT UNSIGNED
,   p_height     MEDIUMINT UNSIGNED
)
RETURNS LONGTEXT
READS SQL DATA
BEGIN
    DECLARE v_sum_size, v_max_size DOUBLE;
    DECLARE v_size_series, v_size_labels LONGTEXT;

    -- store current group_concat_max_len so we can reset
    DECLARE v_group_concat_max_len BIGINT UNSIGNED DEFAULT @@group_concat_max_len;

    -- ensure group_concat capacity
    SET @@group_concat_max_len := @@max_allowed_packet;

    -- get the database size
    SELECT  ROUND(SUM(size),2), MAX(size)
    ,       GROUP_CONCAT(
                ROUND(size,2)
                ORDER BY size, table_schema
            )
    ,       GROUP_CONCAT(
                table_schema 
                ORDER BY size, table_schema
                SEPARATOR '|'
            )
    INTO    v_sum_size, v_max_size
    ,       v_size_series, v_size_labels
    FROM    (
            SELECT      table_schema
            ,           SUM(data_length + index_length) / 1024 / 1024 size
            FROM        information_schema.tables 
            WHERE       table_type = 'BASE TABLE'
            GROUP BY    table_schema
            ) a;

    -- restore original group_concat_max_len
    SET @@group_concat_max_len := v_group_concat_max_len;

    -- build URL
    RETURN CONCAT(
        'http://chart.apis.google.com/chart'
    ,   '?cht='   , p_chart_type
    ,   '&chs='   , p_width, 'x', p_height
    ,   '&chds=0,', v_max_size
    ,   '&chd=t:' , v_size_series
    ,   '&chdl='  , v_size_labels
    ,   '&chl='   , replace(v_size_series, ',', '|')
    ,   '&chtt=MySQL Database Size (', v_sum_size, 'MB)'
    );
END;

Here's a quick summary that points out some differences with regard to the original code:

Instead of accepting CHAR(1) for the chart type, an ENUM('bhs','p') is used to restrict the value to a listed type.
Instead returning a varchar(3000), this function returns LONGTEXT, effectively leaving it to the Google chart API to report a URL length limitation
Instead of a cursor, we use a single SELECT..INTO statement.
The query in the SELECT...INTO statement uses a subquery in the FROM clause which is functionally equivalent to the actual cursor in the original code. A small but important improvement is the addition of a condition to restrict the result only to base tables. This automatically excludes the information_schema and databases that contain only views
The outer query is functionally equivalent to the cursor loop and uses one GROUP_CONCAT() expression to obtain a list of values, and one GROUP_CONCAT() expression to obtain a list of labels. Here, we also calculate the scaling factor by taking the MAX of the database size.
To work around the truncation problem with GROUP_CONCAT, we set group_concat_max_len the the maximum practical value. We don't just set max_group_concat_len and leave it at that. We restore its original value at the end of the stored function.

Wednesday, March 04, 2009

Woot! MySQL bug 11661 fixed, finally...

Yeah...title says it all.

Bug bug #11661, Raising Exceptions from within stored procedures: Support for SIGNAL statement.

I can't recall the exact moment of filing this bug, because I filed a lot of them. However, this one is special to me because I filed it when I was in the process of getting involved with MySQL and its community. I remember it as a very intense period of learning, meeting other community members online (mostly through forums.mysql.com and blogging on a regular basis.

A lot has happened since then, and if you ask me, most of it is good stuff. The only thing that casts somewhat of a shadow is that it took so long to fix this. But I don't want to whine about it. Rather, I'd like to use this opportunity and extend my big, BIG THANK YOU!! to Marc Alff who has been working very hard on implementing this feature. I should also mention Peter Gulutzan, who's enduring attention to quality and efforts for making MySQL comply more to the SQL Standard have made MySQL a better product.

Thanks Guys! I'm looking forward to using this and other features in MySQL 6.0.

Saturday, February 21, 2009

Rare find...

First time in my lifetime....

And indeed, about a minute later all is back to normal.

My respect goes out to the architects and admins of wikipedia that somehow manage to keep this large scale operation running so smoothly. It's truly amazing.

Friday, February 20, 2009

Exporting a Kettle Repository to Files

Hi All!

Today I'd like to announce KREX, a small solution I put together to export a Kettle (a.k.a. Pentaho Data Integration) Repository to individual transformation (.ktr) and and job (.kjb) files.

The idea to create this was inspired by this thread on the pentaho forums, started by kandrews. He (she?) wrote:

Has anyone ever been able to export a PDI repository and convert it somehow into regular non-repository .kjb & .ktr files? If you have done this already or this functionality already exists please let me know.

My initial thoughts are possibly an XLS translation against the XML from the repository export. Thoughts?

Well, I hope this helps! Enjoy en let me know if its useful. Be advised that in the same thread, Matt Casters already revealed that the functionality to do this will soon be built into PDI, but until then this may be of use.

To start using KREX,

checkout the repository or download the Job and Transformation files to your file system.

Open the main Job file export_repository_to_files.kjb using Pentaho Data Integration 3.2's spoon (Currently a Milestone 1 release)

Configure the Set Source Repository Step in the set_source_repo_and_target_directory transformation to match the repository you want to export

Run the main job file (export_repository_to_files.kjb)

If all goes well, you should now have a directory called pdi_repo_export in your home directory which contains a subdirectory named after your exported repository containing the directory tree with the .ktr and .kjb files.

Here's a quick screenshot of the main job, just to give you an idea:

The heart of the job is formed by the very last transformation, which does the actual legwork of extracting and saving the individual transformations:

The steps before that are mainly configuration and ensuring that the directory tree that is to contain the files is created before we attempt to write any files.

If you have any suggestions or comments, I welcome you to post them here. If you are trying to use KREX but run into an issue, please use the KREX issuelist.

If you are looking for more tips and trick with kettle and Pentaho in general, stay tuned. The "Building Pentaho Solutions" book I'm writing for Wiley together with Jos van Dongen will contain tons and tons of practical tips and solutions, and explain many of its technologies and concepts in thorough detail.

Cheers and until next time,

Roland

Saturday, January 31, 2009

Loading a dimension table with SCD1 and SCD2 attributes

Jos, my co-author for the "Building Pentaho Solutions" book just pointed me to a recent article by Jeff Prenevost entitled "The Problem with History".

Abstract

Jeff's topic, loading a hybrid Type 1 / Type 2 slowly changing dimension table is related to data warehousing but maybe of interest outside of that context as well.

As it turns out, the particular problem described by Jeff is non-trivial, but can be solved quite elegantly in a single SQL statment. This may be a compelling alternative to the multi-step, multi-pass solution proposed in his article.

Type 1 and Type 2 Slowly Changing Dimensions

In his article, Jeff describes a method to load a slowly changing dimension (SCD) table from an audit trail. This would be quite straight forward in case we are dealing with a Type 2 slowly changing dimension. In that case, each row in the audit trail would also yield one row in the dimension table. If the dimension would be a Type 1 slowly changing dimension, the matter would only be slightly more complicated - in this case only the most recent version of each object would be loaded into the dimension table.

A Hybrid SCD Type 1/2

The interesting thing about the problem described in the article is that the dimension table is a hybrid Type 1 / Type 2 dimension. That is, for some attributes, history needs to be tracked in the dimension table (Type 2 attributes), whereas only the most recent data is required for other attributes (Type 1 attributes).

Sample Data

To make things tangible, here's a sample Employee audit trail:

         |    SCD Type 1        |    SCD Type 2  |
+--------+--------+-------------+--------+-------+------------+------------+
| empkey | name   | ssn         | gender | state | valid_from | valid_to   |
+--------+--------+-------------+--------+-------+------------+------------+
|     14 | Jo     | 323-10-1116 | F      | MI    | 1998-12-03 | 1998-12-27 |
|     14 | Jo     | 323-10-1119 | F      | MI    | 1998-12-28 | 2005-04-22 |
|     14 | Joe    | 323-10-1119 | F      | MI    | 2005-04-23 | 2005-08-07 |
|     14 | Joseph | 323-10-1119 | M      | MI    | 2005-08-08 | 2006-02-12 |
|     14 | Joe    | 323-10-1119 | M      | MI    | 2006-02-13 | 2006-07-04 |
|     14 | Joseph | 323-10-1119 | M      | NY    | 2006-07-05 | 2006-12-24 |
|     14 | Joseph | 323-10-1119 | M      | MI    | 2006-12-25 | NULL       |
|     15 | Jim    | 224-57-2726 | M      | IL    | 2002-01-16 | 2004-03-15 |
|     15 | James  | 224-57-2726 | M      | IL    | 2004-03-16 | 2007-06-22 |
|     15 | James  | 224-57-2726 | M      | IN    | 2007-06-23 | 2007-08-31 |
+--------+--------+-------------+--------+-------+------------+------------+

The data shows the history for the employees with empkey 14 and 15. All rows with the same empkey value form a time line of data change events. Each row represents a change event, updating some of the employee's data. The valid_from and valid_to columns are used to record when the data change event occurred, so the values in these columns change for each row. For the other columns, I used bold markup to make it easier to spot the change.

SCD Type 1 and 2 Attributes

In Jeff's article, the columns name and ssn end up as SCD Type 1 attributes, and the columns gender and state as SCD Type 2 attributes. I used color highlighting to mark up groups of consecutive rows where the values for the gender and state columns did not change. After loading the dimension table we must end up with one row for each such group, capturing all change events for these columns. The ssn and name must always contain the most recent data. I highlighted the most recent row using grey markup.

Resulting Dimension Table

The data for the resulting dimension table would look like this:

+--------+--------+--------+-------------+--------+-------+------------+------------+------------+
| dw_key | empkey | name   | ssn         | gender | state | valid_from | valid_to   | is_current |
+--------+--------+--------+-------------+--------+-------+------------+------------+------------+
|      1 |     14 | Joseph | 323-10-1119 | F      | MI    | 1998-12-03 | 2005-08-07 | N          |
|      2 |     14 | Joseph | 323-10-1119 | M      | MI    | 2005-08-08 | 2006-07-04 | N          |
|      3 |     14 | Joseph | 323-10-1119 | M      | NY    | 2006-07-05 | 2006-12-24 | N          |
|      4 |     14 | Joseph | 323-10-1119 | M      | MI    | 2006-12-25 | 9999-12-31 | Y          |
|      5 |     15 | James  | 224-57-2726 | M      | IL    | 2002-01-16 | 2007-06-22 | N          |
|      6 |     15 | James  | 224-57-2726 | M      | IN    | 2007-06-23 | 9999-12-31 | Y          |
+--------+--------+--------+-------------+--------+-------+------------+------------+------------

As you can see, only the changes in the gender and state columns are recorded, ignoring any changes in the name and ssn columns. Instead, these columns get the values of the most recent change.

Jeff's Solution

Jeff's article describes a step-wise solution to this problem. To give you an idea and some context, here's a quote from the article describing the mindset in developing this approach:

While an ETL tool like Informatica or DataStage would be handy, any of this could be done fairly easily in straight SQL. We’ll also keep all the steps simple, easy to understand and discrete. It’s possible to create enormous heaps of nested SQL to do everything in one statement, but it's best to keep everything understandable. We create and label "tables," but whether you choose to actually create real database tables, or the tables are just record sets inside a flow, the process remains essentially unchanged.

The actual steps that make up the described solution can be summarized as follows:

Rank all rows according to the timeline

Join consecutive rows by rank in case the there is a change in the SCD attributes

Break the joined rows back into two rows (Jeff calls this 'semi-pivot')

Add the last and first version for each object to the set

Add another rank number to the result set

At this point, the rows where the new rank number is odd contain the valid_from date, and rows with an even number contain the valid_to date. Join each odd row to its consecutive even rows.

Join to the last version of each object to fill in for the SCD Type 1 attributes, and set a flag for the latest version of the object.

I admit I have a hard time understanding Jeff's method, and I am completely puzzled as to how he invented it. My apologies if my summary isn't exactly crystal clear - I encourage you to read the original article, "The Problem with History" yourself if you are interested in the details.

My Alternative

First of all, let me say that I am impressed with Jeff's creativity. I would never think of this solution and I am not really sure I even understand it. I also am not categorically opposed to Jeff's initial mindset: I too think that it is sometimes better to avoid complex SQL statements in favor of a series of relatively simple ones.

That said, when I first read Jeff's article, my first impression was: "Wow...that are a lot of moving parts". My second impression was: "Darn, that's a lot of multiple passes over the same data set". All the time, I had this hunch that it wouldn't be to hard to do it in two statements (one SELECT...INTO, one UPDATE), or perhaps even in one (SELECT...INTO).

Preparation

I started by setting up the test data in MySQL 5.1. I'm including the script here for the readers' convenience:


CREATE TABLE  hrsource (
    empkey     int         NOT NULL
,   name       varchar(10) NOT NULL
,   ssn        char(11)    NOT NULL
,   gender     char(1)     NOT NULL
,   state      char(2)     NOT NULL
,   valid_from date        NOT NULL
,   valid_to   date       
,   PRIMARY KEY(empkey, valid_from)
);

INSERT INTO hrsource 
(empkey,name,ssn,gender,state,valid_from,valid_to) VALUES
 (14, 'Jo'    , '323-10-1116', 'F', 'MI', '1998-12-03', '1998-12-27')
,(14, 'Jo'    , '323-10-1119', 'F', 'MI', '1998-12-28', '2005-04-22')
,(14, 'Joe'   , '323-10-1119', 'F', 'MI', '2005-04-23', '2005-08-07')
,(14, 'Joseph', '323-10-1119', 'M', 'MI', '2005-08-08', '2006-02-12')
,(14, 'Joe'   , '323-10-1119', 'M', 'MI', '2006-02-13', '2006-07-04')
,(14, 'Joseph', '323-10-1119', 'M', 'NY', '2006-07-05', '2006-12-24')
,(14, 'Joseph', '323-10-1119', 'M', 'MI', '2006-12-25', NULL)
,(15, 'Jim'   , '224-57-2726', 'M', 'IL', '2002-01-16', '2004-03-15')
,(15, 'James' , '224-57-2726', 'M', 'IL', '2004-03-16', '2007-06-22')
,(15, 'James' , '224-57-2726', 'M', 'IN', '2007-06-23', '2007-08-31');

Of course, this is not a lot of data, but it is the data from the original article, dutifully hacked into my computer's keyboard by your's truly.

Solution

Without further ado, this is my solution for selecting the dimension's table dataset:

SELECT     prv.empkey
,          curr.name
,          curr.ssn
,          prv.gender
,          prv.state
,          MIN(prv.valid_from)                        valid_from
,          COALESCE(
               nxt.valid_from - INTERVAL 1 DAY
           ,   '9999-12-31'
           )                                          valid_to
,          CASE 
               WHEN nxt.valid_from IS NULL THEN 'Y'
               ELSE 'N'
           END                                        is_current
FROM       hrsource                curr
INNER JOIN (
            SELECT   empkey
            ,        MAX(valid_from) valid_from
            FROM     hrsource
            GROUP BY empkey
           )                       curr1
ON         curr.empkey           = curr1.empkey
AND        curr.valid_from       = curr1.valid_from
INNER JOIN hrsource                prv
ON         curr.empkey           = prv.empkey
LEFT JOIN  hrsource                nxt
ON         prv.empkey            = nxt.empkey
AND        prv.valid_to          < nxt.valid_from
AND       (prv.gender,prv.state)!= (nxt.gender,nxt.state)
LEFT JOIN  hrsource                inb
ON         prv.empkey            = inb.empkey
AND        prv.valid_to          < inb.valid_from
AND        nxt.valid_from        > inb.valid_to
AND       (prv.gender,prv.state)!= (inb.gender,inb.state)
WHERE      inb.empkey IS NULL
GROUP BY   prv.empkey
,          nxt.valid_from

Now I would certainly not qualify this as a simple statement. At the same time, I feel this does not resemble the "enormous heaps of nested SQL" so dreaded by Jeff. Let me explain how it works, and you can judge for yourself.

Explanation

I will know do a step-by-step explanation of my statement. I hope it clears up any doubt you might have as to how my solution solves the problem.

SCD Type 1 attributes: Isolating the Most Recent Change

In the introduction, I mentioned that the loading the dimension table would be easy in case it was either a Type 1 or a Type 2 slowly changing dimension. Well, we know that, come what may, we will always need the most recent row for each empkey to supply values for the Type 1 attributes name and ssn. So, we start by attacking that part of the problem.

The following fragment of my solution does exactly that:

FROM       hrsource                curr
INNER JOIN (
            SELECT   empkey
            ,        MAX(valid_from) valid_from
            FROM     hrsource
            GROUP BY empkey
           )                       curr1
ON         curr.empkey           = curr1.empkey
AND        curr.valid_from       = curr1.valid_from

The subquery curr1 finds us the highest value for the valid_from column for each distinct empkey. The GROUP BY empkey clause ensures we get exactly one row for each distinct value of empkey. From all rows with the same empkey value, the MAX(valid_from) bit finds the largest valid_from value.

The combination empkey, valid_from is the primary key of the employee audit trail. This means we can now use the empkey, MAX(valid_from) pair from the curr1 subquery to point out the most recent row for each distinct empkey. The INNER JOIN with the hrsource table (dubbed curr) does exactly that.

Grouping consecutive rows with identical Type 2 attributes

We must now deal with the second part of the problem, the SCD Type 2 attributes.

If you look back at the color highlighting in the sample data set, you may realize this is basically a grouping problem: for each distinct empkey we need to group rows with identical values in the Type 2 columns gender and state. From the point of view of the dimension table, no change occurred within this group, so we should store it as one row in the dimension table, and reconstruct the change dates by taking the minimum valid_from and maximum valid_to values from the group.

We must be careful though: we can't simply make groups of all distinct combinations of the SCD Type 2 columns. Look for example at the rows having 14 for the empkey column. The rows with the valid_from values 2005-08-08, 2006-02-13 and 2006-12-25 all have the identical combination (M, MI) in the Type 2 columns gender and state respectively. However, only the first two of these belong together in a group. The row with 2006-12-25 in the valid_from column does not belong to the group because another change event occurred on 2006-07-05, which lies between 2006-02-13 and 2006-12-25.

So, we must sharpen up the definition of the problem. It is not enough to make groups of identical combinations of Type 2 attributes: we must also demand that the rows in the group are consecutive.

The first step in attacking this problem is generate combinations of the audit trail rows so that we have the first row of each group along the first row of the next group. The following fragment solves part of that problem by combining each row (alias: prv) with all more recent rows (alias: nxt) that have the same empkey value but different values in the Type 2 columns gender and state:


INNER JOIN hrsource                prv
ON         curr.empkey           = prv.empkey
LEFT JOIN  hrsource                nxt
ON         prv.empkey            = nxt.empkey             -- timeline of same employee
AND        prv.valid_to          < nxt.valid_from         -- nxt must be more recent than prv
AND       (prv.gender,prv.state)!= (nxt.gender,nxt.state) -- type 2 attributes differ

(Please ignore the part I struck out - it's not relevant at this point)

Note the usage of the LEFT JOIN. It ensures that if prv is the most recent row, and there is by definition no row in nxt that is more recent, the prv row is still retained. Had we used an INNER JOIN, we would have lost that row, messing up the result.

If we run this fragment in isolation, we can certainly see rows that pair the first row of a group with the first row of the next group. For example, if you run this query:

SELECT    prv.empkey
,         prv.valid_from prv_from, prv.valid_to prv_to
,         nxt.valid_from nxt_from, nxt.valid_to nxt_to
,         prv.gender prv_gender
,         nxt.gender nxt_gender
FROM      hrsource prv
LEFT JOIN hrsource nxt
ON        prv.empkey   = nxt.empkey
AND       prv.valid_to < nxt.valid_from
AND      (prv.gender, prv.state) != (nxt.gender, nxt.state)

We get results like this:


+--------+------------+------------+------------+------------+------------+------------+
| empkey | prv_from   | prv_to     | nxt_from   | nxt_to     | prv_gender | nxt_gender |
+--------+------------+------------+------------+------------+------------+------------+
|     14 | 1998-12-03 | 1998-12-27 | 2005-08-08 | 2006-02-12 | F          | M          |
|     14 | 1998-12-03 | 1998-12-27 | 2006-02-13 | 2006-07-04 | F          | M          |
|     14 | 1998-12-03 | 1998-12-27 | 2006-07-05 | 2006-12-24 | F          | M          |
.        .            .            .            .            .            .            .
.        .            ...more rows...           .            .            .            .
.        .            .            .            .            .            .            .
+--------+------------+------------+------------+------------+------------+------------+

The first row highlighted in green is a desired combination because the valid_from date of nxt is closest to that of prv. It is this row from nxt that marks the end of the group of rows starting with prv. For the rows highlighted in red, the nxt part indicates a change that occurred beyond that point and are thus not desired.

To get rid of these undesired rows, we simply have to demand that no row indicating a change in the SCD Type 2 attributes occurs between prv and nxt. In my query, the following fragment is responsible for that part:

LEFT JOIN  hrsource                inb
ON         prv.empkey            = inb.empkey             -- time line of same employee
AND        prv.valid_to          < inb.valid_from         -- more recent than prv
AND       (prv.gender,prv.state)!= (inb.gender,inb.state) -- Type 2 columns changed
AND        nxt.valid_from        > inb.valid_to           -- less recent than nxt
WHERE      inb.empkey IS NULL                             -- does not exist

This fragment essentially states that there must not be any row in between prv and nxt that indicate a change in the SCD type 2 attributes as compared to prv. Note that the first part of the join condition is identical to the one we used to join nxt to prv. The extra's are that we also ask that inb lies before nxt. Because we used a LEFT JOIN, the result row is retained in case no such inb row exists. The WHERE inb.empkey IS NULL condition explicitly filters for these cases. By definition, this means that the nxt part in the result row marks the end of whatever group the prv belongs to.

Rolling Up the Rows in their Groups

Now, if we put these parts together we get something like this:


SELECT    prv.empkey
,         prv.valid_from prv_from, prv.valid_to prv_to
,         nxt.valid_from nxt_from, nxt.valid_to nxt_to
,         prv.gender prv_gender
,         nxt.gender nxt_gender
FROM      hrsource                  prv
LEFT JOIN hrsource                  nxt
ON        prv.empkey              = nxt.empkey
AND       prv.valid_to            < nxt.valid_from
AND      (prv.gender, prv.state) != (nxt.gender, nxt.state)
LEFT JOIN  hrsource                 inb
ON         prv.empkey             = inb.empkey             
AND        prv.valid_to           < inb.valid_from         
AND       (prv.gender,prv.state) != (inb.gender,inb.state) 
AND        nxt.valid_from         > inb.valid_to           
WHERE      inb.empkey IS NULL

Some of the results are here:

+--------+------------+------------+------------+------------+------------+------------+
| empkey | prv_from   | prv_to     | nxt_from   | nxt_to     | prv_gender | nxt_gender |
+--------+------------+------------+------------+------------+------------+------------+
|     14 | 1998-12-03 | 1998-12-27 | 2005-08-08 | 2006-02-12 | F          | M          |
|     14 | 1998-12-28 | 2005-04-22 | 2005-08-08 | 2006-02-12 | F          | M          |
|     14 | 2005-04-23 | 2005-08-07 | 2005-08-08 | 2006-02-12 | F          | M          |
.        .            .            .            .            .            .            .
.        .            ...more rows...           .            .            .            .
.        .            .            .            .            .            .            .
+--------+------------+------------+------------+------------+------------+------------+

Now, we still see the individual rows. However, because we have paired them with the row that definitely marks the end of the group it belongs to, we can now lump them together using GROUP BY. This explains the final bits of the query:

GROUP BY   prv.empkey
,          nxt.valid_from

For each value in empkey, this GROUP BY clause essentially rolls up the group of rows that showed no change in the SCD Type 2 attributes until the occurrence of the nxt row. In the SELECT list, we can now use the MIN function to find the date of the first row in the group.

Note that the GROUP BY list is the thinnest one possible. The SELECT list contains many more columns that do not appear in the GROUP BY list. However, it is perfectly safe to do so in this case. For more details on this matter, please read my Debunking Group By Myths article.

Glueing the SCD Type 1 and Type 2 Solutions together

We developed the solution as two separate queries - one to take care of the SCD Type 1 attributes, one for the SCD Type 2 attributes. These two are simply joined together over the empkey. The code to do this was displayed struck out in the first query fragment that deals with SCD type 2 attributes.

Calculating the Result columns

The only part of the query that was not discussed yet is the SELECT list:

SELECT     prv.empkey
,          curr.name
,          curr.ssn
,          prv.gender
,          prv.state
,          MIN(prv.valid_from)                        valid_from
,          COALESCE(
               nxt.valid_from - INTERVAL 1 DAY
           ,   '9999-12-31'
           )                                          valid_to
,          CASE 
               WHEN nxt.valid_from IS NULL THEN 'Y'
               ELSE 'N'
           END                                        is_current

Note that we select the name and ssn columns from the curr table. This ensures these columns will always reflect the data of the most recent change which is typical for SCD Type 1 attributes. The gender and state columns on the other hand are drawn from prv. They reflect the change history for each individual employee.

The valid_from column is aggregated using the MIN function. This essentially yields the date of the first row of each group of consecutive rows having the same empkey and values in the SCD Type 2 columns.

The valid_to date is calculated from the valid_from date of the first row of the next group of rows that indicates a change of values in the SCD Type 2 columns. This is the relevant code:

nxt.valid_from - INTERVAL 1 DAY

Now, to be completely fair - this is in fact cheating. It implies I trust the data to be such that the value for the valid_to date is always exactly one day less than the value for the valid_from date for the next row in the history. This was certainly the case with the sample data, but we cannot really rely on that. That said, it is not that hard to get the 'real' valid_to date, but it would make this example a lot harder to understand.

Another thing about the calculation of valid_to is the treatment of NULL values. For each history associated with one distinct empkey value, the last row will have NULL here, and this is substituted by a large date instead. Dimension tables tend to avoid nullable columns and in this case the maximum value '9999-12-31' expresses that this row is the current row. Using that same logic, and exrta flag column is added to indicate whether the row is the current row.