Roland Bouman's blog: BI

Showing posts with label BI. Show all posts

Tuesday, May 10, 2011

Managing kettle job configuration

Over time I've grown a habit of making a configuration file for my kettle jobs. This is especially useful if you have a reusable job, where the same work has to be done but against different conditions. A simple example where I found this useful is when you have separate development, testing and production environments: when you're done developing your job, you transfer the .kjb file (and its dependencies) to the testing environment. This is the easy part. But the job still has to run within the new environment, against different database connections, webservice urls and file system paths.

Variables

In the past, much has been written about using kettle variables, parameters and arguments. Variables are the basic features that provide the mechanism to configure the transformation steps and job entries: instead of using literal configuration values, you use a variable reference. This way, you can initialize all variables to whatever values are appropriate at that time, and for that environment. Today, I don't want to discuss variables and variable references - instead I'm just focussing on how to manage the configuration once you already used variable references inside your your jobs and transformations.

Managing configuration

To manage the configuration, I typically start the main job with a set-variables.ktr transformation. This transformation reads configuration data from a config.properties file and assigns it to the variables so any subsequent jobs and transformations can access the configration data through variable references. The main job has one parameter called ${CONFIG_DIR} which has to be set by the caller so the set-variables.ktr transformation knows where to look for its config.properties file:

Reading configuration properties

The config.properties file is just a list of key/value pairs separated by an equals sign. Each key represents a variable name, and the value the appropriate value. The following snippet should give you an idea:

#staging database connection
STAGING_DATABASE=staging
STAGING_HOST=localhost
STAGING_PORT=3351
STAGING_USER=staging
STAGING_PASSWORD=$74g!n9

The set-variables.ktr transformation reads it using a "Property Input" step, and this yields a stream of key/value pairs:

Pivoting key/value pairs to use the "set variables" step

In the past, I used to set the variables using the "Set variables" step. This step works by creating a variable from selected fields in the incoming stream and assigning the field value to it. This means that you can't just feed the stream of key/value pairs from the property input step into the set variables step: the stream coming out of the property input step contains multiple rows with just two fields called "Key" and "value". Feeding it directly into the "Set variables" step would just lead to creating two variables called Key and Value, and they would be assigned values multiple times for all key/value pairs in the stream. So in order to meaningfully assign variable, I used to pivot the stream of key/value pairs into a single row having one field for each key in the stream using the "Row Denormaliser" step:

As you can see in the screenshot, "Key" is the key field: the value of this field is scanned to determine in which output fields to put the corresponding value. There are no fields that make up a grouping: rather, we want all key/value pairs to end up in one big row. Or put another way, there is just one group comprising all key/value pairs. Finally, the grid below specifies for each distinct value of the "Key" field to which output field name it should be mapped, and in all cases, we want the value of the "Value" field to be stored in those fields.

Drawbacks

There are two important drawbacks to this approach:

The "Row Normaliser" uses the value of the keys to map the value to a new field. This means that we have to type the name of each and every variable appearing in the config.properties file. So you manually need to keep he config.propeties and the "Denormaliser" synchronized, and in practice it's very easy to make mistakes here.
Due to the fact that the "Row Denormaliser" step literally needs to know all variables, the set-variables.ktr transformation becomes specific for just one particular project.

Given these drawbacks, I seriously started to question the usefulness of a separate configuration file: because the set-variables.ktr transformation has to know all variables names anyway, I was tempted to store the configration values themselves also inside the transformation (using a "generate rows" or "data grid" step or something like that), and "simply" make a new set-variables.ktr transformation for every environment. Of course, that didn't feel right either.

Solution: Javascript

As it turns out, there is in fact a very simple solution that solves all of these problems: don't use the "set variables" step for this kind of problem! We still need to set the variables of course, but we can conveniently do this using a JavaScript step. The new set-variables.ktr transformation now looks like this:

The actual variable assignemnt is done with Kettle's built-in setVariable(key, value, scope). The key and value from the incoming stream are passed as arguments to the key and value arguments of the setVariable() function. The third argument of the setVariable() function is a string that identifies the scope of the variable, and must have one of the following values:

"s" - system-wide

"r" - up to the root

"p" - up to the parent job of this transormation

"g" - up to the grandparent job of this transormation

For my purpose, I settle for "r".

The bonus is that this set-variables.ktr is less complex than the previous one and is now even completely independent of the content of the configuration. It has become a reusable transformation that you can use over and over.

Sunday, March 14, 2010

Writing another book: Pentaho Kettle Solutions

Last year, at about this time of the year, I was well involved in the process of writing the book Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL" for Wiley. To date, "Pentaho Solutions" is still the only all-round book on the open source Pentaho Business Intelligence suite.

It was an extremely interesting project to participate in, full of new experiences. Although the act of writing was time consuming and at times very trying for me as well as my family, it was completely worth it. I have none but happy memories of the collaboration with my full co-author Jos van Dongen, our technical editors Jens Bleuel, Jeroen Kuiper, Tom Barber and Tomas Morgner, several of the Pentaho Developers, and last but not least, the team at Wiley, in particular Robert Elliot and Sara Shlaer.

When the book was finally published, late August 2009, I was very proud - as a matter of fact, I still am :) Both Jos and I have been rewarded with a lot of positive feedback, and so far, book sales are meeting the expectations of the publisher. We've had mostly positive reviews on places like Amazon, and elsewhere on the web. I'd like to use this opportunity to thank everybody that took the time to review the book: Thank you all - it is very rewarding to get this kind of feedback, and I appreciate it enourmously that you all took the time to spread the word. Beer is on me next time we meet :)

Announcing "Pentaho Kettle Solutions"

In the autumn of 2009, just a month after "Pentaho Solutions" was published, Wiley contacted Jos and me to find out if we were interested in writing a more specialized book on ETL and data integration using Pentaho. I felt honoured, and took the fact that Wiley, an experienced and well-reknowned publisher in the field of data warehousing and business intelligence, voiced interested in another Pentaho book by Jos an me as a token of confidence and encouragement that I value greatly. (For Pentaho Solutions, we heard that Wiley was interested, but we contacted them.) At the same time, I admit I had my share of doubts, having the memories of what it took to write Pentaho Solutions still fresh in my mind.

As it happens, Jos and I both attended the 2009 Pentaho Community Meeting, and there we seized the opportunity to talk to Matt Casters, chief Pentaho Data Integration and founding developer of Kettle (a.k.a. Pentaho Data Integration). Both Jos and I didn't expect Matt to be able to free up any time in his ever busy schedule to help us to write the new book. Needless to say, he made us both very happy when he rather liked the idea, and expressed immediate interest in becoming a full co-author!

Together, the three of us made a detailed outline and wrote a formal proposal for Wiley. Our proposal was accepted in December 2009, and we have been writing since, focusing on the forthcoming Kettle version, Kettle 4.0 . The tentative title of the book is Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration. It is planned to be published in September 2010, and it will have approximately 750 pages.

Our working copy of the outline is quite detailed but may still change in the future, which is why I won't publish it here until we finished our first draft of the book. I am 99% confident that the top level of the outline is stable, and I have no reservation in releasing that already:

Part I: Getting Started
- ETL Primer
- Kettle Concepts
- Installation and Configuration
- Sample ETL Solution

Part II: ETL Subsystems
- Overview of the 34 Subsystems of ETL
- Data Extraction
- Cleansing and Conforming
- Handling Dimension Tables
- Fact Tables
- Loading OLAP Cubes

Part III: Management and Deployment
- Testing and Debugging
- Scheduling and Monitoring
- Versioning and Migration
- Lineage and Auditing
- Securing your Environment
- Documenting

Part IV: Performance and Scalability
- Performance Tuning
- Parallization and Partitioning
- Dynamic Clustering in the Cloud
- Realtime and Streaming data

Part V: Integrating and Extending Kettle
- Pentaho BI Integration
- Third-party Kettle Integration
- Extending Kettle

Part VI: Advanced Topics
- Webservices and Web APIs
- Complex File Handling
- Data Vault Management
- Working with ERP Systems

Feel free to ask me any questions about this new book. If you're interested, stay tuned - I will probably be posting 2 or 3 updates as we go.

Saturday, January 31, 2009

Loading a dimension table with SCD1 and SCD2 attributes

Jos, my co-author for the "Building Pentaho Solutions" book just pointed me to a recent article by Jeff Prenevost entitled "The Problem with History".

Abstract

Jeff's topic, loading a hybrid Type 1 / Type 2 slowly changing dimension table is related to data warehousing but maybe of interest outside of that context as well.

As it turns out, the particular problem described by Jeff is non-trivial, but can be solved quite elegantly in a single SQL statment. This may be a compelling alternative to the multi-step, multi-pass solution proposed in his article.

Type 1 and Type 2 Slowly Changing Dimensions

In his article, Jeff describes a method to load a slowly changing dimension (SCD) table from an audit trail. This would be quite straight forward in case we are dealing with a Type 2 slowly changing dimension. In that case, each row in the audit trail would also yield one row in the dimension table. If the dimension would be a Type 1 slowly changing dimension, the matter would only be slightly more complicated - in this case only the most recent version of each object would be loaded into the dimension table.

A Hybrid SCD Type 1/2

The interesting thing about the problem described in the article is that the dimension table is a hybrid Type 1 / Type 2 dimension. That is, for some attributes, history needs to be tracked in the dimension table (Type 2 attributes), whereas only the most recent data is required for other attributes (Type 1 attributes).

Sample Data

To make things tangible, here's a sample Employee audit trail:

         |    SCD Type 1        |    SCD Type 2  |
+--------+--------+-------------+--------+-------+------------+------------+
| empkey | name   | ssn         | gender | state | valid_from | valid_to   |
+--------+--------+-------------+--------+-------+------------+------------+
|     14 | Jo     | 323-10-1116 | F      | MI    | 1998-12-03 | 1998-12-27 |
|     14 | Jo     | 323-10-1119 | F      | MI    | 1998-12-28 | 2005-04-22 |
|     14 | Joe    | 323-10-1119 | F      | MI    | 2005-04-23 | 2005-08-07 |
|     14 | Joseph | 323-10-1119 | M      | MI    | 2005-08-08 | 2006-02-12 |
|     14 | Joe    | 323-10-1119 | M      | MI    | 2006-02-13 | 2006-07-04 |
|     14 | Joseph | 323-10-1119 | M      | NY    | 2006-07-05 | 2006-12-24 |
|     14 | Joseph | 323-10-1119 | M      | MI    | 2006-12-25 | NULL       |
|     15 | Jim    | 224-57-2726 | M      | IL    | 2002-01-16 | 2004-03-15 |
|     15 | James  | 224-57-2726 | M      | IL    | 2004-03-16 | 2007-06-22 |
|     15 | James  | 224-57-2726 | M      | IN    | 2007-06-23 | 2007-08-31 |
+--------+--------+-------------+--------+-------+------------+------------+

The data shows the history for the employees with empkey 14 and 15. All rows with the same empkey value form a time line of data change events. Each row represents a change event, updating some of the employee's data. The valid_from and valid_to columns are used to record when the data change event occurred, so the values in these columns change for each row. For the other columns, I used bold markup to make it easier to spot the change.

SCD Type 1 and 2 Attributes

In Jeff's article, the columns name and ssn end up as SCD Type 1 attributes, and the columns gender and state as SCD Type 2 attributes. I used color highlighting to mark up groups of consecutive rows where the values for the gender and state columns did not change. After loading the dimension table we must end up with one row for each such group, capturing all change events for these columns. The ssn and name must always contain the most recent data. I highlighted the most recent row using grey markup.

Resulting Dimension Table

The data for the resulting dimension table would look like this:

+--------+--------+--------+-------------+--------+-------+------------+------------+------------+
| dw_key | empkey | name   | ssn         | gender | state | valid_from | valid_to   | is_current |
+--------+--------+--------+-------------+--------+-------+------------+------------+------------+
|      1 |     14 | Joseph | 323-10-1119 | F      | MI    | 1998-12-03 | 2005-08-07 | N          |
|      2 |     14 | Joseph | 323-10-1119 | M      | MI    | 2005-08-08 | 2006-07-04 | N          |
|      3 |     14 | Joseph | 323-10-1119 | M      | NY    | 2006-07-05 | 2006-12-24 | N          |
|      4 |     14 | Joseph | 323-10-1119 | M      | MI    | 2006-12-25 | 9999-12-31 | Y          |
|      5 |     15 | James  | 224-57-2726 | M      | IL    | 2002-01-16 | 2007-06-22 | N          |
|      6 |     15 | James  | 224-57-2726 | M      | IN    | 2007-06-23 | 9999-12-31 | Y          |
+--------+--------+--------+-------------+--------+-------+------------+------------+------------

As you can see, only the changes in the gender and state columns are recorded, ignoring any changes in the name and ssn columns. Instead, these columns get the values of the most recent change.

Jeff's Solution

Jeff's article describes a step-wise solution to this problem. To give you an idea and some context, here's a quote from the article describing the mindset in developing this approach:

While an ETL tool like Informatica or DataStage would be handy, any of this could be done fairly easily in straight SQL. We’ll also keep all the steps simple, easy to understand and discrete. It’s possible to create enormous heaps of nested SQL to do everything in one statement, but it's best to keep everything understandable. We create and label "tables," but whether you choose to actually create real database tables, or the tables are just record sets inside a flow, the process remains essentially unchanged.

The actual steps that make up the described solution can be summarized as follows:

Rank all rows according to the timeline

Join consecutive rows by rank in case the there is a change in the SCD attributes

Break the joined rows back into two rows (Jeff calls this 'semi-pivot')

Add the last and first version for each object to the set

Add another rank number to the result set

At this point, the rows where the new rank number is odd contain the valid_from date, and rows with an even number contain the valid_to date. Join each odd row to its consecutive even rows.

Join to the last version of each object to fill in for the SCD Type 1 attributes, and set a flag for the latest version of the object.

I admit I have a hard time understanding Jeff's method, and I am completely puzzled as to how he invented it. My apologies if my summary isn't exactly crystal clear - I encourage you to read the original article, "The Problem with History" yourself if you are interested in the details.

My Alternative

First of all, let me say that I am impressed with Jeff's creativity. I would never think of this solution and I am not really sure I even understand it. I also am not categorically opposed to Jeff's initial mindset: I too think that it is sometimes better to avoid complex SQL statements in favor of a series of relatively simple ones.

That said, when I first read Jeff's article, my first impression was: "Wow...that are a lot of moving parts". My second impression was: "Darn, that's a lot of multiple passes over the same data set". All the time, I had this hunch that it wouldn't be to hard to do it in two statements (one SELECT...INTO, one UPDATE), or perhaps even in one (SELECT...INTO).

Preparation

I started by setting up the test data in MySQL 5.1. I'm including the script here for the readers' convenience:


CREATE TABLE  hrsource (
    empkey     int         NOT NULL
,   name       varchar(10) NOT NULL
,   ssn        char(11)    NOT NULL
,   gender     char(1)     NOT NULL
,   state      char(2)     NOT NULL
,   valid_from date        NOT NULL
,   valid_to   date       
,   PRIMARY KEY(empkey, valid_from)
);

INSERT INTO hrsource 
(empkey,name,ssn,gender,state,valid_from,valid_to) VALUES
 (14, 'Jo'    , '323-10-1116', 'F', 'MI', '1998-12-03', '1998-12-27')
,(14, 'Jo'    , '323-10-1119', 'F', 'MI', '1998-12-28', '2005-04-22')
,(14, 'Joe'   , '323-10-1119', 'F', 'MI', '2005-04-23', '2005-08-07')
,(14, 'Joseph', '323-10-1119', 'M', 'MI', '2005-08-08', '2006-02-12')
,(14, 'Joe'   , '323-10-1119', 'M', 'MI', '2006-02-13', '2006-07-04')
,(14, 'Joseph', '323-10-1119', 'M', 'NY', '2006-07-05', '2006-12-24')
,(14, 'Joseph', '323-10-1119', 'M', 'MI', '2006-12-25', NULL)
,(15, 'Jim'   , '224-57-2726', 'M', 'IL', '2002-01-16', '2004-03-15')
,(15, 'James' , '224-57-2726', 'M', 'IL', '2004-03-16', '2007-06-22')
,(15, 'James' , '224-57-2726', 'M', 'IN', '2007-06-23', '2007-08-31');

Of course, this is not a lot of data, but it is the data from the original article, dutifully hacked into my computer's keyboard by your's truly.

Solution

Without further ado, this is my solution for selecting the dimension's table dataset:

SELECT     prv.empkey
,          curr.name
,          curr.ssn
,          prv.gender
,          prv.state
,          MIN(prv.valid_from)                        valid_from
,          COALESCE(
               nxt.valid_from - INTERVAL 1 DAY
           ,   '9999-12-31'
           )                                          valid_to
,          CASE 
               WHEN nxt.valid_from IS NULL THEN 'Y'
               ELSE 'N'
           END                                        is_current
FROM       hrsource                curr
INNER JOIN (
            SELECT   empkey
            ,        MAX(valid_from) valid_from
            FROM     hrsource
            GROUP BY empkey
           )                       curr1
ON         curr.empkey           = curr1.empkey
AND        curr.valid_from       = curr1.valid_from
INNER JOIN hrsource                prv
ON         curr.empkey           = prv.empkey
LEFT JOIN  hrsource                nxt
ON         prv.empkey            = nxt.empkey
AND        prv.valid_to          < nxt.valid_from
AND       (prv.gender,prv.state)!= (nxt.gender,nxt.state)
LEFT JOIN  hrsource                inb
ON         prv.empkey            = inb.empkey
AND        prv.valid_to          < inb.valid_from
AND        nxt.valid_from        > inb.valid_to
AND       (prv.gender,prv.state)!= (inb.gender,inb.state)
WHERE      inb.empkey IS NULL
GROUP BY   prv.empkey
,          nxt.valid_from

Now I would certainly not qualify this as a simple statement. At the same time, I feel this does not resemble the "enormous heaps of nested SQL" so dreaded by Jeff. Let me explain how it works, and you can judge for yourself.

Explanation

I will know do a step-by-step explanation of my statement. I hope it clears up any doubt you might have as to how my solution solves the problem.

SCD Type 1 attributes: Isolating the Most Recent Change

In the introduction, I mentioned that the loading the dimension table would be easy in case it was either a Type 1 or a Type 2 slowly changing dimension. Well, we know that, come what may, we will always need the most recent row for each empkey to supply values for the Type 1 attributes name and ssn. So, we start by attacking that part of the problem.

The following fragment of my solution does exactly that:

FROM       hrsource                curr
INNER JOIN (
            SELECT   empkey
            ,        MAX(valid_from) valid_from
            FROM     hrsource
            GROUP BY empkey
           )                       curr1
ON         curr.empkey           = curr1.empkey
AND        curr.valid_from       = curr1.valid_from

The subquery curr1 finds us the highest value for the valid_from column for each distinct empkey. The GROUP BY empkey clause ensures we get exactly one row for each distinct value of empkey. From all rows with the same empkey value, the MAX(valid_from) bit finds the largest valid_from value.

The combination empkey, valid_from is the primary key of the employee audit trail. This means we can now use the empkey, MAX(valid_from) pair from the curr1 subquery to point out the most recent row for each distinct empkey. The INNER JOIN with the hrsource table (dubbed curr) does exactly that.

Grouping consecutive rows with identical Type 2 attributes

We must now deal with the second part of the problem, the SCD Type 2 attributes.

If you look back at the color highlighting in the sample data set, you may realize this is basically a grouping problem: for each distinct empkey we need to group rows with identical values in the Type 2 columns gender and state. From the point of view of the dimension table, no change occurred within this group, so we should store it as one row in the dimension table, and reconstruct the change dates by taking the minimum valid_from and maximum valid_to values from the group.

We must be careful though: we can't simply make groups of all distinct combinations of the SCD Type 2 columns. Look for example at the rows having 14 for the empkey column. The rows with the valid_from values 2005-08-08, 2006-02-13 and 2006-12-25 all have the identical combination (M, MI) in the Type 2 columns gender and state respectively. However, only the first two of these belong together in a group. The row with 2006-12-25 in the valid_from column does not belong to the group because another change event occurred on 2006-07-05, which lies between 2006-02-13 and 2006-12-25.

So, we must sharpen up the definition of the problem. It is not enough to make groups of identical combinations of Type 2 attributes: we must also demand that the rows in the group are consecutive.

The first step in attacking this problem is generate combinations of the audit trail rows so that we have the first row of each group along the first row of the next group. The following fragment solves part of that problem by combining each row (alias: prv) with all more recent rows (alias: nxt) that have the same empkey value but different values in the Type 2 columns gender and state:


INNER JOIN hrsource                prv
ON         curr.empkey           = prv.empkey
LEFT JOIN  hrsource                nxt
ON         prv.empkey            = nxt.empkey             -- timeline of same employee
AND        prv.valid_to          < nxt.valid_from         -- nxt must be more recent than prv
AND       (prv.gender,prv.state)!= (nxt.gender,nxt.state) -- type 2 attributes differ

(Please ignore the part I struck out - it's not relevant at this point)

Note the usage of the LEFT JOIN. It ensures that if prv is the most recent row, and there is by definition no row in nxt that is more recent, the prv row is still retained. Had we used an INNER JOIN, we would have lost that row, messing up the result.

If we run this fragment in isolation, we can certainly see rows that pair the first row of a group with the first row of the next group. For example, if you run this query:

SELECT    prv.empkey
,         prv.valid_from prv_from, prv.valid_to prv_to
,         nxt.valid_from nxt_from, nxt.valid_to nxt_to
,         prv.gender prv_gender
,         nxt.gender nxt_gender
FROM      hrsource prv
LEFT JOIN hrsource nxt
ON        prv.empkey   = nxt.empkey
AND       prv.valid_to < nxt.valid_from
AND      (prv.gender, prv.state) != (nxt.gender, nxt.state)

We get results like this:


+--------+------------+------------+------------+------------+------------+------------+
| empkey | prv_from   | prv_to     | nxt_from   | nxt_to     | prv_gender | nxt_gender |
+--------+------------+------------+------------+------------+------------+------------+
|     14 | 1998-12-03 | 1998-12-27 | 2005-08-08 | 2006-02-12 | F          | M          |
|     14 | 1998-12-03 | 1998-12-27 | 2006-02-13 | 2006-07-04 | F          | M          |
|     14 | 1998-12-03 | 1998-12-27 | 2006-07-05 | 2006-12-24 | F          | M          |
.        .            .            .            .            .            .            .
.        .            ...more rows...           .            .            .            .
.        .            .            .            .            .            .            .
+--------+------------+------------+------------+------------+------------+------------+

The first row highlighted in green is a desired combination because the valid_from date of nxt is closest to that of prv. It is this row from nxt that marks the end of the group of rows starting with prv. For the rows highlighted in red, the nxt part indicates a change that occurred beyond that point and are thus not desired.

To get rid of these undesired rows, we simply have to demand that no row indicating a change in the SCD Type 2 attributes occurs between prv and nxt. In my query, the following fragment is responsible for that part:

LEFT JOIN  hrsource                inb
ON         prv.empkey            = inb.empkey             -- time line of same employee
AND        prv.valid_to          < inb.valid_from         -- more recent than prv
AND       (prv.gender,prv.state)!= (inb.gender,inb.state) -- Type 2 columns changed
AND        nxt.valid_from        > inb.valid_to           -- less recent than nxt
WHERE      inb.empkey IS NULL                             -- does not exist

This fragment essentially states that there must not be any row in between prv and nxt that indicate a change in the SCD type 2 attributes as compared to prv. Note that the first part of the join condition is identical to the one we used to join nxt to prv. The extra's are that we also ask that inb lies before nxt. Because we used a LEFT JOIN, the result row is retained in case no such inb row exists. The WHERE inb.empkey IS NULL condition explicitly filters for these cases. By definition, this means that the nxt part in the result row marks the end of whatever group the prv belongs to.

Rolling Up the Rows in their Groups

Now, if we put these parts together we get something like this:


SELECT    prv.empkey
,         prv.valid_from prv_from, prv.valid_to prv_to
,         nxt.valid_from nxt_from, nxt.valid_to nxt_to
,         prv.gender prv_gender
,         nxt.gender nxt_gender
FROM      hrsource                  prv
LEFT JOIN hrsource                  nxt
ON        prv.empkey              = nxt.empkey
AND       prv.valid_to            < nxt.valid_from
AND      (prv.gender, prv.state) != (nxt.gender, nxt.state)
LEFT JOIN  hrsource                 inb
ON         prv.empkey             = inb.empkey             
AND        prv.valid_to           < inb.valid_from         
AND       (prv.gender,prv.state) != (inb.gender,inb.state) 
AND        nxt.valid_from         > inb.valid_to           
WHERE      inb.empkey IS NULL

Some of the results are here:

+--------+------------+------------+------------+------------+------------+------------+
| empkey | prv_from   | prv_to     | nxt_from   | nxt_to     | prv_gender | nxt_gender |
+--------+------------+------------+------------+------------+------------+------------+
|     14 | 1998-12-03 | 1998-12-27 | 2005-08-08 | 2006-02-12 | F          | M          |
|     14 | 1998-12-28 | 2005-04-22 | 2005-08-08 | 2006-02-12 | F          | M          |
|     14 | 2005-04-23 | 2005-08-07 | 2005-08-08 | 2006-02-12 | F          | M          |
.        .            .            .            .            .            .            .
.        .            ...more rows...           .            .            .            .
.        .            .            .            .            .            .            .
+--------+------------+------------+------------+------------+------------+------------+

Now, we still see the individual rows. However, because we have paired them with the row that definitely marks the end of the group it belongs to, we can now lump them together using GROUP BY. This explains the final bits of the query:

GROUP BY   prv.empkey
,          nxt.valid_from

For each value in empkey, this GROUP BY clause essentially rolls up the group of rows that showed no change in the SCD Type 2 attributes until the occurrence of the nxt row. In the SELECT list, we can now use the MIN function to find the date of the first row in the group.

Note that the GROUP BY list is the thinnest one possible. The SELECT list contains many more columns that do not appear in the GROUP BY list. However, it is perfectly safe to do so in this case. For more details on this matter, please read my Debunking Group By Myths article.

Glueing the SCD Type 1 and Type 2 Solutions together

We developed the solution as two separate queries - one to take care of the SCD Type 1 attributes, one for the SCD Type 2 attributes. These two are simply joined together over the empkey. The code to do this was displayed struck out in the first query fragment that deals with SCD type 2 attributes.

Calculating the Result columns

The only part of the query that was not discussed yet is the SELECT list:

SELECT     prv.empkey
,          curr.name
,          curr.ssn
,          prv.gender
,          prv.state
,          MIN(prv.valid_from)                        valid_from
,          COALESCE(
               nxt.valid_from - INTERVAL 1 DAY
           ,   '9999-12-31'
           )                                          valid_to
,          CASE 
               WHEN nxt.valid_from IS NULL THEN 'Y'
               ELSE 'N'
           END                                        is_current

Note that we select the name and ssn columns from the curr table. This ensures these columns will always reflect the data of the most recent change which is typical for SCD Type 1 attributes. The gender and state columns on the other hand are drawn from prv. They reflect the change history for each individual employee.

The valid_from column is aggregated using the MIN function. This essentially yields the date of the first row of each group of consecutive rows having the same empkey and values in the SCD Type 2 columns.

The valid_to date is calculated from the valid_from date of the first row of the next group of rows that indicates a change of values in the SCD Type 2 columns. This is the relevant code:

nxt.valid_from - INTERVAL 1 DAY

Now, to be completely fair - this is in fact cheating. It implies I trust the data to be such that the value for the valid_to date is always exactly one day less than the value for the valid_from date for the next row in the history. This was certainly the case with the sample data, but we cannot really rely on that. That said, it is not that hard to get the 'real' valid_to date, but it would make this example a lot harder to understand.

Another thing about the calculation of valid_to is the treatment of NULL values. For each history associated with one distinct empkey value, the last row will have NULL here, and this is substituted by a large date instead. Dimension tables tend to avoid nullable columns and in this case the maximum value '9999-12-31' expresses that this row is the current row. Using that same logic, and exrta flag column is added to indicate whether the row is the current row.

Wednesday, October 15, 2008

Maturity of Open Source ETL / Why do people fail to do their homework?

I just read this post on Matt Casters' blog. Here, Matt describes why Element 61's Jan Claes is dead wrong in the way he assesses the maturity of open source ETL tools.

Well, I've just read Jan Claes' article in the "research and insights" area of the Element61 website, and frankly, it is pretty easy to see how unsubstantiated it is. Some may be tempted to classify the article as FUD, although typically 'real' FUD articles are more sophisticated and are usually not so blatantly sloppy with facts. I could get all worked up over it, but mostly it leaves me with a genuine confusion as to why a seemingly serious IT consultancy company would want to have their professionals post this kind of stuff on the company website and let it pass as "research" or even "insight".

Anyway, let's take a look at the article and I'll try and explain why it won't wash...
(Disclaimer - I can easily debunk Jan's article when I look at Kettle, a.k.a. Pentaho data integration. I do not have enough experience with the other open source ETL tools mentioned in his article, Talend and CloverETL, so it is entirely possible that the article does these tools similar injustice).

The article starts off by explaining what ETL is, and correctly mentions that prerequisite tools and resources are generally considered expensive. Quite unsurprisingly, the article moves on to explore open source ETL tools in order to find a cost-reducing alternative. So far, so good - the notion that cost reduction is the primary motive for most businesses to start using open source software is far from controversial.

From here on the article wanders off, and starts talking about open source in general. Some sympathetic but unsubstantiated and unillustrated claims are made about the relation between "standards" and open source software.

The article then becomes more critical and downplays the viability of an ETL open source community as compared to the Linux community. According to Jan Claes, Linux is successful ...because the developers united in a “quest” against Windows.... According to him, a common motive lacks in the ETL community. He has it that ...[ETL] is a specialist's world..., and because of that, ...progress might be significantly slower than in the case of other –more successful- Open Source programs.

Just take a second to let it sink in. Anybody met a Linux dev lately that was out to beat windows? Last time I checked, Linux was meant to be a free (as in speech) UNIX system. Most Linux users I know couldn't care less about Windows, and they are certainly not 'united in a quest' to create something like it. I take it that by tagging ETL as belonging to the 'specialist's world', I am to understand that Linux kernel development is 'generalistic'.

If I am to believe Jan, everybody is so busy with their quest against windows, doing all that 'general' hacking in C/C++ on the Linux kernel that nobody is specialist enough to help that poor quailing open source ETL community develop say, a MySQL bulk loader step, or a Regular Expression Matching step, or a Normalizing step. (Those that have been following the development of Kettle know that these and more are all contributed by the community).

To make sure he gets the point across, the article then says that ...The most advanced suppliers of Open Source ETL actually admit today that they are not really community-driven, and that each company works with their own developers.... No link. No date. No nothing, we should probably just take Jan's word for it.

Look, don't get me wrong - I don't want to deny that ETL is a specialist area of expertise. But in my opinion, this is in part due to the fact that ETL (and BI tools in general) have been inaccessible to most people. Let me explain that.

My take on it is that now, open source BI is in the same boat where the open source databases were a few years ago. While mocked and ignored by the 'real' experts and specialists that do not deign themselves to touch the open source 'toy' databases, and can only be humbled to work with traditional 'enterprise' databases, the open source databases caused an emancipation among application developers and web developers.

Databases are now a commodity, and there are quite a lot of developers that know how to use one. Of course, apart from a large body of database users with basic skills, there are still 'real' experts, but some of them have become one with little or none of the typical DBA background thanks to the fact that open source databases disrupted the market and commoditized it. I see the same thing happening for BI: because the tools are now accessible and available, BI is popping up where it didn't before. In stealth mode, it is becoming a commodity.

The article continues, and now starts discussing a number of products: CloverETL, Pentaho, and Talend. Jan claims that these ETL tools provide ...just enough transformations to make it a viable option.

Huh?! Say what?! Just enough?... Of course, no comparison is made with a closed source product, so we can only guess what he was expecting. But really people, when I open Kettle 3.1, the latest GA version, I see:

Over 20 input steps, ranging from standard things like simple database tables, csv files and excel workbooks to slightly more advanced things like XML and LDAP to more exotic things like XBase and Web services

In addition to regular input steps, specialized bulk loaders for Oracle, MySQL, Postgres, Greenplum and MonetDB

Over 10 output steps (again ranging from standard stuff to more advanced things)

About 10 flavors of lookup, join and dimension maintenance steps

No less than 30 transformation steps, including normalization, denormalization, XSLT, XML injection

A bunch of flow control steps

And this is just for transformations! For kettle jobs, there's another bucket of steps, there's just too much to mention.

The article continues and does a run-down of "when to use/when not to use" open source ETL tools, and it is in this section where we see most of the inaccuracies.

Here are the "when to use" cases according to Jan Claes:

No complex transformations

The straightforward transformations can be implemented very quickly, ... When there are complex transformations to make, Open Source ETL tools will often not offer out-of-the-box solutions.

This is like, "so what"?! I mean, seriously - basically this complaint can be simplified to "when it's hard, it's hard". I dare anybody to come up with a single case where any tool provides an "out-of-the-box-solution" for an arbitrarily complex problem. I suspect Jan has not tried, and if he did, why doesn't he mention what he tried? I mean, man, just one example of what was expected but not found, surely that is not too much to ask? Gosh...

Few data sources to connect to

Most reputed ETL-vendors provide an extensive amount of connections to all sorts of data sources. This is a problem with Open Source ETL tools: most of them are based on Java architecture and need JDBC to connect to a database.

Seriously, what crap is this?

First of all, it is completely untrue. I mean, I won't bore you with the big 5 databases, and I know you have all heard of the less well known ones like Greenplum, Intersystems Cache and Netezza, but have you ever heard of "ExtenDB"? Or "Kingbase ES"? Or "Gupta"? No?! Well, Kettle has, and it has all of this shipped for free. Yup, that's right, No less than 32 different RDBMS systems, and for most of them, there isn't just one driver, but multiple drivers available - no additional installation or fee required.

Second, how does it make sense to whine about the underlying protocol or API used to connect to the data source? I have never heard anybody complain about MS SQL only offering AdoDB or ODBC connectivity, have you? In practice there is a JDBC driver available for every database, and if not you can still use ODBC connectivity. You are also not confined to the predelivered drivers, you can simply drop your own drivers in the kettle jdbc directory and connect to other dbs. If anything, Kettle's choice for JDBC seems like it may have done everything but limit connectivity.

Do you homework, dude. All it takes is one download and one unzip command.

Small scaled projects

Small scaled projects with one or two, very common sources and a limited number of target tables don't need an expensive and complicated ETL tool. An Open Source ETL tool can bring a cheap and easy to install solution, which even could run on a local PC.

Well, it is certainly true that you can run Kettle on a single PC with modest resource usage. But you can also think big, and benefit from built-in threading to burn all the cores of your high-end server. Or you can cluster a bunch of machines and have them divide the work to perform a single transformation. You can run it in a cloud and use sophisticated features like parallel input, you can use it to load shards ("database partitioning"), in fact there are no obvious limitations to the Kettle's deployment scale.

So, as it should be obvious by now, Jan has managed to write his "when to use cases" as if they are "when not to use" cases. The real "when not to use" cases follow:

Java & XML knowledge required for complex transformations

As all the Open Source ETL tools are Java and XML based, you need at least a basic understanding and knowledge of Java and XML to get the most out of them...[bla, bla and more bla]...the user has to develop the transformation rules, and therefore needs an advanced knowledge of Java.

This is a blatant lie. I won't deny that it is possible to muck around coding java as a Kettle user, but technically you would be a plugin writer. And the XML claim is completely off track - there is no requirement whatsoever for the end user to edit raw XML, ever. Really, I have just gone down the road describing the wealth of steps available to build you transformation, and really, you can trust me when I say: this is enough for about 95% of the cases, if not 99%. Who can blame the product for allowing you to write a java plugin if you really want it?

Speaking of which, does Jan mean to imply that the "traditional" products he is writing about do not allow you to write plugins? If that is so, then it seems an easy choice for me. I'd rather have more options not less. That saidd, I know that for example MS SQL IS allows .NET plugins. How is this functionally different?

Connectivity to multiple & less common data sources required.

Heh ;) As i have argued before, this is nonsense....Elementary, Watson.

Lack of skills, knowledge & resources

Well, I guess this is one of the things I got to give them. Clearly, certain skills, knowledge and resources are missing in Element61's outfit. Otherwise I can't explain why they'd be advertising such an easy target article.

Anyway, maybe you have thoughts on Open Source ETL that you want to share. If so, be my guest.

Thank you for bearing with me, and please left a comment if necessary.