Abstract
Jeff's topic, loading a hybrid Type 1 / Type 2 slowly changing dimension table is related to data warehousing but maybe of interest outside of that context as well.As it turns out, the particular problem described by Jeff is non-trivial, but can be solved quite elegantly in a single SQL statment. This may be a compelling alternative to the multi-step, multi-pass solution proposed in his article.
Type 1 and Type 2 Slowly Changing Dimensions
In his article, Jeff describes a method to load a slowly changing dimension (SCD) table from an audit trail. This would be quite straight forward in case we are dealing with a Type 2 slowly changing dimension. In that case, each row in the audit trail would also yield one row in the dimension table. If the dimension would be a Type 1 slowly changing dimension, the matter would only be slightly more complicated - in this case only the most recent version of each object would be loaded into the dimension table.A Hybrid SCD Type 1/2
The interesting thing about the problem described in the article is that the dimension table is a hybrid Type 1 / Type 2 dimension. That is, for some attributes, history needs to be tracked in the dimension table (Type 2 attributes), whereas only the most recent data is required for other attributes (Type 1 attributes).Sample Data
To make things tangible, here's a sample Employee audit trail:| SCD Type 1 | SCD Type 2 |The data shows the history for the employees with
+--------+--------+-------------+--------+-------+------------+------------+
| empkey | name | ssn | gender | state | valid_from | valid_to |
+--------+--------+-------------+--------+-------+------------+------------+
| 14 | Jo | 323-10-1116 | F | MI | 1998-12-03 | 1998-12-27 |
| 14 | Jo | 323-10-1119 | F | MI | 1998-12-28 | 2005-04-22 |
| 14 | Joe | 323-10-1119 | F | MI | 2005-04-23 | 2005-08-07 |
| 14 | Joseph | 323-10-1119 | M | MI | 2005-08-08 | 2006-02-12 |
| 14 | Joe | 323-10-1119 | M | MI | 2006-02-13 | 2006-07-04 |
| 14 | Joseph | 323-10-1119 | M | NY | 2006-07-05 | 2006-12-24 |
| 14 | Joseph | 323-10-1119 | M | MI | 2006-12-25 | NULL |
| 15 | Jim | 224-57-2726 | M | IL | 2002-01-16 | 2004-03-15 |
| 15 | James | 224-57-2726 | M | IL | 2004-03-16 | 2007-06-22 |
| 15 | James | 224-57-2726 | M | IN | 2007-06-23 | 2007-08-31 |
+--------+--------+-------------+--------+-------+------------+------------+
empkey
14 and 15. All rows with the same empkey
value form a time line of data change events. Each row represents a change event, updating some of the employee's data. The valid_from
and valid_to
columns are used to record when the data change event occurred, so the values in these columns change for each row. For the other columns, I used bold markup to make it easier to spot the change.SCD Type 1 and 2 Attributes
In Jeff's article, the columnsname
and ssn
end up as SCD Type 1 attributes, and the columns gender
and state
as SCD Type 2 attributes. I used color highlighting to mark up groups of consecutive rows where the values for the gender
and state
columns did not change. After loading the dimension table we must end up with one row for each such group, capturing all change events for these columns. The ssn
and name
must always contain the most recent data. I highlighted the most recent row using grey markup.Resulting Dimension Table
The data for the resulting dimension table would look like this:+--------+--------+--------+-------------+--------+-------+------------+------------+------------+As you can see, only the changes in the
| dw_key | empkey | name | ssn | gender | state | valid_from | valid_to | is_current |
+--------+--------+--------+-------------+--------+-------+------------+------------+------------+
| 1 | 14 | Joseph | 323-10-1119 | F | MI | 1998-12-03 | 2005-08-07 | N |
| 2 | 14 | Joseph | 323-10-1119 | M | MI | 2005-08-08 | 2006-07-04 | N |
| 3 | 14 | Joseph | 323-10-1119 | M | NY | 2006-07-05 | 2006-12-24 | N |
| 4 | 14 | Joseph | 323-10-1119 | M | MI | 2006-12-25 | 9999-12-31 | Y |
| 5 | 15 | James | 224-57-2726 | M | IL | 2002-01-16 | 2007-06-22 | N |
| 6 | 15 | James | 224-57-2726 | M | IN | 2007-06-23 | 9999-12-31 | Y |
+--------+--------+--------+-------------+--------+-------+------------+------------+------------
gender
and state
columns are recorded, ignoring any changes in the name
and ssn
columns. Instead, these columns get the values of the most recent change.Jeff's Solution
Jeff's article describes a step-wise solution to this problem. To give you an idea and some context, here's a quote from the article describing the mindset in developing this approach:The actual steps that make up the described solution can be summarized as follows:
While an ETL tool like Informatica or DataStage would be handy, any of this could be done fairly easily in straight SQL. We’ll also keep all the steps simple, easy to understand and discrete. It’s possible to create enormous heaps of nested SQL to do everything in one statement, but it's best to keep everything understandable. We create and label "tables," but whether you choose to actually create real database tables, or the tables are just record sets inside a flow, the process remains essentially unchanged.
- Rank all rows according to the timeline
- Join consecutive rows by rank in case the there is a change in the SCD attributes
- Break the joined rows back into two rows (Jeff calls this 'semi-pivot')
- Add the last and first version for each object to the set
- Add another rank number to the result set
- At this point, the rows where the new rank number is odd contain the
valid_from
date, and rows with an even number contain thevalid_to
date. Join each odd row to its consecutive even rows. - Join to the last version of each object to fill in for the SCD Type 1 attributes, and set a flag for the latest version of the object.
My Alternative
First of all, let me say that I am impressed with Jeff's creativity. I would never think of this solution and I am not really sure I even understand it. I also am not categorically opposed to Jeff's initial mindset: I too think that it is sometimes better to avoid complex SQL statements in favor of a series of relatively simple ones.That said, when I first read Jeff's article, my first impression was: "Wow...that are a lot of moving parts". My second impression was: "Darn, that's a lot of multiple passes over the same data set". All the time, I had this hunch that it wouldn't be to hard to do it in two statements (one
SELECT...INTO
, one UPDATE
), or perhaps even in one (SELECT...INTO
).Preparation
I started by setting up the test data in MySQL 5.1. I'm including the script here for the readers' convenience:Of course, this is not a lot of data, but it is the data from the original article, dutifully hacked into my computer's keyboard by your's truly.
CREATE TABLE hrsource (
empkey int NOT NULL
, name varchar(10) NOT NULL
, ssn char(11) NOT NULL
, gender char(1) NOT NULL
, state char(2) NOT NULL
, valid_from date NOT NULL
, valid_to date
, PRIMARY KEY(empkey, valid_from)
);
INSERT INTO hrsource
(empkey,name,ssn,gender,state,valid_from,valid_to) VALUES
(14, 'Jo' , '323-10-1116', 'F', 'MI', '1998-12-03', '1998-12-27')
,(14, 'Jo' , '323-10-1119', 'F', 'MI', '1998-12-28', '2005-04-22')
,(14, 'Joe' , '323-10-1119', 'F', 'MI', '2005-04-23', '2005-08-07')
,(14, 'Joseph', '323-10-1119', 'M', 'MI', '2005-08-08', '2006-02-12')
,(14, 'Joe' , '323-10-1119', 'M', 'MI', '2006-02-13', '2006-07-04')
,(14, 'Joseph', '323-10-1119', 'M', 'NY', '2006-07-05', '2006-12-24')
,(14, 'Joseph', '323-10-1119', 'M', 'MI', '2006-12-25', NULL)
,(15, 'Jim' , '224-57-2726', 'M', 'IL', '2002-01-16', '2004-03-15')
,(15, 'James' , '224-57-2726', 'M', 'IL', '2004-03-16', '2007-06-22')
,(15, 'James' , '224-57-2726', 'M', 'IN', '2007-06-23', '2007-08-31');
Solution
Without further ado, this is my solution for selecting the dimension's table dataset:SELECT prv.empkeyNow I would certainly not qualify this as a simple statement. At the same time, I feel this does not resemble the "enormous heaps of nested SQL" so dreaded by Jeff. Let me explain how it works, and you can judge for yourself.
, curr.name
, curr.ssn
, prv.gender
, prv.state
, MIN(prv.valid_from) valid_from
, COALESCE(
nxt.valid_from - INTERVAL 1 DAY
, '9999-12-31'
) valid_to
, CASE
WHEN nxt.valid_from IS NULL THEN 'Y'
ELSE 'N'
END is_current
FROM hrsource curr
INNER JOIN (
SELECT empkey
, MAX(valid_from) valid_from
FROM hrsource
GROUP BY empkey
) curr1
ON curr.empkey = curr1.empkey
AND curr.valid_from = curr1.valid_from
INNER JOIN hrsource prv
ON curr.empkey = prv.empkey
LEFT JOIN hrsource nxt
ON prv.empkey = nxt.empkey
AND prv.valid_to < nxt.valid_from
AND (prv.gender,prv.state)!= (nxt.gender,nxt.state)
LEFT JOIN hrsource inb
ON prv.empkey = inb.empkey
AND prv.valid_to < inb.valid_from
AND nxt.valid_from > inb.valid_to
AND (prv.gender,prv.state)!= (inb.gender,inb.state)
WHERE inb.empkey IS NULL
GROUP BY prv.empkey
, nxt.valid_from
Explanation
I will know do a step-by-step explanation of my statement. I hope it clears up any doubt you might have as to how my solution solves the problem.SCD Type 1 attributes: Isolating the Most Recent Change
In the introduction, I mentioned that the loading the dimension table would be easy in case it was either a Type 1 or a Type 2 slowly changing dimension. Well, we know that, come what may, we will always need the most recent row for eachempkey
to supply values for the Type 1 attributes name
and ssn
. So, we start by attacking that part of the problem. The following fragment of my solution does exactly that:
FROM hrsource currThe subquery
INNER JOIN (
SELECT empkey
, MAX(valid_from) valid_from
FROM hrsource
GROUP BY empkey
) curr1
ON curr.empkey = curr1.empkey
AND curr.valid_from = curr1.valid_from
curr1
finds us the highest value for the valid_from
column for each distinct empkey
. The GROUP BY empkey
clause ensures we get exactly one row for each distinct value of empkey
. From all rows with the same empkey
value, the MAX(valid_from)
bit finds the largest valid_from
value.The combination
empkey, valid_from
is the primary key of the employee audit trail. This means we can now use the empkey, MAX(valid_from)
pair from the curr1
subquery to point out the most recent row for each distinct empkey
. The INNER JOIN
with the hrsource
table (dubbed curr
) does exactly that.Grouping consecutive rows with identical Type 2 attributes
We must now deal with the second part of the problem, the SCD Type 2 attributes.If you look back at the color highlighting in the sample data set, you may realize this is basically a grouping problem: for each distinct
empkey
we need to group rows with identical values in the Type 2 columns gender
and state
. From the point of view of the dimension table, no change occurred within this group, so we should store it as one row in the dimension table, and reconstruct the change dates by taking the minimum valid_from
and maximum valid_to
values from the group. We must be careful though: we can't simply make groups of all distinct combinations of the SCD Type 2 columns. Look for example at the rows having
14
for the empkey
column. The rows with the valid_from
values 2005-08-08
, 2006-02-13
and 2006-12-25
all have the identical combination (M
, MI
) in the Type 2 columns gender
and state
respectively. However, only the first two of these belong together in a group. The row with 2006-12-25
in the valid_from
column does not belong to the group because another change event occurred on 2006-07-05
, which lies between 2006-02-13
and 2006-12-25
.So, we must sharpen up the definition of the problem. It is not enough to make groups of identical combinations of Type 2 attributes: we must also demand that the rows in the group are consecutive.
The first step in attacking this problem is generate combinations of the audit trail rows so that we have the first row of each group along the first row of the next group. The following fragment solves part of that problem by combining each row (alias:
prv
) with all more recent rows (alias: nxt
) that have the same empkey
value but different values in the Type 2 columns gender
and state
:(Please ignore the part I struck out - it's not relevant at this point)
INNER JOIN hrsource prv
ON curr.empkey = prv.empkey
LEFT JOIN hrsource nxt
ON prv.empkey = nxt.empkey -- timeline of same employee
AND prv.valid_to < nxt.valid_from -- nxt must be more recent than prv
AND (prv.gender,prv.state)!= (nxt.gender,nxt.state) -- type 2 attributes differ
Note the usage of the
LEFT JOIN
. It ensures that if prv
is the most recent row, and there is by definition no row in nxt
that is more recent, the prv
row is still retained. Had we used an INNER JOIN
, we would have lost that row, messing up the result.If we run this fragment in isolation, we can certainly see rows that pair the first row of a group with the first row of the next group. For example, if you run this query:
SELECT prv.empkeyWe get results like this:
, prv.valid_from prv_from, prv.valid_to prv_to
, nxt.valid_from nxt_from, nxt.valid_to nxt_to
, prv.gender prv_gender
, nxt.gender nxt_gender
FROM hrsource prv
LEFT JOIN hrsource nxt
ON prv.empkey = nxt.empkey
AND prv.valid_to < nxt.valid_from
AND (prv.gender, prv.state) != (nxt.gender, nxt.state)
The first row highlighted in green is a desired combination because the
+--------+------------+------------+------------+------------+------------+------------+
| empkey | prv_from | prv_to | nxt_from | nxt_to | prv_gender | nxt_gender |
+--------+------------+------------+------------+------------+------------+------------+
| 14 | 1998-12-03 | 1998-12-27 | 2005-08-08 | 2006-02-12 | F | M |
| 14 | 1998-12-03 | 1998-12-27 | 2006-02-13 | 2006-07-04 | F | M |
| 14 | 1998-12-03 | 1998-12-27 | 2006-07-05 | 2006-12-24 | F | M |
. . . . . . . .
. . ...more rows... . . . .
. . . . . . . .
+--------+------------+------------+------------+------------+------------+------------+
valid_from
date of nxt
is closest to that of prv
. It is this row from nxt
that marks the end of the group of rows starting with prv
. For the rows highlighted in red, the nxt
part indicates a change that occurred beyond that point and are thus not desired.To get rid of these undesired rows, we simply have to demand that no row indicating a change in the SCD Type 2 attributes occurs between
prv
and nxt
. In my query, the following fragment is responsible for that part:LEFT JOIN hrsource inbThis fragment essentially states that there must not be any row in between
ON prv.empkey = inb.empkey -- time line of same employee
AND prv.valid_to < inb.valid_from -- more recent than prv
AND (prv.gender,prv.state)!= (inb.gender,inb.state) -- Type 2 columns changed
AND nxt.valid_from > inb.valid_to -- less recent than nxt
WHERE inb.empkey IS NULL -- does not exist
prv
and nxt
that indicate a change in the SCD type 2 attributes as compared to prv
. Note that the first part of the join condition is identical to the one we used to join nxt
to prv
. The extra's are that we also ask that inb
lies before nxt
. Because we used a LEFT JOIN
, the result row is retained in case no such inb
row exists. The WHERE inb.empkey IS NULL
condition explicitly filters for these cases. By definition, this means that the nxt
part in the result row marks the end of whatever group the prv
belongs to.Rolling Up the Rows in their Groups
Now, if we put these parts together we get something like this:Some of the results are here:
SELECT prv.empkey
, prv.valid_from prv_from, prv.valid_to prv_to
, nxt.valid_from nxt_from, nxt.valid_to nxt_to
, prv.gender prv_gender
, nxt.gender nxt_gender
FROM hrsource prv
LEFT JOIN hrsource nxt
ON prv.empkey = nxt.empkey
AND prv.valid_to < nxt.valid_from
AND (prv.gender, prv.state) != (nxt.gender, nxt.state)
LEFT JOIN hrsource inb
ON prv.empkey = inb.empkey
AND prv.valid_to < inb.valid_from
AND (prv.gender,prv.state) != (inb.gender,inb.state)
AND nxt.valid_from > inb.valid_to
WHERE inb.empkey IS NULL
+--------+------------+------------+------------+------------+------------+------------+Now, we still see the individual rows. However, because we have paired them with the row that definitely marks the end of the group it belongs to, we can now lump them together using
| empkey | prv_from | prv_to | nxt_from | nxt_to | prv_gender | nxt_gender |
+--------+------------+------------+------------+------------+------------+------------+
| 14 | 1998-12-03 | 1998-12-27 | 2005-08-08 | 2006-02-12 | F | M |
| 14 | 1998-12-28 | 2005-04-22 | 2005-08-08 | 2006-02-12 | F | M |
| 14 | 2005-04-23 | 2005-08-07 | 2005-08-08 | 2006-02-12 | F | M |
. . . . . . . .
. . ...more rows... . . . .
. . . . . . . .
+--------+------------+------------+------------+------------+------------+------------+
GROUP BY
. This explains the final bits of the query:GROUP BY prv.empkeyFor each value in
, nxt.valid_from
empkey
, this GROUP BY
clause essentially rolls up the group of rows that showed no change in the SCD Type 2 attributes until the occurrence of the nxt
row. In the SELECT
list, we can now use the MIN
function to find the date of the first row in the group.Note that the
GROUP BY
list is the thinnest one possible. The SELECT
list contains many more columns that do not appear in the GROUP BY
list. However, it is perfectly safe to do so in this case. For more details on this matter, please read my Debunking Group By Myths article.Glueing the SCD Type 1 and Type 2 Solutions together
We developed the solution as two separate queries - one to take care of the SCD Type 1 attributes, one for the SCD Type 2 attributes. These two are simply joined together over theempkey
. The code to do this was displayed struck out in the first query fragment that deals with SCD type 2 attributes.Calculating the Result columns
The only part of the query that was not discussed yet is theSELECT
list:SELECT prv.empkeyNote that we select the
, curr.name
, curr.ssn
, prv.gender
, prv.state
, MIN(prv.valid_from) valid_from
, COALESCE(
nxt.valid_from - INTERVAL 1 DAY
, '9999-12-31'
) valid_to
, CASE
WHEN nxt.valid_from IS NULL THEN 'Y'
ELSE 'N'
END is_current
name
and ssn
columns from the curr
table. This ensures these columns will always reflect the data of the most recent change which is typical for SCD Type 1 attributes. The gender
and state
columns on the other hand are drawn from prv
. They reflect the change history for each individual employee. The
valid_from
column is aggregated using the MIN
function. This essentially yields the date of the first row of each group of consecutive rows having the same empkey
and values in the SCD Type 2 columns. The
valid_to
date is calculated from the valid_from
date of the first row of the next group of rows that indicates a change of values in the SCD Type 2 columns. This is the relevant code:nxt.valid_from - INTERVAL 1 DAYNow, to be completely fair - this is in fact cheating. It implies I trust the data to be such that the value for the
valid_to
date is always exactly one day less than the value for the valid_from
date for the next row in the history. This was certainly the case with the sample data, but we cannot really rely on that. That said, it is not that hard to get the 'real' valid_to
date, but it would make this example a lot harder to understand.Another thing about the calculation of
valid_to
is the treatment of NULL
values. For each history associated with one distinct empkey
value, the last row will have NULL
here, and this is substituted by a large date instead. Dimension tables tend to avoid nullable columns and in this case the maximum value '9999-12-31' expresses that this row is the current row. Using that same logic, and exrta flag column is added to indicate whether the row is the current row.
7 comments:
Hi Roland,
This would be my version using CTE's and window functions introduced in PostgreSQL 8.4:
WITH scd1 AS (
-- get the scd1 (last value) using the DISTINCT ON with an ORDER BY
SELECT DISTINCT ON (empkey) empkey
, name
, ssn
FROM hrsource
ORDER
BY empkey
, valid_from DESC
)
, scd2 AS (
-- detect the matations in gender and state with the LAG window function
-- compare the previous value with the current
-- the IS DISTINCT FROM operator (<>) treats NULL as a 'normal' value
SELECT empkey
, valid_from
, valid_to
, gender
, state
, CASE WHEN gender IS DISTINCT FROM LAG(gender) OVER timeline THEN 1
ELSE 0
END AS gender_mutation
, CASE WHEN state IS DISTINCT FROM LAG(state) OVER timeline THEN 1
ELSE 0
END AS state_mutation
FROM public.hrsource
WINDOW timeline AS (PARTITION BY empkey ORDER BY valid_from)
)
-- combine scd1 and scd2
-- filter scd2 on only the changed values
-- compute the valid_to as next valid_from with the LEAD window function
-- if not found then use maximum date '9999-12-31'
SELECT scd2.empkey
, scd1.name
, scd1.ssn
, scd2.gender
, scd2.state
, scd2.valid_from
, LEAD(scd2.valid_from,1,'9999-12-31'::date) OVER timeline AS valid_to
FROM scd1
JOIN scd2
ON scd1.empkey = scd2.empkey
AND 1 in (gender_mutation,state_mutation)
WINDOW timeline AS (PARTITION BY scd2.empkey ORDER BY scd2.valid_from)
Window functions can be emulated in MySQL, see this article:
http://www.oreillynet.com/pub/a/mysql/2007/04/12/emulating-analytic-aka-ranking-functions-with-mysql.html
Salud,
JJ.
Hi Roland,
I've made the query with the window-emulation technique described in the linkt I mentioned in my previous comment:
select empkey
-- for a string value concat the 0's from the least and use substing to
-- loose them afterwards
-- keep the last value (the first in reverse ordering!) until the empkey changes
, case when (@empkey_n = empkey)
then @name
else substring(concat(least(0,@empkey_n := empkey),least(0,@name := name),name) ,3)
end as name
, case when (@empkey_x = empkey)
then @ssn
else substring(concat(least(0,@empkey_x := empkey),least(0,@ssn := ssn),ssn),3)
end as ssn
, gender
, state
, valid_from
-- the variables seem to be calculated before the where condition
-- so a check on mutations to keep the right previous valid_from is necessary
-- for a datetime's use an add interval for the least constructions
, case when @empkey_v = empkey
then case when 1 in (gender_mutation,state_mutation)
then @valid_from + interval least(0, @valid_from := valid_from) day
else @valid_from
end
else '9999-12-31' + interval least(0,@empkey_v := empkey) + least(0, @valid_from := valid_from) day
end as valid_to
from
(
select empkey
, name
, ssn
, gender
, state
, valid_from
-- generate an indicator to tell which record contains the mutation
-- compare the current value and the previous value in the variable
, case when @empkey_g = empkey
then case when @gender <> gender
then 1
else 0
end
else 1 + least(0,@empkey_g := empkey)
end + least(0, @gender := gender) as gender_mutation
, case when @empkey_s = empkey
then case when @state <> state
then 1
else 0 end
else 1 + least(0,@empkey_s := empkey)
end + least(0, @state := state) as state_mutation
from hrsource
, (select (@empkey_g := 0)) as v1-- init
, (select (@empkey_s := 0)) as v2 -- init
, (select (@gender := '')) as v3-- init
, (select (@state := '')) as v4-- init
-- order the valid_from to get the 'previous record' values in the variables
order
by empkey
, valid_from
) as scd2
, (select (@empkey_n := 0)) as v1 -- init
, (select (@empkey_x := 0)) as v2 -- init
, (select (@empkey_v := 0)) as v3 -- init
, (select (@name := '')) as v4 -- init
, (select (@ssn := '')) as v5 -- init
, (select (@valid_from := '9999-12-31')) as v6 -- init
-- filter only the mutated records of the scd2
where 1 in (gender_mutation,state_mutation)
-- order the valid_from in reverse to get te 'next record' values in the variables
order
by empkey
, valid_from desc
It uses a more ETL-tool/streaming approach, by making use of sorting and variables to keep the previous/next values. I don't know which is faster. I just liked to try it with the variables approach.
Salud,
JJ.
Hi Juan Jose,
wow - this is pretty impressive! Thanks for posting your comments.
I still think that for the particular topic at hand my solution is the simplest one, but I understand the benefit of windowing functions (esp. when reporting on a DWH). I know they have been implemented in PG and I've been meaning to check them out.
Your emulation query looks interesting. I haven't looked at it in detail, but I will do so shortly - it looks I could learn quite a lot from it.
No wonder you're running behind on schedule... I won't send any more articles, too distracting ;-)
Hi Roland,
I made the scd in Kettle using the Analytic Query step twice and a Group by step. I cannot post the picture in the comment. So I'll post the url with the xml, so you can load in Kettle: http://public.me.com/delostilos
Salud,
JJ.
Thank you so much for sharing this awesome code!!! I had been researching the SCD2 dimensions attempting insert/update, merge, and other queries with no success. This is just what I needed...ingenious way to handle these slowly changing dimensions.
Hi and thanks for ur best website and best information. I enjoy alot
Post a Comment