Another approach to obtain the data for a date dimension is to generate it using an ETL tool like Pentaho Data Integration, a.k.a. Kettle. I think this approach makes sense for a number of reasons:
- When you tend to use a particular ETL tool, you will be able to reuse the date dimension generator over an over, and on different database platforms.
- You won't need special database privileges beyond the ones you need already. Privileges for creating tables and to perform DML will usually be available, whereas you might need to convince a DBA that you require extra privileges to create and execute stored procedures.
Prerequisites
In this tip, the steps to create a date dimension are described using Kettle 2.5.1 (Generally available Release) and MySQL 5.1.20 (Beta). You will be able to follow through the example using earlier (and later) versions of both products though - I am not using any functionality that is specific to these particular version of the products. The recipe does not really require that you understand anything about data warehouses or date dimensions, but you will probably appreciate it better if you do ;)
Overview
The transformation to generate the data for the date dimension follows a pretty straightforward design. The graphical representation of the transformation is shown below:
First, the dimension table is created (Prepare). After that, rows are generated to fill it (Input). However, the generated rows are almost empty and barren - we still need to derive and add data to fill the attributes of the date dimension (Transformation). Finally, the data is stored in the date dimension table (Output).
Step-by-Step
The remainder of this article describes in detail how to build this transformation. The majority of steps is probably not very interesting to moderately experienced Kettle users, but may be of use to beginning users.
Note for users that are completely new to Kettle - it is advisable to review the first few chapters of the Spoon user guide (Spoon is the name of Kettle tool you use to design the ETL process). It explains how to start up the tool, create a new transformation, add and connect steps etc. You can find it in the docs/English directory beneath the Kettle home directory.
MySQL JDBC driver: setting the characterEncoding property to UTF8
You need to create a (JDBC) connection to MySQL in the usual, straightforward way:
In addition, you need to set the
characterEncoding
property of the JDBC driver:This ensures MySQL will be able to understand the utf8 encoded data that we may produce to generate a date dimension in the, say, Chinese language. Note that you cannot just use a statement like
SET NAMES utf8
to do this. This is not specific to Kettle, but has to do with the way the MySQL JDBC driver (Connector/J) handles character sets. Please refer to the "Using character sets and unicode" section of the Connector/J documentation for more information on this topic.Creating the date dimension table
In this particular case, it seemed convenient to create the dimension table as part of the transformation. This is done using the "Execute SQL Script" step shown below:The "Execute SQL Script" step executes the following script to create the date dimension table:
DROP TABLE IF EXISTS dim_dateThis is by no means a complete date dimension. The most important limitation is that it only contains attributes that are immediately derivable from the calendar. So, attributes to denote business specific periods like the fiscal year and holidays are not included.
;
CREATE TABLE IF NOT EXISTS dim_date (
date_key smallint unsigned NOT NULL,
date date NOT NULL,
date_short char(12) NOT NULL,
date_medium char(16) NOT NULL,
date_long char(24) NOT NULL,
date_full char(32) NOT NULL,
day_in_year smallint unsigned NOT NULL,
day_in_month tinyint unsigned NOT NULL,
is_first_day_in_month char(10) NOT NULL,
is_last_day_in_month char(10) NOT NULL,
day_abbreviation char(3) NOT NULL,
day_name char(12) NOT NULL,
week_in_year tinyint unsigned NOT NULL,
week_in_month tinyint unsigned NOT NULL,
is_first_day_in_week char(10) NOT NULL,
is_last_day_in_week char(10) NOT NULL,
month_number tinyint unsigned NOT NULL,
month_abbreviation char(3) NOT NULL,
month_name char(12) NOT NULL,
year2 char(2) NOT NULL,
year4 year NOT NULL,
quarter_name char(2) NOT NULL,
quarter_number tinyint NOT NULL,
year_quarter char(7) NOT NULL,
year_month_number char(7) NOT NULL,
year_month_abbreviation char(8) NOT NULL,
PRIMARY KEY(date_key),
UNIQUE(date)
)
ENGINE=MyISAM
DEFAULT CHARACTER SET utf8
DEFAULT COLLATE utf8_unicode_ci
Generating 10 years worth of days
The grain of the date dimension is days - a row in the date dimension represents a single day. In this case, the "Generate Rows" is configured to generate 3660 rows, which roughly corresponds with enough days to last 10 years:In the example, this step is also used to provide parameters to generate the date dimension data. As we'll see in a moment, the
inital_date
field effectively specifies the first date that goes into the date dimension. The language_code
and country_code
fields are used to localize the textual attributes of the date dimension, and the local_yes
and local_no
fields are used for boolean attributes.There are other ways to get these parameters into our transformation. For example, we could have used an "Add Constants" step with a similar result. Another possibility would be to get this data from the environment using a "Get Variables" step, and this would allow the parameters to be specified at transformation run-time.
Counting the days
Although we certainly generate enough rows, they are all identical. In order to have each row represent a single distinct day, we need a way to 'count' the generated rows. We do this by adding a "Sequence" step:In this case, we use the "Add sequence" step to generate an incrementing number within the scope of the transformation. As we'll see later on, we can add this to our initial date to get a series of consecutive dates.
Calculating date dimension Attributes
The previous steps form a basis from which we can derive all of the attributes that currently make up our date dimension. To actually calculate the date attributes, we use a "Modified Java Script Value" step:Kettle comes with an embedded Rhino javascript engine. The "Modified Java Script Value" step lets you use it to run javascript code to as part of the transformation.
The javascript code is executed for each row that comes out of the previous steps. In the script code, one can reference the values from the input rows, perform some processing on them, and generate new output fields.
One of the fortunate characteristics of the Rhino engine is that it lets us use java classes inside the javascript code. Let's take a look at the script to see how we can use that to generate the localized data for our data dimension attributes.
Initialization
The first thing we do in the javascript code is to get data from the current input row. In the "Generate Rows" step, we added thelanguage_code
and country_code
fields to specify a locale. Here, in the script, we use the following piece of code to turn that into a java.util.Locale
object://Create a Locale according to the specified language codeThe
var locale = new java.util.Locale(
language_code.getString() //get the ISO639 language_code from the input row
, country_code.getString() //get the ISO3166 country_code from the input row
);
java.util.Locale
class represents a particular cultural region. It forms a cornerstone of the internationalization support built into the java platform, and provides information to many other classes to generate appropriately localized output. We will be using the locale on a number of occasions, but first, we use our it to initialize a
java.util.Calendar
object:(Note that the java platform currently only provides one concrete Calender Class: the
//Create a calendar, use the specified locale
var calendar = new java.util.GregorianCalendar(locale);
java.util.GregorianCalendar
. Unfortunately, java does not seem to provide a built-in recipe for dealing with, for example, Islamic or Hebrew calendars).We require the calendar object to obtain an instance of the
java.util.Date
Class that represents the date corresponding to the current row. To do that, we first set the calendar's current date using the initial_date
field that was specified in the "Generate Rows" step://Set the initial dateWe need this to add the number of days generated by our "Add Sequence" step:
calendar.setTime(initial_date.getDate());
//set the calendar to the current date by adding DaySequence days
calendar.add(calendar.DAY_OF_MONTH,DaySequence.getInteger() - 1);
(Note that we substract
1
from the DaySequence value. This is because our sequence starts at 1
, and we want the specified initial date to be included in our date dimension).We conclude the initialization of the script by retrieving a
java.util.Date
object that represents the date for the current row.//get the calendar dateThis
var date = new java.util.Date(calendar.getTimeInMillis());
java.util.Date
instance is assigned to the date
variable in the script, allowing it to be used as an output field of the javascript step. We require this in order to fill the date
column of the date dimension table. We will also be using the date
variable later on in this script to derive the value of other date dimension attributes.Getting Text representations of full dates
Our date dimension has a number of attributes to denote a complete date containing day, month and year parts, in various formats:date_short
, date_medium
, date_long
and date_full
. These are all generated using the java.text.DateFormat
class. To do that, we first need to create an appropriate
DateFormat
instance using the static getDateInstance()
method, passing our locale object as well as a constant that specifies whether we want to short, medium, long or full format. Then, we can pass the java.util.Date
object for which we want to obtain the textual representation to the format
method of the newly created java.text.DateFormat
instance://en-us example: 9/3/07
var date_short = java.text.DateFormat.getDateInstance(
java.text.DateFormat.SHORT
, locale
).format(date);
//en-us example: Sep 3, 2007
var date_medium = java.text.DateFormat.getDateInstance(
java.text.DateFormat.MEDIUM
, locale
).format(date);
//en-us example: September 3, 2007
var date_long = java.text.DateFormat.getDateInstance(
java.text.DateFormat.LONG
, locale
).format(date);
//en-us example: Monday, September 3, 2007
var date_full = java.text.DateFormat.getDateInstance(
java.text.DateFormat.FULL
, locale
).format(date);
Formatting date parts
Extracting and formatting different date parts is most easily done by applying theformat
function on a subclass of java.text.Dateformat
, the java.text.SimpleDateFormat
class. The java.text.SimpleDateFormat
class allows formatting of dates based on date and time patterns://day in year: 1..366In this example, we pass both the locale and a date pattern to the constructor to create an instance of the
var simpleDateFormat = java.text.SimpleDateFormat("D",locale);
var day_in_year = simpleDateFormat.format(date);
java.text.SimpleDateFormat
class. The pattern is passed as the string "D"
, specifying a day-in-year format. Once we created the
java.text.SimpleDateFormat
instance, we can apply a new pattern to it using the applyPattern()
method. Calling the format
method again, we obtain the date in the desired format: //day in month: 1..31
simpleDateFormat.applyPattern("d");
var day_in_month = simpleDateFormat.format(date);
//en-us example: "Monday"
simpleDateFormat.applyPattern("EEEE");
var day_name = simpleDateFormat.format(date);
//en-us example: "Mon"
simpleDateFormat.applyPattern("E");
var day_abbreviation = simpleDateFormat.format(date);
//week in year, 1..53
simpleDateFormat.applyPattern("ww");
var week_in_year = simpleDateFormat.format(date);
//week in month, 1..5
simpleDateFormat.applyPattern("W");
var week_in_month = simpleDateFormat.format(date);
//month number in year, 1..12
simpleDateFormat.applyPattern("MM");
var month_number = simpleDateFormat.format(date);
//en-us example: "September"
simpleDateFormat.applyPattern("MMMM");
var month_name = simpleDateFormat.format(date);
//en-us example: "Sep"
simpleDateFormat.applyPattern("MMM");
var month_abbreviation = simpleDateFormat.format(date);
//2 digit representation of the year, example: "07" for 2007
simpleDateFormat.applyPattern("y");
var year2 = simpleDateFormat.format(date);
//4 digit representation of the year, example: 2007
simpleDateFormat.applyPattern("yyyy");
var year4 = simpleDateFormat.format(date);
Dealing with Quarters
Although thejava.text.SimpleDateFormat
class is useful, it does not provide any functionality for working with quarters. We do want our date dimension to contain attributes to represent the quarter, so we have to reside to computing these manually://handling Quarters is a DIYAlthough this will do for now, this solution doesn't really cut it because it does not produce localized output. Anyway, it is better than nothing so we'll just have to make do with it.
var quarter_name = "Q";
var quarter_number;
switch(parseInt(month_number)){
case 1: case 2: case 3: quarter_number = "1"; break;
case 4: case 5: case 6: quarter_number = "2"; break;
case 7: case 8: case 9: quarter_number = "3"; break;
case 10: case 11: case 12: quarter_number = "4"; break;
}
quarter_name += quarter_number;
Period demarcation flags
Our date dimension has a few attributes that are used to indicate the start and end of week and month periods. We use simple yes/no type flags, but we allow the actual "yes" and "no" values to be specified by the user in the "Generate Rows" step. We retrieve them with the following piece of code://get the local yes/no valuesWe can now use them these to flag the start and end of week and month periods.
var yes = local_yes.getString();
var no = local_no.getString();
The start (and of course, also the end) of the week are subject to the locale. In order to find out if we are dealing with the first day of a week, we use the
getFirstDayOfWeek()
method of the java.util.Calendar
class. By comparing its return value with the day of week of the current row, we can see if we happen to be dealing with the first day of the week://initialize for week calculationsNote that we obtain the current day of the week by passing the value of the
var first_day_of_week = calendar.getFirstDayOfWeek();
var day_of_week = java.util.Calendar.DAY_OF_WEEK;
//find out if this is the first day of the week
var is_first_day_in_week;
if(first_day_of_week==calendar.get(day_of_week)){
is_first_day_in_week = yes;
} else {
is_first_day_in_week = no;
}
DAY_OF_WEEK
constant to the get method of the java.util.Calendar
object that we initialized at the start of the script.In order to set the value for the
is_last_day_in_week
attribute of the date dimension, we simply find out if the next day happens to be the first day of the week. If it is, then by definition, the current row represents the last day of the week://calculate the next day(Note that we have already used similar code to add a day to a date when we added the day sequence to the initial date.)
calendar.add(calendar.DAY_OF_MONTH,1);
//get the next calendar date
var next_day = new java.util.Date(calendar.getTimeInMillis());
//find out if this is the first day of the week
var is_last_day_in_week;
if(first_day_of_week==calendar.get(day_of_week)){
is_last_day_in_week = yes;
} else {
is_last_day_in_week = no;
}
We can use similar logic to calculate the values for the
is_first_day_in_month
and is_last_day_in_month
indicators. This is actually easier, because the first day in the month is not dependant upon the locale (at least - not within one calendar). So, we only need to find out if the day of month is equal to one://find out if this is the first day of the month
var is_first_day_of_month;
if(day_in_month == 1){
is_first_day_in_month = yes;
} else {
is_first_day_in_month = no;
}
//find out if this is the last day in the month
var is_last_day_of_month;
if(java.text.SimpleDateFormat("d",locale).format(next_day)==1){
is_last_day_in_month = yes;
} else {
is_last_day_in_month = no;
}
A few more date attributes
We conclude the computation of the date attributes by adding a few more useful labels://a few useful labelsLike when we calculated the quarters, this is actually not a very good method because the results will not be localized. That said, the result will make sense for many locales, and we don't really have a better way to deal with it right now.
var year_quarter = year4 + "-" + quarter_name;
var year_month_number = year4 + "-" + month_number;
var year_month_abbreviation = year4 + "-" + month_abbreviation;
Defining the step outputs
We just calculated all the required values to fill the attributes of our date dimension. We just need to get them out of the script and into the outputs of the step.Every variable declared in the javascript (using the
var
keyword) can be used as an output field of the javascript step. The easiest way to generate the outputs is by hitting the "Get Variables" button at the bottom of the dialog. This simply adds an output field for each variable declared in the script:By default, the data type for all the outputs added in this way is set to the String type. Although it is good practice to choose a more specific data type, it is almost always unncessary in this case, as all integer type values will be correctly converted implicitly when we insert them into the database. There is one exception in this case, and that is the
date
output. Inside the script, it is an instance of a java.util.Date
class, and we must set the type to "Date" in the output too. Otherwise, the (java) string representation of the java.util.Date
object will be sent as output, and this is not automatically recognized as a date
by MySQL.Discarding Fields
We are now almost ready to insert the rows into the date dimension table. We only need to discard all fields in the stream that do not correspond with any of the columns in our date dimension table. We use a "Select Values" step to do that:We use the "Get Fields To Select" button to pull in all available fields, and after that, simply select and delete each field that we do not need. As a final step, we rename the
DaySequence
field to date_key
to map it to the date_key
column in our date dimension table.Inserting data into the table
In the final step, we add the generated data to thedim_date
table we created in the very first step of the transformation:We only need to specify the connection and the table name here, and the step will then automatically attempt to map the fields of the incoming rows to table columns.
We could have used the "Insert / Update" step, or even the "Execute SQL Script" step too to write the data to the dimension table, but that would require a little bit of extra work.
Running the transformation
After building the transformation, you can run it by hitting the "running man" icon on the toolbar. This will open a dialog where you can set a number of properties for the transformation. Hit "Launch" button there and after that, the transformation will be executed:
Closing Notes
I hope you enjoyed this tip. If you want to, you can download the kettle transformation here, and use it as you see fit.
If you are interested in open source data warehousing, register for the MySQL Enterprise Data Warehousing Seminar, Thursday, September 06, 2007 and hear what Robin Schumacher has to say about that subject. (Note that this is a general MySQL data warehousing seminar - this post and the seminar are unrelated)
7 comments:
Hi Mr.Roland, I am a begginer with kettle.
I Have a database with a table names "program". On the other hand, I have a colum with a table names "nameofprogram". I need to change the names of programs that start with "ESP" to code "01"; at the same time I need to rename the programs thar start with "ING" to code "02", the other one of names need to be rename with "03"
all these information need to be save in the same colum and the others colums don´t change.
Example:
Before:
TABLE: PROGRAM
nameofprogram colum2 ...
ESPECIALIZACIÓN EN GERIATRIA infcolum2 ...
INGENIERIA MECATRONICA infcolum2 ..
ESPECIALIZACIÓN EN ODONTOLOGIA infcolum2 ...
DERECHO
After:
TABLE: PROGRAM2
code .....
01 infcolum2 ....
02 infcolum2 ....
01 infcolum2 ....
03 infcolum2 ....
How can I do this ? can you send me a transformation file or something like this?
Thank you for your help,
Andrés García
Thanks so much for the awesome tip/tutorial! I just followed your steps and was able to generate my dim_date table, which will be very helpful. I wanted to point out one minor error in the field names that caused me some problems until I figured it out. In the create table step, you give a column names of is_first_day_in_month and is_last_day_of_month.
However, during the modified JS scripting step, you give these output field names of is_first_day_OF_month and is_last_day_OF_month. Running this in Spoon 4.0 and MySQL 5.1 caused the mapping to fail, so the fields were initially populated as 'undefined' instead of yes/no character strings. I changed the name of the columns in the table to match the JS output and it worked fine. Again, thanks for the patient explanation. I feel like I understand both how to create the date dimensions and how to use Kettle better after going through this process.
Hi thoughtcrimes!
thanks for the kind words, and many thanks for pointing out these bugs. I'm not sure when I'll get round to fixing them, but I can now at least point to your comment if there's someone else having issues.
I'm glad this post was useful to you.
If you're interested in working with Kettle more, I'd like to point out that recently, the book "Pentaho Kettle Solutions" was published by Wiley and written by me, Jos van Dongen and Matt Casters, chief data integration at Pentaho and Kettle founder and lead developer. You can find it here on amazon: http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177
Thank you very much for your excelent post! Nevertheless, I think there is a small error on the Javascript part: when you format the 2 digit year, the pattern is "y", and it should be "yy":
simpleDateFormat.applyPattern("yy");
var year2 = simpleDateFormat.format(date);
Hope it helps. Thanks a lot!
click en get variables button from javascritpt show this:
General error executing script:
ReferenceError: "initial_date" is not defined.
Spoon v. 4.2.0
Hi Roland, I am a begginer with kettle..
I want to thank you for this awesome tutorial! I just followed your steps to generate my dim_date table. I had also generated a dim_time table.
Can you please tell me the script that I should write in the step "Modified Java Script Value" in order to extract only the TIME [09:20:08] from an attribut in a PostgresSQL database (its type is "timestamp without time zone" ) which has this forme " 2014-03-06 09:20:08.742 " ?
... in that case the step "Strings cut" isn't useful.
Thank you for your help,
Sara
Hi Sara,
the Javascript step is like a swiss army knife. It can do anything, but it might not be the most efficient or straightforward tool for the job.
To extract the time, you can try a couple of things. You could use a "Select Values" step, and use the "Modify" tab to cast the timstamp values to strings. The fields grid has a "format" column where you can specify the desired format. So there you would type the format string that extracts only the time. Look here for the format strings: http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html
I think this format does the job:
HH:mm:ss
Alternatively you could use the regex step to extract the timepart.
Post a Comment