Tuesday, September 04, 2007

Kettle Tip: Using java locales for a Date Dimension

The Date dimension is a well known construct in general data warehousing. In many cases, the data for a date dimension is generated using a database stored procedure or shell-script.

Another approach to obtain the data for a date dimension is to generate it using an ETL tool like Pentaho Data Integration, a.k.a. Kettle. I think this approach makes sense for a number of reasons:

  • When you tend to use a particular ETL tool, you will be able to reuse the date dimension generator over an over, and on different database platforms.

  • You won't need special database privileges beyond the ones you need already. Privileges for creating tables and to perform DML will usually be available, whereas you might need to convince a DBA that you require extra privileges to create and execute stored procedures.

In addition to these general considerations, you can pull a neat little trick with Kettle to localize the data and format of the date attributes. I wouldn't go as far as to say that this feature is Kettle specific: rather, it relies on the localization support built into the java platform and the way you can put that to use in Kettle transformations.

Prerequisites


In this tip, the steps to create a date dimension are described using Kettle 2.5.1 (Generally available Release) and MySQL 5.1.20 (Beta). You will be able to follow through the example using earlier (and later) versions of both products though - I am not using any functionality that is specific to these particular version of the products. The recipe does not really require that you understand anything about data warehouses or date dimensions, but you will probably appreciate it better if you do ;)

Overview


The transformation to generate the data for the date dimension follows a pretty straightforward design. The graphical representation of the transformation is shown below:

localized_date_dimension1

First, the dimension table is created (Prepare). After that, rows are generated to fill it (Input). However, the generated rows are almost empty and barren - we still need to derive and add data to fill the attributes of the date dimension (Transformation). Finally, the data is stored in the date dimension table (Output).

Step-by-Step


The remainder of this article describes in detail how to build this transformation. The majority of steps is probably not very interesting to moderately experienced Kettle users, but may be of use to beginning users.

Note for users that are completely new to Kettle - it is advisable to review the first few chapters of the Spoon user guide (Spoon is the name of Kettle tool you use to design the ETL process). It explains how to start up the tool, create a new transformation, add and connect steps etc. You can find it in the docs/English directory beneath the Kettle home directory.

MySQL JDBC driver: setting the characterEncoding property to UTF8


You need to create a (JDBC) connection to MySQL in the usual, straightforward way:

kettle-mysql-connection

In addition, you need to set the characterEncoding property of the JDBC driver:

kettle-mysql-connection-utf8

This ensures MySQL will be able to understand the utf8 encoded data that we may produce to generate a date dimension in the, say, Chinese language. Note that you cannot just use a statement like SET NAMES utf8 to do this. This is not specific to Kettle, but has to do with the way the MySQL JDBC driver (Connector/J) handles character sets. Please refer to the "Using character sets and unicode" section of the Connector/J documentation for more information on this topic.

Creating the date dimension table

In this particular case, it seemed convenient to create the dimension table as part of the transformation. This is done using the "Execute SQL Script" step shown below:

kettle-sql-script

The "Execute SQL Script" step executes the following script to create the date dimension table:
DROP TABLE IF EXISTS dim_date
;
CREATE TABLE IF NOT EXISTS dim_date (
date_key smallint unsigned NOT NULL,
date date NOT NULL,
date_short char(12) NOT NULL,
date_medium char(16) NOT NULL,
date_long char(24) NOT NULL,
date_full char(32) NOT NULL,
day_in_year smallint unsigned NOT NULL,
day_in_month tinyint unsigned NOT NULL,
is_first_day_in_month char(10) NOT NULL,
is_last_day_in_month char(10) NOT NULL,
day_abbreviation char(3) NOT NULL,
day_name char(12) NOT NULL,
week_in_year tinyint unsigned NOT NULL,
week_in_month tinyint unsigned NOT NULL,
is_first_day_in_week char(10) NOT NULL,
is_last_day_in_week char(10) NOT NULL,
month_number tinyint unsigned NOT NULL,
month_abbreviation char(3) NOT NULL,
month_name char(12) NOT NULL,
year2 char(2) NOT NULL,
year4 year NOT NULL,
quarter_name char(2) NOT NULL,
quarter_number tinyint NOT NULL,
year_quarter char(7) NOT NULL,
year_month_number char(7) NOT NULL,
year_month_abbreviation char(8) NOT NULL,
PRIMARY KEY(date_key),
UNIQUE(date)
)
ENGINE=MyISAM
DEFAULT CHARACTER SET utf8
DEFAULT COLLATE utf8_unicode_ci
This is by no means a complete date dimension. The most important limitation is that it only contains attributes that are immediately derivable from the calendar. So, attributes to denote business specific periods like the fiscal year and holidays are not included.

Generating 10 years worth of days

The grain of the date dimension is days - a row in the date dimension represents a single day. In this case, the "Generate Rows" is configured to generate 3660 rows, which roughly corresponds with enough days to last 10 years:

kettle-generating-10-years

In the example, this step is also used to provide parameters to generate the date dimension data. As we'll see in a moment, the inital_date field effectively specifies the first date that goes into the date dimension. The language_code and country_code fields are used to localize the textual attributes of the date dimension, and the local_yes and local_no fields are used for boolean attributes.

There are other ways to get these parameters into our transformation. For example, we could have used an "Add Constants" step with a similar result. Another possibility would be to get this data from the environment using a "Get Variables" step, and this would allow the parameters to be specified at transformation run-time.

Counting the days

Although we certainly generate enough rows, they are all identical. In order to have each row represent a single distinct day, we need a way to 'count' the generated rows. We do this by adding a "Sequence" step:

kettle-adding-sequence

In this case, we use the "Add sequence" step to generate an incrementing number within the scope of the transformation. As we'll see later on, we can add this to our initial date to get a series of consecutive dates.

Calculating date dimension Attributes

The previous steps form a basis from which we can derive all of the attributes that currently make up our date dimension. To actually calculate the date attributes, we use a "Modified Java Script Value" step:

kettle-javascript

Kettle comes with an embedded Rhino javascript engine. The "Modified Java Script Value" step lets you use it to run javascript code to as part of the transformation.

The javascript code is executed for each row that comes out of the previous steps. In the script code, one can reference the values from the input rows, perform some processing on them, and generate new output fields.

One of the fortunate characteristics of the Rhino engine is that it lets us use java classes inside the javascript code. Let's take a look at the script to see how we can use that to generate the localized data for our data dimension attributes.
Initialization
The first thing we do in the javascript code is to get data from the current input row. In the "Generate Rows" step, we added the language_code and country_code fields to specify a locale. Here, in the script, we use the following piece of code to turn that into a java.util.Locale object:
//Create a Locale according to the specified language code
var locale = new java.util.Locale(
language_code.getString() //get the ISO639 language_code from the input row
, country_code.getString() //get the ISO3166 country_code from the input row
);
The java.util.Locale class represents a particular cultural region. It forms a cornerstone of the internationalization support built into the java platform, and provides information to many other classes to generate appropriately localized output.

We will be using the locale on a number of occasions, but first, we use our it to initialize a java.util.Calendar object:

//Create a calendar, use the specified locale
var calendar = new java.util.GregorianCalendar(locale);
(Note that the java platform currently only provides one concrete Calender Class: the java.util.GregorianCalendar. Unfortunately, java does not seem to provide a built-in recipe for dealing with, for example, Islamic or Hebrew calendars).

We require the calendar object to obtain an instance of the java.util.Date Class that represents the date corresponding to the current row. To do that, we first set the calendar's current date using the initial_date field that was specified in the "Generate Rows" step:
//Set the initial date
calendar.setTime(initial_date.getDate());
We need this to add the number of days generated by our "Add Sequence" step:
//set the calendar to the current date by adding DaySequence days
calendar.add(calendar.DAY_OF_MONTH,DaySequence.getInteger() - 1);

(Note that we substract 1 from the DaySequence value. This is because our sequence starts at 1, and we want the specified initial date to be included in our date dimension).

We conclude the initialization of the script by retrieving a java.util.Date object that represents the date for the current row.
//get the calendar date
var date = new java.util.Date(calendar.getTimeInMillis());
This java.util.Date instance is assigned to the date variable in the script, allowing it to be used as an output field of the javascript step. We require this in order to fill the date column of the date dimension table. We will also be using the date variable later on in this script to derive the value of other date dimension attributes.
Getting Text representations of full dates
Our date dimension has a number of attributes to denote a complete date containing day, month and year parts, in various formats: date_short, date_medium, date_long and date_full. These are all generated using the java.text.DateFormat class.

To do that, we first need to create an appropriate DateFormat instance using the static getDateInstance() method, passing our locale object as well as a constant that specifies whether we want to short, medium, long or full format. Then, we can pass the java.util.Date object for which we want to obtain the textual representation to the format method of the newly created java.text.DateFormat instance:
//en-us example: 9/3/07
var date_short = java.text.DateFormat.getDateInstance(
java.text.DateFormat.SHORT
, locale
).format(date);
//en-us example: Sep 3, 2007
var date_medium = java.text.DateFormat.getDateInstance(
java.text.DateFormat.MEDIUM
, locale
).format(date);
//en-us example: September 3, 2007
var date_long = java.text.DateFormat.getDateInstance(
java.text.DateFormat.LONG
, locale
).format(date);
//en-us example: Monday, September 3, 2007
var date_full = java.text.DateFormat.getDateInstance(
java.text.DateFormat.FULL
, locale
).format(date);
Formatting date parts
Extracting and formatting different date parts is most easily done by applying the format function on a subclass of java.text.Dateformat, the java.text.SimpleDateFormat class. The java.text.SimpleDateFormat class allows formatting of dates based on date and time patterns:
//day in year: 1..366
var simpleDateFormat = java.text.SimpleDateFormat("D",locale);
var day_in_year = simpleDateFormat.format(date);
In this example, we pass both the locale and a date pattern to the constructor to create an instance of the java.text.SimpleDateFormat class. The pattern is passed as the string "D", specifying a day-in-year format.

Once we created the java.text.SimpleDateFormat instance, we can apply a new pattern to it using the applyPattern() method. Calling the format method again, we obtain the date in the desired format:
//day in month: 1..31
simpleDateFormat.applyPattern("d");
var day_in_month = simpleDateFormat.format(date);
//en-us example: "Monday"
simpleDateFormat.applyPattern("EEEE");
var day_name = simpleDateFormat.format(date);
//en-us example: "Mon"
simpleDateFormat.applyPattern("E");
var day_abbreviation = simpleDateFormat.format(date);
//week in year, 1..53
simpleDateFormat.applyPattern("ww");
var week_in_year = simpleDateFormat.format(date);
//week in month, 1..5
simpleDateFormat.applyPattern("W");
var week_in_month = simpleDateFormat.format(date);
//month number in year, 1..12
simpleDateFormat.applyPattern("MM");
var month_number = simpleDateFormat.format(date);
//en-us example: "September"
simpleDateFormat.applyPattern("MMMM");
var month_name = simpleDateFormat.format(date);
//en-us example: "Sep"
simpleDateFormat.applyPattern("MMM");
var month_abbreviation = simpleDateFormat.format(date);
//2 digit representation of the year, example: "07" for 2007
simpleDateFormat.applyPattern("y");
var year2 = simpleDateFormat.format(date);
//4 digit representation of the year, example: 2007
simpleDateFormat.applyPattern("yyyy");
var year4 = simpleDateFormat.format(date);
Dealing with Quarters
Although the java.text.SimpleDateFormat class is useful, it does not provide any functionality for working with quarters. We do want our date dimension to contain attributes to represent the quarter, so we have to reside to computing these manually:
//handling Quarters is a DIY
var quarter_name = "Q";
var quarter_number;
switch(parseInt(month_number)){
case 1: case 2: case 3: quarter_number = "1"; break;
case 4: case 5: case 6: quarter_number = "2"; break;
case 7: case 8: case 9: quarter_number = "3"; break;
case 10: case 11: case 12: quarter_number = "4"; break;
}
quarter_name += quarter_number;
Although this will do for now, this solution doesn't really cut it because it does not produce localized output. Anyway, it is better than nothing so we'll just have to make do with it.
Period demarcation flags
Our date dimension has a few attributes that are used to indicate the start and end of week and month periods. We use simple yes/no type flags, but we allow the actual "yes" and "no" values to be specified by the user in the "Generate Rows" step. We retrieve them with the following piece of code:
//get the local yes/no values
var yes = local_yes.getString();
var no = local_no.getString();
We can now use them these to flag the start and end of week and month periods.

The start (and of course, also the end) of the week are subject to the locale. In order to find out if we are dealing with the first day of a week, we use the getFirstDayOfWeek() method of the java.util.Calendar class. By comparing its return value with the day of week of the current row, we can see if we happen to be dealing with the first day of the week:
//initialize for week calculations
var first_day_of_week = calendar.getFirstDayOfWeek();
var day_of_week = java.util.Calendar.DAY_OF_WEEK;

//find out if this is the first day of the week
var is_first_day_in_week;
if(first_day_of_week==calendar.get(day_of_week)){
is_first_day_in_week = yes;
} else {
is_first_day_in_week = no;
}
Note that we obtain the current day of the week by passing the value of the DAY_OF_WEEK constant to the get method of the java.util.Calendar object that we initialized at the start of the script.

In order to set the value for the is_last_day_in_week attribute of the date dimension, we simply find out if the next day happens to be the first day of the week. If it is, then by definition, the current row represents the last day of the week:
//calculate the next day
calendar.add(calendar.DAY_OF_MONTH,1);

//get the next calendar date
var next_day = new java.util.Date(calendar.getTimeInMillis());

//find out if this is the first day of the week
var is_last_day_in_week;
if(first_day_of_week==calendar.get(day_of_week)){
is_last_day_in_week = yes;
} else {
is_last_day_in_week = no;
}
(Note that we have already used similar code to add a day to a date when we added the day sequence to the initial date.)

We can use similar logic to calculate the values for the is_first_day_in_month and is_last_day_in_month indicators. This is actually easier, because the first day in the month is not dependant upon the locale (at least - not within one calendar). So, we only need to find out if the day of month is equal to one:
//find out if this is the first day of the month
var is_first_day_of_month;
if(day_in_month == 1){
is_first_day_in_month = yes;
} else {
is_first_day_in_month = no;
}

//find out if this is the last day in the month
var is_last_day_of_month;
if(java.text.SimpleDateFormat("d",locale).format(next_day)==1){
is_last_day_in_month = yes;
} else {
is_last_day_in_month = no;
}
A few more date attributes
We conclude the computation of the date attributes by adding a few more useful labels:
//a few useful labels
var year_quarter = year4 + "-" + quarter_name;
var year_month_number = year4 + "-" + month_number;
var year_month_abbreviation = year4 + "-" + month_abbreviation;
Like when we calculated the quarters, this is actually not a very good method because the results will not be localized. That said, the result will make sense for many locales, and we don't really have a better way to deal with it right now.
Defining the step outputs
We just calculated all the required values to fill the attributes of our date dimension. We just need to get them out of the script and into the outputs of the step.

Every variable declared in the javascript (using the var keyword) can be used as an output field of the javascript step. The easiest way to generate the outputs is by hitting the "Get Variables" button at the bottom of the dialog. This simply adds an output field for each variable declared in the script:

kettle-javascript-outputs

By default, the data type for all the outputs added in this way is set to the String type. Although it is good practice to choose a more specific data type, it is almost always unncessary in this case, as all integer type values will be correctly converted implicitly when we insert them into the database. There is one exception in this case, and that is the date output. Inside the script, it is an instance of a java.util.Date class, and we must set the type to "Date" in the output too. Otherwise, the (java) string representation of the java.util.Date object will be sent as output, and this is not automatically recognized as a date by MySQL.

Discarding Fields

We are now almost ready to insert the rows into the date dimension table. We only need to discard all fields in the stream that do not correspond with any of the columns in our date dimension table. We use a "Select Values" step to do that:

kettle-select-values

We use the "Get Fields To Select" button to pull in all available fields, and after that, simply select and delete each field that we do not need. As a final step, we rename the DaySequence field to date_key to map it to the date_key column in our date dimension table.

Inserting data into the table

In the final step, we add the generated data to the dim_date table we created in the very first step of the transformation:

kettle-table-output

We only need to specify the connection and the table name here, and the step will then automatically attempt to map the fields of the incoming rows to table columns.

We could have used the "Insert / Update" step, or even the "Execute SQL Script" step too to write the data to the dimension table, but that would require a little bit of extra work.

Running the transformation


After building the transformation, you can run it by hitting the "running man" icon on the toolbar. This will open a dialog where you can set a number of properties for the transformation. Hit "Launch" button there and after that, the transformation will be executed:

kettle-run-transformation

Closing Notes


I hope you enjoyed this tip. If you want to, you can download the kettle transformation here, and use it as you see fit.

If you are interested in open source data warehousing, register for the MySQL Enterprise Data Warehousing Seminar, Thursday, September 06, 2007 and hear what Robin Schumacher has to say about that subject. (Note that this is a general MySQL data warehousing seminar - this post and the seminar are unrelated)

7 comments:

Andrés García said...

Hi Mr.Roland, I am a begginer with kettle.
I Have a database with a table names "program". On the other hand, I have a colum with a table names "nameofprogram". I need to change the names of programs that start with "ESP" to code "01"; at the same time I need to rename the programs thar start with "ING" to code "02", the other one of names need to be rename with "03"
all these information need to be save in the same colum and the others colums don´t change.

Example:

Before:

TABLE: PROGRAM

nameofprogram colum2 ...

ESPECIALIZACIÓN EN GERIATRIA infcolum2 ...
INGENIERIA MECATRONICA infcolum2 ..
ESPECIALIZACIÓN EN ODONTOLOGIA infcolum2 ...
DERECHO

After:

TABLE: PROGRAM2

code .....

01 infcolum2 ....
02 infcolum2 ....
01 infcolum2 ....
03 infcolum2 ....

How can I do this ? can you send me a transformation file or something like this?

Thank you for your help,
Andrés García

thoughtcrimes said...

Thanks so much for the awesome tip/tutorial! I just followed your steps and was able to generate my dim_date table, which will be very helpful. I wanted to point out one minor error in the field names that caused me some problems until I figured it out. In the create table step, you give a column names of is_first_day_in_month and is_last_day_of_month.

However, during the modified JS scripting step, you give these output field names of is_first_day_OF_month and is_last_day_OF_month. Running this in Spoon 4.0 and MySQL 5.1 caused the mapping to fail, so the fields were initially populated as 'undefined' instead of yes/no character strings. I changed the name of the columns in the table to match the JS output and it worked fine. Again, thanks for the patient explanation. I feel like I understand both how to create the date dimensions and how to use Kettle better after going through this process.

rpbouman said...

Hi thoughtcrimes!

thanks for the kind words, and many thanks for pointing out these bugs. I'm not sure when I'll get round to fixing them, but I can now at least point to your comment if there's someone else having issues.

I'm glad this post was useful to you.

If you're interested in working with Kettle more, I'd like to point out that recently, the book "Pentaho Kettle Solutions" was published by Wiley and written by me, Jos van Dongen and Matt Casters, chief data integration at Pentaho and Kettle founder and lead developer. You can find it here on amazon: http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177

Roberto Novaes said...

Thank you very much for your excelent post! Nevertheless, I think there is a small error on the Javascript part: when you format the 2 digit year, the pattern is "y", and it should be "yy":
simpleDateFormat.applyPattern("yy");
var year2 = simpleDateFormat.format(date);

Hope it helps. Thanks a lot!

Anonymous said...

click en get variables button from javascritpt show this:
General error executing script:
ReferenceError: "initial_date" is not defined.

Spoon v. 4.2.0

Anonymous said...

Hi Roland, I am a begginer with kettle..

I want to thank you for this awesome tutorial! I just followed your steps to generate my dim_date table. I had also generated a dim_time table.

Can you please tell me the script that I should write in the step "Modified Java Script Value" in order to extract only the TIME [09:20:08] from an attribut in a PostgresSQL database (its type is "timestamp without time zone" ) which has this forme " 2014-03-06 09:20:08.742 " ?
... in that case the step "Strings cut" isn't useful.


Thank you for your help,
Sara

rpbouman said...

Hi Sara,

the Javascript step is like a swiss army knife. It can do anything, but it might not be the most efficient or straightforward tool for the job.

To extract the time, you can try a couple of things. You could use a "Select Values" step, and use the "Modify" tab to cast the timstamp values to strings. The fields grid has a "format" column where you can specify the desired format. So there you would type the format string that extracts only the time. Look here for the format strings: http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html

I think this format does the job:

HH:mm:ss

Alternatively you could use the regex step to extract the timepart.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...