Percona Live MySQL User's Conference, San Francisco, April 10-12th, 2012 Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Monday, September 24, 2007

Creating MySQL UDFs with Microsoft Visual C++ Express

Some time ago, I announced the MySQL UDF Repository. In short, the MySQL UDF Repository tries to be a one stop place to obtain high quality LGPL licensed libraries containing MySQL UDFs, including documentation and binaries. Since the announcement, our Google Group has grown to a 22 members (including a number of MySQL employees and prominent community members), and we've gained a few interesting new UDF libraries:

lib_mysqludf_preg
A library authored by Rich Waters providing PERL compatible regular expressions.

lib_mysqludf_xql
A library authored by Arnold Daniels with many useful functions to map and export relational data from MySQL to XML.

Another thing that we see happening now is that people are starting to ask for windows binaries. Although we intend to provide binaries for major operating systems, we haven't really got round to it yet for Windows in particular.

I want to make a start now by explaining how to create and run MySQL UDFs on Windows using the Express Edition of the popular Microsoft Visual C++ IDE. I hope this information will be useful to the many MS oriented developers out there that have trouble getting started with MySQL UDFs.

Preparation

Before we can actually start, we need to install and configure some software and obtain a few resources.

Installing Visual C++ 2005 Express Edition

First, you'll need to download and install Microsoft Visual C++ 2005 Express Edition. At present, this is the latest stable release of the free (As in Beer) version of the popular Visual Studio IDE, set up to create, compile and debug C++ programs.

The installation procedure can take a little while, but is otherwise pretty straightforward.

If you already have a paid-for version of Visual Studio 2005, you should not install the express edition (you are at risk of messing up the existing installation if you do). In that case, you should use your paid-for version or alternatively, download the upcoming release (Microsoft Visual C++ 2008 Express Edition, now in Beta) and give that a spin.

Installing the Microsoft Platform SDK

Apart from Visual Studio, you also need to have the Microsoft Platform SDK installed. Although this SDK is officially entitled "Microsoft ® Windows Server® 2003 R2 Platform SDK", it includes the resources for many flavours of Windows, including Windows 2000, Windows 2003 and Windows XP.

Installing the MySQL Development resources

Source files for UDFs contain references to header files supplied by MySQL. The easiest way to obtain them is by installing them using the Setup.exe installer program that you use to install the MySQL Server.

If you are installing a new server, you can ensure that the files are installed by choosing "custom" in the "Setup Type" step of the wizard started by the Setup.exe:

MySQL-custom

After that, you will be able to choose which features you want to install. You need to ensure that the "C Include files/Lib Files" under the "Developer Components" is selected:

MySQL-dev-components

If you are not installing a new server, you should first check to see if you have an include directory immediately beneath the MySQL base direction. If so, you probably don't need to do anything right now.

If you don't have the include directory, it probably means you did not choose to install the "C Include files/Lib Files" when installing the server (by default, they are not installed). Running the Setup.exe program again will offer you the possibility to add new components to the installation:

MySQL-modify

And from here, you will be led to the step where you can choose to install the include files and library files.

Setting up a VC++EE Project for MySQL UDFs

Once you fulfilled all necessary prerequisites, the next step is to create a Visual Studio Project. In this context, a project is a container for source files, resources, references to existing libraries, as well as a number of options to compile the source files.

In this article we will set up a VS project to compile the (existing) source of lib_mysqludf_udf. This is a library that demonstrates the basic features of the MySQL UDF interface, such as processing arguments, returning values and handling errors. It is great to get started programming MySQL UDFs. In order to walk through this example, you only need to download the C source file. (Tip: right click the link and choose an option from the context menu to download the file to your local file system.)

Creating a new Project

To actually create the project, we can use the File/New/Project... menu or else the "create project" hyperlink on the the Visual Studio Startpage:

VC++EE-new-project-menu

This opens a dialog where we must enter a few details about our project:

VC++EE-new-project-dialog-project

For the Visual C++ Express edition, it works best to choose a General/Empty Project. (The paid-for edition of Visual Studio provides templates for projects to create dynamically linked libraries a.k.a. DLLs but as we shall see later on we have to configure this manually.)

We are also required to provide a name for the project. In this case, we use a name that corresponds directly to the source file: lib_mysqludf_udf.

Visual Studio Solution

In Visual Studio, a project is always part of a Solution, which is basically a container for a number of related projects. Because we just started a new project, we are implicitly creating a new solution too, so we have to specify a few things about that as well:

VC++EE-new-project-dialog-solution

In this case, we create a separate directory for the solution itself, and we use the same name for the solution as for the project. It is important to realize that there can be multiple projects per solution, in which case it probably makes more sense to choose a distinct name for the solution as a whole.

After confirming the dialog, a number of directories and files are created:

VC++EE-project-dirs

Adding a source file

Now it is time to add the source file to our project, so if you didn't download the lib_mysqludf_udf C source file yet, you should do so now. Be sure to copy the lib_mysqludf_udf.c source file to the lib_mysqludf_udf project directory beneath the lib_mysqludf_udf solution directory:

VC++EE-copy-source-file

Copying the source file there is just a matter of convenience - I like to keep things that belong together in one place. If you don't keep the file in the project directory, things might may (and probably will) go wrong if you move the source file or the project to another location later on.

Copying the file to the directory still does not formally add the file to the project. To actually add the file to the project, you can right-click the "Source Files" folder beneath the project folder in the Visual Studio Solution Explorer window and use the context menu to add the existing item:

VC++EE-add-existing-item-menu

As an alternative to using the menu you can also add the source file to your project by simply dragging it into the Solution Explorer and dropping it into the "Source Files" folder of the project.

If all went well, the source file is now part of the project and can be opened from the solution explorer:

VC++EE-source-file-added

Project Configuration

Although we already defined the structure of the project, we need to configure it in order to compile it. The configuration can be edited through a single dialog which can be accessed by clicking the "Properties" item in the project folder's context menu:

VC++EE-opening-properties

General Properties

We first need to take care of the general configuration. A Visual Studio project can have several configurations - something which is very useful if you want to create different builds (debug or release) from the same project. However, it is a good idea to first configure all project properties that are the same for all configurations. To do that, we choose "All configurations" in the top left listbox of the configuration dialog. (For this example, we do not separately configure for debug and release builds.)

VC++EE-general-properties

The rest of the configuration process is a matter of editing individual properties. Related properties are organized in property pages, each of which covers a particular aspect of the project. By default, the "General" property page is selected and it makes sense to start editing properties there right away.

In the "General" property page we need to set the "Configuration Type" property to "Dynamic library (.dll)" as we need to be able to load the UDF library dynamically into the MySQL Server.

Configuring the Include path

MySQL UDFs refer to types and contants defined in C header files provided by MySQL. In turn, these refer to header files from the Microsoft Platform SDK. The project does not know automatically where to locate these header files, so we need to configure the project and point it to the location(s) manually.

To specify the location of the header files, we need to activate the "C/C++" property page and edit the "Additional Include Directories" property. You can either directly type the paths in the text box, or otherwise click the elipsis buttons (...) to browse for them.

VC++EE-include-path

For this example, we need to specify two locations:

  • The location of the "include" directory beneath the MySQL installation directory.

  • The location of the "include" directory beneath the Microsoft Platform SDK installation directory.

If you can't find these directorie, you most likely need to revisit the "preparation" section of this article.

Adding the HAVE_DLOPEN macro

The lib_mysqludf_udf.c source file was created using the udf_example.c source file from the MySQL source distribution as an example. The structure of that code uses conditional compilation according to wheter HAVE_DLOPEN is defined:

#ifdef HAVE_DLOPEN

...code goes here...

#endif /* HAVE_DLOPEN */
And this is also used in lib_mysqludf_udf.

I admit that I don't understand why that is there, or what it is supposed to achieve, and I would very much like someone to comment on this blog entry to explain it. Anyway, for Visual C++ it means we have to explicitly define it using a Preprocessor definition:

VC++EE-preprocessor

Configuring the library path

We configured the project to compile a Dynamic-Link Library. For the compiler, this means it cannot just compile the code and package it in a file: the dll target file needs to adhere to a certain specification. In order to make that happen, it needs to link to existing libraries from the platform SDK.

Just like we did for the include path, we need to tell Visual C++ where it can find the libraries it must link to. This can be configured by editing the "Additional Library Directories" property in the "Linker" property page:

VC++EE-linker

In this case, we only need to specify the path of the "Lib" directory find immediately beneath the Platform SDK installation directory.

Compiling the UDFs

At this point, we are ready to compile the project and/or solution. In most cases, you will want to choose the build configuration to choose between a debug or a release build. This can be done by clicking the "Configuration Manager" item in the build menu to invoke the Configuration Manager dialog:

VC++EE-config-manager

Actually building the project is done using the "Build Solution" or " Build Project" item in the "Build" menu:

VC++EE-build

The result of building the solution should be as indicated in the screenshot. If the final line does not read
1>lib_mysqludf_udf - 0 error(s), 0 warning(s)
you might want to read the remainder of this section to figure out what the problem is.

Common problems

Of course, no programming task is complete without running into trouble. In this section, a few common problems compiling the project are listed, as well as their solutions.
fatal error C1083: Cannot open include file: 'filename.h'
The output of the build process might look something like this:
1>Compiling...
1>lib_mysqludf_udf.c
1>..\..\..\temp\lib_mysqludf_udf.c(41) : fatal error C1083: Cannot open include file: 'my_global.h': No such file or directory
1>Build log was saved at "file://c:\projects\lib_mysqludf_udf\lib_mysqludf_udf\Release\BuildLog.htm"
1>lib_mysqludf_udf - 1 error(s), 0 warning(s)
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========
This indicates that you did not configure the include path properly. You should revisit the section on configuring the include path and ensure that the path does in fact point to the include directory that appears under the mysql installation directory.

The build output might look something like this:
1>Compiling...
1>lib_mysqludf_udf.c
1>C:\MySQL Server 5.1.21\include\config-win.h(30) : fatal error C1083: Cannot open include file: 'winsock2.h': No such file or directory
1>Build log was saved at "file://c:\projects\lib_mysqludf_udf\lib_mysqludf_udf\Release\BuildLog.htm"
1>lib_mysqludf_udf - 1 error(s), 0 warning(s)
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========
This is a similar problem. It occurs when you did include the "include" directory beneath the MySQL installation directory but forgot the one beneath the Microsoft Platform SDK installation directory. Because the latter is referenced by the former, both have to be added to the include path.
fatal error LNK1104: cannot open file 'file.lib'
Your build output might look like this:
1>------ Build started: Project: lib_mysqludf_udf, Configuration: Release Win32 ------
1>Linking...
1>LINK : fatal error LNK1104: cannot open file 'uuid.lib'
1>Build log was saved at "file://c:\projects\lib_mysqludf_udf\lib_mysqludf_udf\Release\BuildLog.htm"
1>lib_mysqludf_udf - 1 error(s), 0 warning(s)
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========
This indicates that you did not properly configure the path where Visual Studio looks for Additional Libraries. You should revisit the relevant section in this article and ensure the configured path does in fact contain the specified missing library.

Installing the UDFs

If you managed to succesfully compile the project you are ready to install the UDFs in your MySQL Server. Depending on the whether you chose to do a "Release" or a "Debug" build, you will find the lib_mysqludf_udf.dll in the "Release" or "Debug" directory directly beneath the Solution directory respectively:

VC++EE-dll-location

The .ddl needs to be copied to a location that is accessible to the MySQL Server. For MySQL versions lower than 5.1.19, the bin and/or lib directories right beneath the MySQL installation directory should work. For MySQL version 5.1.19 and beyond, you are required to copy the dll to the plugin_dir. The plugin_dir can be determined by running the following query:
mysql> show variables like 'plugin_dir';
+---------------+-----------------------------+
| Variable_name | Value |
+---------------+-----------------------------+
| plugin_dir | C:\MySQL Server 5.1.21\lib/ |
+---------------+-----------------------------+
1 row in set (0.00 sec)

When the dll is in place, the UDFs can be installed using the CREATE FUNCTION syntax. The following script will install all functions in lib_mysqludf_udf.dll:
CREATE FUNCTION lib_mysqludf_udf_info 
RETURNS STRING
SONAME 'lib_mysqludf_udf.dll'
;
CREATE FUNCTION udf_arg_count
RETURNS INTEGER
SONAME 'lib_mysqludf_udf.dll'
;
CREATE FUNCTION udf_arg_type
RETURNS INTEGER
SONAME 'lib_mysqludf_udf.dll'
;
CREATE FUNCTION udf_arg_value
RETURNS STRING
SONAME 'lib_mysqludf_udf.dll'
;
CREATE FUNCTION udf_arg_value_length
RETURNS INTEGER
SONAME 'lib_mysqludf_udf.dll'
;
CREATE FUNCTION udf_arg_maybe_null
RETURNS INTEGER
SONAME 'lib_mysqludf_udf.dll'
;
CREATE FUNCTION udf_arg_attribute
RETURNS STRING
SONAME 'lib_mysqludf_udf.dll'
;
CREATE FUNCTION udf_arg_attribute_length
RETURNS INTEGER
SONAME 'lib_mysqludf_udf.dll'
;
CREATE FUNCTION udf_init_error
RETURNS INTEGER
SONAME 'lib_mysqludf_udf.dll'
;

Common Problems

Even if you successfully compiled the solution, you still might run into a few problems at this stage. A few common ones are described in the remainder of this section.
ERROR 1126 (HY000): Can't open shared library 'file.dll'
If you encounter this error, it means that MySQL cannot find the library you are referring to in the SONAME clause of the CREATE FUNCTION statement. You may have made a typo in your statement, or the MySQL may be looking in another location for the libarary than you might think it does. Verify that you typed the correct location. For MySQL 5.1.18 and earlier, ensure that the dll is copied to either the bin and/or lib directory beneath the MySQL installation directory. For MySQL 5.1.19 and beyond, ensure that the file is located in the plugin_dir.
ERROR 1127 (HY000): Can't find symbol 'functionname' in library
If you encounter this error, a few things might be the matter. You might have made a typo in the function identifier in the CREATE FUNCTION statement. Another possibility is that you forgot to add the HAVE_DLOPEN macro to the preprocessor definitions. If needed, revisit that section in this article.
ERROR 1046 (3D000): No database selected
This error occurs when you did not set the default database. The workaround is to set any database as default database using the USE statement:
USE test;
. Arguably this is a bug in the MySQL Server: it somehow thinks we are trying to create a stored function which is bound to a database. It's as if MySQL cannot distinguish between a UDF and a stored function at this point.
ERROR 1044 (42000): Access denied for user 'user'@'host' to database 'mysql'
This error occurs when the user that is trying to create the function is not privileged to write to the mysql system database. To the best of my knowledge, UDFs are written only the mysql.func table so I would expect that granting privileges on that table would be enough to be allowed to create UDFs. It turns out that this is not the case. Granting all privileges on the mysql database does allow a user to install UDFs, but I don't know if that is indeed the minimal set of privileges required to install UDFs.

Using UDFs

If you succesfully installed the UDFs, any user will be able to use them, regardless of the setting for the default database . Here is a quick set of examples for lib_mysqludf_udf. The examples in itself are not very useful in itself, but you they can be useful tools for general UDF development when debugging or testing. Also, you can learn from the code how to work with the UDF interface)
mysql> -- show the lib_mysqludf_udf version
mysql> select lib_mysqludf_udf_info();
+--------------------------------+
| lib_mysqludf_udf_info() |
+--------------------------------+
| lib_mysqludf_sys version 0.0.2 |
+--------------------------------+
1 row in set (0.00 sec)

mysql> -- returns the number of passed arguments
mysql> select udf_arg_count(1,2,3,4);
+------------------------+
| udf_arg_count(1,2,3,4) |
+------------------------+
| 4 |
+------------------------+
1 row in set (0.00 sec)

mysql> -- returns the type of the passed argument
mysql> select udf_arg_type(1), udf_arg_type('string');
+-----------------+------------------------+
| udf_arg_type(1) | udf_arg_type('string') |
+-----------------+------------------------+
| 2 | 0 |
+-----------------+------------------------+
1 row in set (0.00 sec)

mysql> -- returns the (string)value of the passed argument
mysql> select udf_arg_value(1);
+------------------+
| udf_arg_value(1) |
+------------------+
| 1 |
+------------------+
1 row in set (0.00 sec)

mysql> -- returns the string-length of the passed argument
mysql> select udf_arg_value_length(123);
+---------------------------+
| udf_arg_value_length(123) |
+---------------------------+
| 3 |
+---------------------------+
1 row in set (0.00 sec)

mysql> -- returns 1 or 0 according to whether the passed argument might be null
mysql> select udf_arg_maybe_null(null), udf_arg_maybe_null(1);
+--------------------------+-----------------------+
| udf_arg_maybe_null(null) | udf_arg_maybe_null(1) |
+--------------------------+-----------------------+
| 1 | 0 |
+--------------------------+-----------------------+
1 row in set (0.00 sec)

mysql> -- returns the expression text or its alias
mysql> select udf_arg_attribute(1+1), udf_arg_attribute(1+1 as alias);
+------------------------+---------------------------------+
| udf_arg_attribute(1+1) | udf_arg_attribute(1+1 as alias) |
+------------------------+---------------------------------+
| 1+1 | alias |
+------------------------+---------------------------------+
1 row in set (0.00 sec)

mysql> -- returns the length of the expression text or its alias
mysql> select udf_arg_attribute_length(1+1);
+-------------------------------+
| udf_arg_attribute_length(1+1) |
+-------------------------------+
| 3 |
+-------------------------------+
1 row in set (0.00 sec)

mysql> -- throws an error
mysql> select udf_init_error('Whoops!');
ERROR:
Whoops!

Tuesday, September 04, 2007

Kettle Tip: Using java locales for a Date Dimension

The Date dimension is a well known construct in general data warehousing. In many cases, the data for a date dimension is generated using a database stored procedure or shell-script.

Another approach to obtain the data for a date dimension is to generate it using an ETL tool like Pentaho Data Integration, a.k.a. Kettle. I think this approach makes sense for a number of reasons:

  • When you tend to use a particular ETL tool, you will be able to reuse the date dimension generator over an over, and on different database platforms.

  • You won't need special database privileges beyond the ones you need already. Privileges for creating tables and to perform DML will usually be available, whereas you might need to convince a DBA that you require extra privileges to create and execute stored procedures.

In addition to these general considerations, you can pull a neat little trick with Kettle to localize the data and format of the date attributes. I wouldn't go as far as to say that this feature is Kettle specific: rather, it relies on the localization support built into the java platform and the way you can put that to use in Kettle transformations.

Prerequisites


In this tip, the steps to create a date dimension are described using Kettle 2.5.1 (Generally available Release) and MySQL 5.1.20 (Beta). You will be able to follow through the example using earlier (and later) versions of both products though - I am not using any functionality that is specific to these particular version of the products. The recipe does not really require that you understand anything about data warehouses or date dimensions, but you will probably appreciate it better if you do ;)

Overview


The transformation to generate the data for the date dimension follows a pretty straightforward design. The graphical representation of the transformation is shown below:

localized_date_dimension1

First, the dimension table is created (Prepare). After that, rows are generated to fill it (Input). However, the generated rows are almost empty and barren - we still need to derive and add data to fill the attributes of the date dimension (Transformation). Finally, the data is stored in the date dimension table (Output).

Step-by-Step


The remainder of this article describes in detail how to build this transformation. The majority of steps is probably not very interesting to moderately experienced Kettle users, but may be of use to beginning users.

Note for users that are completely new to Kettle - it is advisable to review the first few chapters of the Spoon user guide (Spoon is the name of Kettle tool you use to design the ETL process). It explains how to start up the tool, create a new transformation, add and connect steps etc. You can find it in the docs/English directory beneath the Kettle home directory.

MySQL JDBC driver: setting the characterEncoding property to UTF8


You need to create a (JDBC) connection to MySQL in the usual, straightforward way:

kettle-mysql-connection

In addition, you need to set the characterEncoding property of the JDBC driver:

kettle-mysql-connection-utf8

This ensures MySQL will be able to understand the utf8 encoded data that we may produce to generate a date dimension in the, say, Chinese language. Note that you cannot just use a statement like SET NAMES utf8 to do this. This is not specific to Kettle, but has to do with the way the MySQL JDBC driver (Connector/J) handles character sets. Please refer to the "Using character sets and unicode" section of the Connector/J documentation for more information on this topic.

Creating the date dimension table

In this particular case, it seemed convenient to create the dimension table as part of the transformation. This is done using the "Execute SQL Script" step shown below:

kettle-sql-script

The "Execute SQL Script" step executes the following script to create the date dimension table:
DROP TABLE IF EXISTS dim_date
;
CREATE TABLE IF NOT EXISTS dim_date (
date_key smallint unsigned NOT NULL,
date date NOT NULL,
date_short char(12) NOT NULL,
date_medium char(16) NOT NULL,
date_long char(24) NOT NULL,
date_full char(32) NOT NULL,
day_in_year smallint unsigned NOT NULL,
day_in_month tinyint unsigned NOT NULL,
is_first_day_in_month char(10) NOT NULL,
is_last_day_in_month char(10) NOT NULL,
day_abbreviation char(3) NOT NULL,
day_name char(12) NOT NULL,
week_in_year tinyint unsigned NOT NULL,
week_in_month tinyint unsigned NOT NULL,
is_first_day_in_week char(10) NOT NULL,
is_last_day_in_week char(10) NOT NULL,
month_number tinyint unsigned NOT NULL,
month_abbreviation char(3) NOT NULL,
month_name char(12) NOT NULL,
year2 char(2) NOT NULL,
year4 year NOT NULL,
quarter_name char(2) NOT NULL,
quarter_number tinyint NOT NULL,
year_quarter char(7) NOT NULL,
year_month_number char(7) NOT NULL,
year_month_abbreviation char(8) NOT NULL,
PRIMARY KEY(date_key),
UNIQUE(date)
)
ENGINE=MyISAM
DEFAULT CHARACTER SET utf8
DEFAULT COLLATE utf8_unicode_ci
This is by no means a complete date dimension. The most important limitation is that it only contains attributes that are immediately derivable from the calendar. So, attributes to denote business specific periods like the fiscal year and holidays are not included.

Generating 10 years worth of days

The grain of the date dimension is days - a row in the date dimension represents a single day. In this case, the "Generate Rows" is configured to generate 3660 rows, which roughly corresponds with enough days to last 10 years:

kettle-generating-10-years

In the example, this step is also used to provide parameters to generate the date dimension data. As we'll see in a moment, the inital_date field effectively specifies the first date that goes into the date dimension. The language_code and country_code fields are used to localize the textual attributes of the date dimension, and the local_yes and local_no fields are used for boolean attributes.

There are other ways to get these parameters into our transformation. For example, we could have used an "Add Constants" step with a similar result. Another possibility would be to get this data from the environment using a "Get Variables" step, and this would allow the parameters to be specified at transformation run-time.

Counting the days

Although we certainly generate enough rows, they are all identical. In order to have each row represent a single distinct day, we need a way to 'count' the generated rows. We do this by adding a "Sequence" step:

kettle-adding-sequence

In this case, we use the "Add sequence" step to generate an incrementing number within the scope of the transformation. As we'll see later on, we can add this to our initial date to get a series of consecutive dates.

Calculating date dimension Attributes

The previous steps form a basis from which we can derive all of the attributes that currently make up our date dimension. To actually calculate the date attributes, we use a "Modified Java Script Value" step:

kettle-javascript

Kettle comes with an embedded Rhino javascript engine. The "Modified Java Script Value" step lets you use it to run javascript code to as part of the transformation.

The javascript code is executed for each row that comes out of the previous steps. In the script code, one can reference the values from the input rows, perform some processing on them, and generate new output fields.

One of the fortunate characteristics of the Rhino engine is that it lets us use java classes inside the javascript code. Let's take a look at the script to see how we can use that to generate the localized data for our data dimension attributes.
Initialization
The first thing we do in the javascript code is to get data from the current input row. In the "Generate Rows" step, we added the language_code and country_code fields to specify a locale. Here, in the script, we use the following piece of code to turn that into a java.util.Locale object:
//Create a Locale according to the specified language code
var locale = new java.util.Locale(
language_code.getString() //get the ISO639 language_code from the input row
, country_code.getString() //get the ISO3166 country_code from the input row
);
The java.util.Locale class represents a particular cultural region. It forms a cornerstone of the internationalization support built into the java platform, and provides information to many other classes to generate appropriately localized output.

We will be using the locale on a number of occasions, but first, we use our it to initialize a java.util.Calendar object:

//Create a calendar, use the specified locale
var calendar = new java.util.GregorianCalendar(locale);
(Note that the java platform currently only provides one concrete Calender Class: the java.util.GregorianCalendar. Unfortunately, java does not seem to provide a built-in recipe for dealing with, for example, Islamic or Hebrew calendars).

We require the calendar object to obtain an instance of the java.util.Date Class that represents the date corresponding to the current row. To do that, we first set the calendar's current date using the initial_date field that was specified in the "Generate Rows" step:
//Set the initial date
calendar.setTime(initial_date.getDate());
We need this to add the number of days generated by our "Add Sequence" step:
//set the calendar to the current date by adding DaySequence days
calendar.add(calendar.DAY_OF_MONTH,DaySequence.getInteger() - 1);

(Note that we substract 1 from the DaySequence value. This is because our sequence starts at 1, and we want the specified initial date to be included in our date dimension).

We conclude the initialization of the script by retrieving a java.util.Date object that represents the date for the current row.
//get the calendar date
var date = new java.util.Date(calendar.getTimeInMillis());
This java.util.Date instance is assigned to the date variable in the script, allowing it to be used as an output field of the javascript step. We require this in order to fill the date column of the date dimension table. We will also be using the date variable later on in this script to derive the value of other date dimension attributes.
Getting Text representations of full dates
Our date dimension has a number of attributes to denote a complete date containing day, month and year parts, in various formats: date_short, date_medium, date_long and date_full. These are all generated using the java.text.DateFormat class.

To do that, we first need to create an appropriate DateFormat instance using the static getDateInstance() method, passing our locale object as well as a constant that specifies whether we want to short, medium, long or full format. Then, we can pass the java.util.Date object for which we want to obtain the textual representation to the format method of the newly created java.text.DateFormat instance:
//en-us example: 9/3/07
var date_short = java.text.DateFormat.getDateInstance(
java.text.DateFormat.SHORT
, locale
).format(date);
//en-us example: Sep 3, 2007
var date_medium = java.text.DateFormat.getDateInstance(
java.text.DateFormat.MEDIUM
, locale
).format(date);
//en-us example: September 3, 2007
var date_long = java.text.DateFormat.getDateInstance(
java.text.DateFormat.LONG
, locale
).format(date);
//en-us example: Monday, September 3, 2007
var date_full = java.text.DateFormat.getDateInstance(
java.text.DateFormat.FULL
, locale
).format(date);
Formatting date parts
Extracting and formatting different date parts is most easily done by applying the format function on a subclass of java.text.Dateformat, the java.text.SimpleDateFormat class. The java.text.SimpleDateFormat class allows formatting of dates based on date and time patterns:
//day in year: 1..366
var simpleDateFormat = java.text.SimpleDateFormat("D",locale);
var day_in_year = simpleDateFormat.format(date);
In this example, we pass both the locale and a date pattern to the constructor to create an instance of the java.text.SimpleDateFormat class. The pattern is passed as the string "D", specifying a day-in-year format.

Once we created the java.text.SimpleDateFormat instance, we can apply a new pattern to it using the applyPattern() method. Calling the format method again, we obtain the date in the desired format:
//day in month: 1..31
simpleDateFormat.applyPattern("d");
var day_in_month = simpleDateFormat.format(date);
//en-us example: "Monday"
simpleDateFormat.applyPattern("EEEE");
var day_name = simpleDateFormat.format(date);
//en-us example: "Mon"
simpleDateFormat.applyPattern("E");
var day_abbreviation = simpleDateFormat.format(date);
//week in year, 1..53
simpleDateFormat.applyPattern("ww");
var week_in_year = simpleDateFormat.format(date);
//week in month, 1..5
simpleDateFormat.applyPattern("W");
var week_in_month = simpleDateFormat.format(date);
//month number in year, 1..12
simpleDateFormat.applyPattern("MM");
var month_number = simpleDateFormat.format(date);
//en-us example: "September"
simpleDateFormat.applyPattern("MMMM");
var month_name = simpleDateFormat.format(date);
//en-us example: "Sep"
simpleDateFormat.applyPattern("MMM");
var month_abbreviation = simpleDateFormat.format(date);
//2 digit representation of the year, example: "07" for 2007
simpleDateFormat.applyPattern("y");
var year2 = simpleDateFormat.format(date);
//4 digit representation of the year, example: 2007
simpleDateFormat.applyPattern("yyyy");
var year4 = simpleDateFormat.format(date);
Dealing with Quarters
Although the java.text.SimpleDateFormat class is useful, it does not provide any functionality for working with quarters. We do want our date dimension to contain attributes to represent the quarter, so we have to reside to computing these manually:
//handling Quarters is a DIY
var quarter_name = "Q";
var quarter_number;
switch(parseInt(month_number)){
case 1: case 2: case 3: quarter_number = "1"; break;
case 4: case 5: case 6: quarter_number = "2"; break;
case 7: case 8: case 9: quarter_number = "3"; break;
case 10: case 11: case 12: quarter_number = "4"; break;
}
quarter_name += quarter_number;
Although this will do for now, this solution doesn't really cut it because it does not produce localized output. Anyway, it is better than nothing so we'll just have to make do with it.
Period demarcation flags
Our date dimension has a few attributes that are used to indicate the start and end of week and month periods. We use simple yes/no type flags, but we allow the actual "yes" and "no" values to be specified by the user in the "Generate Rows" step. We retrieve them with the following piece of code:
//get the local yes/no values
var yes = local_yes.getString();
var no = local_no.getString();
We can now use them these to flag the start and end of week and month periods.

The start (and of course, also the end) of the week are subject to the locale. In order to find out if we are dealing with the first day of a week, we use the getFirstDayOfWeek() method of the java.util.Calendar class. By comparing its return value with the day of week of the current row, we can see if we happen to be dealing with the first day of the week:
//initialize for week calculations
var first_day_of_week = calendar.getFirstDayOfWeek();
var day_of_week = java.util.Calendar.DAY_OF_WEEK;

//find out if this is the first day of the week
var is_first_day_in_week;
if(first_day_of_week==calendar.get(day_of_week)){
is_first_day_in_week = yes;
} else {
is_first_day_in_week = no;
}
Note that we obtain the current day of the week by passing the value of the DAY_OF_WEEK constant to the get method of the java.util.Calendar object that we initialized at the start of the script.

In order to set the value for the is_last_day_in_week attribute of the date dimension, we simply find out if the next day happens to be the first day of the week. If it is, then by definition, the current row represents the last day of the week:
//calculate the next day
calendar.add(calendar.DAY_OF_MONTH,1);

//get the next calendar date
var next_day = new java.util.Date(calendar.getTimeInMillis());

//find out if this is the first day of the week
var is_last_day_in_week;
if(first_day_of_week==calendar.get(day_of_week)){
is_last_day_in_week = yes;
} else {
is_last_day_in_week = no;
}
(Note that we have already used similar code to add a day to a date when we added the day sequence to the initial date.)

We can use similar logic to calculate the values for the is_first_day_in_month and is_last_day_in_month indicators. This is actually easier, because the first day in the month is not dependant upon the locale (at least - not within one calendar). So, we only need to find out if the day of month is equal to one:
//find out if this is the first day of the month
var is_first_day_of_month;
if(day_in_month == 1){
is_first_day_in_month = yes;
} else {
is_first_day_in_month = no;
}

//find out if this is the last day in the month
var is_last_day_of_month;
if(java.text.SimpleDateFormat("d",locale).format(next_day)==1){
is_last_day_in_month = yes;
} else {
is_last_day_in_month = no;
}
A few more date attributes
We conclude the computation of the date attributes by adding a few more useful labels:
//a few useful labels
var year_quarter = year4 + "-" + quarter_name;
var year_month_number = year4 + "-" + month_number;
var year_month_abbreviation = year4 + "-" + month_abbreviation;
Like when we calculated the quarters, this is actually not a very good method because the results will not be localized. That said, the result will make sense for many locales, and we don't really have a better way to deal with it right now.
Defining the step outputs
We just calculated all the required values to fill the attributes of our date dimension. We just need to get them out of the script and into the outputs of the step.

Every variable declared in the javascript (using the var keyword) can be used as an output field of the javascript step. The easiest way to generate the outputs is by hitting the "Get Variables" button at the bottom of the dialog. This simply adds an output field for each variable declared in the script:

kettle-javascript-outputs

By default, the data type for all the outputs added in this way is set to the String type. Although it is good practice to choose a more specific data type, it is almost always unncessary in this case, as all integer type values will be correctly converted implicitly when we insert them into the database. There is one exception in this case, and that is the date output. Inside the script, it is an instance of a java.util.Date class, and we must set the type to "Date" in the output too. Otherwise, the (java) string representation of the java.util.Date object will be sent as output, and this is not automatically recognized as a date by MySQL.

Discarding Fields

We are now almost ready to insert the rows into the date dimension table. We only need to discard all fields in the stream that do not correspond with any of the columns in our date dimension table. We use a "Select Values" step to do that:

kettle-select-values

We use the "Get Fields To Select" button to pull in all available fields, and after that, simply select and delete each field that we do not need. As a final step, we rename the DaySequence field to date_key to map it to the date_key column in our date dimension table.

Inserting data into the table

In the final step, we add the generated data to the dim_date table we created in the very first step of the transformation:

kettle-table-output

We only need to specify the connection and the table name here, and the step will then automatically attempt to map the fields of the incoming rows to table columns.

We could have used the "Insert / Update" step, or even the "Execute SQL Script" step too to write the data to the dimension table, but that would require a little bit of extra work.

Running the transformation


After building the transformation, you can run it by hitting the "running man" icon on the toolbar. This will open a dialog where you can set a number of properties for the transformation. Hit "Launch" button there and after that, the transformation will be executed:

kettle-run-transformation

Closing Notes


I hope you enjoyed this tip. If you want to, you can download the kettle transformation here, and use it as you see fit.

If you are interested in open source data warehousing, register for the MySQL Enterprise Data Warehousing Seminar, Thursday, September 06, 2007 and hear what Robin Schumacher has to say about that subject. (Note that this is a general MySQL data warehousing seminar - this post and the seminar are unrelated)