Roland Bouman's blog: June 2006

Thursday, June 29, 2006

Joining the Dolphin!

Finally. Now, I really feel my job with Inter Access has come to an end. After seven years and seven months, I'm leaving them to join.....MySQL!

The old job and how I got there

I started my IT career somewhere halfway 1998. At the time, I was busy becoming increasingly unsatisfied with working the irregular jobs I used to have then. I got my master's in Molecular Biology in 1997, and just did not succeed in finding a job related to my studies. Of course, this had a lot to do with the fact that I aspired having a job in bio-ethics, whereas my studies groomed me to become a Laboratory Researcher.

(I still find scientific research a very interesting thing, but when I started my studies, I did not realize that reading science books and consuming scientific knowledge are activities that are entirely different from constructing scientific facts. On top of that, I discovered that although the term "Molecular Biology" suggests studying the very fundaments of life and nature, the fact is that most of the knowledge established in that field involves researching things that are dead, and in an environment -the laboratory- that is entirely artificial. Actually, as a Molecular Biologist, you can consider yourself quite lucky to study even dead things - most of the time, you'll be stuck experimenting only on parts of, or even just molecules of dead things. Anyway, this paradox among other things made me turn to bio-ethics instead.)

Suddenly, I got the opportunity to do a half-year course to become a database/RAD developer. You know, RAD whas the big thing then, and although there were lot's of courses like that, I feel that I was very lucky to do this one particularly. Most of the 'courses' then consisted of little more than dressing up cherry university graduates from whatever discipline with a suit and tie, giving them a car, a laptop, a cellphone and a pack of business cards that read "IT Consultant" just before sending them to customers to find w2k problem code in their COBOL systems.
Well, maybe that's a bit of an exaggeration, but I think I can objectively say that I got excellent tuition from Leo Wiegerink at CENS Univerity in Hoofddorp. Despite the course title, database/RAD developer, we spent lots of time on information modelling, theory of the relational model, normalization, object orientation, client/server concepts.

The most important lesson I learnt from Leo is that most problems are information problems, and most of them boil down to the lack of a single point of definition. Analyzing the problem in order to model it is often more important than the physical implementation.

Anyway, I was offered a job as a (oracle) database developer by CENS immediately after the course. Shortly after that, CENS was aquired by Inter Access and this has been my employer from then on. I moved on to the BI team, and finally ended up in the Technical Management unit of the Microsft Solutions branch to become a Information Architect.

I've learned a lot during my 7 years and 7 months of employment with Inter Access, and most of the time, I enjoyed myself a lot. However, during the past year and a half or so, I've really noticed how my and the company's vision of IT business have been increasingly growing apart. Inter Access wants to be a mainstream (IBM, Microsoft, Oracle) IT service provider, with a focus on infrastructure and long maintenance maintenance contracts, and I suppose that's fine. However, this is not my future vision, and the differences between us have grown to be so big that I think it's better to part ways now.

The new job

I started visiting the mysql job vacancy site as long ago as november 2005. Ever since the MySQL 5.0 beta challenge took off in june 2005, I've been quite occupied in my spare time checking out MySQL 5.0. It took me little time to grow tremendously enthusiastic about MySQL 5.0 and the role MySQL is playing by providing a world class open source database management product that is the fundament of all these emerging Enterprise open source stacks like Pentaho and Opentaps - let alone the prominent presence MySQL already has in all those websites and e-commerce applications.

To a large part, my enthusiasm is inspired by the MySQL open source business model. It just makes sense to me to be open about the software source as a measure of quality and trust. It was all the more frustrating to me that I just did not find any opportunity to start using this software in my former job with Inter Access. It just felt like not managing and prioritizing our business opportunities. So, as time went by, it seemed, and it still seems, to make good sense to seize the opportunity to join MySQL.

The way I see it, MySQL is high velocity train that has taken off and is now accelerating. By the looks of it, it is not ready to slow down any time soon. MySQL is still relatively small -300 people- in terms of number of employees which makes it all the more fantastic that they are the 3rd rdbms vendor in terms of number of installations. Third! Next to the giants (or should I say dinosaurs) like Oracle and MS SQL. Well, I don't know, but I don't need to be convinced anymore to see that a truly shining future lies ahead of MySQL AB and MySQL products, and I just want to be part of that.

I think it was in March 2006 when I went to the site and applied for the "Certification Developer" job vacancy. To me it just seemed to be a perfect mix between staying on top of new functionality in the MySQL server, writing about that, working together with the training program and actually developing questions that make up exams. During my studies, I actually did develop questionaires so I'm familiar with test-theory and some of the analyses involved in that field. Just a few days later, I received a telephone call from Carsten Pedersen, the MySQL AB Certification Manager. We had a very pleasant talk, and he explained to me that I would have to go through a few more interviews. MySQL Certification development is tightly aligned with both documentation and training, and these departments checked me out too, all through phone calls. Then, I got to meet a lot of MySQL people at the user conference and now, finally, I can announce that I am in fact, as of 1 july 2006, a new MySQL employee.

Needless to say that although I'm pretty exhausted from the past month wrapping up my business with Inter Access, I'm just wildy excited that I can start with a job that is so radiacally different from the former one. Instead of a service company like Inter Access, MySQL is also a product centric company. Instead of a dutch national company, MySQL is operating on a worldwide scale. Also, instead of mostly spending my working hours in an office, I will now be working from home most of the time. It just makes me feel full of energy, ready to make the most of this grand opportunity.

Roland Bouman, Certification Developer
MySQL AB, Holland, www.mysql.com

Thursday, June 15, 2006

Now What's that connection string again?

This might be handy

Jeroen Swart, one of the most capable and knowledgable Microsoft .NET Professionals I know, provided me with a pointer to www.connectionstrings.com. This site lists connection string formats for a whole bunch of databases. Of course, Oracle, MySQL and MS SQL are there, as are Postgres and DB2.

But what about Mimer? Lightbase? Right, that's why you might visit them.

Actually, there are also some products missing, like HSQL and SQLite and what have you. And they could pick a nicer Icon for MySQL. Otherwise, handy site.

Sunday, June 11, 2006

REFERENTIAL_CONSTRAINTS for the MySQL Information Schema Diagram

I just found out that MySQL 5.1.11 now has an implementation of the REFERENTIAL_CONSTRAINTS system view in the information_schema.

So, what can I do...Yeah, right...I just added it to the diagram of the MySQL 5.1 information schema.

Enjoy!

Saturday, June 10, 2006

Pentaho Data Integration: Kettle turns data into business

Log Pentaho Data Integration (formerly know as KETTLE)

In my previous blog entry, I wrote about how I'm currently checking out the Pentaho open source Business Intelligence platform. Well, I've only done a little bit of all the checking out I planned to do, but here I'd like to present some of the things that I found out so far.

In this blog entry, I will focus on some features of the latest addition to the Pentaho: K.E.T.T.L.E, which forms the principal ETL component of the platform.

In case you're interested in Pentaho in general: I just heard that on 21 june, MySQL will be conducting a Web-seminar on BI with Pentaho and MySQL, so you might want to register.

Huh? Kettle!? I thought you said Pentaho?!

I didn't mention this in my previous blog entry, but most of Pentaho is built on and from existing components rather than developed from scratch. For example, the reporting layer runs JasperReports, BIRT (discussed in one of my earlier blog entries) and JFreeReport, OLAP Services are provided by Mondrian, and one of the core components, the Workflow engine, is the Enhydra Shark Engine.

Pentaho actively contributes and strives to improve some of these existing components, for example, JFreeReport. The same goes for Kettle, which was actually acquired by Pentaho. Pentaho also bundles the components and provides the framework to let them work together.

Right now, the Pentaho developers are busy integrating Kettle with the rest of the platform, and probably sooner than later, the services that are now provided by Kettle will be known as 'Pentaho Data Integration'. (In fact, I just noticed that the welcome screen of the latest development release now announces itself as "Pentaho Data Integration - Previously Kettle").

So, what'd you say Kettle was...?

Anyway, back to Kettle. Kettle is a free, open source (LGPL) ETL (Extraction, Transformation and Loading) tool. The product name should actually be spelled as K.E.T.T.L.E, which is a recursive acronym for "Kettle Extraction, Transport, Transformation and Loading Environment".

Kettle was first conceived about four years ago by Matt Casters, who needed a platform-independant ETL tool for his work as a BI Consultant. Matt's now working for Pentaho as Chief of Data Integration. Although there's a growing number of contributers, a lot of the work on Kettle is still being done by Matt himself.

E.T.L. and Datawarehousing

Being an ETL tool, Kettle is an environment that's designed to:

collect data from a variety of sources (extraction)
move and modify data (transport and transform) while cleansing, denormalizing, aggregating and enriching it in the process
frequently (typically on a daily basis) store data (loading) in the final target destination, which is usually a large, dimensionally modelled database called a data warehouse

Although most of these concepts are equally applicable to almost any data importing or exporting processes, ETL is most frequently encountered in datawarehousing environments.

Kettle Architecture

Kettle is built with the java programming language. It consists of four distinct applications (tip: follow links for relevant .pdf documentation):

spoon,

pan,

chef,

and kitchen.

Spoon: is a graphically oriented end-user tool to model the flow of data from input through transformation to output. One such model is also called a transformation
Pan: is a commandline tool that executes transformations modelled with Spoon
Chef: is a graphically oriented end-user tool used to model jobs. Jobs consist of job entries such as transformations, FTP downloads etc. that are placed in a flow of control
Kitchen: is a commandline tool used to execute jobs created with Chef

Model-driven

An interesting feature of Kettle is that it is model-driven. Both Spoon and Chef offer a graphical user interface to define the ETL processes on a high level. Typically, this involves no actual programming at all - rather, it's a purely declarative task which results in a model.

The command line tools Pan and Kitchen (or rather the underlying API) know how to read and interpret the models created by Spoon and Chef respectively. These tools actually execute the implied ETL processes. This is done all in one go: there is no intermediate code generation or compilation involved.

Repository-Based

Models can be saved to file in a particular XML format, or they can be stored into a relational database: a repository. Using a repository can be a major advantage, especially when handling many models. Because the models are stored in a structured manner, arbitrary queries can be written against the repository.

The repository may also be used to store the logs that are generated when executing transformations and jobs. Certain environments, such as banks, require that every manipulation that is performed with financial data be stored for longer periods of time for auditing purposes. The repository sure seems to be the place to do that, at least, as far as the ETL process is concerned.

Installing and Running Kettle

In case you're worried: no, we do not need to build a datawarehouse to show how Kettle can help you schedule the extraction, transformation and loading of 'hello world' into it. All we need is a Java 1.4 runtime environment (which is probably sitting on your system anyway) and Kettle itself.

If you don't have a suitable Java version find one on the Sun Microsystems download page. You should pick the distribution appropriate for your operating system, and install it according to the accompanying instructions.

Go to the Kettle downloads page to pick up Kettle. From here, you can download the stable 2.2.2 release, (as well as a lot of other stuff, such as documentation, the latest development release, and the Source downloads).

Installation is a breeze: just unzip the distribution in a location of your choice. Executables for the Kettle tools are located in the topmost Kettle directory. For the Windows operating system, these are .bat (batch) files: spoon.bat, pan.bat, chef.bat and kitchen.bat.

For *nix operating systems, .sh files (Shell scripts) are provided: spoon.sh, pan.sh, chef.sh and kitchen.sh. These need to be made executable before the can be used. This can be done by executing this command from inside the kettle directory:

chmod +x *.sh

Kettle Documentation is provided in the docs directory right below the main Kettle directory.

Putting the Kettle on

Ok, let's give this tool a spin, and see what it looks like. In the remainder of this article, I will be using a transformation and some datafiles I prepared. You can download it here (2k .zip dowload). Just unzip it into a directory of your choice. Be sure to use this directory wherever the remainder of this article refers to a particular filename.

Launcing Spoon

Launch Spoon by executing the .bat (windows) or .sh (*nix) file. The Kettle splash screen pops up, and right after that, you're invited to connect to a repository. We'll be skipping that for now, so hit the 'No Repository' button at the bottom of the dialog.

I think that a repository is a great feature, and I think it's even better that Kettle lets you decide if you want to use it. Although we won't be using the repository right away, we can still connect to one at any time once spoon is fully up and running.

Here's a picture of Spoon:

On the upper left side, a treeview is visible. The treeview acts as explorer of the currently loaded model as well as a menu from which we can drag items into the grahical view.

On the lower left side, right below the treeview, there's a little window that maintains a history of the different types of steps that were added to the model.

The right side shows a number of tabbed views. In the picture, the graphical view of the transformation model is selected. The model shown here is the 'hello, world!' transformation I just refered to. It's is included in the zip., bu you can also pick it up right here.

Spoon concepts

Although the transformation shown here is very simple, it does illustrate some of the most important concepts of Spoon. So, lets get a closer look at it's ingredients, the steps and the hops.

Steps

First of all, the graphical view shows a number of square icons. These are called transformation Steps. One such step denotes a particular kind of action that is performed on data. Steps are easily created by dragging them from the treeview and dropping them on the graphical model view.

Once a step is created, it can be opened by double-clicking it. A dialog window appears that can be used to parameterize the step to specify it's exact behaviour. Most steps have multiple property pages according to the different categories of properties appblicable to steps of that type.

In all cases, the name of the step may (and should) be modified to clarify the function of the step in the light of the entire ETL process. Most step types also have a separate property sheet to define the fields flowing in or out of the step.

Kettle provides a a lot of different step types, and you can extend Kettle and plugin your own. However, fundamentally, there are only three different kinds of steps: Inputs, Transformations, and Outputs.

Input: Input steps process some kind of 'raw' resource, such as a file, a database query or system variables, and create an outputstream of records from it.
Transformation: Transforming steps process inputstreams and perform a particular action on it, often adding new fields or even new records to it. This is then fed to one or more outputstreams. Kettle offers many transformation steps out of the box. Some steps in this category perform very simple tasks, such as renaming fields; some of them perform complex tasks, such as normalizing data or maintaining a slowly changing dimension in a datawarehouse.
Output: Output steps are like the reverse of input steps: they accept records, and store them in some external resource, such as a file or a database table.

Hops

In the graphical representation of the model, lines are visible that form connections between the steps. In Kettle, these connections are called hops. Hops between steps behave like pipelines. Records may flow through them from one step to the other. The records indeed travel in a stream-like manner, and steps may buffer records until the next step in line is ready to accept them. This is actually implemented by a creating a separate thread of execution for each hop.

Hops can be created by placing the mouse pointer above the source step, holding the shift button and then dragging (holding the left mouse button and the shift button) to the destination step. Hops may also be created by dragging the 'hops' node in the treeview onto the canvas. Then, A dialog appears that let's you select the source and destination steps from dropdown listboxes.

A Simple Recipe

Now that these spoon concepts are introduced, let's see how they are used to actually create an ETL process. Tip: click the picture to download the zipped data files and Kettle transformation if you haven't done so already.

By just looking at the graphical representation of the model, most people will grasp immediately what the implied ETL process looks like. First, in the step named Text file input, data is presumably read from a text file. From there, the data then flows into a step named Filter rows. There, a condition is checked for each record that flows through the input stream. The records are then split into two separate streams: data that passes the filter, and data that does not. Finally, each stream of filtered data is written to it's own text file.

Text file input

Double-clicking the text file input step reveals a number of Property pages: File, Content, Error handling and Fields. The text file that is read for input is specified in the File tab. Use the Browse button to locate files. After that, hit the Add button to select the file for input.

A filename (or pattern) maybe specified that defines the files that are to be read. Although you can add as many files as you like, only one file is specified in this case, spoon_hello_world_input.txt. Here are the contents:

Hello, World...
Welcome to Spoon!

(Actually, this way of specifying the file is not really recommended for production purposes. The filename is 'hardcoded' into the transformation, which is not convenient to deploy the transformation. Check out this technical tip for a more robust method of working with input files.)

Use the Content tab to specify the format of the input file. For this example, I modified the field separator to a comma, and I unchecked the header row option.

Use the Fields tab to define the record layout. You should be able to use the Get Fields Button, which tries to discover the record layout using data from the specified files. (Although I've successfully used this function before, this particular setup seems to provoke a bug). After that, you can rename the fields, and adorn them with more specific properties, such as type, length and format.

Use the Preview rows button to check if the data is extracted in the desired format:

Well, this definitely seems correct: the lines are interpreted as records. The first line is split at the comma. Because the second line did not contain a comma, only the first field's got value.

A lot of the things we've seen sofar, such as the Fields tab, and the Preview rows button are common for a lot of step types.

Filter rows

There's just one hop that leaves the Text file input step, and it arrives at the Filter rows step. The filter rows step is a transforming step. It changes the contents of the inputstream by filtering out the records that do not match a criterion specified in the step properties. Double-clicking the step pops up the following dialog:

In the lower part of the dialog, a condition has been defined:

    Field2 IS NOT NULL

The two dropdown-listboxes must be used to link to subsequent steps. The first one specifies this for the data that passed the criterion. The second one specifies the step that will receive all records that did not match de criterion.

Output steps

There are two output steps in the model: one to catch the records that passed the filter, and one to catch those that were rejected. Both are simple Text file output steps. This type of step is quite like the reverse of the Text file input step, and the methods to specify the filename and fields are virtually the same. So, here too, a file name and location must be specified, and fields need to defined so the step knows what fields it should write to file.

In the model, the filenames that are used in the Text file Output steps are spoon_hello_world_pass.txt and spoon_hello_world_fail.txt.

Verify, Preview and Run

Now that we have a transformation model, we can actually do things with it. Spoon's got a special transformation menu with items to work with the transformation model.

First of all, we should verify the model. The verify action can be found in the spoon transformation menu, and also on the toolbar. Verification logically checks te model. It will report unreachable steps, field name mismatches and other logical errors.

We can then preview the transformation. Preview is just like a normal execution of a transformation, except that it is run for a limited number of rows. The Preview action can be found in the menu and on the toolbar. The action is also accessible from the bottom of the Log View. When you choose to do a Preview, Kettle switches automatically to the Log View:

The log contains all kinds of information on the process execution: number of rows that entered a step, the errors that occurred, the time it took to run the task and much, much more. The log view is a very valuable source for troubleshooting and debugging the transformation.

If you're happy with the preview, you can then run the transformation for real. This action too is available from the menu, the toolbar and the log view.

Results

Now, let's take a look at the results. Here's the content of the spoon_hello_world_pass.txt file:

Field1;Field2
Hello; World...

This the content of the spoon_hello_world_fail.txt file:

Field1;Field2
Welcome to Spoon!;

This suggests the transformation works fine. The split rows step nicely separated the records that had comma separated values from the records that didn't.

Final thoughts

There are so much things I would like to write still about Pentaho and Kettle and I will most certainly do that in forthcoming blog entries. Just read on and see what you can expect in the remainder of this month and next month.

Realistic use case

The example show here is of course very easy and does not approach the complexity of a real world application. However, it did touch on a few of the fundamental concepts behind spoon, and served well to that purpose. I will be providing and explaining a more detailed transformation shortly.

What about Chef?

Although Spoon transformations are an important part, Kettle entails more than just that. We haven't discussed Chef and jobs at all, and I'll write something soon about that.

Using the repository

Although I mentioned the repository, we didnt actually use one. I will most certainly explain this feature in greater detail soon.

JDBC configuration

KETTLE comes with database support in the form of a lot of drivers. Sometimes you want to upgrade a particular driver. Easy, if you know how...

(Pentaho) deployment

Here, we used Kettle rather as a standalone tool: we never deployed the transformation. It would be nice to see how ETL hooks in with the workflow engine

Tuesday, June 06, 2006

To equals or not to equals

That's just funny...
I just had a discussion with Beat, Giuseppe and Markus about the most preferable to write assignments in MySQL.

MySQL has a couple of ways to write an assignment:

inside the SET clause of an UPDATE statement to assign values to database columns: UPDATE sakila.film SET original_language_id := 1
inside a SET statement in to assign values to user defined variables or (in a stored routine context) local variables, parameters and trigger NEW.% values.: SET v_done := TRUE;
inside the SELECT list, where you can assign expressions to user-defined variables.: SELECT @a := col1 FROM table1
using a SELECT..INTO statement.: SELECT col1 INTO @a FROM table1

Our discussion mainly dealt with the issue of the assignment operator: should one write :=, or rather =?

Now, for the SELECT @a := col1 FROM table1 case, it's simple and clear cut: this is MySQL proprietary syntax, and writing SELECT @a = col1 FROM table1 simply is a different thing here: = means comparison, := means assignment.

Given that, it seems quite natural to always use := for assignment, and = for comparison right? It would make me happy, as I'm quite used to this distinction in Oracle PL/SQL. The only thing that bothered me a little bit is: should I use := for the SET clause in ordinary UPDATE statements too? That would certainly be different in orcle?

(As it seems, := is allowed in MySQL, even in UPDATE...SET statements)

Anyway, as it turns out, this is what the SQL standard (ISO/IEC 9075-4:2003, 13.5, assignment statement) sais:


[singleton variable assignment] ::=
    SET [assignment target] [equals operator] [assignment source]

Which I happen to find funny! So, yeah right, assignment uses the [equals operator]....LOL!