In my previous blog entry, I wrote about how I'm currently checking out the Pentaho open source Business Intelligence platform. Well, I've only done a little bit of all the checking out I planned to do, but here I'd like to present some of the things that I found out so far.
In this blog entry, I will focus on some features of the latest addition to the Pentaho: K.E.T.T.L.E, which forms the principal ETL component of the platform.
In case you're interested in Pentaho in general: I just heard that on 21 june, MySQL will be conducting a Web-seminar on BI with Pentaho and MySQL, so you might want to register.
Huh? Kettle!? I thought you said Pentaho?!
I didn't mention this in my previous blog entry, but most of Pentaho is built on and from existing components rather than developed from scratch. For example, the reporting layer runs JasperReports, BIRT (discussed in one of my earlier blog entries) and JFreeReport, OLAP Services are provided by Mondrian, and one of the core components, the Workflow engine, is the Enhydra Shark Engine.
Pentaho actively contributes and strives to improve some of these existing components, for example, JFreeReport. The same goes for Kettle, which was actually acquired by Pentaho. Pentaho also bundles the components and provides the framework to let them work together.
Right now, the Pentaho developers are busy integrating Kettle with the rest of the platform, and probably sooner than later, the services that are now provided by Kettle will be known as 'Pentaho Data Integration'. (In fact, I just noticed that the welcome screen of the latest development release now announces itself as "Pentaho Data Integration - Previously Kettle").
So, what'd you say Kettle was...?
Anyway, back to Kettle. Kettle is a free, open source (LGPL) ETL (Extraction, Transformation and Loading) tool. The product name should actually be spelled as K.E.T.T.L.E, which is a recursive acronym for "Kettle Extraction, Transport, Transformation and Loading Environment".
Kettle was first conceived about four years ago by Matt Casters, who needed a platform-independant ETL tool for his work as a BI Consultant. Matt's now working for Pentaho as Chief of Data Integration. Although there's a growing number of contributers, a lot of the work on Kettle is still being done by Matt himself.
E.T.L. and Datawarehousing
Being an ETL tool, Kettle is an environment that's designed to:
- collect data from a variety of sources (extraction)
- move and modify data (transport and transform) while cleansing, denormalizing, aggregating and enriching it in the process
- frequently (typically on a daily basis) store data (loading) in the final target destination, which is usually a large, dimensionally modelled database called a data warehouse
Although most of these concepts are equally applicable to almost any data importing or exporting processes, ETL is most frequently encountered in datawarehousing environments.
Kettle Architecture
Kettle is built with the java programming language. It consists of four distinct applications (tip: follow links for relevant .pdf documentation):
- Spoon
- is a graphically oriented end-user tool to model the flow of data from input through transformation to output. One such model is also called a transformation
- Pan
- is a commandline tool that executes transformations modelled with Spoon
- Chef
- is a graphically oriented end-user tool used to model jobs. Jobs consist of job entries such as transformations, FTP downloads etc. that are placed in a flow of control
- Kitchen
- is a commandline tool used to execute jobs created with Chef
Model-driven
An interesting feature of Kettle is that it is model-driven. Both Spoon and Chef offer a graphical user interface to define the ETL processes on a high level. Typically, this involves no actual programming at all - rather, it's a purely declarative task which results in a model.
The command line tools Pan and Kitchen (or rather the underlying API) know how to read and interpret the models created by Spoon and Chef respectively. These tools actually execute the implied ETL processes. This is done all in one go: there is no intermediate code generation or compilation involved.
Repository-Based
Models can be saved to file in a particular XML format, or they can be stored into a relational database: a repository. Using a repository can be a major advantage, especially when handling many models. Because the models are stored in a structured manner, arbitrary queries can be written against the repository.
The repository may also be used to store the logs that are generated when executing transformations and jobs. Certain environments, such as banks, require that every manipulation that is performed with financial data be stored for longer periods of time for auditing purposes. The repository sure seems to be the place to do that, at least, as far as the ETL process is concerned.
Installing and Running Kettle
In case you're worried: no, we do not need to build a datawarehouse to show how Kettle can help you schedule the extraction, transformation and loading of 'hello world' into it. All we need is a Java 1.4 runtime environment (which is probably sitting on your system anyway) and Kettle itself.
If you don't have a suitable Java version find one on the Sun Microsystems download page. You should pick the distribution appropriate for your operating system, and install it according to the accompanying instructions.
Go to the Kettle downloads page to pick up Kettle. From here, you can download the stable 2.2.2 release, (as well as a lot of other stuff, such as documentation, the latest development release, and the Source downloads).
Installation is a breeze: just unzip the distribution in a location of your choice. Executables for the Kettle tools are located in the topmost Kettle directory. For the Windows operating system, these are .bat (batch) files: spoon.bat, pan.bat, chef.bat and kitchen.bat.
For *nix operating systems, .sh files (Shell scripts) are provided: spoon.sh, pan.sh, chef.sh and kitchen.sh. These need to be made executable before the can be used. This can be done by executing this command from inside the kettle directory:
chmod +x *.sh
Kettle Documentation is provided in the docs directory right below the main Kettle directory.
Putting the Kettle on
Ok, let's give this tool a spin, and see what it looks like. In the remainder of this article, I will be using a transformation and some datafiles I prepared. You can download it here (2k .zip dowload). Just unzip it into a directory of your choice. Be sure to use this directory wherever the remainder of this article refers to a particular filename.
Launcing Spoon
Launch Spoon by executing the .bat (windows) or .sh (*nix) file. The Kettle splash screen pops up, and right after that, you're invited to connect to a repository. We'll be skipping that for now, so hit the 'No Repository' button at the bottom of the dialog.
I think that a repository is a great feature, and I think it's even better that Kettle lets you decide if you want to use it. Although we won't be using the repository right away, we can still connect to one at any time once spoon is fully up and running.
Here's a picture of Spoon:
On the upper left side, a treeview is visible. The treeview acts as explorer of the currently loaded model as well as a menu from which we can drag items into the grahical view.
On the lower left side, right below the treeview, there's a little window that maintains a history of the different types of steps that were added to the model.
The right side shows a number of tabbed views. In the picture, the graphical view of the transformation model is selected. The model shown here is the 'hello, world!' transformation I just refered to. It's is included in the zip., bu you can also pick it up right here.
Spoon concepts
Although the transformation shown here is very simple, it does illustrate some of the most important concepts of Spoon. So, lets get a closer look at it's ingredients, the steps and the hops.
Steps
First of all, the graphical view shows a number of square icons. These are called transformation Steps. One such step denotes a particular kind of action that is performed on data. Steps are easily created by dragging them from the treeview and dropping them on the graphical model view.
Once a step is created, it can be opened by double-clicking it. A dialog window appears that can be used to parameterize the step to specify it's exact behaviour. Most steps have multiple property pages according to the different categories of properties appblicable to steps of that type.
In all cases, the name of the step may (and should) be modified to clarify the function of the step in the light of the entire ETL process. Most step types also have a separate property sheet to define the fields flowing in or out of the step.
Kettle provides a a lot of different step types, and you can extend Kettle and plugin your own. However, fundamentally, there are only three different kinds of steps: Inputs, Transformations, and Outputs.
- Input
- Input steps process some kind of 'raw' resource, such as a file, a database query or system variables, and create an outputstream of records from it.
- Transformation
- Transforming steps process inputstreams and perform a particular action on it, often adding new fields or even new records to it. This is then fed to one or more outputstreams. Kettle offers many transformation steps out of the box. Some steps in this category perform very simple tasks, such as renaming fields; some of them perform complex tasks, such as normalizing data or maintaining a slowly changing dimension in a datawarehouse.
- Output
- Output steps are like the reverse of input steps: they accept records, and store them in some external resource, such as a file or a database table.
Hops
In the graphical representation of the model, lines are visible that form connections between the steps. In Kettle, these connections are called hops. Hops between steps behave like pipelines. Records may flow through them from one step to the other. The records indeed travel in a stream-like manner, and steps may buffer records until the next step in line is ready to accept them. This is actually implemented by a creating a separate thread of execution for each hop.
Hops can be created by placing the mouse pointer above the source step, holding the shift button and then dragging (holding the left mouse button and the shift button) to the destination step. Hops may also be created by dragging the 'hops' node in the treeview onto the canvas. Then, A dialog appears that let's you select the source and destination steps from dropdown listboxes.
A Simple Recipe
Now that these spoon concepts are introduced, let's see how they are used to actually create an ETL process. Tip: click the picture to download the zipped data files and Kettle transformation if you haven't done so already.
By just looking at the graphical representation of the model, most people will grasp immediately what the implied ETL process looks like. First, in the step named Text file input, data is presumably read from a text file. From there, the data then flows into a step named Filter rows. There, a condition is checked for each record that flows through the input stream. The records are then split into two separate streams: data that passes the filter, and data that does not. Finally, each stream of filtered data is written to it's own text file.
Text file input
Double-clicking the text file input step reveals a number of Property pages: File, Content, Error handling and Fields. The text file that is read for input is specified in the File tab. Use the Browse button to locate files. After that, hit the Add button to select the file for input.
A filename (or pattern) maybe specified that defines the files that are to be read. Although you can add as many files as you like, only one file is specified in this case, spoon_hello_world_input.txt. Here are the contents:
Hello, World...
Welcome to Spoon!
(Actually, this way of specifying the file is not really recommended for production purposes. The filename is 'hardcoded' into the transformation, which is not convenient to deploy the transformation. Check out this technical tip for a more robust method of working with input files.)
Use the Content tab to specify the format of the input file. For this example, I modified the field separator to a comma, and I unchecked the header row option.
Use the Fields tab to define the record layout. You should be able to use the Get Fields Button, which tries to discover the record layout using data from the specified files. (Although I've successfully used this function before, this particular setup seems to provoke a bug). After that, you can rename the fields, and adorn them with more specific properties, such as type, length and format.
Use the Preview rows button to check if the data is extracted in the desired format:
Well, this definitely seems correct: the lines are interpreted as records. The first line is split at the comma. Because the second line did not contain a comma, only the first field's got value.
A lot of the things we've seen sofar, such as the Fields tab, and the Preview rows button are common for a lot of step types.
Filter rows
There's just one hop that leaves the Text file input step, and it arrives at the Filter rows step. The filter rows step is a transforming step. It changes the contents of the inputstream by filtering out the records that do not match a criterion specified in the step properties. Double-clicking the step pops up the following dialog:
In the lower part of the dialog, a condition has been defined:
Field2 IS NOT NULL
The two dropdown-listboxes must be used to link to subsequent steps. The first one specifies this for the data that passed the criterion. The second one specifies the step that will receive all records that did not match de criterion.
Output steps
There are two output steps in the model: one to catch the records that passed the filter, and one to catch those that were rejected. Both are simple Text file output steps. This type of step is quite like the reverse of the Text file input step, and the methods to specify the filename and fields are virtually the same. So, here too, a file name and location must be specified, and fields need to defined so the step knows what fields it should write to file.
In the model, the filenames that are used in the Text file Output steps are spoon_hello_world_pass.txt and spoon_hello_world_fail.txt.
Verify, Preview and Run
Now that we have a transformation model, we can actually do things with it. Spoon's got a special transformation menu with items to work with the transformation model.
First of all, we should verify the model. The verify action can be found in the spoon transformation menu, and also on the toolbar. Verification logically checks te model. It will report unreachable steps, field name mismatches and other logical errors.
We can then preview the transformation. Preview is just like a normal execution of a transformation, except that it is run for a limited number of rows. The Preview action can be found in the menu and on the toolbar. The action is also accessible from the bottom of the Log View. When you choose to do a Preview, Kettle switches automatically to the Log View:
The log contains all kinds of information on the process execution: number of rows that entered a step, the errors that occurred, the time it took to run the task and much, much more. The log view is a very valuable source for troubleshooting and debugging the transformation.
If you're happy with the preview, you can then run the transformation for real. This action too is available from the menu, the toolbar and the log view.
Results
Now, let's take a look at the results. Here's the content of the spoon_hello_world_pass.txt file:
Field1;Field2
Hello; World...
This the content of the spoon_hello_world_fail.txt file:
Field1;Field2
Welcome to Spoon!;
This suggests the transformation works fine. The split rows step nicely separated the records that had comma separated values from the records that didn't.
Final thoughts
There are so much things I would like to write still about Pentaho and Kettle and I will most certainly do that in forthcoming blog entries. Just read on and see what you can expect in the remainder of this month and next month.
Realistic use case
The example show here is of course very easy and does not approach the complexity of a real world application. However, it did touch on a few of the fundamental concepts behind spoon, and served well to that purpose. I will be providing and explaining a more detailed transformation shortly.
What about Chef?
Although Spoon transformations are an important part, Kettle entails more than just that. We haven't discussed Chef and jobs at all, and I'll write something soon about that.
Using the repository
Although I mentioned the repository, we didnt actually use one. I will most certainly explain this feature in greater detail soon.
JDBC configuration
KETTLE comes with database support in the form of a lot of drivers. Sometimes you want to upgrade a particular driver. Easy, if you know how...
(Pentaho) deployment
Here, we used Kettle rather as a standalone tool: we never deployed the transformation. It would be nice to see how ETL hooks in with the workflow engine