Showing posts with label pentaho data integration. Show all posts
Showing posts with label pentaho data integration. Show all posts

Wednesday, March 25, 2015

A Generic Normalizer for Pentaho Data integration - Revisited

A while ago, I wrote about how to create a generic normalizer for Pentaho Data integration.

To freshen up your memory, the generic normalizer takes any input stream, and for each input row, it outputs one row for each field in the input stream. The output rows contain fields for input row number, input field number and input field value. As such it provides the same functionality as the built-in Row Normaliser step without requiring any configuration, thus allowing it to process arbitrary streams of data.

A reusable normalizer

Recently I received an comment asking for more information on how to make this normalizer more reusable:
I want to use this method to do field level auditing but I want to encapsulate it in a sub transformation to which I can pass the result rows from any step. In your image of "how all these steps work together", instead of a data grid, the input would need to be dynamic in terms of the number of fields/columns and the data types. Could you possibly provide a hint how to make the input to these steps (in your example, the datagrid) dynamic?
In the mean while, I learned a thing or two about kettle's internals and it seemed like a good idea to describe how to improve on the original example and make it suitable to be used in a so-called Mapping, a.k.a. a sub-transformation.

Design for use as Subtransformation

The design for the re-usable generic normalizer is shown below:
The User-defined Java class step in the middle actually implements the normalizer. The Mapping input and Mapping output specification steps allow the normalizer to be called from another transformation. They enable it to respectively receive input data from, and return output data to the calling transformation.

In the screenshot above, the configuration dialogs for both the Mapping input and output specification steps are shown. This is mainly to show that there is no configuration involved: the Mapping input specification step will faithfully pass all fields received from the incoming stream on to the normalizer, and the Mapping output specification will output all fields coming out of the normalizer to the outgoing stream.

Normalizer Improvements

The configuration of the user-defined Java class step differs in a number of aspects from what I used in the original normalizer example. In the original example the normalizer output consisted of three fields:
rownum
A sequential integer number identifying the position of the row to which the current output row applies.
fieldnum
A sequential integer number identifying the position of the field to which the current output row applies.
value
A string representation of the value to which the output applies
The original example used a Metadata Structure of Stream step to obtain metadata of the input stream, and this metadata was then tied to the output of the normalizer using a Stream Lookup step, "joining" the output of the Metadata Structure step with the output of the normalizer using the field number.
The improved generic normalizer adds two more output fields:
fieldname
The name of the field as it appears in the input stream
fieldtype
The name of the data type of the field as it appears in the input stream
Argueably, these two items are the most important pieces of metadata that were previously provided by the Metadata Structure of Stream in the original example, and I felt many people would probably prefer to have all of that work done by the normalizer itself rather than having to tie all the pieces together in the transformation itself using additional steps.

Code

The code for the user-defined Java class is shown below:
static long rownum = 0;
static RowMetaInterface inputRowMeta;
static long numFields;
static String[] fieldNames;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
  // get the current row
  Object[] r = getRow();

  // If the row object is null, we are done processing.
  if (r == null) {
    setOutputDone();
    return false;
  }

  // If this is the first row, cache some metadata.
  // We will reuse this metadata in processing the rest of the rows.
  if (first) {
    inputRowMeta = getInputRowMeta();
    numFields = inputRowMeta.size(); 
  }
    
  // Generate a new id number for the current row.
  rownum += 1;

  // Generate one output row for each field in the input stream.
  int fieldnum;
  ValueMetaInterface valueMetaInterface;
  for (fieldnum = 0; fieldnum < numFields; fieldnum++) {
    //get metadata for the current field
    valueMetaInterface = inputRowMeta.getValueMeta(fieldnum);
    Object[] outputRow = new Object[5];
    outputRow[0] = rownum;
    // Assign the field id. Note that we need to cast to long to match Kettle's type system.
    outputRow[1] = (long)fieldnum+1;
    //assign the data type name
    outputRow[2] = valueMetaInterface.getTypeDesc();
    //assign the field name
    outputRow[3] = valueMetaInterface.getName();
    //assign a string representation of the field value
    outputRow[4] = inputRowMeta.getString(r, fieldnum);
    //emit a row.
    putRow(data.outputRowMeta, outputRow);
  }
  
  return true;
}
The main difference with the original code is the addition of the two new output fields, fieldname and fieldtype. In order to obtain the values for these fields, the loop over the fields first obtains the ValueMetaInterface object for the current field. This is done by calling the getValueMeta() method of the RowMetaInterface object and passing the index of the desired field.
Using the ValueMetaInterface object, the field name is obtained using its getName() method. The data type name is obtained by calling its getTypeDesc() method.

Calling the normalizer as subtransformation

Using the improved normalizer is as simple as adding a Mapping-step to your transformation and pointing it to the transformation that contains the normalizer and Mapping input and output specifications:

Download samples

The transformations discussed in this post are available here: These transformations are in the public domain: you can use, copy, redistribute and modify these transformations as you see fit. You are encouraged but not obliged to share any modifications that you make to these examples.

Tuesday, March 20, 2012

A Generic Normalizer step for Kettle

UPDATE: if you're interested in this article you might be interested in the follow-up article as well: A Generic Normalizer for Pentaho Data integration - Revisited.

Abstract

Kettle (a.k.a. Pentaho Data Integration) offers the standard Row Normalizer step to "unpivot" columns to rows. However, this step requires some configuration which presumes its input stream is static, and its structure is known. In this post, I explain how to construct a simple User-defined java class step that implements a generic Row Normalizer step that can unpivot an arbitrary input stream without manual configuration.

The Row Normalizer step


Kettle (a.k.a. Pentaho Data Integration) offers a standard step to "unpivot" columns to rows. This step is called the Row Normalizer. You can see it in action in the screenshot below:

In the screenshot, the input is a table of columns id, first name, and last name. The output is a table of columns id, fieldname, and value. The id column is preserved, but for each row coming from the input stream, two rows are created in the output stream: 1 for the first name field, and 1 for the last name field.

Essentially the Row Normalizer step in this example is configured to treat the first name and last name fields as a repeating group. The repeating group is untangled by dumping all values for either field in the value column. The fieldname column is used to mark the kind of value: some values are of the "first name field" kind (in case they came from the original first name input field), some are from the "last name field" kind (when the derive from the last name input field).

There are several use cases for the operation performed by the Row normalizer step. It could be used to break down a genuine repeating group in order to create a more normalized dataset. Or you might need to convert a relational dataset into a graph consisting of subject-predicate-object tuples for loading a triple store. Or maybe you want to turn a table into a fine-grained stream of changes for auditing.

The problem

The Row normalizer step works great for streams that have a structure that is known in advance. The structure needs to be known in advance in order to specify those fields that are to be considered as repeating group in the step configuration so they can be broken out into separate kinds.

Sometimes, you don't know the structure of the input stream in advance, or it is just to inconvenient to manually specify it. In these cases, you'd really wish you could somehow unpivot any field that happens to be part of the input stream. In other words, you'd need to have a generic Row Normalizer step.

The Solution

In Kettle, there's always a solution, and often more. Here, I'd like to present a solution to dynamically unpivot an arbitrary input stream using a user-defined java class step.

Below is a screenshot of the step configuration:

This configuration allows the step to take an arbitrary input stream and normalize it into a collection of triples consisting of:
  1. An id column. This column delivers generated integer values, and each distinct value uniquely identifies a single input row.
  2. A fieldnum column. This is a generated integer value that uniquely identifies a field within each input row.
  3. A value column. This is a string column that contains the value that appears in the field identified by the fieldnum column within the row identified by the rownum value.

The Row Normalizer versus the UJDC generic Normalizer

For the input data set mentioned in the initial example, the output generated by this UJDC step is shown below:
There are a few differences with regard to the output of kettle's Row Normalizer step:
  1. One obvious difference is that the Row Normalizer step has the ability to attach names to the values, whereas the UJDC step only delivers a generated field position. One the one hand, it's really nice to have field names. On the other hand, this is also one of the weaknesses of the Row Normalizer step, because providing the names most be done manually.
  2. Another difference is that the UDJC step delivers 3 output rows for each input row, instead of the 2 rows delivered by the Row Normalizer step. The "extra" row is due to the id column. Because the id column is the key of the original input data, the Row Normalizer step was configured to only unpivot the first name and last name fields, keeping the id field unscathed: this allows any downstream steps to see which fields belong to which row. The UDJC step however does not know which field or fields form the key of the input stream. Instead, it generates its own key, the rownum field, and the id field is simply treated like any other field and unpivoted, just like the first name and last name fields. So the main difference is that the downstream steps need to use the generated rownum field to see which fields belong to which row.

The Code

The code and comments are pretty straightforward:
static long rownum = 0;
static RowMetaInterface inputRowMeta;
static long numFields;
static String[] fieldNames;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
// get the current row
Object[] r = getRow();

// If the row object is null, we are done processing.
if (r == null) {
setOutputDone();
return false;
}

// If this is the first row, cache some metadata.
// We will reuse this metadata in processing the rest of the rows.
if (first) {
inputRowMeta = getInputRowMeta();
fieldNames = inputRowMeta.getFieldNames();
numFields = fieldNames.length;
}

// Generate a new id number for the current row.
rownum += 1;

// Generate one output row for each field in the input stream.
int fieldnum;
for (fieldnum = 0; fieldnum < numFields; fieldnum++) {
Object[] outputRow = new Object[3];
outputRow[0] = rownum;
// Assign the field id. Note that we need to cast to long to match Kettle's type system.
outputRow[1] = (long)fieldnum+1;
outputRow[2] = inputRowMeta.getString(r, fieldnum);
putRow(data.outputRowMeta, outputRow);
}

return true;
}

Getting Field information

So the UDJC step only generates a number to identify the field. For some purposes it may be useful to pull in other information about the fields, like their name, data type or data format. While we could do this also directly int the UDJC step by writing more java code, it is easier and more flexible to use some of Kettle's built-in steps:
  1. The Get Metadata Structure step. This step takes an input stream, and generates one row for each distinct field. Each of these rows has a number of columns that describe the field from the input stream. One of the fields is a Position field, which uniquely identifies each field from the input stream using a generated integer, just like the fieldnum field from our UJDC step does.
  2. The stream lookup step. This step allows us to combine the output stream of our UJDC step with the output of the Get Metadata structure step. By matching the Position field of the Get Metadata Structure step with the fieldnum field of the UDJC step, we can lookup any metadata fields that we happen to find useful.


Below is a screenshot that shows how all these steps work together:
And here endeth the lesson.

Monday, August 15, 2011

Proposals for Codebits.EU

Codebits is an annual 3-day conference about software and, well, code. It's organized by SAPO and this year's edition is to be held on November 10 thru 12 at the Pavilhão Atlântico, Sala Tejo in Lisbon, Portugal.

I've never attended SAPO Codebits before, but I heard good things about it from Datacharmer Giuseppe Maxia. The interesting thing about the way this conference is organized is that all proposals are available to the public, which can also vote for the proposals. This year's proposals are looking very interesting already, with high quality proposals from Giuseppe about database replication with Tungsten replicator, Pentaho's chief of data integration Matt Casters about Kettle (aka Pentaho data integration), and Pedro Alves from webdetails who will be talking about "Big Data" analysis and dashboarding work he did for the Mozilla team.

There are many more interesting talks, and you should simply check out the proposals for yourself and give a thumbs up or a thumbs down according to whether you'd like see a particular proposal at the conference. I decided to send in a few proposals as well:So, if you like what you see here, take a minute to vote and shape this codebits conference. I'm hoping to meet you there!

Monday, May 30, 2011

Cleaning webpages with Pentaho Data Integration and JTidy

Here's an issue I've come across multiple times: I need to scrape HTML websites to extract data. Pentaho Data Integration (kettle) has lots of functionality on-board to make this an easy process, except one: it does not support reading data directly from HTML.

In this short post, I provide a simple tip to clean HTML pages and convert them to XML so you can extract its data using the conventional "Get data from XML" step. The solution hinges on two ingredients:

Standard Kettle tools for Webservices

Kettle is really good at fetching data from the web and extracting data from webservices, be they in a SOAP/XML, REST/JSON or RSS flavor. (There is an extensive chapter on this subject in Pentaho Kettle Solutions). But when you're dealing with plain old HTML, things can get pretty hairy.

If you're lucky, the page may be in XHTML, and in that case it's worth trying the Get Data from XML step. However, quite often a webpage that claims to be XHTML is not well-formed XML, and even if it is, Kettle does not understand things like &nbsp; entities, which are valid in XHTML, but not in plain XML. And of course, more often than not, you're not lucky, and XHTML represents only a minor fraction of all the web pages out there.

Workaround: JavaScript string manipulation

In the past, I usually worked around these issues. In practice, some quick and dirty string manipulation using the Modified Javascript Value step and some built-in indexOf(), substring and replace() functions go a long way.

In most cases I don't really need the entire web page, but only a <table>, <ul> or <ol> element in the <body>. Excising only the interesting sections out of the page using plain string manipulation will often get rid of most of the cruft that prevents the data from being treated as XML. For example, if we only need to get the rows from a table with a particular id attribute, we can use a JavaScript snippet like this:

//table we're looking for
var startHandle = "<table class=\"lvw\" cellpadding=0 cellspacing=0>";
var startPosition= html.indexOf(startHandle);
//look beyond the start tag to lose the invalid unquoted attributes
startPosition += startHandle.length;

//find where this table ends (lucky us, no nested table elements :)
var endHandle = "</table>";
var endPosition = html.indexOf(endHandle, startPosition);

//make a complete table fragment out of it again
var table = "<table>" + html.substring(startPosition, endPosition + endHandle.length);

//replace nbsp entities, empty unclosed img elements, and value-less nowrap attributes
table = table.replace(/&nbsp;|<img[^>]>|nowrap/ig, "");


There are of course no guarantees that the sections you cut out like that are in fact well-formed XML, but in my experience it's often worth a try.

A better way: using JTidy

While the JavaScript workaround may just work for your particular case, it certainly has disadvantages. Sometimes it may just be not so simple to clean the HTML with plain string manipulation. And of course there is a performance issue too - the JavaScript step can be quite slow.

Fortunately, there is a better way.

Using a user-defined Java Class step we can have JTidy do the dirty work of cleaning the HTML and converting it to XML, which we can then process in a sane way with Kettle's Get Data from XML step.

We need to do two things to make this work: first, you have to download JTidy, unzip it, and place the jtidy-r938.jar in the libext directory, which resides immediately in your kettle installation directory. (note that if you were running spoon, you need to restart it before it will be picked up). Second, you need a little bit of glue code for the User-defined Java class step so Kettle can use the Tidy class inside the jar. With some help from the pentaho wiki and the JTidy JavaDoc documentation, I came up with the following Java snippet to make it work:

import org.w3c.tidy.Tidy;
import java.io.StringReader;
import java.io.StringWriter;

protected Tidy tidy;

public boolean init(StepMetaInterface stepMetaInterface, StepDataInterface stepDataInterface)
{
//create and configure a Tidy instance
tidy = new Tidy();
tidy.setXmlOut(true);
return parent.initImpl(stepMetaInterface, stepDataInterface);
}

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException {
Object[] r;
//Get row from incoming stream.
//Bail out if its not there.
if ((r = getRow()) == null) {
setOutputDone();
return false;
}

//read the value of the html input field
//the html field happens to be the 5th field in the stream,
//because java arrays start at 0, we use index 4 to reference it
StringReader html = new StringReader((String)r[4]);

//use tidy to parse html to xml
StringWriter xml = new StringWriter();
tidy.parse(html, xml);

//assign the xml to the output row
//note we simply overwrite the original html field from the input row.
r[4] = xml.toString();

//push the output row to the outgoing stream.
putRow(data.outputRowMeta, r);
return true;
}

(Tip: for more examples and background information on the user-defined java class step, check out the excellent blog posts by Slawomir Chodnicki, Matt Casters and the video walk-through by Dein Einspanjer)

The big advantage of using Tidy is that you can be sure that the result is well-formed XML. In addition, you can have JTidy report on any errors or warnings, which makes it much more robust than any ad-hoc string manipulation you can come up with.

Tuesday, May 10, 2011

Managing kettle job configuration

Over time I've grown a habit of making a configuration file for my kettle jobs. This is especially useful if you have a reusable job, where the same work has to be done but against different conditions. A simple example where I found this useful is when you have separate development, testing and production environments: when you're done developing your job, you transfer the .kjb file (and its dependencies) to the testing environment. This is the easy part. But the job still has to run within the new environment, against different database connections, webservice urls and file system paths.

Variables


In the past, much has been written about using kettle variables, parameters and arguments. Variables are the basic features that provide the mechanism to configure the transformation steps and job entries: instead of using literal configuration values, you use a variable reference. This way, you can initialize all variables to whatever values are appropriate at that time, and for that environment. Today, I don't want to discuss variables and variable references - instead I'm just focussing on how to manage the configuration once you already used variable references inside your your jobs and transformations.

Managing configuration


To manage the configuration, I typically start the main job with a set-variables.ktr transformation. This transformation reads configuration data from a config.properties file and assigns it to the variables so any subsequent jobs and transformations can access the configration data through variable references. The main job has one parameter called ${CONFIG_DIR} which has to be set by the caller so the set-variables.ktr transformation knows where to look for its config.properties file:


Reading configuration properties


The config.properties file is just a list of key/value pairs separated by an equals sign. Each key represents a variable name, and the value the appropriate value. The following snippet should give you an idea:
#staging database connection
STAGING_DATABASE=staging
STAGING_HOST=localhost
STAGING_PORT=3351
STAGING_USER=staging
STAGING_PASSWORD=$74g!n9
The set-variables.ktr transformation reads it using a "Property Input" step, and this yields a stream of key/value pairs:


Pivoting key/value pairs to use the "set variables" step


In the past, I used to set the variables using the "Set variables" step. This step works by creating a variable from selected fields in the incoming stream and assigning the field value to it. This means that you can't just feed the stream of key/value pairs from the property input step into the set variables step: the stream coming out of the property input step contains multiple rows with just two fields called "Key" and "value". Feeding it directly into the "Set variables" step would just lead to creating two variables called Key and Value, and they would be assigned values multiple times for all key/value pairs in the stream. So in order to meaningfully assign variable, I used to pivot the stream of key/value pairs into a single row having one field for each key in the stream using the "Row Denormaliser" step:

As you can see in the screenshot, "Key" is the key field: the value of this field is scanned to determine in which output fields to put the corresponding value. There are no fields that make up a grouping: rather, we want all key/value pairs to end up in one big row. Or put another way, there is just one group comprising all key/value pairs. Finally, the grid below specifies for each distinct value of the "Key" field to which output field name it should be mapped, and in all cases, we want the value of the "Value" field to be stored in those fields.

Drawbacks


There are two important drawbacks to this approach:

  • The "Row Normaliser" uses the value of the keys to map the value to a new field. This means that we have to type the name of each and every variable appearing in the config.properties file. So you manually need to keep he config.propeties and the "Denormaliser" synchronized, and in practice it's very easy to make mistakes here.
  • Due to the fact that the "Row Denormaliser" step literally needs to know all variables, the set-variables.ktr transformation becomes specific for just one particular project.

Given these drawbacks, I seriously started to question the usefulness of a separate configuration file: because the set-variables.ktr transformation has to know all variables names anyway, I was tempted to store the configration values themselves also inside the transformation (using a "generate rows" or "data grid" step or something like that), and "simply" make a new set-variables.ktr transformation for every environment. Of course, that didn't feel right either.

Solution: Javascript


As it turns out, there is in fact a very simple solution that solves all of these problems: don't use the "set variables" step for this kind of problem! We still need to set the variables of course, but we can conveniently do this using a JavaScript step. The new set-variables.ktr transformation now looks like this:



The actual variable assignemnt is done with Kettle's built-in setVariable(key, value, scope). The key and value from the incoming stream are passed as arguments to the key and value arguments of the setVariable() function. The third argument of the setVariable() function is a string that identifies the scope of the variable, and must have one of the following values:

  • "s" - system-wide

  • "r" - up to the root

  • "p" - up to the parent job of this transormation

  • "g" - up to the grandparent job of this transormation

For my purpose, I settle for "r".

The bonus is that this set-variables.ktr is less complex than the previous one and is now even completely independent of the content of the configuration. It has become a reusable transformation that you can use over and over.

Wednesday, January 26, 2011

NoSQL support lands in JasperSoft

JasperSoft, one of the leading open source BI suites just announced it is delivering connectors for a range of so-called NoSQL databases. The big names are all there: Cassandra, MongoDB, Riak, HBase, CouchDB, Neo4J, Infinispan, VoltDB and Redis.

I used to explain to people that the lack of SQL support in NoSQL databases poses a challenge for traditional Business Intelligence tools, because those all talk either SQL or MDX (and maybe some XQuey/XPath). With this development, this is no longer true, and I want to congratulate JasperSoft in spearheading this innovation.

I still have a number of reservations though. Although I personally value the ability to report on data in my NoSQL database, I think its usefulness will hava a number of limitations that are worth consideration.

Admittedly I am not an expert in the NoSQL database field, but as far my knowledge goes, both the dynamo-style key/value stores like Riak, and the Bigtable-style hashtable stores like HBase and Cassandra can basically do 2 types of read operations: fetch a single object by key, or scan everything. The fetched object can be complex and contain a lot of data, and it would certainly be nice if you could run a report on that. The scan everything operation doesn't seem that useful at the report level: for all but trivial cases, you need considerable logic to make this scan useful, and I don't think a report is the right place for this. Apart from that, if the NoSQL solution was put in place because of the large data volume, then the report itself would probably need to be executed on a cluster just to achieve acceptable response time. I may be wrong but I don't think JasperReports supports that.

So, for a full scan of those NoSQL databases, connectors at the data integration end seem more appropriate. I think the integration of Hadoop with Pentaho data integration (a.k.a Kettle) is a step in the right direction, but of course only applicable if you're a Hadoop user.

Another point is data quality. Typically reporting is done on a data warehouse or reporting environment where the data quality is kept in check by processing the raw data with a data integration and quality tools. Directly reporting on any operational database can be problematic because you skip those checks. Because the NoSQL databases offer virtually no constraints, those checks are even more important. So to me this seems like another reason why NoSQL connectivity is more useful in the data integration tools.

JasperSoft also offers connectivity for the MongoDB and CouchDB docmentstores. I think that for raw reporting on the actual source documents, the same reservations apply as I mentioned in relation to the dynamo and Bigtable style solutions. But, there may be a few more possibilities here, at least for CouchDB

CouchDB has a feature called views, which allows you to "query" the raw documents using a map/reduce job. I can certainly see why it'd be useful to build a report on top of that. Of course, you would still have to implement the logic to do a useful scan, and you would still have to deal with data quality issues, but you can do it in the map/reduce job, which seems a more appropriate place to handle this than a report.

All in all, I think this is a promising development, and I should probably get my feet wet and try it out myself. But for now, I would recommend to keep it out of the wrecking tentacles of unaware business users :)

Saturday, December 18, 2010

Substituting variables in Kettle Parameter values

Kettle (a.k.a. Pentaho Data Integration) jobs and transformations offers support for named parameters (as of version 3.2.0). Named parameters form a special class of ordinary kettle variables and are intended to clearly and explicitly define for which variables the caller should supply a value.

One of my pet projects, the pentaho auto-documentation solution kettle-cookbook, uses two named parameters called INPUT_DIR and OUTPUT_DIR. These allow you to specify the directory that contains the BI content that is to be documented (such as kettle transformation and job files, action sequence files and mondrian schema files), and the directory to store the generated documentation.

Several kettle-cookbook users ran into problems attempting to use variable references in the values they supplied for the INPUT_DIR and OUTPUT_DIR variables. In this case, the variables referenced in the supplied parameter values would be set by adding entries in kettle.properties file. I just committed revision 64 of kettle-cookbook which should fix this problem. In this article I briefly discuss the solution, as I think it may be useful to other kettle users.

Substituting Kettle Variable References


Kettle doesn't automatically substitute variable references in parameter values (nor in ordinary variable values). So, if you need to support variable references inside parameter values, you have to substitute the variables yourself.

Variable substitution in Kettle 4.01 and up


As of Kettle version 4.01, the Calculator step supports a calculation type called "variable substitution in string A" that is intended exactly for that purpose. I have tested this but unfortunately in 4.01 it doesn't seem to work, at least not for the built-in variable ${Internal.Transformation.Filename.Directory} which I used in my test. In the latest stable version, Kettle 4.10 it does work as advertised, I would recommend using this method if you're a user of Kettle 4.10 (or later).

Variable substitution in earlier Kettle versions


I have committed myself to making kettle-cookbook work on kettle 3.2.0, as my sources tell me that this is still an often-used version in many production environments. I'm even prepared to make kettle-cookbook work on Kettle versions earlier than 3.20, should there be sufficient demand for that. Anyway, the bottom line is, these versions do not support the "variable substitution in string A" calculation in the Calculator step, so you have to resort to a little trick.

A Kettle 3.2.0 transformation to substitute variables in parameters


For kettle-cookbook, I added a single transformation called substitute-variables-in-parameters.ktr as the first transformation of the main job.



The substitute-variables-in-parameters.ktr transformation uses a "Get Variables" step to read the values of the INPUT_DIR and OUTPUT_DIR parameters. The values are then processed by a javascript function which substitutes all variable references with their values. Finally, a "Set Variables" step overwrites the original value of the variables with their replaced value.

The code for the JavaScript step is shown below:

function replace_variables(text){
var re = /\$\{([^\}]+)\}|%%([^%]+)%%/g,
match, from = 0,
variable_name, variable_value,
replaced_text = ""
;

while ((match = re.exec(text)) !== null) {
variable_name = match[1] ? match[1] : match[2];
variable_value = getVariable(variable_name, "");
replaced_text += text.substring(from, match.index);
replaced_text += variable_value;
from = match.index + match[0].length;
}
replaced_text += text.substring(from, text.length);
return replaced_text;
}

var replaced_input_dir = replace_variable(input_dir);
var replaced_output_dir = replace_variable(output_dir);

The script first defines function replace_variables(text) which accepts the parameter value, and returns the substituted value. Then it calls the function, applying it to the input_dir and output_dir fields from the incoming stream. These fields originate in the preceding "Get variables" step which assigns them the value of the INPUT_DIR and OUTPUT_DIR variables. The output of the replace_variables() function is assigned to the replaced_input_dir and replaced_output_dir javascript variables, which leave the JavaScript step as fields of the outgoing stream. In the final "Set variables" step, the replaced_input_dir and replaced_output_dir fields are used to overwrite the original value of the INPUT_DIR and OUTPUT_DIR values.

The replace_variables() function


Let's take a closer look at the replace_variables() function.

The heart of the function is formed by a while loop that executes a javascript regular expression called re that matches variable references.

The regular expression itself is defined in the top of the function:

var re = /\$\{([^\}]+)\}|%%([^%]+)%%/g,

It's intention is to recognize variable references of the form ${NAME} and %%NAME%%. The part of the pattern for the name is enclosed in parenthesis to form a capturing group. As we shall see later on, this allows us to extract the actual name of the referenced variable. The trailing g indicates that the pattern should be matched anywhere in the string. This is necessary because we want to replace all variable references in the input text, not just the first one.

The regular expression object is used to drive the while loop by calling its exec() method. In case of a match, the exec() returns an array that describes the text matched by the regular expression. If there's no match, exec() returns null

while ((match = re.exec(text)) !== null) {
...
}

If there is a match, we first extract the variable name:

variable_name = match[1] ? match[1] : match[2];
The first entry of the match array (at index 0) is the text that was matched by the pattern as a whole. The array contains subsequent elements for each capturing group in the regular expression. Because our regular expression re has 2 capturing groups, the match array contains two more elements. If the variable is of the form ${NAME}, the element at index=1 contains the variable name. If it's of the form %%NAME%%, it will be contained in the element at index=2.

Once we have the variable name, we can use the getVariable() javascript function to obtain its value:

variable_value = getVariable(variable_name, "");

The getVariable() is not a standard javascript function, but supplied by the kettle javascript step.

To perform the actual substitution, we take the substring of the original text up to the location where the variable reference was matched. This location is conveniently supplied by the match array:

replaced_text += text.substring(from, match.index);
Right after that location, we need to put the variable value instead of its name:

replaced_text += variable_value;
The last action in the loop is to remember to location right behind the last replaced variable reference, so we can pick up at the right location in the original value the next time we match a variable:

from = match.index + match[0].length;

Right after the loop, we need to copy the final piece of original text occurring right behind the last variable reference to yield the complete replaced text:

replaced_text += text.substring(from, text.length);

Odds and Ends


While the substitute-variables-in-parameters.ktr transformation works great for its intended purpose, substituting variables in the known parameters INPUT_DIR and OUTPUT_DIR, it is not really applicable beyond kettle cookbook. What you'd really want to have is a job that replaces variables in all parameters, not just those that are known in advance.

As it turns out this is actually almost trivial to achieve, however to solution is a bit too long-winded for this post. If anyone is interested in such a solution, please post a comment and let me know, and I'd be happy to provide it.

UPDATE: A solution that substitutes all variable references occurring in the parameter values of the containing job is now available at the kettle exchange area in the Pentaho wiki.

Thursday, December 09, 2010

Parameterizing SQL statements in the Kettle Table Input step: Variables vs Parameters

I get this question on a regular basis, so I figured I might as well blog it, in the hope it will be useful for others. Here goes:

Let's say that I want to delete all records that match an id from a set of tables. The table names come in as rows into the Execute SQL Script step (check execute for every row). Next I write:
DELETE FROM {table_name} WHERE id = {identifier}

as the SQL to execute. In the parameters grid at the bottom right, I have two fields: table_name and identifier. What is the syntax for substituting the table_name and identifier parameters in the sql script?


(Although this particular question focuses on the "Execute SQL Script" step, it also applies to the "Table Input" step, and probably a few more steps I can't recall right now.)

The parameters grid can be used for prepared statement value placeholders. In the SQL statement these placeholders are denoted as questionmarks (?). These are positional parameters: they get their value from those fields in the incoming stream that are entered in the parameters grid, in order.
Here's an example of the correct usage of these placeholders:
DELETE FROM myTable WHERE id = ?

Here, the ? in the WHERE clause will be bound to the value of the first field from the incoming stream entered in the parameters grid. Because there is only one such placeholder, there can be only one field in the parameters grid.

An important thing to realize is that these parameters can only be used to parameterize value expressions. So, this kind of parameter does not work for identifiers, nor do they work for structural elements of the SQL statement, such as keywords. So this kind of parameter cannot be used to parameterize the table name which seems to be the intention in the original example posed in the question.

There is a way to parameterize the structural elements of the SQL statement as well as the parameters. You can apply variable substitution to the SQL statetment.

Kettle Variables can be defined by a Set Variables step, or by specifying parameters at the transformation level. They get their value from "the environment": for example, parameters get their value initially when the transformation is started, and regular variables are typically set somewhere in the job that is calling your transformation.

In text fields, including the SQL textarea of the Table input step or the Execute SQL Script step, you denote those variables with this syntax: ${VARIABLE_NAME}. So to parameterize the table name we could use something like this:
DELETE FROM ${TABLE_NAME}


In order to force kettle to apply variable substitution to the SQL statement, you have to check the "variable substitution" checkbox. If this checkbox is checked, then all variables are simply substituted with their (string)value during transformation initialization. This is a lot like the way macro's are substituted by the pre-processor in C/C++ code.

When comparing variables with parameters, two important things should be mentioned here:

  • Unlike value placeholders, variables can be used to manipulate any aspect of the SQL statement, not just value expressions. The variable value will simply become the text that makes up the SQL statement, it is your responsibility it results in a syntactically valid and correct SQL statement.

  • Variables are evaluated once during transformation initalization. So if you want to vary the variable value, you'll have to call the transformation again for the change to take effect. For the same reasons, you cannot set the value of a variable and read it within the same transformation: setting the variable value occurs at runtime, but evaluating it occurs at initialization time.



Finally, here's a screenshot that summarizes these different ways to parameterize SQL statements in kettle:

If you want to read more about this topic, it's covered in both our books Pentaho Solutions and Pentaho Kettle Solutions. Another title you might be interested in is Maria Roldan's Pentaho 3.2 Data Integration: Beginner's Guide.

Wednesday, August 11, 2010

Back to blogging....

It has been a while since I posted on my blog - in fact, I believe this is the first time ever that more than one month passed between posts since I started blogging. There are a couple of reasons for the lag:

  • Matt Casters, Jos van Dongen and me have spent a lot of time finalizing our forthcoming book, Pentaho Kettle Solutions (Wiley, ISBN: 978-0-470-63517-9). The book is currently being produced, and should be available according to schedule in early September 2010. If you're interested, you might like to read one of my earlier posts that explains the organization and outline of the book.

    (I should point out that we have reorganized the outline as the project progressed, so the final result will not have all the chapters mentioned in that post. We do however cover most of the topics mentioned.)

  • I have been checking out Quipu, a promising Open Source data warehouse management solution. Quipu provides a repository-based extensible code-generator that allows you to generate and maintain a data warehouse based on the Data Vault model. One of the things I want to do in the short term is write templates that allows it to work for MySQL, and after that, I want to see if I can get Quipu to generate Kettle Jobs and transformations.

  • I have been working on a couple of software projects. Two of them are currently available as open source on google code:

    mql-to-sql
    This allows you to use the Metaweb Query Language (MQL) to query a RDBMS. If you're wondering what this is all about: MQL is the query language used by Freebase (which is a collaborative "database-of-everything" or "wikipedia-gone-database").

    While Freebase is interesting in its own right, I am particularly enthused about the MQL query language. I feel that MQL is an exceptionally good solution for flexible, expressive and secure data access for modern (AJAX) Web applications. Even though MQL was not developed with relational database systems in mind, I think it is a pretty good fit.

    Anyway, this is very much a work in progress, and I appreciate your feedback on it. If you're interested, you can read a bit more about my take on RDBMS data access for web applications and MQL on the mql-to-sql project home page. I have also put up an online demo that allows you to query the sakila MySQL sample database using MQL.

    kettle-cookbook
    This is a project that provides auto-documentation for Kettle (a.k.a. Pentaho Data Integration).

    The project consists of a bunch of Kettle Jobs and transformations (as well as some XSLT stylesheets) that extract data from Kettle Jobs and transformation and transform it into a collection of human-readable HTML documents along with a table of contents. The resulting documentation looks and feels a bit like JavaDoc documentation.

    If all goes well, I will be presenting kettle-cookbook in a Pentaho web seminar, which is currently scheduled for September 15, 2010


  • I've been enjoying 4 weeks of vacation (I started working this week). There's not much to tell about that, other than that it was great spending a lot of time with my family. I plan to do this more often, and now that Kettle Solutions is finished I should be able to find more time to do it.

  • I've been looking at two emerging HTML5 APIs for client-side structured storage, the Web SQL Database API and the Indexed DB API.

    I have developed a few thoughts about the ongoing debate (see this and article for some background) about which one is better and I see a role for something like MQL here too. I will probably write something about this in the next few weeks

  • Yesterday, I got wind of the JS1k contest! Basically, the challenge is to write an interesting standalone JavaScript demo program that must be no larger than 1024 bytes. It is amazing and inspiring to see what people manage to do with a modern browser in 1k of JavaScript code.

    I decided to try it myself, and you can find my submission here: An interactive SQL query tool for the Web SQL DB API.

    Essentially, you get a textarea where you can enter arbitrary SQLite queries, and button to execute the SQL, and a result area that will print a feedback message and a result table (if applicable). As a bonus, there's a button to get a listing of the available database objects (using a query on sqlite_master) and an explain button to show the query plan of the current SQL statement.

    The demo works on recent versions of Google Chrome, Apple Safari and Opera. It can run offline, and does not require any plugins. I should say that I expect my submission will be rejected by the judges since the demo is not functional on Mozilla Firefox, which is a requirement. (That is, the script will detect that the Web SQL Database API is not supported and print a message to that effect). However, it was still fun to try my hand at it.


Ok - that's it for now. I will try and post more regularly and write about these and other things in the near future. Don't hesitate to leave a comment if you have any questions or suggestions.

Sunday, March 14, 2010

Writing another book: Pentaho Kettle Solutions

Last year, at about this time of the year, I was well involved in the process of writing the book Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL" for Wiley. To date, "Pentaho Solutions" is still the only all-round book on the open source Pentaho Business Intelligence suite.

It was an extremely interesting project to participate in, full of new experiences. Although the act of writing was time consuming and at times very trying for me as well as my family, it was completely worth it. I have none but happy memories of the collaboration with my full co-author Jos van Dongen, our technical editors Jens Bleuel, Jeroen Kuiper, Tom Barber and Tomas Morgner, several of the Pentaho Developers, and last but not least, the team at Wiley, in particular Robert Elliot and Sara Shlaer.

When the book was finally published, late August 2009, I was very proud - as a matter of fact, I still am :) Both Jos and I have been rewarded with a lot of positive feedback, and so far, book sales are meeting the expectations of the publisher. We've had mostly positive reviews on places like Amazon, and elsewhere on the web. I'd like to use this opportunity to thank everybody that took the time to review the book: Thank you all - it is very rewarding to get this kind of feedback, and I appreciate it enourmously that you all took the time to spread the word. Beer is on me next time we meet :)

Announcing "Pentaho Kettle Solutions"


In the autumn of 2009, just a month after "Pentaho Solutions" was published, Wiley contacted Jos and me to find out if we were interested in writing a more specialized book on ETL and data integration using Pentaho. I felt honoured, and took the fact that Wiley, an experienced and well-reknowned publisher in the field of data warehousing and business intelligence, voiced interested in another Pentaho book by Jos an me as a token of confidence and encouragement that I value greatly. (For Pentaho Solutions, we heard that Wiley was interested, but we contacted them.) At the same time, I admit I had my share of doubts, having the memories of what it took to write Pentaho Solutions still fresh in my mind.

As it happens, Jos and I both attended the 2009 Pentaho Community Meeting, and there we seized the opportunity to talk to Matt Casters, chief Pentaho Data Integration and founding developer of Kettle (a.k.a. Pentaho Data Integration). Both Jos and I didn't expect Matt to be able to free up any time in his ever busy schedule to help us to write the new book. Needless to say, he made us both very happy when he rather liked the idea, and expressed immediate interest in becoming a full co-author!

Together, the three of us made a detailed outline and wrote a formal proposal for Wiley. Our proposal was accepted in December 2009, and we have been writing since, focusing on the forthcoming Kettle version, Kettle 4.0 . The tentative title of the book is Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration. It is planned to be published in September 2010, and it will have approximately 750 pages.



Our working copy of the outline is quite detailed but may still change in the future, which is why I won't publish it here until we finished our first draft of the book. I am 99% confident that the top level of the outline is stable, and I have no reservation in releasing that already:

  • Part I: Getting Started

    • ETL Primer

    • Kettle Concepts

    • Installation and Configuration

    • Sample ETL Solution


  • Part II: ETL Subsystems

    • Overview of the 34 Subsystems of ETL

    • Data Extraction

    • Cleansing and Conforming

    • Handling Dimension Tables

    • Fact Tables

    • Loading OLAP Cubes


  • Part III: Management and Deployment

    • Testing and Debugging

    • Scheduling and Monitoring

    • Versioning and Migration

    • Lineage and Auditing

    • Securing your Environment

    • Documenting


  • Part IV: Performance and Scalability

    • Performance Tuning

    • Parallization and Partitioning

    • Dynamic Clustering in the Cloud

    • Realtime and Streaming data


  • Part V: Integrating and Extending Kettle

    • Pentaho BI Integration

    • Third-party Kettle Integration

    • Extending Kettle


  • Part VI: Advanced Topics

    • Webservices and Web APIs

    • Complex File Handling

    • Data Vault Management

    • Working with ERP Systems



Feel free to ask me any questions about this new book. If you're interested, stay tuned - I will probably be posting 2 or 3 updates as we go.

Wednesday, January 27, 2010

Easter Eggs for MySQL and Kettle

To whom it may concern,

A MySQL stored function to calculate easter day


I uploaded a MySQL forge snippet for the f_easter() function. You can use this function in MySQL statements to calculate easter sunday for any given year:

mysql> select f_easter(year(now()));
+-----------------------+
| f_easter(year(now())) |
+-----------------------+
| 2010-04-04 |
+-----------------------+
1 row in set (0.00 sec)

Anonymous Gregorian algorithm


To implement it, I simply transcribed the code of the "Anonymous Gregorian algorithm" from wikipedia's Computus article.

You might ask yourself: "how does it work?". Frankly, I don't know. Much like a tax form, I treat the calculation as a black box. But, it's wikipedia, so it must be right, right?

A Javascript snippet to calculate easter day


I also transcribed the algorithm to javascript, so I could use it in Kettle (a.k.a. Pentaho Data Integration). Of course, nothing should stop you from using it for another environment, such as a webpage.

I don't have a proper place to host that code, so I'm listing it here:

//Script to calculate Easter day
//according to the "Anonymous Gregorian algorithm"
function easterDay(year) {
var a = year % 19,
b = Math.floor(year / 100),
c = year % 100,
d = Math.floor(b / 4),
e = b % 4,
f = Math.floor((b + 8) / 25),
g = Math.floor((b - f + 1) / 3),
h = (19 * a + b - d - g + 15) % 30,
i = Math.floor(c / 4),
k = c % 4,
L = (32 + 2 * e + 2 * i - h - k) % 7,
m = Math.floor((a + 11 * h + 22 * L) / 451),
n = h + L - 7 * m + 114;
return new Date(year,
Math.floor(n / 31) - 1,
(n % 31) + 1);
}

easter = easterDay(year);

To use this in your kettle transformations, create a stream with an field of the Integer type called year. The year field should take on the value of some year. In the step, create one output field of the Date type to take on the value of the easter script variable. (For usage outside kettle, just use the easterDay() function as you see fit)

Nice, but so what?


The thought may have crossed your mind: "So what, who cares - why should I ever want to know when it's easter day?"

Apparently, if you think like that, you don't like eggs very much. That's ok - I don't blame you. But I happen to like eggs, and people in the egg business like people that like eggs like me so they can sell them more eggs. In fact, they like selling eggs so much, that it makes a big difference to them whether their business intelligence reports say: "On March 22, 2008, we sold 10 times more eggs than on February 22 and May 22 of the same year" as compared to "In 2008, on the day before Easter, we only sold half the amount of eggs as compared to the day before Easter in 2009".

In order to report these facts, special events and holidays like easter are stored in a date dimension. (I wrote about creating a localized date dimension, a date dimension that speaks your language some time ago)

So there you go: you could use these solutions in order to build a date dimension that understands easter. The nice thing about easter is that it can be used to derive a whole bunch of other Christian holidays, like good friday, ascension, and pentecost, and in many western countries, these will be special days with regard to the normal course of business. I leave all these as an exercise to the reader, but trust me - calculating easter is the key to a solving a lot of these problems.

Monday, November 16, 2009

Pentaho Data Integration: Javascript Step Performance

I just read a post from Vincent Teyssier on cleaning strings using the javascript capabilities of Pentaho Data Integration (also known as Kettle) and Talend.

In this post, I am looking at a few details of Vincent's approach to using Javascript in his transformation. I will present a few modifications that considerably improve performance of the Javascript execution. Some of these improvements are generic: because they apply to the use of the javascript language, they are likely to improve performance in both Talend as well as Kettle. Other improvements have to do with the way the incoming and outgoing record streams are bound to the javascript step in Kettle.

Original Problem


The problem described by Vincent is simple enough: for each input string, return the string in lower case, except for the initial character, which should be in upper case. For example: vIncEnt should become Vincent.

Vincent illustrates his solution using Pentaho Data Integration's "Modified Javascript Value" step. He uses it to execute the following piece of code:

//First letter in uppercase, others in lowercase
var c = Input.getString().substr(0,1);
if (parseInt(Input.getString().length)==1)
{
var cc = upper(c);
}
else
{
var cc = upper(c) + lower(Input.getString().slice(1));
}

(The original post explains that one should be able to execute the code with minimal modification in Talend. While I don't have much experience with that tool, I think the proper step to use in that case is the tRhino step. Both tools use an embedded Rhino engine as javascript runtime, but I can imagine that there are slight differences with regard to binding the input and output fields and the support for built-in functions. Please feel free and leave a comment if you can provide more detailed information with regard to this matter.)

In the script, Input is the string field in the incoming stream that is to be modified, and cc is added to the output stream, containing the modified value. For some reason, the original example uses the javascript step in compatibility mode, necessitating expressions such as Input.getString() to obtain the value from the field.

I used the following transformation to test this script:
v0
The transformation uses a Generate Rows step to generate 1 million rows having a single String type field with the default value vIncEnt. The rows are processed by the Modified Javascript Value step, using the original code and compatibility mode like described in Vincent's original post. Finally, I used a Dummy step. I am not entirely sure the dummy ste has any effect on the performance of the javascript step, but I figured it would be a good idea to ensure the output of the script is actually copied to an outgoing stream.

On my laptop, using Pentaho Data Integration 3.2, this transformation takes 21.6 seconds to complete, and the Javascript step processes the rows at a rate of 46210.7 rows/second.

Caching calls to getString()


Like I mentioned, the original transformation uses the Javascript step in compatibility mode. Compatibility mode affects the way the fields of the stream are bound to the javascript step. With compatibility mode enabled, the step behaves like it did in Kettle 2.5 (and earlier versions): fields from the input stream are considered to be objects, and a special getter method is required to obtain their value. This is why we need an expression like Input.getString() to obtain the actual value.

The first improvement I'd like to present is based on simply caching the return value from the getter method. So instead of writing Input.getString() all the time, we simply write a line like this:

var input = Input.getString();

Afterwards, we simply refer only to input instead of Input.toString(). With this modifcation, the script becomes:

//First letter in uppercase, others in lowercase
var input = Input.getString();
var c = input.substr(0,1);
if (parseInt(input.length)==1)
{
var cc = upper(c);
}
else
{
var cc = upper(c) + lower(input.slice(1));
}

(Note that input and Input are two different things here: Input refers to the field object from the incoming record stream, and input refers to a global javascript variable which we use to cache the return value from the getString() method of the Input field object.)

If you compare this code to the original, you will notice that although this modified example adds an assignment to cache the value, it saves at least one call to the getString() method in the generic case. However, because the input value used in the example is longer than one character, it also saves another call done in the else branch of the if statement. So all in all, we can avoid two calls to getString() in this example.

This may not seem like that big a deal, but still, this improvement allows the javascript step to process rows at a rate of 51200.6 rows per second, which is an improvement of about 11%. Scripts that would have more than two calls to the getter method would benefit even more from this simple improvement.

Disabling Compatibility mode


The compatibility mode is just that: a way to stay compatible with the old Kettle 2.5 behaviour. While this is useful to ensure your old transformations don't break, you really should consider not using it for new transformations.

When disabling compatibility mode, you will need to change the script. In compatibility mode, the names of the fields from the input stream behave like variables that point to the field objects. With compatibility mode disabled, fieldnames still behave like variables, but now they point to the actual value of the field, and not the field object. So we need to change the script like this:

var c = Input.substr(0,1);
var cc;
if (parseInt(Input.length)==1){
cc = upper(c);
}
else {
cc = upper(c) + lower(Input.slice(1));
}

As you can see, we don't need to use the getSting() method anywhere anymore, and this also makes our first improvement obsolete. Personally, I feel this is an improvement code-wise. In addition, the transformation now performs considerably better: now it takes 14,8 seconds, and the javascript step is processing 67159,1 rows per second, which 30% better than the previous solution, and 45% better than the original.

Eliminating unncessary code


The fastest code is the code you don't execute. The original script contains a call to the javascript built-in parseInt() function which is applied to the length property of Input:

if (parseInt(Input.length)==1){
...snip...
}
The intended usage of parseInt() is to parse strings into integer values. Because the type of the length property of a string is already an integer, the call to parseInt() is simply redundant, and can be removed without any issue. This cuts down execution time to 12.8 seconds, and the Javascript step is now processing at a rate of 75204,9 rows per second: an improvement of 12% as compared to the previous improvement, and 63% as compared to the original.

Optimizing the flow


Although it may look like we optimized the original javascript as much as we could, there is still room for improvement. We can rewrite the if statements using the ternary operator, like so:

var cc = Input.length==0
? ""
: Input.length==1
? Input.toUpperCase()
: Input.substr(0,1).toUpperCase()
+ Input.substr(1).toLowerCase()
;

(Note that I am now using the toLowerCase() and toUpperCase() methods of the javascrpt String object in favor of the kettle built-in lower() and upper() functions.)
Not everybody may appreciate this code-wise, as it may appear a lot less explcit than the original if logic. In its defense, the approach of this solution has a more functional feel (as opposed to the procedural logic of the prior examples), which may feel more natural for the problem at hand. Regardless of any code-maintenance or aesthetic arguments, this code is actually slightly faster: It takes 12.3 seconds total, and the javascript step is processing 80301,9
rows per second, which is a 7% improvement as compared to the previous solution, and a 74% improvement as compared to the original.

Not using Javascript at all


The Javascript step can be very useful. But always keep in mind that it really is a general purpose scripting device. With the javascript step, you can do loops, open files, write to databases and whatnot. Do we really need all this power to solve the original problem? Especially if you are proficient in Javascript, it may be somewhat of a challenge to find better ways to solve the problem at hand, but really - it is often worth it.

First, let us realize that the original problem does not presume a particularly difficult transformation. We just need "something" that takes one input value, and returns one output value. We don't need any side effects, like writing to a file. We also don't need to change the grain: every input row is matched by exactly one output row, which is similar in layout to the output row, save for the addition of a field to hold the transformed value.

When discussing the previous solution, I already hinted that it was more "functional" as compared to the more "procedural" examples before that. We will now look at a few solutions that are also functional in nature:

The Formula step


So, basically, we need to write a function. The Formula step lets you combine several built-in functions in about the same manner as you can in spreadsheet programs like open office and Microsoft Excel. Using the formula step we can enter the following formula:

UPPER(LEFT([Input];1)) & LOWER(MID([Input];2;LEN([Input])))
If, like me, your eyes are bleading now, you might appreciate this formatted overview of this calculation:

UPPER(
LEFT(
[Input]
; 1
)
)
& LOWER(
MID(
[Input]
; 2
; LEN([Input])
)
)

This solution takes 8.5 seconds to complete, and the formula step is processing rows at a rate of 117868.9 per second, which is 47% better than the previous solution, and 155% better than the original (!!!)

The Calculator step


While not as flexible as the Formula step, the Calculator step offers a reasonable range of often used functions, and has the advantage of often being faster than the formula step. In this case, we're lucky, and we can set up two calculations: one "LowerCase of a string A" to convert the input value entirely to lower case, and then a "First letter of each word in capital of a string A". By feeding the output of the former into the latter, we get the desired result:
v4
(To be fair, because the calculation will actually add a capital to every word in the input, the result will actually be different as compared to any of the other transformations. However, in many cases, you might be able to guarantee that there is actually one word in the input, or otherwise, it may be considered desirable to capitalize all words.)

This transformation complets in just 6.5 seconds, and the calculator processes rows at a rate of 155327,7 per second. This is 32% better than the previous solution and 236% better than the original.

User-defined Java Expression


The final kicker is the user-defined java expression step. The user-defined java expression step allows you to write a java expression, which is compiled while the transformation is initialized. The expression I used is quite like the last javascript solution I discussed, except that we have to use methods of the Java String object (and not the JavaScript string object)

Input.length()==0?"":Input.length()==1?Input.toUpperCase():Input.substring(0,1).toUpperCase() + Input.substring(1).toLowerCase()

The result is truly amazing: The transformation completes in just 3.1 seconds, with the user-defined Java expression step processing at a rate of 324886,2 rows per second. This is 109% faster than the previous solution, and 603% faster than the original.

Conclusion


Javascript is a powerful device in data intergration transfomations, but it is quite slow. Consider replacing the javascript step with either the formula step, the calculator step or the user-defined Java expression step. Depending on your requirements, there may be other steps that deliver the fuunctionality you need.

If you really do need javascript, and you are using Pentaho Data Integration, consider disabling the compatibility mode. On the other hand, if you do need the compatibility mode, be sure to avoid repeated calls the getter methods of the field objects to obtain the value. Instead, call the getter methods just once, and use global script variables to cache the return value.

Summary


Here's a summary of the measurements:

Transformation | Rows per second
-----------------+-----------------
Original | 46201,7
Cache getString()| 51200,6
No Compatmode | 67159,1
no parseInt() | 75204,9
Optimize flow | 80301,9
Formula | 117868,9
Calculator | 155327,7
Java Expression | 324886,2

...and here, a bar chart showing the results:
v1000

Final thoughts


One of the things I haven't looked at in detail is adding more parallelism. By simply modifying the number of copies of the transforming step, we can use more cores/processors, but this is an excellent subject for a separate blog post.


UPDATE
Daniel Einspanjer from Mozilla Coorp. created a 30 min. video demonstrating this hands-on! He adds a few very interesting approaches to squeeze out even more performance.

Friday, July 24, 2009

OSCON 2009 Presentation "Taming your Data: Practical Data Integration Solutions with Kettle" now online

I just delivered my OSCON 2009 presentation "Taming your data: Practical Data Integration Solutions with Kettle". I think it went pretty well, and I got good responses from the audience. I did have much more material than time, and I probably should have proposed to do a tutorial instead. Maybe next year :) Anyway, you can find the presentation and the examples on the OSCON 2009 website. The slides are available in pdf format. There's also a zip file that contains the presentation as well as the kettle sample transformations and jobs.

Saturday, July 18, 2009

My OSCON 2009 Session: Taming your Data...

Yes!

Finally, it's there: In a few hours, I will be flying off to San Franscisco to attend OSCON 2009 in San Jose, California. This is the first time I'm attending, and I'm tremendously excited to be there! The sessions look very promising, and I'm looking forward to seeing some excellent speakers. I expect to learn a lot.

I'm also very proud and feel honoured to have the chance to deliver a session myself. It's called Taming Your Data: Practical Data Integration Solutions with Kettle.

Unsurprisingly, I will be talkig a lot about Kettle, a.k.a. Pentaho Data Integration. Recently, I talked about Kettle too at the MySQL user's conference, and more recently, at a MySQL university session. Those sessions were focused mainly on how Kettle can help you load a data warehouse.

But...there's much more to this tool than just data warehousing, and in this session, I will be exploring rougher grounds, like making sense of raw imdb text files, loading and generating XML, clustering and more. This session will also be much more hands on demonstration than the Sakila sessions. If you're interested and you are also attending, don't hesitate to drop by! I'm looking forward to meeting you :)

And...because the topic of the session kind of relates to my upcoming book, "Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL" (ISBN: 978-0-470-48432-6, 600+ pages, list price $50.00), my publisher Wiley decided to throw in a litte extra. Yup, that's right - I've got discount coupons for the book, so if you are interested in picking up a copy, or if you just want to give one away to a friend or colleague, come find me at my session (or somewhere else on OSCON) and I'll make sure you'll get one. Thanks Wiley!!

Anyway - I'm hoping to meet you there: see you soon!!!

Friday, July 03, 2009

Starring Sakila: MySQL university recording, slides and materials available onMySQLForge

Hi!

Yesterday I had the honour of presenting my mini-bi/datawarehousing tutorial "Starring Sakila" for MySQL University. I did a modified version of the presentation I did together with Matt Casters at the MySQL user's conference 2009 (video available here, thanks to Sheeri). The structure of the presentation is still largely the same, although I condensed various bits, and I added practical examples of setting up the ETL process and creating a Pentaho Analysis View (OLAP pivot table) on top of a Mondrian Cube.

The slides, session recording, and materials such as SQL script, pentaho data integration jobs and transformations, and Sakila Rentals Cube for Mondrian are all available here on MySQL Forge.

Copyright Notice


Presentation slides, and materials such as SQL script, pentaho data integration jobs and transformations, and Sakila Rentals Cube for Mondrian are all Copyright Roland Bouman. Feel free to download and learn from it. But please do not distribute the materials yourself - instead, point people to the wiki page to get their own copy of the materials. Personal use of the files is allowed. Use these materials for creating training materials of using these materials as training materials is explicitly not allowed without written prior consent. (Just mail me at roland dot bouman at gmail dot com if you would like to use the materials for such purposes, and we can work something out.)

Friday, February 20, 2009

Exporting a Kettle Repository to Files

Hi All!

Today I'd like to announce KREX, a small solution I put together to export a Kettle (a.k.a. Pentaho Data Integration) Repository to individual transformation (.ktr) and and job (.kjb) files.

The idea to create this was inspired by this thread on the pentaho forums, started by kandrews. He (she?) wrote:
Has anyone ever been able to export a PDI repository and convert it somehow into regular non-repository .kjb & .ktr files? If you have done this already or this functionality already exists please let me know.

My initial thoughts are possibly an XLS translation against the XML from the repository export. Thoughts?
Well, I hope this helps! Enjoy en let me know if its useful. Be advised that in the same thread, Matt Casters already revealed that the functionality to do this will soon be built into PDI, but until then this may be of use.

To start using KREX,

  • checkout the repository or download the Job and Transformation files to your file system.

  • Open the main Job file export_repository_to_files.kjb using Pentaho Data Integration 3.2's spoon (Currently a Milestone 1 release)

  • Configure the Set Source Repository Step in the set_source_repo_and_target_directory transformation to match the repository you want to export

  • Run the main job file (export_repository_to_files.kjb)

If all goes well, you should now have a directory called pdi_repo_export in your home directory which contains a subdirectory named after your exported repository containing the directory tree with the .ktr and .kjb files.

Here's a quick screenshot of the main job, just to give you an idea:
krexThe heart of the job is formed by the very last transformation, which does the actual legwork of extracting and saving the individual transformations:
krex2
The steps before that are mainly configuration and ensuring that the directory tree that is to contain the files is created before we attempt to write any files.

If you have any suggestions or comments, I welcome you to post them here. If you are trying to use KREX but run into an issue, please use the KREX issuelist.

If you are looking for more tips and trick with kettle and Pentaho in general, stay tuned. The "Building Pentaho Solutions" book I'm writing for Wiley together with Jos van Dongen will contain tons and tons of practical tips and solutions, and explain many of its technologies and concepts in thorough detail.

Cheers and until next time,

Roland

DataZen winter meetup 2025

The DataZen winter meetup 2025 is nigh! Join us 18 - 20 February 2025 for 3 days of expert-led sessions on AI, LLM, ChatGPT, Big Data, M...