Monday, November 16, 2009

Pentaho Data Integration: Javascript Step Performance

I just read a post from Vincent Teyssier on cleaning strings using the javascript capabilities of Pentaho Data Integration (also known as Kettle) and Talend.

In this post, I am looking at a few details of Vincent's approach to using Javascript in his transformation. I will present a few modifications that considerably improve performance of the Javascript execution. Some of these improvements are generic: because they apply to the use of the javascript language, they are likely to improve performance in both Talend as well as Kettle. Other improvements have to do with the way the incoming and outgoing record streams are bound to the javascript step in Kettle.

Original Problem


The problem described by Vincent is simple enough: for each input string, return the string in lower case, except for the initial character, which should be in upper case. For example: vIncEnt should become Vincent.

Vincent illustrates his solution using Pentaho Data Integration's "Modified Javascript Value" step. He uses it to execute the following piece of code:

//First letter in uppercase, others in lowercase
var c = Input.getString().substr(0,1);
if (parseInt(Input.getString().length)==1)
{
var cc = upper(c);
}
else
{
var cc = upper(c) + lower(Input.getString().slice(1));
}

(The original post explains that one should be able to execute the code with minimal modification in Talend. While I don't have much experience with that tool, I think the proper step to use in that case is the tRhino step. Both tools use an embedded Rhino engine as javascript runtime, but I can imagine that there are slight differences with regard to binding the input and output fields and the support for built-in functions. Please feel free and leave a comment if you can provide more detailed information with regard to this matter.)

In the script, Input is the string field in the incoming stream that is to be modified, and cc is added to the output stream, containing the modified value. For some reason, the original example uses the javascript step in compatibility mode, necessitating expressions such as Input.getString() to obtain the value from the field.

I used the following transformation to test this script:
v0
The transformation uses a Generate Rows step to generate 1 million rows having a single String type field with the default value vIncEnt. The rows are processed by the Modified Javascript Value step, using the original code and compatibility mode like described in Vincent's original post. Finally, I used a Dummy step. I am not entirely sure the dummy ste has any effect on the performance of the javascript step, but I figured it would be a good idea to ensure the output of the script is actually copied to an outgoing stream.

On my laptop, using Pentaho Data Integration 3.2, this transformation takes 21.6 seconds to complete, and the Javascript step processes the rows at a rate of 46210.7 rows/second.

Caching calls to getString()


Like I mentioned, the original transformation uses the Javascript step in compatibility mode. Compatibility mode affects the way the fields of the stream are bound to the javascript step. With compatibility mode enabled, the step behaves like it did in Kettle 2.5 (and earlier versions): fields from the input stream are considered to be objects, and a special getter method is required to obtain their value. This is why we need an expression like Input.getString() to obtain the actual value.

The first improvement I'd like to present is based on simply caching the return value from the getter method. So instead of writing Input.getString() all the time, we simply write a line like this:

var input = Input.getString();

Afterwards, we simply refer only to input instead of Input.toString(). With this modifcation, the script becomes:

//First letter in uppercase, others in lowercase
var input = Input.getString();
var c = input.substr(0,1);
if (parseInt(input.length)==1)
{
var cc = upper(c);
}
else
{
var cc = upper(c) + lower(input.slice(1));
}

(Note that input and Input are two different things here: Input refers to the field object from the incoming record stream, and input refers to a global javascript variable which we use to cache the return value from the getString() method of the Input field object.)

If you compare this code to the original, you will notice that although this modified example adds an assignment to cache the value, it saves at least one call to the getString() method in the generic case. However, because the input value used in the example is longer than one character, it also saves another call done in the else branch of the if statement. So all in all, we can avoid two calls to getString() in this example.

This may not seem like that big a deal, but still, this improvement allows the javascript step to process rows at a rate of 51200.6 rows per second, which is an improvement of about 11%. Scripts that would have more than two calls to the getter method would benefit even more from this simple improvement.

Disabling Compatibility mode


The compatibility mode is just that: a way to stay compatible with the old Kettle 2.5 behaviour. While this is useful to ensure your old transformations don't break, you really should consider not using it for new transformations.

When disabling compatibility mode, you will need to change the script. In compatibility mode, the names of the fields from the input stream behave like variables that point to the field objects. With compatibility mode disabled, fieldnames still behave like variables, but now they point to the actual value of the field, and not the field object. So we need to change the script like this:

var c = Input.substr(0,1);
var cc;
if (parseInt(Input.length)==1){
cc = upper(c);
}
else {
cc = upper(c) + lower(Input.slice(1));
}

As you can see, we don't need to use the getSting() method anywhere anymore, and this also makes our first improvement obsolete. Personally, I feel this is an improvement code-wise. In addition, the transformation now performs considerably better: now it takes 14,8 seconds, and the javascript step is processing 67159,1 rows per second, which 30% better than the previous solution, and 45% better than the original.

Eliminating unncessary code


The fastest code is the code you don't execute. The original script contains a call to the javascript built-in parseInt() function which is applied to the length property of Input:

if (parseInt(Input.length)==1){
...snip...
}
The intended usage of parseInt() is to parse strings into integer values. Because the type of the length property of a string is already an integer, the call to parseInt() is simply redundant, and can be removed without any issue. This cuts down execution time to 12.8 seconds, and the Javascript step is now processing at a rate of 75204,9 rows per second: an improvement of 12% as compared to the previous improvement, and 63% as compared to the original.

Optimizing the flow


Although it may look like we optimized the original javascript as much as we could, there is still room for improvement. We can rewrite the if statements using the ternary operator, like so:

var cc = Input.length==0
? ""
: Input.length==1
? Input.toUpperCase()
: Input.substr(0,1).toUpperCase()
+ Input.substr(1).toLowerCase()
;

(Note that I am now using the toLowerCase() and toUpperCase() methods of the javascrpt String object in favor of the kettle built-in lower() and upper() functions.)
Not everybody may appreciate this code-wise, as it may appear a lot less explcit than the original if logic. In its defense, the approach of this solution has a more functional feel (as opposed to the procedural logic of the prior examples), which may feel more natural for the problem at hand. Regardless of any code-maintenance or aesthetic arguments, this code is actually slightly faster: It takes 12.3 seconds total, and the javascript step is processing 80301,9
rows per second, which is a 7% improvement as compared to the previous solution, and a 74% improvement as compared to the original.

Not using Javascript at all


The Javascript step can be very useful. But always keep in mind that it really is a general purpose scripting device. With the javascript step, you can do loops, open files, write to databases and whatnot. Do we really need all this power to solve the original problem? Especially if you are proficient in Javascript, it may be somewhat of a challenge to find better ways to solve the problem at hand, but really - it is often worth it.

First, let us realize that the original problem does not presume a particularly difficult transformation. We just need "something" that takes one input value, and returns one output value. We don't need any side effects, like writing to a file. We also don't need to change the grain: every input row is matched by exactly one output row, which is similar in layout to the output row, save for the addition of a field to hold the transformed value.

When discussing the previous solution, I already hinted that it was more "functional" as compared to the more "procedural" examples before that. We will now look at a few solutions that are also functional in nature:

The Formula step


So, basically, we need to write a function. The Formula step lets you combine several built-in functions in about the same manner as you can in spreadsheet programs like open office and Microsoft Excel. Using the formula step we can enter the following formula:

UPPER(LEFT([Input];1)) & LOWER(MID([Input];2;LEN([Input])))
If, like me, your eyes are bleading now, you might appreciate this formatted overview of this calculation:

UPPER(
LEFT(
[Input]
; 1
)
)
& LOWER(
MID(
[Input]
; 2
; LEN([Input])
)
)

This solution takes 8.5 seconds to complete, and the formula step is processing rows at a rate of 117868.9 per second, which is 47% better than the previous solution, and 155% better than the original (!!!)

The Calculator step


While not as flexible as the Formula step, the Calculator step offers a reasonable range of often used functions, and has the advantage of often being faster than the formula step. In this case, we're lucky, and we can set up two calculations: one "LowerCase of a string A" to convert the input value entirely to lower case, and then a "First letter of each word in capital of a string A". By feeding the output of the former into the latter, we get the desired result:
v4
(To be fair, because the calculation will actually add a capital to every word in the input, the result will actually be different as compared to any of the other transformations. However, in many cases, you might be able to guarantee that there is actually one word in the input, or otherwise, it may be considered desirable to capitalize all words.)

This transformation complets in just 6.5 seconds, and the calculator processes rows at a rate of 155327,7 per second. This is 32% better than the previous solution and 236% better than the original.

User-defined Java Expression


The final kicker is the user-defined java expression step. The user-defined java expression step allows you to write a java expression, which is compiled while the transformation is initialized. The expression I used is quite like the last javascript solution I discussed, except that we have to use methods of the Java String object (and not the JavaScript string object)

Input.length()==0?"":Input.length()==1?Input.toUpperCase():Input.substring(0,1).toUpperCase() + Input.substring(1).toLowerCase()

The result is truly amazing: The transformation completes in just 3.1 seconds, with the user-defined Java expression step processing at a rate of 324886,2 rows per second. This is 109% faster than the previous solution, and 603% faster than the original.

Conclusion


Javascript is a powerful device in data intergration transfomations, but it is quite slow. Consider replacing the javascript step with either the formula step, the calculator step or the user-defined Java expression step. Depending on your requirements, there may be other steps that deliver the fuunctionality you need.

If you really do need javascript, and you are using Pentaho Data Integration, consider disabling the compatibility mode. On the other hand, if you do need the compatibility mode, be sure to avoid repeated calls the getter methods of the field objects to obtain the value. Instead, call the getter methods just once, and use global script variables to cache the return value.

Summary


Here's a summary of the measurements:

Transformation | Rows per second
-----------------+-----------------
Original | 46201,7
Cache getString()| 51200,6
No Compatmode | 67159,1
no parseInt() | 75204,9
Optimize flow | 80301,9
Formula | 117868,9
Calculator | 155327,7
Java Expression | 324886,2

...and here, a bar chart showing the results:
v1000

Final thoughts


One of the things I haven't looked at in detail is adding more parallelism. By simply modifying the number of copies of the transforming step, we can use more cores/processors, but this is an excellent subject for a separate blog post.


UPDATE
Daniel Einspanjer from Mozilla Coorp. created a 30 min. video demonstrating this hands-on! He adds a few very interesting approaches to squeeze out even more performance.

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Rolling up each game's lines into a single game row (6/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is the sixth installment of a s...