Saturday, December 18, 2010

Substituting variables in Kettle Parameter values

Kettle (a.k.a. Pentaho Data Integration) jobs and transformations offers support for named parameters (as of version 3.2.0). Named parameters form a special class of ordinary kettle variables and are intended to clearly and explicitly define for which variables the caller should supply a value.

One of my pet projects, the pentaho auto-documentation solution kettle-cookbook, uses two named parameters called INPUT_DIR and OUTPUT_DIR. These allow you to specify the directory that contains the BI content that is to be documented (such as kettle transformation and job files, action sequence files and mondrian schema files), and the directory to store the generated documentation.

Several kettle-cookbook users ran into problems attempting to use variable references in the values they supplied for the INPUT_DIR and OUTPUT_DIR variables. In this case, the variables referenced in the supplied parameter values would be set by adding entries in kettle.properties file. I just committed revision 64 of kettle-cookbook which should fix this problem. In this article I briefly discuss the solution, as I think it may be useful to other kettle users.

Substituting Kettle Variable References


Kettle doesn't automatically substitute variable references in parameter values (nor in ordinary variable values). So, if you need to support variable references inside parameter values, you have to substitute the variables yourself.

Variable substitution in Kettle 4.01 and up


As of Kettle version 4.01, the Calculator step supports a calculation type called "variable substitution in string A" that is intended exactly for that purpose. I have tested this but unfortunately in 4.01 it doesn't seem to work, at least not for the built-in variable ${Internal.Transformation.Filename.Directory} which I used in my test. In the latest stable version, Kettle 4.10 it does work as advertised, I would recommend using this method if you're a user of Kettle 4.10 (or later).

Variable substitution in earlier Kettle versions


I have committed myself to making kettle-cookbook work on kettle 3.2.0, as my sources tell me that this is still an often-used version in many production environments. I'm even prepared to make kettle-cookbook work on Kettle versions earlier than 3.20, should there be sufficient demand for that. Anyway, the bottom line is, these versions do not support the "variable substitution in string A" calculation in the Calculator step, so you have to resort to a little trick.

A Kettle 3.2.0 transformation to substitute variables in parameters


For kettle-cookbook, I added a single transformation called substitute-variables-in-parameters.ktr as the first transformation of the main job.



The substitute-variables-in-parameters.ktr transformation uses a "Get Variables" step to read the values of the INPUT_DIR and OUTPUT_DIR parameters. The values are then processed by a javascript function which substitutes all variable references with their values. Finally, a "Set Variables" step overwrites the original value of the variables with their replaced value.

The code for the JavaScript step is shown below:

function replace_variables(text){
var re = /\$\{([^\}]+)\}|%%([^%]+)%%/g,
match, from = 0,
variable_name, variable_value,
replaced_text = ""
;

while ((match = re.exec(text)) !== null) {
variable_name = match[1] ? match[1] : match[2];
variable_value = getVariable(variable_name, "");
replaced_text += text.substring(from, match.index);
replaced_text += variable_value;
from = match.index + match[0].length;
}
replaced_text += text.substring(from, text.length);
return replaced_text;
}

var replaced_input_dir = replace_variable(input_dir);
var replaced_output_dir = replace_variable(output_dir);

The script first defines function replace_variables(text) which accepts the parameter value, and returns the substituted value. Then it calls the function, applying it to the input_dir and output_dir fields from the incoming stream. These fields originate in the preceding "Get variables" step which assigns them the value of the INPUT_DIR and OUTPUT_DIR variables. The output of the replace_variables() function is assigned to the replaced_input_dir and replaced_output_dir javascript variables, which leave the JavaScript step as fields of the outgoing stream. In the final "Set variables" step, the replaced_input_dir and replaced_output_dir fields are used to overwrite the original value of the INPUT_DIR and OUTPUT_DIR values.

The replace_variables() function


Let's take a closer look at the replace_variables() function.

The heart of the function is formed by a while loop that executes a javascript regular expression called re that matches variable references.

The regular expression itself is defined in the top of the function:

var re = /\$\{([^\}]+)\}|%%([^%]+)%%/g,

It's intention is to recognize variable references of the form ${NAME} and %%NAME%%. The part of the pattern for the name is enclosed in parenthesis to form a capturing group. As we shall see later on, this allows us to extract the actual name of the referenced variable. The trailing g indicates that the pattern should be matched anywhere in the string. This is necessary because we want to replace all variable references in the input text, not just the first one.

The regular expression object is used to drive the while loop by calling its exec() method. In case of a match, the exec() returns an array that describes the text matched by the regular expression. If there's no match, exec() returns null

while ((match = re.exec(text)) !== null) {
...
}

If there is a match, we first extract the variable name:

variable_name = match[1] ? match[1] : match[2];
The first entry of the match array (at index 0) is the text that was matched by the pattern as a whole. The array contains subsequent elements for each capturing group in the regular expression. Because our regular expression re has 2 capturing groups, the match array contains two more elements. If the variable is of the form ${NAME}, the element at index=1 contains the variable name. If it's of the form %%NAME%%, it will be contained in the element at index=2.

Once we have the variable name, we can use the getVariable() javascript function to obtain its value:

variable_value = getVariable(variable_name, "");

The getVariable() is not a standard javascript function, but supplied by the kettle javascript step.

To perform the actual substitution, we take the substring of the original text up to the location where the variable reference was matched. This location is conveniently supplied by the match array:

replaced_text += text.substring(from, match.index);
Right after that location, we need to put the variable value instead of its name:

replaced_text += variable_value;
The last action in the loop is to remember to location right behind the last replaced variable reference, so we can pick up at the right location in the original value the next time we match a variable:

from = match.index + match[0].length;

Right after the loop, we need to copy the final piece of original text occurring right behind the last variable reference to yield the complete replaced text:

replaced_text += text.substring(from, text.length);

Odds and Ends


While the substitute-variables-in-parameters.ktr transformation works great for its intended purpose, substituting variables in the known parameters INPUT_DIR and OUTPUT_DIR, it is not really applicable beyond kettle cookbook. What you'd really want to have is a job that replaces variables in all parameters, not just those that are known in advance.

As it turns out this is actually almost trivial to achieve, however to solution is a bit too long-winded for this post. If anyone is interested in such a solution, please post a comment and let me know, and I'd be happy to provide it.

UPDATE: A solution that substitutes all variable references occurring in the parameter values of the containing job is now available at the kettle exchange area in the Pentaho wiki.

8 comments:

Joao Luis said...

Hi Roland

I'm sort of a rookie using Kettle 4.1.0 and I'm having problems in getting variables values with Calculator.

Can you please provide an example that works?

Thanks in advance for your help.

rpbouman said...

Hi Joao Luis,

please look in the samples/transformations directory underneath your kettle installation dir. You'll find lots of samples there, including one called "Calculator.ktr".

Karl Miller said...

Roland,
I'm trying to run a transform with 4.1 where I need to do something like the following (I'm a novice with JS, so bear with, please):

// If the current value is not null,
if (!(currval.isNull))
{
// Use the current vale in this record.
useval = currval;
// Save the current value for the next record's previous value
prevval = currval;
}
// If the current value is NULL and a PREVIOUS value exists,
else if ((currval.isNull) && (!(prevval.isNull)))
{
// Use the previous value for the current row.
currval = prevval;
}
// If there's no current value or previous value (e.g. first record has a blank),
else if ((currval.isNull) & (prevval.isNull))
{
// Use the DEFAULT value for both this run and preserve it for the next row, just in case.
useval = dfltval;
prevval = dfltval;
}

setVariable(FIELD_01,useval);
setVariable(PREV_01, prevval);

And then want my output to use the value of FIELD_01 and have the next record see the value of PREV_01 in case it needs it.

How can I accomplish this?

Thanks,
Karl

rpbouman said...

Hi Karl,

Just based on a general scan of the code and the comments, I'd say this:

// Use the previous value for the current row.
currval = prevval;

is not correct. That is, I'd expect it to be

useval = prevval;

As a more general comment on what you're trying to achieve:

"And then want my output to use the value of FIELD_01 and have the next record see the value of PREV_01 in case it needs it."

This is not really how Kettle variables were meant to be used. Variable references (like ${VARIABLE_NAME}) in configuration dialogs are resolved once in the init phase of a transformation / job. So while you can change the value of the variable with javascript or with the set variables step many times, this typically doesn't change anything since the variables were read out once already (although I expect dynamically reading out the variables with javascript would work).

The typical way that variables are used is that in a job, there is a transformation or set variables job entry that sets the value of the variables once, and then the subsequent job entries resolve their variable references against that.

There's more that could be said about your code snippet, but this may be superceded by the remark above.

Can you tell us abit about what you want to achieve by setting the variables dynamically. Chances are your problem can be solved in another way.

Karl Miller said...

Thanks, Roland.

I'm trying to load data from spreadsheets. Some of the column values a user would enter may be entered many many times. To alleviate this, I wanted to allow a configurable default setting on the job as a variable (dfltval from ${DEFAULT_VAL}), and allow the user to simply set a block of data in the spreadsheet by setting the value in the first row of the block (currVal). This value would need to be preserved and used across rows (prevval) until a new block is started. Whenever a new block is entered, update the value for the block (prevval = currval).

When I get the the output of the step, I wanted to just use the value from the JS step (useval), and leave the logic in just that step.

Leo G said...

Hi Roland
I'm having difficulties passing values in to transformation. I a job that run sql statement that generate multiple rows and I want run next transformation for each row. I can't seems to get values inside the transformation. Could you point me in the right direction. I'm on 4.4 stable. Thanks

Leo G said...

Hi Roland
I'm having difficulties passing values in to transformation. I a job that run sql statement that generate multiple rows and I want run next transformation for each row. I can't seems to get values inside the transformation. Could you point me in the right direction. I'm on 4.4 stable. Thanks

rpbouman said...

Hi Unknown,

Your question does not relate to this blog post. Please use general forums and community support next time.

As for your question:

"I a job that run sql statement"

Are you using the SQL job entry? I don't think that works if you want to use the resultset. Instead, you should build a transformation that uses a "table input" step (in the input category), and then use a "copy rows to result" step (in the job category) to export the result to the job containing the transformation.

Then, inside the job, you can right click on the receiving transformation and check the "run for each row" flag. To get the values into the receiving transformation, you can either use parameters or a "get rows from result" input step. In the case of parameters, double click the transformation to open its configuration dialog, and use the parameters tab to configure how fields from the resultset map to the parameters of the receiving transformation.

The following post may also be useful:

http://type-exit.org/adventures-with-open-source-bi/2010/06/run-kettle-job-for-each-row/

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...