One of my pet projects, the pentaho auto-documentation solution kettle-cookbook, uses two named parameters called
INPUT_DIR
and OUTPUT_DIR
. These allow you to specify the directory that contains the BI content that is to be documented (such as kettle transformation and job files, action sequence files and mondrian schema files), and the directory to store the generated documentation.Several kettle-cookbook users ran into problems attempting to use variable references in the values they supplied for the
INPUT_DIR
and OUTPUT_DIR
variables. In this case, the variables referenced in the supplied parameter values would be set by adding entries in kettle.properties
file. I just committed revision 64 of kettle-cookbook which should fix this problem. In this article I briefly discuss the solution, as I think it may be useful to other kettle users. Substituting Kettle Variable References
Kettle doesn't automatically substitute variable references in parameter values (nor in ordinary variable values). So, if you need to support variable references inside parameter values, you have to substitute the variables yourself.
Variable substitution in Kettle 4.01 and up
As of Kettle version 4.01, the Calculator step supports a calculation type called "variable substitution in string A" that is intended exactly for that purpose. I have tested this but unfortunately in 4.01 it doesn't seem to work, at least not for the built-in variable
${Internal.Transformation.Filename.Directory}
which I used in my test. In the latest stable version, Kettle 4.10 it does work as advertised, I would recommend using this method if you're a user of Kettle 4.10 (or later).Variable substitution in earlier Kettle versions
I have committed myself to making kettle-cookbook work on kettle 3.2.0, as my sources tell me that this is still an often-used version in many production environments. I'm even prepared to make kettle-cookbook work on Kettle versions earlier than 3.20, should there be sufficient demand for that. Anyway, the bottom line is, these versions do not support the "variable substitution in string A" calculation in the Calculator step, so you have to resort to a little trick.
A Kettle 3.2.0 transformation to substitute variables in parameters
For kettle-cookbook, I added a single transformation called substitute-variables-in-parameters.ktr as the first transformation of the main job.
The
substitute-variables-in-parameters.ktr
transformation uses a "Get Variables" step to read the values of the INPUT_DIR
and OUTPUT_DIR
parameters. The values are then processed by a javascript function which substitutes all variable references with their values. Finally, a "Set Variables" step overwrites the original value of the variables with their replaced value.The code for the JavaScript step is shown below:
function replace_variables(text){
var re = /\$\{([^\}]+)\}|%%([^%]+)%%/g,
match, from = 0,
variable_name, variable_value,
replaced_text = ""
;
while ((match = re.exec(text)) !== null) {
variable_name = match[1] ? match[1] : match[2];
variable_value = getVariable(variable_name, "");
replaced_text += text.substring(from, match.index);
replaced_text += variable_value;
from = match.index + match[0].length;
}
replaced_text += text.substring(from, text.length);
return replaced_text;
}
var replaced_input_dir = replace_variable(input_dir);
var replaced_output_dir = replace_variable(output_dir);
The script first defines
function replace_variables(text)
which accepts the parameter value, and returns the substituted value. Then it calls the function, applying it to the input_dir
and output_dir
fields from the incoming stream. These fields originate in the preceding "Get variables" step which assigns them the value of the INPUT_DIR
and OUTPUT_DIR
variables. The output of the replace_variables()
function is assigned to the replaced_input_dir
and replaced_output_dir
javascript variables, which leave the JavaScript step as fields of the outgoing stream. In the final "Set variables" step, the replaced_input_dir
and replaced_output_dir
fields are used to overwrite the original value of the INPUT_DIR
and OUTPUT_DIR
values.The replace_variables()
function
Let's take a closer look at the
replace_variables()
function.The heart of the function is formed by a
while
loop that executes a javascript regular expression called re
that matches variable references.The regular expression itself is defined in the top of the function:
var re = /\$\{([^\}]+)\}|%%([^%]+)%%/g,
It's intention is to recognize variable references of the form
${NAME}
and %%NAME%%
. The part of the pattern for the name is enclosed in parenthesis to form a capturing group. As we shall see later on, this allows us to extract the actual name of the referenced variable. The trailing g
indicates that the pattern should be matched anywhere in the string. This is necessary because we want to replace all variable references in the input text, not just the first one.The regular expression object is used to drive the
while
loop by calling its exec()
method. In case of a match, the exec()
returns an array that describes the text matched by the regular expression. If there's no match, exec()
returns null
while ((match = re.exec(text)) !== null) {
...
}
If there is a match, we first extract the variable name:
The first entry of the
variable_name = match[1] ? match[1] : match[2];
match
array (at index 0) is the text that was matched by the pattern as a whole. The array contains subsequent elements for each capturing group in the regular expression. Because our regular expression re
has 2 capturing groups, the match
array contains two more elements. If the variable is of the form ${NAME}
, the element at index=1 contains the variable name. If it's of the form %%NAME%%
, it will be contained in the element at index=2.Once we have the variable name, we can use the
getVariable()
javascript function to obtain its value:
variable_value = getVariable(variable_name, "");
The
getVariable()
is not a standard javascript function, but supplied by the kettle javascript step.To perform the actual substitution, we take the substring of the original text up to the location where the variable reference was matched. This location is conveniently supplied by the
match
array:Right after that location, we need to put the variable value instead of its name:
replaced_text += text.substring(from, match.index);
The last action in the loop is to remember to location right behind the last replaced variable reference, so we can pick up at the right location in the original value the next time we match a variable:
replaced_text += variable_value;
from = match.index + match[0].length;
Right after the loop, we need to copy the final piece of original text occurring right behind the last variable reference to yield the complete replaced text:
replaced_text += text.substring(from, text.length);
Odds and Ends
While the
substitute-variables-in-parameters.ktr
transformation works great for its intended purpose, substituting variables in the known parameters INPUT_DIR
and OUTPUT_DIR
, it is not really applicable beyond kettle cookbook. What you'd really want to have is a job that replaces variables in all parameters, not just those that are known in advance. As it turns out this is actually almost trivial to achieve, however to solution is a bit too long-winded for this post. If anyone is interested in such a solution, please post a comment and let me know, and I'd be happy to provide it.
UPDATE: A solution that substitutes all variable references occurring in the parameter values of the containing job is now available at the kettle exchange area in the Pentaho wiki.