Saturday, December 18, 2010

Substituting variables in Kettle Parameter values

Kettle (a.k.a. Pentaho Data Integration) jobs and transformations offers support for named parameters (as of version 3.2.0). Named parameters form a special class of ordinary kettle variables and are intended to clearly and explicitly define for which variables the caller should supply a value.

One of my pet projects, the pentaho auto-documentation solution kettle-cookbook, uses two named parameters called INPUT_DIR and OUTPUT_DIR. These allow you to specify the directory that contains the BI content that is to be documented (such as kettle transformation and job files, action sequence files and mondrian schema files), and the directory to store the generated documentation.

Several kettle-cookbook users ran into problems attempting to use variable references in the values they supplied for the INPUT_DIR and OUTPUT_DIR variables. In this case, the variables referenced in the supplied parameter values would be set by adding entries in kettle.properties file. I just committed revision 64 of kettle-cookbook which should fix this problem. In this article I briefly discuss the solution, as I think it may be useful to other kettle users.

Substituting Kettle Variable References


Kettle doesn't automatically substitute variable references in parameter values (nor in ordinary variable values). So, if you need to support variable references inside parameter values, you have to substitute the variables yourself.

Variable substitution in Kettle 4.01 and up


As of Kettle version 4.01, the Calculator step supports a calculation type called "variable substitution in string A" that is intended exactly for that purpose. I have tested this but unfortunately in 4.01 it doesn't seem to work, at least not for the built-in variable ${Internal.Transformation.Filename.Directory} which I used in my test. In the latest stable version, Kettle 4.10 it does work as advertised, I would recommend using this method if you're a user of Kettle 4.10 (or later).

Variable substitution in earlier Kettle versions


I have committed myself to making kettle-cookbook work on kettle 3.2.0, as my sources tell me that this is still an often-used version in many production environments. I'm even prepared to make kettle-cookbook work on Kettle versions earlier than 3.20, should there be sufficient demand for that. Anyway, the bottom line is, these versions do not support the "variable substitution in string A" calculation in the Calculator step, so you have to resort to a little trick.

A Kettle 3.2.0 transformation to substitute variables in parameters


For kettle-cookbook, I added a single transformation called substitute-variables-in-parameters.ktr as the first transformation of the main job.



The substitute-variables-in-parameters.ktr transformation uses a "Get Variables" step to read the values of the INPUT_DIR and OUTPUT_DIR parameters. The values are then processed by a javascript function which substitutes all variable references with their values. Finally, a "Set Variables" step overwrites the original value of the variables with their replaced value.

The code for the JavaScript step is shown below:

function replace_variables(text){
var re = /\$\{([^\}]+)\}|%%([^%]+)%%/g,
match, from = 0,
variable_name, variable_value,
replaced_text = ""
;

while ((match = re.exec(text)) !== null) {
variable_name = match[1] ? match[1] : match[2];
variable_value = getVariable(variable_name, "");
replaced_text += text.substring(from, match.index);
replaced_text += variable_value;
from = match.index + match[0].length;
}
replaced_text += text.substring(from, text.length);
return replaced_text;
}

var replaced_input_dir = replace_variable(input_dir);
var replaced_output_dir = replace_variable(output_dir);

The script first defines function replace_variables(text) which accepts the parameter value, and returns the substituted value. Then it calls the function, applying it to the input_dir and output_dir fields from the incoming stream. These fields originate in the preceding "Get variables" step which assigns them the value of the INPUT_DIR and OUTPUT_DIR variables. The output of the replace_variables() function is assigned to the replaced_input_dir and replaced_output_dir javascript variables, which leave the JavaScript step as fields of the outgoing stream. In the final "Set variables" step, the replaced_input_dir and replaced_output_dir fields are used to overwrite the original value of the INPUT_DIR and OUTPUT_DIR values.

The replace_variables() function


Let's take a closer look at the replace_variables() function.

The heart of the function is formed by a while loop that executes a javascript regular expression called re that matches variable references.

The regular expression itself is defined in the top of the function:

var re = /\$\{([^\}]+)\}|%%([^%]+)%%/g,

It's intention is to recognize variable references of the form ${NAME} and %%NAME%%. The part of the pattern for the name is enclosed in parenthesis to form a capturing group. As we shall see later on, this allows us to extract the actual name of the referenced variable. The trailing g indicates that the pattern should be matched anywhere in the string. This is necessary because we want to replace all variable references in the input text, not just the first one.

The regular expression object is used to drive the while loop by calling its exec() method. In case of a match, the exec() returns an array that describes the text matched by the regular expression. If there's no match, exec() returns null

while ((match = re.exec(text)) !== null) {
...
}

If there is a match, we first extract the variable name:

variable_name = match[1] ? match[1] : match[2];
The first entry of the match array (at index 0) is the text that was matched by the pattern as a whole. The array contains subsequent elements for each capturing group in the regular expression. Because our regular expression re has 2 capturing groups, the match array contains two more elements. If the variable is of the form ${NAME}, the element at index=1 contains the variable name. If it's of the form %%NAME%%, it will be contained in the element at index=2.

Once we have the variable name, we can use the getVariable() javascript function to obtain its value:

variable_value = getVariable(variable_name, "");

The getVariable() is not a standard javascript function, but supplied by the kettle javascript step.

To perform the actual substitution, we take the substring of the original text up to the location where the variable reference was matched. This location is conveniently supplied by the match array:

replaced_text += text.substring(from, match.index);
Right after that location, we need to put the variable value instead of its name:

replaced_text += variable_value;
The last action in the loop is to remember to location right behind the last replaced variable reference, so we can pick up at the right location in the original value the next time we match a variable:

from = match.index + match[0].length;

Right after the loop, we need to copy the final piece of original text occurring right behind the last variable reference to yield the complete replaced text:

replaced_text += text.substring(from, text.length);

Odds and Ends


While the substitute-variables-in-parameters.ktr transformation works great for its intended purpose, substituting variables in the known parameters INPUT_DIR and OUTPUT_DIR, it is not really applicable beyond kettle cookbook. What you'd really want to have is a job that replaces variables in all parameters, not just those that are known in advance.

As it turns out this is actually almost trivial to achieve, however to solution is a bit too long-winded for this post. If anyone is interested in such a solution, please post a comment and let me know, and I'd be happy to provide it.

UPDATE: A solution that substitutes all variable references occurring in the parameter values of the containing job is now available at the kettle exchange area in the Pentaho wiki.

Thursday, December 09, 2010

Parameterizing SQL statements in the Kettle Table Input step: Variables vs Parameters

I get this question on a regular basis, so I figured I might as well blog it, in the hope it will be useful for others. Here goes:

Let's say that I want to delete all records that match an id from a set of tables. The table names come in as rows into the Execute SQL Script step (check execute for every row). Next I write:
DELETE FROM {table_name} WHERE id = {identifier}

as the SQL to execute. In the parameters grid at the bottom right, I have two fields: table_name and identifier. What is the syntax for substituting the table_name and identifier parameters in the sql script?


(Although this particular question focuses on the "Execute SQL Script" step, it also applies to the "Table Input" step, and probably a few more steps I can't recall right now.)

The parameters grid can be used for prepared statement value placeholders. In the SQL statement these placeholders are denoted as questionmarks (?). These are positional parameters: they get their value from those fields in the incoming stream that are entered in the parameters grid, in order.
Here's an example of the correct usage of these placeholders:
DELETE FROM myTable WHERE id = ?

Here, the ? in the WHERE clause will be bound to the value of the first field from the incoming stream entered in the parameters grid. Because there is only one such placeholder, there can be only one field in the parameters grid.

An important thing to realize is that these parameters can only be used to parameterize value expressions. So, this kind of parameter does not work for identifiers, nor do they work for structural elements of the SQL statement, such as keywords. So this kind of parameter cannot be used to parameterize the table name which seems to be the intention in the original example posed in the question.

There is a way to parameterize the structural elements of the SQL statement as well as the parameters. You can apply variable substitution to the SQL statetment.

Kettle Variables can be defined by a Set Variables step, or by specifying parameters at the transformation level. They get their value from "the environment": for example, parameters get their value initially when the transformation is started, and regular variables are typically set somewhere in the job that is calling your transformation.

In text fields, including the SQL textarea of the Table input step or the Execute SQL Script step, you denote those variables with this syntax: ${VARIABLE_NAME}. So to parameterize the table name we could use something like this:
DELETE FROM ${TABLE_NAME}


In order to force kettle to apply variable substitution to the SQL statement, you have to check the "variable substitution" checkbox. If this checkbox is checked, then all variables are simply substituted with their (string)value during transformation initialization. This is a lot like the way macro's are substituted by the pre-processor in C/C++ code.

When comparing variables with parameters, two important things should be mentioned here:

  • Unlike value placeholders, variables can be used to manipulate any aspect of the SQL statement, not just value expressions. The variable value will simply become the text that makes up the SQL statement, it is your responsibility it results in a syntactically valid and correct SQL statement.

  • Variables are evaluated once during transformation initalization. So if you want to vary the variable value, you'll have to call the transformation again for the change to take effect. For the same reasons, you cannot set the value of a variable and read it within the same transformation: setting the variable value occurs at runtime, but evaluating it occurs at initialization time.



Finally, here's a screenshot that summarizes these different ways to parameterize SQL statements in kettle:

If you want to read more about this topic, it's covered in both our books Pentaho Solutions and Pentaho Kettle Solutions. Another title you might be interested in is Maria Roldan's Pentaho 3.2 Data Integration: Beginner's Guide.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...