Thursday, December 09, 2010

Parameterizing SQL statements in the Kettle Table Input step: Variables vs Parameters

I get this question on a regular basis, so I figured I might as well blog it, in the hope it will be useful for others. Here goes:

Let's say that I want to delete all records that match an id from a set of tables. The table names come in as rows into the Execute SQL Script step (check execute for every row). Next I write:
DELETE FROM {table_name} WHERE id = {identifier}

as the SQL to execute. In the parameters grid at the bottom right, I have two fields: table_name and identifier. What is the syntax for substituting the table_name and identifier parameters in the sql script?


(Although this particular question focuses on the "Execute SQL Script" step, it also applies to the "Table Input" step, and probably a few more steps I can't recall right now.)

The parameters grid can be used for prepared statement value placeholders. In the SQL statement these placeholders are denoted as questionmarks (?). These are positional parameters: they get their value from those fields in the incoming stream that are entered in the parameters grid, in order.
Here's an example of the correct usage of these placeholders:
DELETE FROM myTable WHERE id = ?

Here, the ? in the WHERE clause will be bound to the value of the first field from the incoming stream entered in the parameters grid. Because there is only one such placeholder, there can be only one field in the parameters grid.

An important thing to realize is that these parameters can only be used to parameterize value expressions. So, this kind of parameter does not work for identifiers, nor do they work for structural elements of the SQL statement, such as keywords. So this kind of parameter cannot be used to parameterize the table name which seems to be the intention in the original example posed in the question.

There is a way to parameterize the structural elements of the SQL statement as well as the parameters. You can apply variable substitution to the SQL statetment.

Kettle Variables can be defined by a Set Variables step, or by specifying parameters at the transformation level. They get their value from "the environment": for example, parameters get their value initially when the transformation is started, and regular variables are typically set somewhere in the job that is calling your transformation.

In text fields, including the SQL textarea of the Table input step or the Execute SQL Script step, you denote those variables with this syntax: ${VARIABLE_NAME}. So to parameterize the table name we could use something like this:
DELETE FROM ${TABLE_NAME}


In order to force kettle to apply variable substitution to the SQL statement, you have to check the "variable substitution" checkbox. If this checkbox is checked, then all variables are simply substituted with their (string)value during transformation initialization. This is a lot like the way macro's are substituted by the pre-processor in C/C++ code.

When comparing variables with parameters, two important things should be mentioned here:

  • Unlike value placeholders, variables can be used to manipulate any aspect of the SQL statement, not just value expressions. The variable value will simply become the text that makes up the SQL statement, it is your responsibility it results in a syntactically valid and correct SQL statement.

  • Variables are evaluated once during transformation initalization. So if you want to vary the variable value, you'll have to call the transformation again for the change to take effect. For the same reasons, you cannot set the value of a variable and read it within the same transformation: setting the variable value occurs at runtime, but evaluating it occurs at initialization time.



Finally, here's a screenshot that summarizes these different ways to parameterize SQL statements in kettle:

If you want to read more about this topic, it's covered in both our books Pentaho Solutions and Pentaho Kettle Solutions. Another title you might be interested in is Maria Roldan's Pentaho 3.2 Data Integration: Beginner's Guide.

13 comments:

Muthu said...

Thanks, can you please explain how to pass the Connection - Hostname and Database name as parameter?

rpbouman said...

Muthu, yes, except for the database type, all elements of kettle database connections (host, port, credentials etc) can be specified as variables, and can thus be set using parameters.

Muthu said...

Can you please provide the steps and screen shots like the previous one for passing db connections (hostname, Dbname, username and password) as a variables.

rpbouman said...

Muthu, I can but I don't want to.

- Just create a new transformation.

- Within the transformation, create a new Database connection. Use variables for the fields you want to parameterize. For example, use ${HOST} for hostname, ${PORT} for portnumber etcetera.

- open the transormation settings and in the "Parameters" tab, add a parameter for each variable you used. In the paramters tab, only use the parameter name, and leave out the ${...} markers (those are only appropriate when *referencing* a variable.

Muthu said...

Thanks man, I tried and it worked :)

rpbouman said...

Muthu, cool! Glad that worked.

SNG said...

Thank you so much! Now i know how to pass variable into sql script. Oh yeah!!!

gattrinn said...

do we have any option to pass dynamic column in SQL query ?

rpbouman said...

@gattrinn,

"do we have any option to pass dynamic column in SQL query ?"

Yes. All steps that accept a SQL statement in their config can be represented in part or in their entirety by a kettle variable or parameter. So, you can use this to dynamically set columns as well.

However, you will have to take care that the remainder of your transformation is resilient to a dynamic column make up of the data stream as well.

Another thing you might want to look into in making your transformations more dynamic is metadata injection, which essentially lets you configure your steps dynamically (data driven) http://wiki.pentaho.com/display/EAI/ETL+Metadata+Injection

puneet said...

How do we pass a value to ? in case we want to run --> where column in ?

rpbouman said...

Hi Puneet,

I am assuming you are using the ? placeholder in a table input step? Those placeholders are used as JDBC bind variables. The can be bound if you configure the input step (see: https://help.pentaho.com/Documentation/8.2/Products/Data_Integration/Transformation_Step_Reference/Table_Input) and then the fields from the incoming stream will be bound by position.

I don't think you can bind a list to that. You could write something like:


SELECT * FROM t WHERE c in (?,?,?)


and then have just as many fields in your incoming step. The drawback, apart from the positional binding, is that you need to know in advance how many values there will be in the list.

Alternatively, you might check the "Replace variables in script?" checkbox, and use variables (or, parameters at the transformation level, which are exposed as variables within the scope of the transformation).

You could for example write:

SELECT * FROM t WHERE c IN (${MY_LIST_OF_VALUES})

and then ensure that the variable MY_LIST_OF_VALUES has the string value '1,2,3'; then the resulting SQL will read:

SELECT * FROM t WHERE c IN (1,2,3)


Keep in mind that when you do this, any appearance of a variable placeholder in the SQL will be replaced with the string value of the variable - like a macro substitution. The drawback here is that you'll need to ensure yourself the resulting statement, after variable substitution, is still valid SQL, and apart from that, within one transformation execution, the variable value will always be constant - you cannot set the variable within the transformation, it has to be controlled from the calling job.

HTH.

Unknown said...

I wanna do it the other way around. I wanna save the output of the SQL query into a transformation parameter value. Can someone please tell me how to do it?

rpbouman said...

Hi @Unknown, you asked

"wanna save the output of the SQL query into a transformation parameter value."

yes you can - use the "Set Variables" step.

https://wiki.pentaho.com/display/EAI/Set+Variables

However, the new variable value will not be seen immediately in the transformation after setting it. That's because all the steps are already running.

The new value will be visible from the next step in the job that started the transformation that changed the parameter value.

I hope this helps!

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...