Tuesday, May 10, 2011

Managing kettle job configuration

Over time I've grown a habit of making a configuration file for my kettle jobs. This is especially useful if you have a reusable job, where the same work has to be done but against different conditions. A simple example where I found this useful is when you have separate development, testing and production environments: when you're done developing your job, you transfer the .kjb file (and its dependencies) to the testing environment. This is the easy part. But the job still has to run within the new environment, against different database connections, webservice urls and file system paths.

Variables


In the past, much has been written about using kettle variables, parameters and arguments. Variables are the basic features that provide the mechanism to configure the transformation steps and job entries: instead of using literal configuration values, you use a variable reference. This way, you can initialize all variables to whatever values are appropriate at that time, and for that environment. Today, I don't want to discuss variables and variable references - instead I'm just focussing on how to manage the configuration once you already used variable references inside your your jobs and transformations.

Managing configuration


To manage the configuration, I typically start the main job with a set-variables.ktr transformation. This transformation reads configuration data from a config.properties file and assigns it to the variables so any subsequent jobs and transformations can access the configration data through variable references. The main job has one parameter called ${CONFIG_DIR} which has to be set by the caller so the set-variables.ktr transformation knows where to look for its config.properties file:


Reading configuration properties


The config.properties file is just a list of key/value pairs separated by an equals sign. Each key represents a variable name, and the value the appropriate value. The following snippet should give you an idea:
#staging database connection
STAGING_DATABASE=staging
STAGING_HOST=localhost
STAGING_PORT=3351
STAGING_USER=staging
STAGING_PASSWORD=$74g!n9
The set-variables.ktr transformation reads it using a "Property Input" step, and this yields a stream of key/value pairs:


Pivoting key/value pairs to use the "set variables" step


In the past, I used to set the variables using the "Set variables" step. This step works by creating a variable from selected fields in the incoming stream and assigning the field value to it. This means that you can't just feed the stream of key/value pairs from the property input step into the set variables step: the stream coming out of the property input step contains multiple rows with just two fields called "Key" and "value". Feeding it directly into the "Set variables" step would just lead to creating two variables called Key and Value, and they would be assigned values multiple times for all key/value pairs in the stream. So in order to meaningfully assign variable, I used to pivot the stream of key/value pairs into a single row having one field for each key in the stream using the "Row Denormaliser" step:

As you can see in the screenshot, "Key" is the key field: the value of this field is scanned to determine in which output fields to put the corresponding value. There are no fields that make up a grouping: rather, we want all key/value pairs to end up in one big row. Or put another way, there is just one group comprising all key/value pairs. Finally, the grid below specifies for each distinct value of the "Key" field to which output field name it should be mapped, and in all cases, we want the value of the "Value" field to be stored in those fields.

Drawbacks


There are two important drawbacks to this approach:

  • The "Row Normaliser" uses the value of the keys to map the value to a new field. This means that we have to type the name of each and every variable appearing in the config.properties file. So you manually need to keep he config.propeties and the "Denormaliser" synchronized, and in practice it's very easy to make mistakes here.
  • Due to the fact that the "Row Denormaliser" step literally needs to know all variables, the set-variables.ktr transformation becomes specific for just one particular project.

Given these drawbacks, I seriously started to question the usefulness of a separate configuration file: because the set-variables.ktr transformation has to know all variables names anyway, I was tempted to store the configration values themselves also inside the transformation (using a "generate rows" or "data grid" step or something like that), and "simply" make a new set-variables.ktr transformation for every environment. Of course, that didn't feel right either.

Solution: Javascript


As it turns out, there is in fact a very simple solution that solves all of these problems: don't use the "set variables" step for this kind of problem! We still need to set the variables of course, but we can conveniently do this using a JavaScript step. The new set-variables.ktr transformation now looks like this:



The actual variable assignemnt is done with Kettle's built-in setVariable(key, value, scope). The key and value from the incoming stream are passed as arguments to the key and value arguments of the setVariable() function. The third argument of the setVariable() function is a string that identifies the scope of the variable, and must have one of the following values:

  • "s" - system-wide

  • "r" - up to the root

  • "p" - up to the parent job of this transormation

  • "g" - up to the grandparent job of this transormation

For my purpose, I settle for "r".

The bonus is that this set-variables.ktr is less complex than the previous one and is now even completely independent of the content of the configuration. It has become a reusable transformation that you can use over and over.

33 comments:

Anonymous said...

Very good timing :) Thanks

Rodrigo said...

Wow, this is so cool... I can see this usefull in any PDI project. Thanks for sharing

Anonymous said...

Nice, I solved this at different engagements the same way, thus it's time for a Feature Request making this easier for all of us:
http://jira.pentaho.com/browse/PDI-6184

Anonymous said...

Why do you not just use kettle.properties for hosting the database connection strings, for example. The only downside I see is in switching environments while testing in spoon.

rpbouman said...

Hi Anonymous,

storing specific configurations in kettle.properties is a perfectly valid approach to of course: Thanks for suggesting that! You'd just need to write some launchers to pass the location of the one you'd like to use, and I could certainly work with that method too.

The point of my post was twofold though: on the one hand it was about one of the many ways to configure kettle jobs, and on the other hand about how the setVariable() javascript function can sometimes be easier to use than the "Set Variables" step.

I hope that clears it up :)

kind regards,

Roland

Anonymous said...

Hi Roland,
Tried it but when I get the value at the other transform it is somewhat trimmed.
i.e property file
TempPath=c:/exp/temp

Result on the receiving transform is
c:/exp

I followed the read property file and the javascript.

Thanks.
Gerald

Sunil said...

Hi,

If I am using kettle.properties file for declaring my variables will it work if I am going to schedule the job via Quartz scheduler using xactions?

I know that if we schedule using cron jobs, the kettle.properties file in the cron job running uniix users home directory will be taken. But what will be case if I am trying with quartz/xactions or by using Spoon based GUI scheduler?

Regards,
Sunil George.

Mihai said...

Hello Roland
Can this config.properties file be located in the repository instead of a server?
Thank you

Anonymous said...

Hi Roland,

How do I use them to create a database connection in Kettle for use in the other transformations? Appreciate if you could provide an example. Thanks

rpbouman said...

@Anonymous, I think values with backslashes get trimmed. I thought the forward slash should work.

@Sunil, I think the pentaho server uses just one properties file. So that approach would not be very practical when you want to run many different trasnformations each with their own settings.

@Mihai you can store the properties wherever you like, but the property reader step expects a properties *file*. However the main point of my post is that you can dynamically set whatever data you have as variables.

rpbouman said...

@Anonymous

"How do I use them to create a database connection in Kettle for use in the other transformations?"

Simply create whatever database connection you need, and use variables rather than literal values for host, port, user and password (and whatever other options need to be set for that type of connection)

Vamshi said...

Roland,

Any quick tips to encrypt the database password?

rpbouman said...

@Vamshi, sure. You could've googled yourself too if course.

http://www.nicholasgoodman.com/bt/blog/2010/01/29/encrypt-pdi-passwords/

Anonymous said...

@Roland,

Got it I did the same thing after I posted it here. Just wanted to see if Pentaho Guru has any better suggestion.
:)

-Vamshi

Anonymous said...

Hi,
iam logu.can anybody help me plz to solve my problem ,
iam facing issue like-i scheduled 10 table, job has to run every time for each table.failure or successful loading of table will make job properties to enter log status into log table.
log status will be table name.
log table inserting job name instead of table name for some table.

for remaining table its inserting table name,

Anonymous said...

Hello Roland,

in case of JNDI connection, can it be configured inside config.properties instead of simple-jndi/jdbc.properties?

Thanx, Vana

rpbouman said...

Hi @anonymous,

I don't think you can do it that way. Consider using variables for your connection properties.

Anonymous said...


Hi Roland,

I am trying to pass the db connection info. using parameters and it's not working.
Please check the link below. Not sure, what I missed? Greatly appreciate your help.

http://forums.pentaho.com/showthread.php?206942-Making-a-DB-connection-different-for-different-environment(eg-test-prod)-errors-out

Thanks,
Raji.

rpbouman said...

Raji, how is it "not working"? Any error messages you can share?

Nitin Varghese said...

Thanks a lot Roland..

AD said...

Hi Roland
I am experiencing an issue with the database name in the Connection element in the .ktr file.I have tried many options , but i keep getting javax.naming.NameNotFoundException:...not sure what to enter there ...i am using Tomcat 8 , and this has worked on IBM WAS before ,we are trying to make it work on Tomcat now.
Please let me know if you can help . i can provide more details ...

rpbouman said...

@AD: hey man, sounds like you ran into a bug? Let pentaho know. http://wiki.pentaho.com/display/EAI/Bug+Reports+and+Feature+Requests+FAQ

Nitish Mishra said...

Hi Roland,

How can we make the path of CONFIG_DIR dynamic? We are using Pentaho Enterprise Server with all the jobs uploaded in DI repository.

Kindly provide your inputs on this?

Thanks,
Nitish

rpbouman said...

Hi @Nitish Mishra,


As I wrote:

"The main job has one parameter called ${CONFIG_DIR} which has to be set by the caller so the set-variables.ktr transformation knows where to look for its config.properties file:"

So, they caller has complete control over the value of config dir. That seems pretty dynamic to me. Or, maybe I misunderstood your question - if that is the case, then please explain with an example what you'd like to achieve.

best regards,

Roland.

Unknown said...

Hi there Roland,

we recently started implementing a best practice of totally separating code form configuration. Traditionally we used a lot of embedded parameters in our kettle job which became very messy. Your article really helped me a lot - thank you!

rpbouman said...

Hi @Unknown,

that's really great to hear. Thanks for posting!

Roland.

Sophie said...

That's really helpful information. Thanks for sharing your thoughts.

Unknown said...

Hi

How can i access the key value pairs as individual variables into another transformation.

Are all the Key/Values in SetVariables are accesible in main job in "Get Variables"

rpbouman said...

Hi @Mohammad,

first you need to ensure that you set the variables in the right scope. This is explained in the last section of the blog post: "The third argument of the setVariable() function is a string that identifies the scope of the variable, and must have one of the following values:"

You can then access the value of the variables in your subjobs or transformations that you call inside the job in the usual way:

- most forms to configure steps/job entries allow you to enter their value as a variable (see https://wiki.pentaho.com/display/EAI/.07+Variables). You can recognize these fields because they have a little dollar icon
- if you want to grab variables and have them available in your transformation as fields in a stream, then you need to use the "get variables step". (see: https://wiki.pentaho.com/display/EAI/Get+Variable)

HTH.

Anonymous said...

I have a re-usable transformation that works fine using this method of setting the config dir at run time.

Where it breaks for me, and I have not found any suggestions on the www, is when 1 database requires database connection options that another doesn't.

The difficulty is not the standard USER, PORT, DATABASE, PASSWORD, HOST variables used in the database connector's definition being stored in a kettle_properties file, but rather how to dynamically set database connection options like ssl, sslFactory and prepareThreshold set via the database connection.options page.

Any suggestions/ideas how you might resolve getting fully reusable database connector(s)?

rpbouman said...

@pdi_doordie,

this is an interesting question.

I notied that the options page in the db connection does not support variables; it seems to me this is an omission and you might consider posting a feature request for that.

In the mean while, I suggest trying to use a little bit of javascript to achieve it.
Slawomir did a pretty nice writeup on how to access database connections using javascript. Please check out his blog:

http://type-exit.org/adventures-with-open-source-bi/2010/06/accessing-kettle-internals-with-javascript/

I think you'll want this method to set options:

https://javadoc.pentaho.com/kettle800/kettle-core-8.0.0.0-6-javadoc/org/pentaho/di/core/database/DatabaseInterface.html#addExtraOption-java.lang.String-java.lang.String-java.lang.String-

What puzzles me is the first argument. I have no idea what to put in there; i suggest grepping the kettle code to see what is expected here.

Good luck! Let us know how it went.

Roland.

Anonymous said...

Thanks Roland for pointing me at an excellent starting point, and should I come up with a solution, I'll be sure to post it. Really appreciate you engaging on this thread.

Bhavya said...

A nice article Roland this will be really helpful for newcomers learning PDI. People looking for job vacancy in the sector can also learn through this article.

UI5 Tips: Persistent UI State

This tip provides a way to centrally manage UI state, and to persist it - automatically and without requiring intrusive custom code sprinkle...