Thursday, July 05, 2007

Kettle Quick Tip: Data Driven Execution of Tasks

Currently I need to move a bit of data around. I like to use Kettle for this type of work rather than writing custom scripts for a number of reasons (which I won't discuss here).

Anyway here is a quick tip I want to share with whomever it may concern. It is not rocket science, and many people may go "duh!" but I hope it will still be useful to others.

Quite often, you need a batch task, like truncating a set of tables, deleting data, dropping constraints etc. In kettle, you might model this like a job. In this case, each separate action can be modelled as a step of the job.

The following screenshot illustrates this approach:
TRUNCATE_TABLES_JOB
So here, each step is just an SQL statement that performs exactly the task you want and the steps are connected in order to perform the tasks sequentially. (You could also fork and start multiple steps in parallel).

Anyway, in many cases, you can do the same thing with a transformation. One of the major differences between jobs and transformations is that transformations are about data. And consequentially, transformations are great if you know how to drive your task with data. In this particular case, the following transformation does exactly the same work as the job from the previous screenshot:
TRUNCATE_TABLES_TRANSFORMATION
Here, the first step fetches all the table names from the current database using the following query on the MySQL information_schema.TABLES table:

SELECT table_name
FROM information_schema.tables
WHERE table_schema = schema()

(In fact, we don't even need the information_schema to do this - the SHOW TABLES statement works just as good - Kettle doesn't care whether we use a genuine SQL statement or a SHOW statement, as long as it returns a result set.)

The table input step will now output rows containing a table_name field:
TRUNCATE_TABLES_TRANSFORMATION_STEP1
The next step can now be driven using the data from the table input step. The following screenshot may clarify this:
TRUNCATE_TABLES_TRANSFORMATION_STEP2
First, we need to check the "Execute for each row" checkbox of the SQL script step. For most steps in a transformation, this property is implicit, but the SQL script step is a bit odd in this respect. The reason is that the execution of a SQL script can be a task oriented type of thing and as such in the domain of jobs rather than transformations.

(Note that in itself, the SQL script step is not data oriented: no data can flow out of this step, so it really is a task which just happens to be data driven. However, the data input will pass through and is still available as output of the SQL script step).

Second, we need to write our script so that it accepts the data as input. In this case, the script consists of just a TRUNCATE statement:

TRUNCATE TABLE ?

The question mark here serves as a place holder for values received at the input of this step.

Third, we need to specify which fields from the input are to be pushed into the value place holders in the script. In this case, we only need to specify the table_name that stems from the output of the table input step.

That is all there is to it - I hope it was useful.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...