Thursday, July 05, 2007

Kettle Quick Tip: Data Driven Execution of Tasks

Currently I need to move a bit of data around. I like to use Kettle for this type of work rather than writing custom scripts for a number of reasons (which I won't discuss here).

Anyway here is a quick tip I want to share with whomever it may concern. It is not rocket science, and many people may go "duh!" but I hope it will still be useful to others.

Quite often, you need a batch task, like truncating a set of tables, deleting data, dropping constraints etc. In kettle, you might model this like a job. In this case, each separate action can be modelled as a step of the job.

The following screenshot illustrates this approach:
TRUNCATE_TABLES_JOB
So here, each step is just an SQL statement that performs exactly the task you want and the steps are connected in order to perform the tasks sequentially. (You could also fork and start multiple steps in parallel).

Anyway, in many cases, you can do the same thing with a transformation. One of the major differences between jobs and transformations is that transformations are about data. And consequentially, transformations are great if you know how to drive your task with data. In this particular case, the following transformation does exactly the same work as the job from the previous screenshot:
TRUNCATE_TABLES_TRANSFORMATION
Here, the first step fetches all the table names from the current database using the following query on the MySQL information_schema.TABLES table:

SELECT table_name
FROM information_schema.tables
WHERE table_schema = schema()

(In fact, we don't even need the information_schema to do this - the SHOW TABLES statement works just as good - Kettle doesn't care whether we use a genuine SQL statement or a SHOW statement, as long as it returns a result set.)

The table input step will now output rows containing a table_name field:
TRUNCATE_TABLES_TRANSFORMATION_STEP1
The next step can now be driven using the data from the table input step. The following screenshot may clarify this:
TRUNCATE_TABLES_TRANSFORMATION_STEP2
First, we need to check the "Execute for each row" checkbox of the SQL script step. For most steps in a transformation, this property is implicit, but the SQL script step is a bit odd in this respect. The reason is that the execution of a SQL script can be a task oriented type of thing and as such in the domain of jobs rather than transformations.

(Note that in itself, the SQL script step is not data oriented: no data can flow out of this step, so it really is a task which just happens to be data driven. However, the data input will pass through and is still available as output of the SQL script step).

Second, we need to write our script so that it accepts the data as input. In this case, the script consists of just a TRUNCATE statement:

TRUNCATE TABLE ?

The question mark here serves as a place holder for values received at the input of this step.

Third, we need to specify which fields from the input are to be pushed into the value place holders in the script. In this case, we only need to specify the table_name that stems from the output of the table input step.

That is all there is to it - I hope it was useful.

5 comments:

TP said...

Dear Roland

That was a useful tip for beginners like myself.

Hope you will continue to show us more of these when you get the chance.

Thanks.
Tau

Johan said...

I think this is an excellent example of how we can use visual tools nowadays to do this kind of work. Kettle has done a great job, it looks very similar to how Informatica of Microsoft SSIS would do the job. However, your first remark intriques me, because somehow there are lots of progammers that still put this stuff into stored procedures, build giant loads of code, with cursors etc. Some of those programmers will use Kettle or SSIS as the control mechanism, but some progammers don't even do that. But, I personnaly think that the future is in the visual way of programming.

rpbouman said...

Hi Tau,

glad the tip was useful for you ;)

If you liked that tip, my introduction to kettle might be useful for you too:

http://rpbouman.blogspot.com/2006/06/pentaho-data-integration-kettle-turns.html

Johan,

Thanks for your comments ;)

Yes. I must admit I usually have some reservations against visual programming languages. But so far Kettle has never let me down. It is very intuitive, and it is very easy to be productive with it. The UI is really fast and easy to use - even large diagrams don't stall or block when fooling around.

Personally, I think Kettle can easily withstand a comparison with MS SSIS and Informatica. In fact, I think I can say without exaggeration that kettle's database partitioning and ETL execution clustering are more advanced than what MS SSIS and Informatica have to offer. And of course, in addition, Kettle is free - free as in speech as well as free as in beer. (LGPL).

Regarding my "first remark": do you mean the one about writing scripts? Well, I don't want to be all black and white about it.

I think stored procedures do have their place for data processing. So if all data is on the same database instance and easy to reach, the advantage of a stored procedure is that you are not dependent upon an external component, and that you keep the code and database together (backups, migration to new host etc).

But indeed, for many developers and programmers, you get the hammer/nail thing. Because they are familiar and comfortable with Stored Procedures, they want to do everything with them. Others have the same with scripts.

Personally, I think dedicated ETL tools are a good choice when the data comes from a remote source, and when you need proper logging. Another thing which I have found is that it really is easier to keep an overview of the loading process if you are not constantly digging into the nitty gritty details of the actual implementation of data transfer. At least that is true for me.

Another thing with a tool like kettle is that often, it will perform better than your manual code. If you set up your transformations right, it will use multiple connections simultaneously and work with them in parallel to get the job done. This leads to higher throughput.

Code you write yourself usually is not designed from the start to do that. So when you are done writing it, you may find that it is actually not possible to parallelize things without major overhaul whereas this is typically easy to establish with kettle.

Anyway - I have a few more tips for kettle - damn where is the time to write about it ;)

Unknown said...

Yes Roland. Great little tip!

Pratap Gangula said...

Hi Roland

I have worked on Kettle as well as SSIS. I feel Kettle is still a better tool and much easier to use. Informatica is the best of the lot but considering the price constraints Kettle is definitely a better deal.

Thanks for your tips.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...