Sunday, March 14, 2010

Writing another book: Pentaho Kettle Solutions

Last year, at about this time of the year, I was well involved in the process of writing the book Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL" for Wiley. To date, "Pentaho Solutions" is still the only all-round book on the open source Pentaho Business Intelligence suite.

It was an extremely interesting project to participate in, full of new experiences. Although the act of writing was time consuming and at times very trying for me as well as my family, it was completely worth it. I have none but happy memories of the collaboration with my full co-author Jos van Dongen, our technical editors Jens Bleuel, Jeroen Kuiper, Tom Barber and Tomas Morgner, several of the Pentaho Developers, and last but not least, the team at Wiley, in particular Robert Elliot and Sara Shlaer.

When the book was finally published, late August 2009, I was very proud - as a matter of fact, I still am :) Both Jos and I have been rewarded with a lot of positive feedback, and so far, book sales are meeting the expectations of the publisher. We've had mostly positive reviews on places like Amazon, and elsewhere on the web. I'd like to use this opportunity to thank everybody that took the time to review the book: Thank you all - it is very rewarding to get this kind of feedback, and I appreciate it enourmously that you all took the time to spread the word. Beer is on me next time we meet :)

Announcing "Pentaho Kettle Solutions"

In the autumn of 2009, just a month after "Pentaho Solutions" was published, Wiley contacted Jos and me to find out if we were interested in writing a more specialized book on ETL and data integration using Pentaho. I felt honoured, and took the fact that Wiley, an experienced and well-reknowned publisher in the field of data warehousing and business intelligence, voiced interested in another Pentaho book by Jos an me as a token of confidence and encouragement that I value greatly. (For Pentaho Solutions, we heard that Wiley was interested, but we contacted them.) At the same time, I admit I had my share of doubts, having the memories of what it took to write Pentaho Solutions still fresh in my mind.

As it happens, Jos and I both attended the 2009 Pentaho Community Meeting, and there we seized the opportunity to talk to Matt Casters, chief Pentaho Data Integration and founding developer of Kettle (a.k.a. Pentaho Data Integration). Both Jos and I didn't expect Matt to be able to free up any time in his ever busy schedule to help us to write the new book. Needless to say, he made us both very happy when he rather liked the idea, and expressed immediate interest in becoming a full co-author!

Together, the three of us made a detailed outline and wrote a formal proposal for Wiley. Our proposal was accepted in December 2009, and we have been writing since, focusing on the forthcoming Kettle version, Kettle 4.0 . The tentative title of the book is Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration. It is planned to be published in September 2010, and it will have approximately 750 pages.

Our working copy of the outline is quite detailed but may still change in the future, which is why I won't publish it here until we finished our first draft of the book. I am 99% confident that the top level of the outline is stable, and I have no reservation in releasing that already:

  • Part I: Getting Started

    • ETL Primer

    • Kettle Concepts

    • Installation and Configuration

    • Sample ETL Solution

  • Part II: ETL Subsystems

    • Overview of the 34 Subsystems of ETL

    • Data Extraction

    • Cleansing and Conforming

    • Handling Dimension Tables

    • Fact Tables

    • Loading OLAP Cubes

  • Part III: Management and Deployment

    • Testing and Debugging

    • Scheduling and Monitoring

    • Versioning and Migration

    • Lineage and Auditing

    • Securing your Environment

    • Documenting

  • Part IV: Performance and Scalability

    • Performance Tuning

    • Parallization and Partitioning

    • Dynamic Clustering in the Cloud

    • Realtime and Streaming data

  • Part V: Integrating and Extending Kettle

    • Pentaho BI Integration

    • Third-party Kettle Integration

    • Extending Kettle

  • Part VI: Advanced Topics

    • Webservices and Web APIs

    • Complex File Handling

    • Data Vault Management

    • Working with ERP Systems

Feel free to ask me any questions about this new book. If you're interested, stay tuned - I will probably be posting 2 or 3 updates as we go.


Unknown said...

Same here Schone, same here.

andtorg said...

Judging from your previous work, this is bound to be another must-have to deliver bi solutions with Pentaho. I wish you all the best for it.

BTW, after Tolkien's and Kimball's, should I make room in my library for another trilogy?

Shlomi Noach said...

Hi Roland,

This sounds great. I'm sure it will be a very good read!

Vincent Teyssier said...

For the first time in my life, i can't wait September to come !

Alex Meadows said...

Looking forward to this one as much as the last. I know it will be just as helpful!

Dan said...

That realtime chapter looks well worth a read! Will be ordering this too!

Richard said...

Looks awesome (as previous comment stated). Spend time on the Webservices and Web APIs section. More and more external data, like geo-location and government statistics, are enhancing the internal corporate data. The key to releasing this potential, as you know, is proper data integration. Show us how to do that properly!

Also...another plug...Realtime and Streaming are hot! Beyond performance considerations, show us how to use Kettle to shift for the nuggets in a HUGE data stream.

Unknown said...

Richard, that chapter is already done actually.
It shows you how to create, configure and monitor never-ending-streaming-real-time transformations.
I left looking for the nuggets as an exercise to the reader. :-)

John Dzilvelis said...

Great News! I'm especially looking forward to the section on Performance and Scaling.

Unknown said...

Excellent news! A must-have for me!

Fábio de Salles said...

You're kidding! First 4.0 (a big WOW for you ALL!) and now this! People are already saving money here for it! (Fábio, Brasil).

rpbouman said...

Hi Fabio!

hehe, nope, we're not kidding - we're well underway delivering the chapters.

Sometimes I find it hard to believe myself, but in about one and a half month, we'll be done writing, then off to process the technical reviews.

If all goes well, the book will be published according to schedule, somewhere in September 2010.

Anyway - thanks a lot for your support - it is really good to hear you're looking forward to it. I hope we'll manage to deliver a book that meets or even exceeds your expectations, but frankly, with Matt being part of the author team, I think we should be able to go a long way.

kind regards,


Anonymous said...

oh this is nice news. i'm ready! i have no idea what i'm doing in etl but it's interesting and i'm getting the idea what kettle does and what it could do.

rpbouman said...

Hi Anonymous,

thanks for your interest! Perhaps I should mention that "Pentaho Kettle Solutions" will be geared primarily at experienced ETL developers that want to learn how to use Kettle.

You might be interested in getting a copy of "Pentaho Solutions" first. This book is more general and geared more towards beginning BI developers. Never the less it has pretty deep coverage of ETL, with 3 chapters (about 100 pages) devoted entirely to ETL with Kettle.

"Pentaho Solutions" is already available and has gotten some pretty good reviews. Just look at amazon:

Kind regards,


Francisco, Chile said...

I already want it! Hope everything works fine!

Renato Back said...

Hello Roland,
Great to hear you guys are writing a new book on Kettle. Looking forward to it!
All I have to ask you is a small little favor:
PLEASE hire a new proofreaders!
The first book is full of typos, contradictions, missing and misleading information.
I don't mean to be harsh or evil, just would like to point that out.
Renato back

rpbouman said...

Hi Renato,

glad to see you're interested in the new book.

"Pentaho Solutions" was reviewed by a team of three different technical reviewers, and edited by a professional copy editors from Wiley and then proofread again by us. We did spot a few errors (mostly typos) after publication, but I would expect that in any first edition of a book this size, and frankly my impression is that everybody did a pretty good on that.

Even though you're the first one to point out that the book is "full of typos, contradictions, missing and misleading information" I do take this very seriously. Would you be able to let us know exactly which issues you encountered, and at what page they occur? Please send it to me, Jos or Wiley.

You might have earned yourself a free copy of Kettle Solutions with this information. :)

Renato Back said...

Hi Roland,
Thanks for the quick reply.

I'll start working on it as soon as I can.

Once again, I apologize if it sounded offensive at all. I do understand that the first edition might have a few errors.

Have yourself a nice weekend,

Renato Back

rpbouman said...

Hi Renato!

cool, I'm looking forward to receiving that. Just send it to

Don't apologize - you're not offensive at all. If there are errors, we need to correct them. Thanks again,

Roland Bouman.

Anonymous said...

Hopefully v4 will fix the rather involved process needed to iterate through lists of files without creating a job that calls an xForm for filenames, then another job, etc. The forums don't get this right, in any case, hpefully the " Pentaho 3.2 Data Integration: Beginner's Guide" will get it right...

Anonymous said...

Maybe acknowledge Postgres in there somwhere...

rpbouman said...

Hi Anonymous, not sure if last two messages are by the same person, but here goes:

@Anonymous #1: What do you mean exactly? Input steps that handle files all support regular expressions for specifying files/directories. If the pattern happens to match multiple files, they are all processed, one by one. So no need to build a Job to pick them up one by one.

@Anonymous #2: Yeah, PostgreSQL is a great product. But "Pentaho Kettle Solutions" is a database agnostic product. We have a few samples based on MySQL, because there happens to be more of it. But the samples will run on all JDBC databases.

Anonymous said...

AFA @Anonymous #1 is concerned:
I do realize the regex support for filenames but I have a different requirement. Let's say that at certain time intervals I need to scan a directory for files. More files can come in randomnly. So I need to capture that set of files and do what I need to do (xform and move to archive) just on those files. If more files come in then the next scan will get them. Using regex in xforms will not work when more files that qualify come in before the xform is complete and get inadvertently moved.

rpbouman said...

@Anonymous #1: ok, I see. But still, you don't need a complex job like you described to do that. Just quickly thinking about this problem, I can see at least two ways that seem an adequate solution to me:

1) use a regular expression to have the text input step read all files in the directory. Be sure to add the file name to the stream (use "Include filename in output" flag in the content tab of the file input step). After the file processing pipeline, use something (a step like group by, or analytic query, or even javascript) to identify the last row coming out of each file. Then use a "process files" step (in the utility folder) to move the processed file out of the directory.

Of course, this would change the situation somewhat as you now have two directories, one for the input files, and one for the processed files. But if you think about it that isn't such a bad situation: keeping only a single input directory is going to lead to a problem at some point, as the directory will just keep filling up with files.

2) suppose solution #1 is not what you want, and you really want to hang on to your single directory that receives all files, then you can use a "get filenames" input step and compare the contents of the directory against filenames you stored in a control file using a "merge diff" step. You can use a switch/case step to take the appropriate path according to the value of the diff flag field: if the file from the dir is identical to what you found in the control file, the file was already processed and you do nothing. If the file from the dir does not occur in your file, it is new and it must be processed. Then the only thing you need to do is to append the newly processed file name to the control file.

Here's a picture of a transformation like that:

So, I hope this helps. Maybe you should do yourself a favor, and buy our book - i am sure you will find a lot of things in there that will help you be more productive with kettle :)

Anonymous said...

From @Anonymous #1:

Thank you greatly for your insight. The first choice is the right one and I'll try to refactor all the five jobs into a single transform if possible. If so I'll be really happy...

In any case I will buy your new book as soon as the ink dries. I just wish it were here now!

rpbouman said...

@Anonymous #1: Cool, I'm glad this was helpful!

kind regards,


Anonymous said...

I hope you are able to include some specific information on ETL from DB2 tables. Also, a recommended reading list for those further than begining but not yet expert.

This book is a welcome addition to your previous work.

Anonymous said...

From @Anonymous #1:

Please write a chapter that replaces perl with kettle!!!!

I hope that in the new book you do a complete section on file watching/moving/copying since that is a topic that is not ***CLEAR*** from the docs/forums. Worse than unclear, the solutions often do not work.

If you need use cases, I can supply them in droves!

Anonymous said...

From @Anonymous #1:

Make sure the book includes a section on Text File Input Error handling. In 3.2 there seems to be a problem getting all the messages sent to the files that describe the errors.

I assume that when a bad row comes in there should be a way to remove it from the stream but move it to a bad rows file?

rpbouman said...

Hi Anonymous #1!

Did you try configuring the "Error handling" tab page in the text file input step?

Anonymous said...

From @Anonymous #1:

Got it working but I was getting confused since those other "Error" and "Warning" folders were not getting filled as expected.

What I did was filter on an error count of zero and send those to a success process and then send the others to an error file. I do wish I could capture the malformed input row verbatim and write that out along with the error info thereby splitting the file out.

I cannot pass the bad data out since it causes the same format errors when it gets written back out to the same fields. Maybe we should get the row buffer somehow?

Caio Moreno de Souza said...

Really good!!!
Thanks a lot.

Anonymous said...

From Anonymous #1:

Will there be anything concerning the use of "Custom data sources", like Hibernate within Kettle or BI? I know I read somewhere that this is possible but I cannot find it in the forums. I'd like to use a special data source that has security built in...

okmich said...

I am glad this one is coming out. What is more gladdening is that it is gonna be complementary to Pentaho 3.2 Data Integration Beginner's Guide from Packt Publishing.

When I saw the title on amazon, I first thought is was gonna be a "repeatition" of the one from Packt. But it is wow to see an outline so packed full of wonderful topics.

I am practically 2 months old in using pentaho and so in love with it. My entire goals is to perfect was has been done then I will quickly want to actively become a contributor to the community. I am working on trying out some deployment for some small size firms.

There is no end to possibility, I can see. This book has an entry in my budget.

Unknown said...

nice blog you have,,,

Vishwesh said...

awesome book and nice blog...


Nowadays, many SQL implementations offer some form of aggregate string concatenation function. Being an aggregate function, it has the effe...