Wednesday, October 15, 2008

Maturity of Open Source ETL / Why do people fail to do their homework?

I just read this post on Matt Casters' blog. Here, Matt describes why Element 61's Jan Claes is dead wrong in the way he assesses the maturity of open source ETL tools.

Well, I've just read Jan Claes' article in the "research and insights" area of the Element61 website, and frankly, it is pretty easy to see how unsubstantiated it is. Some may be tempted to classify the article as FUD, although typically 'real' FUD articles are more sophisticated and are usually not so blatantly sloppy with facts. I could get all worked up over it, but mostly it leaves me with a genuine confusion as to why a seemingly serious IT consultancy company would want to have their professionals post this kind of stuff on the company website and let it pass as "research" or even "insight".

Anyway, let's take a look at the article and I'll try and explain why it won't wash...
(Disclaimer - I can easily debunk Jan's article when I look at Kettle, a.k.a. Pentaho data integration. I do not have enough experience with the other open source ETL tools mentioned in his article, Talend and CloverETL, so it is entirely possible that the article does these tools similar injustice).

The article starts off by explaining what ETL is, and correctly mentions that prerequisite tools and resources are generally considered expensive. Quite unsurprisingly, the article moves on to explore open source ETL tools in order to find a cost-reducing alternative. So far, so good - the notion that cost reduction is the primary motive for most businesses to start using open source software is far from controversial.

From here on the article wanders off, and starts talking about open source in general. Some sympathetic but unsubstantiated and unillustrated claims are made about the relation between "standards" and open source software.

The article then becomes more critical and downplays the viability of an ETL open source community as compared to the Linux community. According to Jan Claes, Linux is successful ...because the developers united in a “quest” against Windows.... According to him, a common motive lacks in the ETL community. He has it that ...[ETL] is a specialist's world..., and because of that, ...progress might be significantly slower than in the case of other –more successful- Open Source programs.

Just take a second to let it sink in. Anybody met a Linux dev lately that was out to beat windows? Last time I checked, Linux was meant to be a free (as in speech) UNIX system. Most Linux users I know couldn't care less about Windows, and they are certainly not 'united in a quest' to create something like it. I take it that by tagging ETL as belonging to the 'specialist's world', I am to understand that Linux kernel development is 'generalistic'.

If I am to believe Jan, everybody is so busy with their quest against windows, doing all that 'general' hacking in C/C++ on the Linux kernel that nobody is specialist enough to help that poor quailing open source ETL community develop say, a MySQL bulk loader step, or a Regular Expression Matching step, or a Normalizing step. (Those that have been following the development of Kettle know that these and more are all contributed by the community).

To make sure he gets the point across, the article then says that ...The most advanced suppliers of Open Source ETL actually admit today that they are not really community-driven, and that each company works with their own developers.... No link. No date. No nothing, we should probably just take Jan's word for it.

Look, don't get me wrong - I don't want to deny that ETL is a specialist area of expertise. But in my opinion, this is in part due to the fact that ETL (and BI tools in general) have been inaccessible to most people. Let me explain that.

My take on it is that now, open source BI is in the same boat where the open source databases were a few years ago. While mocked and ignored by the 'real' experts and specialists that do not deign themselves to touch the open source 'toy' databases, and can only be humbled to work with traditional 'enterprise' databases, the open source databases caused an emancipation among application developers and web developers.

Databases are now a commodity, and there are quite a lot of developers that know how to use one. Of course, apart from a large body of database users with basic skills, there are still 'real' experts, but some of them have become one with little or none of the typical DBA background thanks to the fact that open source databases disrupted the market and commoditized it. I see the same thing happening for BI: because the tools are now accessible and available, BI is popping up where it didn't before. In stealth mode, it is becoming a commodity.

The article continues, and now starts discussing a number of products: CloverETL, Pentaho, and Talend. Jan claims that these ETL tools provide ...just enough transformations to make it a viable option.

Huh?! Say what?! Just enough?... Of course, no comparison is made with a closed source product, so we can only guess what he was expecting. But really people, when I open Kettle 3.1, the latest GA version, I see:

  • Over 20 input steps, ranging from standard things like simple database tables, csv files and excel workbooks to slightly more advanced things like XML and LDAP to more exotic things like XBase and Web services

  • In addition to regular input steps, specialized bulk loaders for Oracle, MySQL, Postgres, Greenplum and MonetDB

  • Over 10 output steps (again ranging from standard stuff to more advanced things)

  • About 10 flavors of lookup, join and dimension maintenance steps

  • No less than 30 transformation steps, including normalization, denormalization, XSLT, XML injection

  • A bunch of flow control steps


And this is just for transformations! For kettle jobs, there's another bucket of steps, there's just too much to mention.

The article continues and does a run-down of "when to use/when not to use" open source ETL tools, and it is in this section where we see most of the inaccuracies.

Here are the "when to use" cases according to Jan Claes:


No complex transformations

The straightforward transformations can be implemented very quickly, ... When there are complex transformations to make, Open Source ETL tools will often not offer out-of-the-box solutions.


This is like, "so what"?! I mean, seriously - basically this complaint can be simplified to "when it's hard, it's hard". I dare anybody to come up with a single case where any tool provides an "out-of-the-box-solution" for an arbitrarily complex problem. I suspect Jan has not tried, and if he did, why doesn't he mention what he tried? I mean, man, just one example of what was expected but not found, surely that is not too much to ask? Gosh...


Few data sources to connect to

Most reputed ETL-vendors provide an extensive amount of connections to all sorts of data sources. This is a problem with Open Source ETL tools: most of them are based on Java architecture and need JDBC to connect to a database.


Seriously, what crap is this?

First of all, it is completely untrue. I mean, I won't bore you with the big 5 databases, and I know you have all heard of the less well known ones like Greenplum, Intersystems Cache and Netezza, but have you ever heard of "ExtenDB"? Or "Kingbase ES"? Or "Gupta"? No?! Well, Kettle has, and it has all of this shipped for free. Yup, that's right, No less than 32 different RDBMS systems, and for most of them, there isn't just one driver, but multiple drivers available - no additional installation or fee required.

Second, how does it make sense to whine about the underlying protocol or API used to connect to the data source? I have never heard anybody complain about MS SQL only offering AdoDB or ODBC connectivity, have you? In practice there is a JDBC driver available for every database, and if not you can still use ODBC connectivity. You are also not confined to the predelivered drivers, you can simply drop your own drivers in the kettle jdbc directory and connect to other dbs. If anything, Kettle's choice for JDBC seems like it may have done everything but limit connectivity.

Do you homework, dude. All it takes is one download and one unzip command.


Small scaled projects

Small scaled projects with one or two, very common sources and a limited number of target tables don't need an expensive and complicated ETL tool. An Open Source ETL tool can bring a cheap and easy to install solution, which even could run on a local PC.


Well, it is certainly true that you can run Kettle on a single PC with modest resource usage. But you can also think big, and benefit from built-in threading to burn all the cores of your high-end server. Or you can cluster a bunch of machines and have them divide the work to perform a single transformation. You can run it in a cloud and use sophisticated features like parallel input, you can use it to load shards ("database partitioning"), in fact there are no obvious limitations to the Kettle's deployment scale.

So, as it should be obvious by now, Jan has managed to write his "when to use cases" as if they are "when not to use" cases. The real "when not to use" cases follow:


Java & XML knowledge required for complex transformations

As all the Open Source ETL tools are Java and XML based, you need at least a basic understanding and knowledge of Java and XML to get the most out of them...[bla, bla and more bla]...the user has to develop the transformation rules, and therefore needs an advanced knowledge of Java.


This is a blatant lie. I won't deny that it is possible to muck around coding java as a Kettle user, but technically you would be a plugin writer. And the XML claim is completely off track - there is no requirement whatsoever for the end user to edit raw XML, ever. Really, I have just gone down the road describing the wealth of steps available to build you transformation, and really, you can trust me when I say: this is enough for about 95% of the cases, if not 99%. Who can blame the product for allowing you to write a java plugin if you really want it?

Speaking of which, does Jan mean to imply that the "traditional" products he is writing about do not allow you to write plugins? If that is so, then it seems an easy choice for me. I'd rather have more options not less. That saidd, I know that for example MS SQL IS allows .NET plugins. How is this functionally different?


Connectivity to multiple & less common data sources required.



Heh ;) As i have argued before, this is nonsense....Elementary, Watson.


Lack of skills, knowledge & resources


Well, I guess this is one of the things I got to give them. Clearly, certain skills, knowledge and resources are missing in Element61's outfit. Otherwise I can't explain why they'd be advertising such an easy target article.

Anyway, maybe you have thoughts on Open Source ETL that you want to share. If so, be my guest.

Thank you for bearing with me, and please left a comment if necessary.

6 comments:

Anonymous said...

Just read Jan's text and have to agree with you. Not sure what point he is trying to make. element61 has open source ETL in the list of skills of their consultants. At the same time Jan mentions there is little support for open source ETL, in which case it would be much smarter for him to highlight open source ETL viability and promote element61's portfolio. Anyway, great to see so much passion in the open source ETL community :-)

Anonymous said...

I had read Jan's article via Matt about a while ago, but just caught your post. I agree with everything that you say. I want to point out one thing. I've been involved with data warehousing since 1996. We've NEVER had a project that didn't require custom transformations written in [usually] PL/SQL because they were beyond what Informatica could handle.

More power to KETTLE/Pentaho Data Integration, Talend and all the other ETL, EAI, EII, ESB & ReSTful OSS tools out there.

big data fanatic and bi guru said...

Open source ETL is becoming mature. In the 2011 edition ofthe ETL Tool Survey of Passionned Group, there are several open source ETL tools evaluated. I've seen the report and one of the best open source ETL tools is Talend Open Studio (it outnumbers Informatica on functionality!).

Sylnsr said...

Community support is only part of the issue and not even a major one for people like me. As long as the ETL tools work for the most part, and are open source so that I can get in there and fix bugs on my own, I an be happy. At the end of the day having access to the source of any ETL tool can save you months of lost time vs using something proprietary

(http://sylnsr.blogspot.com/2012/05/pervasive-di-update-for-version-102447.html?showComment=1338493130764#c4850170675426256325)

John Gram said...

For simple ETL tasks I would recommend Scriptella ETL - an open source (Apache licensed!), fast and powerful ETL tool written in Java. It provides one-liner integration with Java and also supports Spring.

rpbouman said...

Hi John,

thanks for pointing me to Scriptella. Never heard of it before.

Frankly it looks pretty low-level. Maybe it has its use when you need to embed simple ETL.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...