Percona Live MySQL User's Conference, San Francisco, April 10-12th, 2012 Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Friday, October 31, 2008

The Dutch are making money with open source software

I just want to take a moment to congratulate Stani Michiels, the creator of Stani's Python IDE.

Stani has won a contest to create the Architecture Fiver, a commemorative Dutch 5 euro coin:



The front side shows the portrait of Queen Beatrice emerging from a spiral of the names of Dutch architects, ordered by the number of internet hits. The back side shows a number of books (authored by Dutch architects), rising up as if they are buildings, their roofs forming the contours of a map of the Netherlands.

I think this is a truly creative and beautiful design, and I think the price is well-deserved.

An extra tidbit that interests me is that the design was created exclusively by Open Source and Free software tools (predominantly Python). So literally, this is money made by open source software. Read all about it on his blog: SPE IDE - Stani's Python Editor: How to make money with free software....

Stani, Kudos to you - When I get my hands one of these I'll keep it for display. You can be sure the image of the coin will make its way to many presentations on open source business.

Read all about it on Stani's blog

Friday, October 24, 2008

What kind of life is that?!

Over a year ago, I wrote about a little bit of irony offered by the runtime error messages emitted from the Windows Vista ultimate testdrive site (which I just checked again, this time with IE6 and happens to choke on another bug).

Anyway, I just found another gem in this category. In search for MS Visual Studio Express, I bumped into an ad for the "Defy All Challenges" campaign on the Dutch Microsoft site. The site shows a picture of a pirate captain that is titled "A Developer's Life for me" and below is some text that invites other brave developers to "Defy all challenges" with Microsft Visual Studio 2008. I took a snapshot for posterity:

whoops

Unfortunately, as soon as you visit the site (I'm using IE 6.0 on Windows XP SP2 right now), you get a messagebox informing you that a javascript runtime error has occurred ("Object Expected"). So I'm afraid it's a case of "A bug's life" rather than "A Developer's life".

Wednesday, October 15, 2008

Maturity of Open Source ETL / Why do people fail to do their homework?

I just read this post on Matt Casters' blog. Here, Matt describes why Element 61's Jan Claes is dead wrong in the way he assesses the maturity of open source ETL tools.

Well, I've just read Jan Claes' article in the "research and insights" area of the Element61 website, and frankly, it is pretty easy to see how unsubstantiated it is. Some may be tempted to classify the article as FUD, although typically 'real' FUD articles are more sophisticated and are usually not so blatantly sloppy with facts. I could get all worked up over it, but mostly it leaves me with a genuine confusion as to why a seemingly serious IT consultancy company would want to have their professionals post this kind of stuff on the company website and let it pass as "research" or even "insight".

Anyway, let's take a look at the article and I'll try and explain why it won't wash...
(Disclaimer - I can easily debunk Jan's article when I look at Kettle, a.k.a. Pentaho data integration. I do not have enough experience with the other open source ETL tools mentioned in his article, Talend and CloverETL, so it is entirely possible that the article does these tools similar injustice).

The article starts off by explaining what ETL is, and correctly mentions that prerequisite tools and resources are generally considered expensive. Quite unsurprisingly, the article moves on to explore open source ETL tools in order to find a cost-reducing alternative. So far, so good - the notion that cost reduction is the primary motive for most businesses to start using open source software is far from controversial.

From here on the article wanders off, and starts talking about open source in general. Some sympathetic but unsubstantiated and unillustrated claims are made about the relation between "standards" and open source software.

The article then becomes more critical and downplays the viability of an ETL open source community as compared to the Linux community. According to Jan Claes, Linux is successful ...because the developers united in a “quest” against Windows.... According to him, a common motive lacks in the ETL community. He has it that ...[ETL] is a specialist's world..., and because of that, ...progress might be significantly slower than in the case of other –more successful- Open Source programs.

Just take a second to let it sink in. Anybody met a Linux dev lately that was out to beat windows? Last time I checked, Linux was meant to be a free (as in speech) UNIX system. Most Linux users I know couldn't care less about Windows, and they are certainly not 'united in a quest' to create something like it. I take it that by tagging ETL as belonging to the 'specialist's world', I am to understand that Linux kernel development is 'generalistic'.

If I am to believe Jan, everybody is so busy with their quest against windows, doing all that 'general' hacking in C/C++ on the Linux kernel that nobody is specialist enough to help that poor quailing open source ETL community develop say, a MySQL bulk loader step, or a Regular Expression Matching step, or a Normalizing step. (Those that have been following the development of Kettle know that these and more are all contributed by the community).

To make sure he gets the point across, the article then says that ...The most advanced suppliers of Open Source ETL actually admit today that they are not really community-driven, and that each company works with their own developers.... No link. No date. No nothing, we should probably just take Jan's word for it.

Look, don't get me wrong - I don't want to deny that ETL is a specialist area of expertise. But in my opinion, this is in part due to the fact that ETL (and BI tools in general) have been inaccessible to most people. Let me explain that.

My take on it is that now, open source BI is in the same boat where the open source databases were a few years ago. While mocked and ignored by the 'real' experts and specialists that do not deign themselves to touch the open source 'toy' databases, and can only be humbled to work with traditional 'enterprise' databases, the open source databases caused an emancipation among application developers and web developers.

Databases are now a commodity, and there are quite a lot of developers that know how to use one. Of course, apart from a large body of database users with basic skills, there are still 'real' experts, but some of them have become one with little or none of the typical DBA background thanks to the fact that open source databases disrupted the market and commoditized it. I see the same thing happening for BI: because the tools are now accessible and available, BI is popping up where it didn't before. In stealth mode, it is becoming a commodity.

The article continues, and now starts discussing a number of products: CloverETL, Pentaho, and Talend. Jan claims that these ETL tools provide ...just enough transformations to make it a viable option.

Huh?! Say what?! Just enough?... Of course, no comparison is made with a closed source product, so we can only guess what he was expecting. But really people, when I open Kettle 3.1, the latest GA version, I see:

  • Over 20 input steps, ranging from standard things like simple database tables, csv files and excel workbooks to slightly more advanced things like XML and LDAP to more exotic things like XBase and Web services

  • In addition to regular input steps, specialized bulk loaders for Oracle, MySQL, Postgres, Greenplum and MonetDB

  • Over 10 output steps (again ranging from standard stuff to more advanced things)

  • About 10 flavors of lookup, join and dimension maintenance steps

  • No less than 30 transformation steps, including normalization, denormalization, XSLT, XML injection

  • A bunch of flow control steps


And this is just for transformations! For kettle jobs, there's another bucket of steps, there's just too much to mention.

The article continues and does a run-down of "when to use/when not to use" open source ETL tools, and it is in this section where we see most of the inaccuracies.

Here are the "when to use" cases according to Jan Claes:


No complex transformations

The straightforward transformations can be implemented very quickly, ... When there are complex transformations to make, Open Source ETL tools will often not offer out-of-the-box solutions.


This is like, "so what"?! I mean, seriously - basically this complaint can be simplified to "when it's hard, it's hard". I dare anybody to come up with a single case where any tool provides an "out-of-the-box-solution" for an arbitrarily complex problem. I suspect Jan has not tried, and if he did, why doesn't he mention what he tried? I mean, man, just one example of what was expected but not found, surely that is not too much to ask? Gosh...


Few data sources to connect to

Most reputed ETL-vendors provide an extensive amount of connections to all sorts of data sources. This is a problem with Open Source ETL tools: most of them are based on Java architecture and need JDBC to connect to a database.


Seriously, what crap is this?

First of all, it is completely untrue. I mean, I won't bore you with the big 5 databases, and I know you have all heard of the less well known ones like Greenplum, Intersystems Cache and Netezza, but have you ever heard of "ExtenDB"? Or "Kingbase ES"? Or "Gupta"? No?! Well, Kettle has, and it has all of this shipped for free. Yup, that's right, No less than 32 different RDBMS systems, and for most of them, there isn't just one driver, but multiple drivers available - no additional installation or fee required.

Second, how does it make sense to whine about the underlying protocol or API used to connect to the data source? I have never heard anybody complain about MS SQL only offering AdoDB or ODBC connectivity, have you? In practice there is a JDBC driver available for every database, and if not you can still use ODBC connectivity. You are also not confined to the predelivered drivers, you can simply drop your own drivers in the kettle jdbc directory and connect to other dbs. If anything, Kettle's choice for JDBC seems like it may have done everything but limit connectivity.

Do you homework, dude. All it takes is one download and one unzip command.


Small scaled projects

Small scaled projects with one or two, very common sources and a limited number of target tables don't need an expensive and complicated ETL tool. An Open Source ETL tool can bring a cheap and easy to install solution, which even could run on a local PC.


Well, it is certainly true that you can run Kettle on a single PC with modest resource usage. But you can also think big, and benefit from built-in threading to burn all the cores of your high-end server. Or you can cluster a bunch of machines and have them divide the work to perform a single transformation. You can run it in a cloud and use sophisticated features like parallel input, you can use it to load shards ("database partitioning"), in fact there are no obvious limitations to the Kettle's deployment scale.

So, as it should be obvious by now, Jan has managed to write his "when to use cases" as if they are "when not to use" cases. The real "when not to use" cases follow:


Java & XML knowledge required for complex transformations

As all the Open Source ETL tools are Java and XML based, you need at least a basic understanding and knowledge of Java and XML to get the most out of them...[bla, bla and more bla]...the user has to develop the transformation rules, and therefore needs an advanced knowledge of Java.


This is a blatant lie. I won't deny that it is possible to muck around coding java as a Kettle user, but technically you would be a plugin writer. And the XML claim is completely off track - there is no requirement whatsoever for the end user to edit raw XML, ever. Really, I have just gone down the road describing the wealth of steps available to build you transformation, and really, you can trust me when I say: this is enough for about 95% of the cases, if not 99%. Who can blame the product for allowing you to write a java plugin if you really want it?

Speaking of which, does Jan mean to imply that the "traditional" products he is writing about do not allow you to write plugins? If that is so, then it seems an easy choice for me. I'd rather have more options not less. That saidd, I know that for example MS SQL IS allows .NET plugins. How is this functionally different?


Connectivity to multiple & less common data sources required.



Heh ;) As i have argued before, this is nonsense....Elementary, Watson.


Lack of skills, knowledge & resources


Well, I guess this is one of the things I got to give them. Clearly, certain skills, knowledge and resources are missing in Element61's outfit. Otherwise I can't explain why they'd be advertising such an easy target article.

Anyway, maybe you have thoughts on Open Source ETL that you want to share. If so, be my guest.

Thank you for bearing with me, and please left a comment if necessary.

Saturday, October 04, 2008

Less is More - Happy Eee PC-ing

Ok - I admit - it was impulse more than utility that made me buy this Asus Eee PC 900. People that know me can confirm that I am not a gadgety type of guy, so I realize it may seem a bit strange that I would sucker into buying trendy hardware.

Well, the secret is: there is no secret. The Eee PC is not a gadget - it is a completely functional laptop that is extremely suitable for traveling. It's got 1G of RAM, and 16G of flash storage onboard. An 900MHz Intel may sound slow, but I haven't really noticed that it is, in fact performance feels pretty peppy. The machine boots *fast* and shuts down even faster. It's light, and small, but still has a keyboard and display that are large enough to work on. The battery lasts me about 2.5 to 3 hours - I will try and find another battery. It cost 300 euros - this price was exactly right for me. If it had been more expensive I probably would not've bought it - even if it'd be a beefier machine.

The Eee PC 900 I bought comes with Xandros linux. I feel they really did a remarkable job of creating a desktop linux that is attractive, clear, easy to use and unwreckable by the non-technical user. The same model is available with windows XP too, but in order to keep the same price they took out Ram and Disk. Please dudes and dudettes - do yourself a favor and don't buy that one. Really, the pre-installed linux is pretty darn good, and considering the clipped hardware specs, I frankly can't see how the same machine with Windows XP could still pass as anything else but a joke. Yeah, I'm biased - trust me, I'm right ;)

If you have some experience with linux, its still easy to set it up so you can use a full KDE desktop.

I used to work on my notebook in packed trains and in cramped planes, but I think I will use the Eee PC instead - it really makes my life easier.

Thursday, October 02, 2008

MySQL 5.1 Cluster Certification Guide online Preview!

Wow!

I just read Martin "MC" Brown's blog announcing that a number of chapters are now available online for preview.

Personally, I think this is a good move. I think the online chapters provide a nice insight what the book is like. I can ensure you that there is lots and lots more to see in the other chapters. I can honestly and without reservation recommend this book to everybody that wants to read a recent reference book on MySQL Cluster, and those that want to obtain the 5.1 Cluster DBA certification exam in particular.

Now it must be noted that I am a co-author for this book, and author of the MySQL 5.1 Cluster DBA exam so my recommendations are hardly objective. However, I don't receive any royalties for the book (or for the exams) - my contributions were just part of my time working with MySQL AB. So, when I say that MC Brown is right when he says that:
...there is only one guide that you need, the MySQL 5.1 Cluster DBA Certification Study Guide
then I mean that sincerely - this is the book you'll want. Pick up your copy here.