Roland Bouman's blog: ETL

Showing posts with label ETL. Show all posts

Saturday, June 18, 2011

HPCC vs Hadoop at a glance

Update

Since this article was written, HPCC has undergone a number of significant changes and updates. This addresses some of the critique voiced in this blog post, such as the license (updated from AGPL to Apache 2.0) and integration with other tools. For more information, refer to the comments placed by Flavio Villanustre and Azana Baksh.

The original article can be read unaltered below:

Yesterday I noticed this tweet by Andrei Savu:

. This prompted me to read the related GigaOM article and then check out the HPCC Systems website.

If you're too lazy to read the article or visit that website:

HPCC (High Performance Computing Cluster) is a massive parallel-processing computing platform that solves Big Data problems. The platform is now Open Source!

HPCC Systems compares itself to Hadoop, which I think is completely justified in terms of functionality. Its product originated as a homegrown solution of LexisNexis Risk Solutions allowing its customers (banks, insurance companies, law enforcment and federal government) to quickly analyze billions of records, and as such it has been in use for a decade or so. It is now open sourced, and I already heard an announcement that Pentaho is its major Business Intelligence Partner.

Based on the limited information a made a quick analysis, which I emailed to the HPCC Systems CTO, Armando Escalante. My friend Jos van Dongen said it was a good analysis and told me I should post it. Now, I don't really have time to make a nice blog post out of it, but I figured it can't hurt to just repeat what I said in my emails. So here goes:

Just going by the documentation, I see a two real unique selling points in HPCC Systems as compared to Hadoop:

Real-time query performance (as opposed to only analytic jobs). HPCC offers two difference setups, labelled Thor and Roxie. Functionalitywise, Thor should be compared to a Map/Reduce cluster like Hadoop: it's good for doing fairly long running analyses on large volumes of data. Roxie is a different beast, and designed to offer fast data access, supporting ad-hoc real-time queries
Integrated toolset (as opposed to hodgepodge of third party tools). We're talking about an IDE, job monitoring, code repository, scheduler, configuration manager, and whatnot. This really looks like like big productivity boosters, which may make Big Data processing a lot more accessible to companies that don't have the kind of development teams required to work with Hadoop.

(there may be many more benefits, but these are just the ones I could clearly distill from the press release and the website)

Especially for Business Intelligence, Roxie maybe a big thing. If real-time Big Data queries could be integrated with Business Intelligence OLAP and reporting tools, then this is certainly a big thing. I can't disclose the details but I have trustworthy information that integration with Pentaho's Analysis Engine, the Mondrian ROLAP engine is underway and will be available as an Enterprise feature.

A few things that look different but which may not matter too much when looking at HPCC and Hadoop from a distance:

ECL, the "Enterprise Control Language", which is a declarative query language (as opposed to just Map/Reduce). This initially seems like a big difference but Hadoop has tools like pig and sqoop and hive. Now, it could be that ECL is vastly superior to these hadoop tools, but my hunch is you'd have to be careful in how you position that. If you choose a head-on strategy in promoting ECL as opposed to pig, then the chances are that people will just spend their energy in discovering the things that pig can do and ECL cannot (not sure if those features actually exist, but that is what hadoop fanboys will look for), and in addition, the pig developers might simply clone the unique ECL features and the leveling of that playing field will just be a matter of time. This does not mean you shouldn't promote ECL - on the contrary, if you feel it is a more productive language than pig or any other hadoop tool, then by all means let your customers and prospects know. Just be careful and avoid downplaying the hadoop equivalents because that strategy could backfire.

Windows support. It's really nice that HPCC Systems is available for Microsoft Windows, it makes that a lot easier for Microsoft shops (and there are a lot of them). That said, customers that really have a big-data problem will solve it no matter what their internal software policies are. So they'd happily start running hadoop on linux if that solves their problems.
Maturity. On paper HPCC looks more mature than hadoop. It's hard to tell how much that matters though because hadoop has all the momentum. People might choose for hadoop because they anticipate that the maturity will come thanks to the sheer number of developers committing to that platform.

The only thing I can think of where HPCC looks like it has a disadvantage as compared to Hadoop is adoption rate and licensing. I hope these will prove not to be significant hurdles for HPCC, but I think that these might be bigger problems then they seem. Especially the AGPL licensing seems problematic to me.

The AGPL is not well regarded by anyone I know - not in the open source world. The general idea seems to be that even more than plain GPL3 it restricts how the software may be used. If the goal of open sourcing HPCC is to gain mindshare and a developer community (something that hadoop has done and is doing extremely well) then a more permissive license is really the way to go.

If you look at products like MySQL but also Pentaho - they are both very strongly corporately led products. The have a good number of users, but few contributions from outside the company, and this is probably due to a combination of GPL licensing and the additional requirement for handing over the copyright of any contributions to the company. Hence these products don't really benefit from an open source development model (or at least not as much as they could). For these companies, Open source may help initially to gain a lot of users, but those are in majority the users that just want a free ride: conversion rates to enterprise edition customers are quite low. It might be enough to make a decent buck, but eventually you'll hit a cap on how far you can grow. I'm not saying this is bad - you only need to grow as much as you have to, but it is something to be aware of.

Contrast this to Hadoop. The have a Apache 2.0 permissive license, and this results in many individuals but also companies contributing to the project. And there are still companies like Cloudera that manage to make a good living off of the services around their distribution of Hadoop. You don't lose the ability to develop add-ons either with this model - apache 2.0 allows all that. The difference with GPL (and AGPL) of course is that it allows this also to other users and companies. So the trick to stay on top in this model is to simply offer the best product (as opposed to being the sole holder of the copyright to he code).

Anyway - that is it for now - I hope this is helpful.

Tuesday, May 10, 2011

Managing kettle job configuration

Over time I've grown a habit of making a configuration file for my kettle jobs. This is especially useful if you have a reusable job, where the same work has to be done but against different conditions. A simple example where I found this useful is when you have separate development, testing and production environments: when you're done developing your job, you transfer the .kjb file (and its dependencies) to the testing environment. This is the easy part. But the job still has to run within the new environment, against different database connections, webservice urls and file system paths.

Variables

In the past, much has been written about using kettle variables, parameters and arguments. Variables are the basic features that provide the mechanism to configure the transformation steps and job entries: instead of using literal configuration values, you use a variable reference. This way, you can initialize all variables to whatever values are appropriate at that time, and for that environment. Today, I don't want to discuss variables and variable references - instead I'm just focussing on how to manage the configuration once you already used variable references inside your your jobs and transformations.

Managing configuration

To manage the configuration, I typically start the main job with a set-variables.ktr transformation. This transformation reads configuration data from a config.properties file and assigns it to the variables so any subsequent jobs and transformations can access the configration data through variable references. The main job has one parameter called ${CONFIG_DIR} which has to be set by the caller so the set-variables.ktr transformation knows where to look for its config.properties file:

Reading configuration properties

The config.properties file is just a list of key/value pairs separated by an equals sign. Each key represents a variable name, and the value the appropriate value. The following snippet should give you an idea:

#staging database connection
STAGING_DATABASE=staging
STAGING_HOST=localhost
STAGING_PORT=3351
STAGING_USER=staging
STAGING_PASSWORD=$74g!n9

The set-variables.ktr transformation reads it using a "Property Input" step, and this yields a stream of key/value pairs:

Pivoting key/value pairs to use the "set variables" step

In the past, I used to set the variables using the "Set variables" step. This step works by creating a variable from selected fields in the incoming stream and assigning the field value to it. This means that you can't just feed the stream of key/value pairs from the property input step into the set variables step: the stream coming out of the property input step contains multiple rows with just two fields called "Key" and "value". Feeding it directly into the "Set variables" step would just lead to creating two variables called Key and Value, and they would be assigned values multiple times for all key/value pairs in the stream. So in order to meaningfully assign variable, I used to pivot the stream of key/value pairs into a single row having one field for each key in the stream using the "Row Denormaliser" step:

As you can see in the screenshot, "Key" is the key field: the value of this field is scanned to determine in which output fields to put the corresponding value. There are no fields that make up a grouping: rather, we want all key/value pairs to end up in one big row. Or put another way, there is just one group comprising all key/value pairs. Finally, the grid below specifies for each distinct value of the "Key" field to which output field name it should be mapped, and in all cases, we want the value of the "Value" field to be stored in those fields.

Drawbacks

There are two important drawbacks to this approach:

The "Row Normaliser" uses the value of the keys to map the value to a new field. This means that we have to type the name of each and every variable appearing in the config.properties file. So you manually need to keep he config.propeties and the "Denormaliser" synchronized, and in practice it's very easy to make mistakes here.
Due to the fact that the "Row Denormaliser" step literally needs to know all variables, the set-variables.ktr transformation becomes specific for just one particular project.

Given these drawbacks, I seriously started to question the usefulness of a separate configuration file: because the set-variables.ktr transformation has to know all variables names anyway, I was tempted to store the configration values themselves also inside the transformation (using a "generate rows" or "data grid" step or something like that), and "simply" make a new set-variables.ktr transformation for every environment. Of course, that didn't feel right either.

Solution: Javascript

As it turns out, there is in fact a very simple solution that solves all of these problems: don't use the "set variables" step for this kind of problem! We still need to set the variables of course, but we can conveniently do this using a JavaScript step. The new set-variables.ktr transformation now looks like this:

The actual variable assignemnt is done with Kettle's built-in setVariable(key, value, scope). The key and value from the incoming stream are passed as arguments to the key and value arguments of the setVariable() function. The third argument of the setVariable() function is a string that identifies the scope of the variable, and must have one of the following values:

"s" - system-wide

"r" - up to the root

"p" - up to the parent job of this transormation

"g" - up to the grandparent job of this transormation

For my purpose, I settle for "r".

The bonus is that this set-variables.ktr is less complex than the previous one and is now even completely independent of the content of the configuration. It has become a reusable transformation that you can use over and over.

Thursday, July 05, 2007

Kettle Quick Tip: Data Driven Execution of Tasks

Currently I need to move a bit of data around. I like to use Kettle for this type of work rather than writing custom scripts for a number of reasons (which I won't discuss here).

Anyway here is a quick tip I want to share with whomever it may concern. It is not rocket science, and many people may go "duh!" but I hope it will still be useful to others.

Quite often, you need a batch task, like truncating a set of tables, deleting data, dropping constraints etc. In kettle, you might model this like a job. In this case, each separate action can be modelled as a step of the job.

The following screenshot illustrates this approach:

So here, each step is just an SQL statement that performs exactly the task you want and the steps are connected in order to perform the tasks sequentially. (You could also fork and start multiple steps in parallel).

Anyway, in many cases, you can do the same thing with a transformation. One of the major differences between jobs and transformations is that transformations are about data. And consequentially, transformations are great if you know how to drive your task with data. In this particular case, the following transformation does exactly the same work as the job from the previous screenshot:

Here, the first step fetches all the table names from the current database using the following query on the MySQL information_schema.TABLES table:


SELECT table_name
FROM   information_schema.tables
WHERE  table_schema = schema()

(In fact, we don't even need the information_schema to do this - the SHOW TABLES statement works just as good - Kettle doesn't care whether we use a genuine SQL statement or a SHOW statement, as long as it returns a result set.)

The table input step will now output rows containing a table_name field:

The next step can now be driven using the data from the table input step. The following screenshot may clarify this:

First, we need to check the "Execute for each row" checkbox of the SQL script step. For most steps in a transformation, this property is implicit, but the SQL script step is a bit odd in this respect. The reason is that the execution of a SQL script can be a task oriented type of thing and as such in the domain of jobs rather than transformations.

(Note that in itself, the SQL script step is not data oriented: no data can flow out of this step, so it really is a task which just happens to be data driven. However, the data input will pass through and is still available as output of the SQL script step).

Second, we need to write our script so that it accepts the data as input. In this case, the script consists of just a TRUNCATE statement:


TRUNCATE TABLE ?

The question mark here serves as a place holder for values received at the input of this step.

Third, we need to specify which fields from the input are to be pushed into the value place holders in the script. In this case, we only need to specify the table_name that stems from the output of the table input step.

That is all there is to it - I hope it was useful.