Saturday, June 18, 2011

HPCC vs Hadoop at a glance


Since this article was written, HPCC has undergone a number of significant changes and updates. This addresses some of the critique voiced in this blog post, such as the license (updated from AGPL to Apache 2.0) and integration with other tools. For more information, refer to the comments placed by Flavio Villanustre and Azana Baksh.

The original article can be read unaltered below:

Yesterday I noticed this tweet by Andrei Savu: . This prompted me to read the related GigaOM article and then check out the HPCC Systems website.

If you're too lazy to read the article or visit that website:
HPCC (High Performance Computing Cluster) is a massive parallel-processing computing platform that solves Big Data problems. The platform is now Open Source!

HPCC Systems compares itself to Hadoop, which I think is completely justified in terms of functionality. Its product originated as a homegrown solution of LexisNexis Risk Solutions allowing its customers (banks, insurance companies, law enforcment and federal government) to quickly analyze billions of records, and as such it has been in use for a decade or so. It is now open sourced, and I already heard an announcement that Pentaho is its major Business Intelligence Partner.

Based on the limited information a made a quick analysis, which I emailed to the HPCC Systems CTO, Armando Escalante. My friend Jos van Dongen said it was a good analysis and told me I should post it. Now, I don't really have time to make a nice blog post out of it, but I figured it can't hurt to just repeat what I said in my emails. So here goes:

Just going by the documentation, I see a two real unique selling points in HPCC Systems as compared to Hadoop:

  • Real-time query performance (as opposed to only analytic jobs). HPCC offers two difference setups, labelled Thor and Roxie. Functionalitywise, Thor should be compared to a Map/Reduce cluster like Hadoop: it's good for doing fairly long running analyses on large volumes of data. Roxie is a different beast, and designed to offer fast data access, supporting ad-hoc real-time queries
  • Integrated toolset (as opposed to hodgepodge of third party tools). We're talking about an IDE, job monitoring, code repository, scheduler, configuration manager, and whatnot. This really looks like like big productivity boosters, which may make Big Data processing a lot more accessible to companies that don't have the kind of development teams required to work with Hadoop.

(there may be many more benefits, but these are just the ones I could clearly distill from the press release and the website)

Especially for Business Intelligence, Roxie maybe a big thing. If real-time Big Data queries could be integrated with Business Intelligence OLAP and reporting tools, then this is certainly a big thing. I can't disclose the details but I have trustworthy information that integration with Pentaho's Analysis Engine, the Mondrian ROLAP engine is underway and will be available as an Enterprise feature.

A few things that look different but which may not matter too much when looking at HPCC and Hadoop from a distance:
  • ECL, the "Enterprise Control Language", which is a declarative query language (as opposed to just Map/Reduce). This initially seems like a big difference but Hadoop has tools like pig and sqoop and hive. Now, it could be that ECL is vastly superior to these hadoop tools, but my hunch is you'd have to be careful in how you position that. If you choose a head-on strategy in promoting ECL as opposed to pig, then the chances are that people will just spend their energy in discovering the things that pig can do and ECL cannot (not sure if those features actually exist, but that is what hadoop fanboys will look for), and in addition, the pig developers might simply clone the unique ECL features and the leveling of that playing field will just be a matter of time. This does not mean you shouldn't promote ECL - on the contrary, if you feel it is a more productive language than pig or any other hadoop tool, then by all means let your customers and prospects know. Just be careful and avoid downplaying the hadoop equivalents because that strategy could backfire.

  • Windows support. It's really nice that HPCC Systems is available for Microsoft Windows, it makes that a lot easier for Microsoft shops (and there are a lot of them). That said, customers that really have a big-data problem will solve it no matter what their internal software policies are. So they'd happily start running hadoop on linux if that solves their problems.
  • Maturity. On paper HPCC looks more mature than hadoop. It's hard to tell how much that matters though because hadoop has all the momentum. People might choose for hadoop because they anticipate that the maturity will come thanks to the sheer number of developers committing to that platform.

The only thing I can think of where HPCC looks like it has a disadvantage as compared to Hadoop is adoption rate and licensing. I hope these will prove not to be significant hurdles for HPCC, but I think that these might be bigger problems then they seem. Especially the AGPL licensing seems problematic to me.

The AGPL is not well regarded by anyone I know - not in the open source world. The general idea seems to be that even more than plain GPL3 it restricts how the software may be used. If the goal of open sourcing HPCC is to gain mindshare and a developer community (something that hadoop has done and is doing extremely well) then a more permissive license is really the way to go.

If you look at products like MySQL but also Pentaho - they are both very strongly corporately led products. The have a good number of users, but few contributions from outside the company, and this is probably due to a combination of GPL licensing and the additional requirement for handing over the copyright of any contributions to the company. Hence these products don't really benefit from an open source development model (or at least not as much as they could). For these companies, Open source may help initially to gain a lot of users, but those are in majority the users that just want a free ride: conversion rates to enterprise edition customers are quite low. It might be enough to make a decent buck, but eventually you'll hit a cap on how far you can grow. I'm not saying this is bad - you only need to grow as much as you have to, but it is something to be aware of.

Contrast this to Hadoop. The have a Apache 2.0 permissive license, and this results in many individuals but also companies contributing to the project. And there are still companies like Cloudera that manage to make a good living off of the services around their distribution of Hadoop. You don't lose the ability to develop add-ons either with this model - apache 2.0 allows all that. The difference with GPL (and AGPL) of course is that it allows this also to other users and companies. So the trick to stay on top in this model is to simply offer the best product (as opposed to being the sole holder of the copyright to he code).

Anyway - that is it for now - I hope this is helpful.


Mark Callaghan said...

The AGPL will prevent some companies from using it. But MongoDB uses AGPL and they appear to be growing. Given the use of GPL or AGPL, I assume they intend to write 99% of the code using MySQL and MongoDB as examples. It might take a few more years to understand whether that limits growth.

From a feature perspective, Roxie is the obvious advantage they have over Hadoop. I don't expect that to last long. When you spend a lot of money building large Hadoop/HDFS/Hive clusters it is a shame to only get at that data via batch queries. Something like Dremel for Hadoop will soon be here.

But I am wary of comparing products via feature check lists as that ignores quality and usability (MySQL has subqueries and views, right?). Maybe this product is rock solid and easy to use. I think that will determine whether it coexists with Hadoop.

I want to know whether HPCC clusters are as CPU bound as Hadoop clusters. Is this really a Java versus C++ issue? Or the problem excessive bloat in the code path. And does this CPU overhead matter given that we tend to have too many cores on modern servers. 

Steve L said...

Where Hadoop is slow is in Map and Reduce startup; that's the cost of JVM creation. Also, because it uses the HDFS filestore for all passing of data, you take a performance hit in exchange for restartability -and it's writes that consume the most net bandwidth.

JVM performance could maybe be improved by having some pooled JVMs that then load in extra classes, but there's still the memory cost. streaming could be addressed by doing some more direct process to process streaming. Yes, you take a hit on failures, but when you don't get a failure, your work is faster. for short lived work, this may be the right tactic.

Regarding Windows support, Hadoop does it half-heartedly, because although it's nominally supported, nobody tests it on large clusters, which are all Linux based. Some bits of the Hadoop codebase are optimised for Linux (how it spawns code), and if more native code is written for process management, it will be focused on Linux. One nice feature about CentOS: no per-machine license, so it's good on large clusters. For small clusters where you want to do other things on the same machines, needs may be different.

Mike said...

One significant difference between the two systems is that "HPCC" (aka DAS, aka Seisint Supercomputer) is designed as part of a hosting and search system designed around a public records web site, whereas Hadoop is designed as a storage and mining system designed to support web crawling. The former is a distributed search engine for millions of relatively small files that can be used for mining and analytics and; the latter is a distributed mining tool for very large files.

Pentaho, BTW, does have some basic integration with Hadoop as an endpoint of their ETL process.

rengolin said...

Great analysis, and the comments are spot on.

I'd just be careful on the momentum analysis. On the days of NetBeans, it was the one with most momentum, and no other IDE could get near it in usage and extensibility.

However, when Eclipse came in, was like child play. Eclipse was robust and stable, something that NetBeans never was, and that was the big difference.

Eclipse had been private for years, and had all the legacy of a private system, which every one complained about at the time, but in the end, it prevailed.

I'm not saying that's what's going to happen, nor that it's good that only one platform prevail. I think that having Hadoop and HPCC together is even better than having just one, whatever that is. If they can interoperate, even better!

Java didn't kill C++, Python didn't kill Perl, Linux didn't kill Windows (or Mac). They now live in *ahem* harmony. ;)

Nagamallikarjuna said...

@Mark Callaghan: Hadoop Cloudera Distribution has RTD and RTQ features with their new Product Impala. Impala has same features like what Google Dremel has. One more Project from ASF is Apache Drill which has features what Roxie has. These two products allow us to do real time querying on top of 1 trillion rows in a few seconds...

Flavio Villanustre said...

I hate to exhume old blog posts, but I think it's useful to point out that the HPCC Systems platform is currently released under an Apache 2.0 license, which should eliminate the concerns around the AGPL restrictions, fostering further adoption.

And while Impala and Drill are good steps in the right direction, they are still quite far from the advanced capabilities found in Roxie: robust and well proven in critical production environments, compound keys, multi-keys, index smart stepping, full text searching, distributed hierarchical b-tree indices, efficient boolean searches, real-time external data linking, full debugger, etc.

And, in Roxie, there are also the benefits of the high level and declarative ECL programming language: code/data encapsulation, purity, higher abstraction paradigm, compiled for native execution, highly optimized, etc.

rpbouman said...

Flavio, thanks for the comment!

This is really valuable information. I hope it helps HPCC gets more attention.

Azana Baksh said...

As an update, the HPCC Systems platform now integrates with BI tools like Pentaho and R and has Hadoop Data Integration which allows read/write access to HDFS data repositories. A recent blog was posted comparing the paradigms behind HPCC and Hadoop here:

rpbouman said...

Flavio, Azana:

thanks again for the updates.

I prefixed the article with a section that explains some information is outdated, and that people should take your comments into account.

Cheers, Roland

Anonymous said...

The deterrent to the widespread adoption of HPCC is that it doesn't have the brand name Apache attached to it as does Apache Hadoop. Marketing makes a difference, even to geeks. There are quite a few technical and practical work flow deterrents to adoption as well.

The Eclipse plug-in does not provide the common features one sees with Java, C/C++, PHP, Scala, and Groovy. HPCC's own IDE isn't really an integrated development environment in the modern sense. There are no unit testing or integration testing frameworks. All of the potential speed benefits of a higher level language are lost in the cumbersome aspect of development without these important work flow features.

The real core is that I haven't found any real data or test set-up information for the somewhat stale yet highly touted speed comparison with hadoop. It makes one wonder whether it was an unbiased test, since LexisNexis has no shortage of government funding and Quantlab is a financial organization, not a technology or research one. I don't see Apache refuting the speed claims, but that is probably because they do not see HPCC as a threat to their market share. What Google is doing with BigQuery is much more competitive.

With all these other options, I doubt whether many developers interested in starting a career in Big Data will take the risk with becoming an ECL programmer, especially since the language is not object oriented, providing no mechanism for inheritance or polymorphism, is functional but has few features to support actual functional development, is declarative but lacks generics as an end-to-end language feature, does not retain state without writing to a file or embedding C code, and cannot be compiled without compiling every dependency every time.

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Rolling up each game's lines into a single game row (6/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is the sixth installment of a s...