tag:blogger.com,1999:blog-15319370.post3322503370403668865..comments2024-03-05T11:16:00.846+01:00Comments on Roland Bouman's blog: HPCC vs Hadoop at a glancerpboumanhttp://www.blogger.com/profile/13365137747952711328noreply@blogger.comBlogger10125tag:blogger.com,1999:blog-15319370.post-46753798466504718922015-08-07T05:17:35.177+02:002015-08-07T05:17:35.177+02:00The deterrent to the widespread adoption of HPCC i...The deterrent to the widespread adoption of HPCC is that it doesn't have the brand name Apache attached to it as does Apache Hadoop. Marketing makes a difference, even to geeks. There are quite a few technical and practical work flow deterrents to adoption as well.<br /><br />The Eclipse plug-in does not provide the common features one sees with Java, C/C++, PHP, Scala, and Groovy. HPCC's own IDE isn't really an integrated development environment in the modern sense. There are no unit testing or integration testing frameworks. All of the potential speed benefits of a higher level language are lost in the cumbersome aspect of development without these important work flow features.<br /><br />The real core is that I haven't found any real data or test set-up information for the somewhat stale yet highly touted speed comparison with hadoop. It makes one wonder whether it was an unbiased test, since LexisNexis has no shortage of government funding and Quantlab is a financial organization, not a technology or research one. I don't see Apache refuting the speed claims, but that is probably because they do not see HPCC as a threat to their market share. What Google is doing with BigQuery is much more competitive.<br /><br />With all these other options, I doubt whether many developers interested in starting a career in Big Data will take the risk with becoming an ECL programmer, especially since the language is not object oriented, providing no mechanism for inheritance or polymorphism, is functional but has few features to support actual functional development, is declarative but lacks generics as an end-to-end language feature, does not retain state without writing to a file or embedding C code, and cannot be compiled without compiling every dependency every time.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-15319370.post-13028652071188377062013-01-23T16:21:07.386+01:002013-01-23T16:21:07.386+01:00Flavio, Azana:
thanks again for the updates.
I...Flavio, Azana: <br /><br />thanks again for the updates. <br /><br />I prefixed the article with a section that explains some information is outdated, and that people should take your comments into account.<br /><br />Cheers, Rolandrpboumanhttps://www.blogger.com/profile/13365137747952711328noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-27311445018420069572013-01-23T16:05:08.101+01:002013-01-23T16:05:08.101+01:00As an update, the HPCC Systems platform now integr...As an update, the HPCC Systems platform now integrates with BI tools like Pentaho and R and has Hadoop Data Integration which allows read/write access to HDFS data repositories. A recent blog was posted comparing the paradigms behind HPCC and Hadoop here: http://hpccsystems.com/blog/hpcc-systems-hadoop-%E2%80%93-contrast-paradigmsAzana Bakshnoreply@blogger.comtag:blogger.com,1999:blog-15319370.post-10396530985876743552013-01-23T14:22:02.772+01:002013-01-23T14:22:02.772+01:00Flavio, thanks for the comment!
This is really v...Flavio, thanks for the comment! <br /><br />This is really valuable information. I hope it helps HPCC gets more attention.rpboumanhttps://www.blogger.com/profile/13365137747952711328noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-75155597578818609702013-01-23T14:13:58.184+01:002013-01-23T14:13:58.184+01:00I hate to exhume old blog posts, but I think it...I hate to exhume old blog posts, but I think it's useful to point out that the HPCC Systems platform is currently released under an Apache 2.0 license, which should eliminate the concerns around the AGPL restrictions, fostering further adoption.<br /><br />And while Impala and Drill are good steps in the right direction, they are still quite far from the advanced capabilities found in Roxie: robust and well proven in critical production environments, compound keys, multi-keys, index smart stepping, full text searching, distributed hierarchical b-tree indices, efficient boolean searches, real-time external data linking, full debugger, etc. <br /><br />And, in Roxie, there are also the benefits of the high level and declarative ECL programming language: code/data encapsulation, purity, higher abstraction paradigm, compiled for native execution, highly optimized, etc.<br /><br />Flavio Villanustrehttps://www.blogger.com/profile/18181867965790984763noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-46731117867116802772012-12-13T10:30:32.531+01:002012-12-13T10:30:32.531+01:00@Mark Callaghan: Hadoop Cloudera Distribution has ...@Mark Callaghan: Hadoop Cloudera Distribution has RTD and RTQ features with their new Product Impala. Impala has same features like what Google Dremel has. One more Project from ASF is Apache Drill which has features what Roxie has. These two products allow us to do real time querying on top of 1 trillion rows in a few seconds...Nagamallikarjunahttps://www.blogger.com/profile/04247214377207776116noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-83890085125665775162011-08-12T19:10:08.221+02:002011-08-12T19:10:08.221+02:00Great analysis, and the comments are spot on.
I&#...Great analysis, and the comments are spot on.<br /><br />I'd just be careful on the momentum analysis. On the days of NetBeans, it was the one with most momentum, and no other IDE could get near it in usage and extensibility.<br /><br />However, when Eclipse came in, was like child play. Eclipse was robust and stable, something that NetBeans never was, and that was the big difference.<br /><br />Eclipse had been private for years, and had all the legacy of a private system, which every one complained about at the time, but in the end, it prevailed.<br /><br />I'm not saying that's what's going to happen, nor that it's good that only one platform prevail. I think that having Hadoop and HPCC together is even better than having just one, whatever that is. If they can interoperate, even better!<br /><br />Java didn't kill C++, Python didn't kill Perl, Linux didn't kill Windows (or Mac). They now live in *ahem* harmony. ;)rengolinhttps://www.blogger.com/profile/12297588785536442893noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-1970135771575924962011-08-11T04:45:23.254+02:002011-08-11T04:45:23.254+02:00One significant difference between the two systems...One significant difference between the two systems is that "HPCC" (aka DAS, aka Seisint Supercomputer) is designed as part of a hosting and search system designed around a public records web site, whereas Hadoop is designed as a storage and mining system designed to support web crawling. The former is a distributed search engine for millions of relatively small files that can be used for mining and analytics and; the latter is a distributed mining tool for very large files. <br /><br />Pentaho, BTW, does have some basic integration with Hadoop as an endpoint of their ETL process.Mikenoreply@blogger.comtag:blogger.com,1999:blog-15319370.post-55939848730003253192011-06-20T14:34:20.410+02:002011-06-20T14:34:20.410+02:00Where Hadoop is slow is in Map and Reduce startup;...Where Hadoop is slow is in Map and Reduce startup; that's the cost of JVM creation. Also, because it uses the HDFS filestore for all passing of data, you take a performance hit in exchange for restartability -and it's writes that consume the most net bandwidth. <br /><br />JVM performance could maybe be improved by having some pooled JVMs that then load in extra classes, but there's still the memory cost. streaming could be addressed by doing some more direct process to process streaming. Yes, you take a hit on failures, but when you don't get a failure, your work is faster. for short lived work, this may be the right tactic. <br /><br />Regarding Windows support, Hadoop does it half-heartedly, because although it's nominally supported, nobody tests it on large clusters, which are all Linux based. Some bits of the Hadoop codebase are optimised for Linux (how it spawns code), and if more native code is written for process management, it will be focused on Linux. One nice feature about CentOS: no per-machine license, so it's good on large clusters. For small clusters where you want to do other things on the same machines, needs may be different.Steve Lnoreply@blogger.comtag:blogger.com,1999:blog-15319370.post-5753814972214079662011-06-19T17:41:54.290+02:002011-06-19T17:41:54.290+02:00The AGPL will prevent some companies from using it...The AGPL will prevent some companies from using it. But MongoDB uses AGPL and they appear to be growing. Given the use of GPL or AGPL, I assume they intend to write 99% of the code using MySQL and MongoDB as examples. It might take a few more years to understand whether that limits growth.<br /><br />From a feature perspective, Roxie is the obvious advantage they have over Hadoop. I don't expect that to last long. When you spend a lot of money building large Hadoop/HDFS/Hive clusters it is a shame to only get at that data via batch queries. Something like Dremel for Hadoop will soon be here.<br /><br />But I am wary of comparing products via feature check lists as that ignores quality and usability (MySQL has subqueries and views, right?). Maybe this product is rock solid and easy to use. I think that will determine whether it coexists with Hadoop. <br /><br />I want to know whether HPCC clusters are as CPU bound as Hadoop clusters. Is this really a Java versus C++ issue? Or the problem excessive bloat in the code path. And does this CPU overhead matter given that we tend to have too many cores on modern servers. Mark Callaghanhttps://www.blogger.com/profile/09590445221922043181noreply@blogger.com