Tuesday, October 27, 2009

Calpont opens up: InfiniDB Open Source Analytical Database (based on MySQL)

Open source business intelligence and data warehousing are on the rise!

If you kept up with the MySQL Performance Blog, you might have noticed a number of posts comparing the open source analytical databases Infobright, LucidDB, and MonetDB. LucidDB got some more news last week when Nick Goodman announced that the Dynamo Business Intelligence Corporation will be offering services around LucidDB, branding it as DynamoDB.

Now, to top if off, Calpont has just released InfiniDB, a GPLv2 open source version of its analytical database offering, which is based on the MySQL server.

So, let's take a quick look at InfiniDB. I haven't yet played around with it, but the features sure look interesting:

  • Column-oriented architecture (like all other analytical database products mentioned)

  • Transparent compression

  • Vertical and horizontal partitioning: on top of being column-oriented, data is also partitioned, potentially allowing for less IO to access data.

  • MVCC and support for high concurrency. It would be interesting to see how much benefit this gives when loading data, because this is usually one of the bottle necks for column-oriented databases

  • Support for ACID/Transactions

  • High performance bulkloader

  • No specialized hardware - InfiniDB is a pure software solution that can run on commidity hardware

  • MySQL compatible


The website sums up a few more features and benefits, but I think this covers the most important ones.

Calpont also offers a closed source enterprise edition, which differs from the open source by offering support for multi-node scale-out support. By that, they do not mean regular MySQL replication scale-out. Instead, the enterprise edition features a true distributed database architecture which allows you to divide incoming requests across a layer of so-called "user modules" (MySQL front ends) and "performance modules" (the actual workhorses that partition, retrieve and cache data). In this scenario, the user modules break the queries they recieve from client applications into pieces, and send them to one or more performance modules in a parallel fashion. The performance modules then retrieve the actual data from either their cache, or from the disk, and sends those back to the user modules which re-assemble the partial and intermediate results to the final resultset which is sent back to the client. (see picture)
shared-disk-arch-simple
Given the MySQL compatibility and otherwise similar features, I think it is fair to compare the open source InfiniDB offering to the Infobright community edition. Interesting differences are that InfiniDB supports all usual DML statements (INSERT, DELETE, UPDATE), and that InfiniDB offers the same bulkloader in both the community edition as well as the enterprise edition: Infobright community edition does not support DML, and offers a bulk loader that is less performant than the one included in its enterprise edition. I have not heard of an InfoBright multi-node option, so when comparing the enterprise edition featuresets, that seems like an advantage too in Calpont's offering.

Please understand that I am not endorsing one of these products over the other: I'm just doing a checkbox feature list comparison here. What it mostly boils down to, is that users that need an affordable analytical database now have even more choice than before. In addition, it adds a bit more competition for the vendors, and I expect them all to improve as a result of that. These are interesting times for the BI and data warehousing market :)

2 comments:

Justin Swanhart said...

Infobright and InfiniDB may both be column stores but they operate very differently.

InfiniDB takes a very proven strategy of "divide and conquer" using partitioning and a distributed hash join to do the "heavy lifting". They also appear to actually use the MySQL storage engine interface for storing data instead of using a custom loader pretending to be LOAD DATA INFILE.

Infobright packages data into packets and uses "rough set theory" to eliminate packets from inspection, and using preaggregated metadata from the "knowledge grid" to answer questions.

It never ceases to amaze me that IB's LDI doesn't even support the basic MySQL syntax for mapping column input, skipping input lines, etc.

I wonder if InfiniDB supports writing binary log files for DML? Given the FAQ on their website about Point-in-time recovery, it looks like they don't write transaction logs for bulk inserts so they probably don't write binlogs in that case either.

AFAIK Infobright's LDI doesn't populate binary logs for any calls.

Robin Schumacher said...

Justin - thanks for the comments; good observations. We don't formally support replication yet, but it's not going to be difficult for us to do so. And you're correct, our bulk loader doesn't write to the txn log, but standard DML statements are written.

When it comes to loading data, the fastest method is our cpimport utility (see FAQ on using it), but we also do LOAD DATA INFILE just fine.

Hope you guys at KF are doing well...

--Robin

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...