Percona Live MySQL User's Conference, San Francisco, April 10-12th, 2012 Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Wednesday, January 26, 2011

NoSQL support lands in JasperSoft

JasperSoft, one of the leading open source BI suites just announced it is delivering connectors for a range of so-called NoSQL databases. The big names are all there: Cassandra, MongoDB, Riak, HBase, CouchDB, Neo4J, Infinispan, VoltDB and Redis.

I used to explain to people that the lack of SQL support in NoSQL databases poses a challenge for traditional Business Intelligence tools, because those all talk either SQL or MDX (and maybe some XQuey/XPath). With this development, this is no longer true, and I want to congratulate JasperSoft in spearheading this innovation.

I still have a number of reservations though. Although I personally value the ability to report on data in my NoSQL database, I think its usefulness will hava a number of limitations that are worth consideration.

Admittedly I am not an expert in the NoSQL database field, but as far my knowledge goes, both the dynamo-style key/value stores like Riak, and the Bigtable-style hashtable stores like HBase and Cassandra can basically do 2 types of read operations: fetch a single object by key, or scan everything. The fetched object can be complex and contain a lot of data, and it would certainly be nice if you could run a report on that. The scan everything operation doesn't seem that useful at the report level: for all but trivial cases, you need considerable logic to make this scan useful, and I don't think a report is the right place for this. Apart from that, if the NoSQL solution was put in place because of the large data volume, then the report itself would probably need to be executed on a cluster just to achieve acceptable response time. I may be wrong but I don't think JasperReports supports that.

So, for a full scan of those NoSQL databases, connectors at the data integration end seem more appropriate. I think the integration of Hadoop with Pentaho data integration (a.k.a Kettle) is a step in the right direction, but of course only applicable if you're a Hadoop user.

Another point is data quality. Typically reporting is done on a data warehouse or reporting environment where the data quality is kept in check by processing the raw data with a data integration and quality tools. Directly reporting on any operational database can be problematic because you skip those checks. Because the NoSQL databases offer virtually no constraints, those checks are even more important. So to me this seems like another reason why NoSQL connectivity is more useful in the data integration tools.

JasperSoft also offers connectivity for the MongoDB and CouchDB docmentstores. I think that for raw reporting on the actual source documents, the same reservations apply as I mentioned in relation to the dynamo and Bigtable style solutions. But, there may be a few more possibilities here, at least for CouchDB

CouchDB has a feature called views, which allows you to "query" the raw documents using a map/reduce job. I can certainly see why it'd be useful to build a report on top of that. Of course, you would still have to implement the logic to do a useful scan, and you would still have to deal with data quality issues, but you can do it in the map/reduce job, which seems a more appropriate place to handle this than a report.

All in all, I think this is a promising development, and I should probably get my feet wet and try it out myself. But for now, I would recommend to keep it out of the wrecking tentacles of unaware business users :)

4 comments:

Matt Casters said...

Just a quick note to mention that adding support for the likes of MongoDB and so forth is quite easy to do in PDI/Kettle with our "User Defined Java Class" step.
It is however a bit unfortunate that there are no standards for communicating to this group of NoSQL "databases". It means you would have to do pretty much the same work over and over again for each dialect implementation of the key/value stores. Perhaps they should all agree to come up with some sort of minimal (without joins) SQL/JSON/Whatever dialect.

Or perhaps that is something we can do ourselves. Perhaps we should simply create a nosql4j interface standard. Any takers? Would anyone even care at this point?

mdahlman said...

Roland, very nice post. The largest volume of folks "commenting" on Jaspersoft's announcements simply re-tweeted it. Your comments, to the contrary, involve quite a bit more thought.

First, I'd like to correct your egregious errors. OK, that was easy: I don't see any.

But I would also like to add some precision to some of your comments and correct some items that don't seem quite right. "hashtable stores like HBase and Cassandra can basically do 2 types of read operations: fetch a single object by key, or scan everything." That's not right. For example, take a look at this Cassandra documentation: http://wiki.apache.org/cassandra/API. In particular, you can return a limited key range so that you are only getting a small subset of "rows".

"the report itself would probably need to be executed on a cluster" This isn't quite right. JasperReports Server can execute the report by sending the query to the clustered NoSQL database, but there's no need for JasperReports itself to be in the clustered environment. It's analogous to Oracle being in a cluster--there's no need for the SQL client to know about the cluster.

Clearly there will be cases when using ETL tools like Talend or Kettle will be better suited to solving the problem. Then the analysis and reporting tools can just use a traditional relational database. But I can see lots of cases where there will be big benefits to connecting directly to the NoSQL database.

"there may be a few more possibilities here, at least for CouchDB" I agree that CouchDB views are well suited to this. We are indeed using them. I'll also point out how MongoDB at least as rich a set of capabilities for querying the data. We can run a query like this:
{
'collectionName':'accounts', 'findFields':{'name':1,'phone_office':1,'billing_address_city':1,'billing_address_street':1,'billing_address_country':1},
'sort':{'billing_address_country':-1,'billing_address_city':1},
'findQueryRegEx':{'name':'/^M /'}
}
The report query specifies the fields and sorting, and it also uses regular expressions filter out the data that is required.

Anyway, thanks for taking a look. I'll be interested to see how the Jaspersoft NoSQL connectors are received by you others in the community.

Roland Bouman said...

Hi Matt and Matthew, thanks for leaving comments - I appreciate it :)

@Matt, yeah I think that the lack of standardization is a bit of a problem, at least for tool developers and vendors. As for a nosql4j standard, I think there are a number of challenges with that idea. NoSQL is a mixed bucket with regard to data model/query language/data access api.

However, I was thinking that it may be a big step forward by creating lightweight JDBC wrappers for the popular ones. This may seem strange but for many reporting and data integration tools you need to convert from the particular NoSQL flavor of data to something tabular anyway. Don't get me wrong - I am not suggesting to write SQL emulators for these beasts - we can simply use the native query language if that exists, or think of a very command language to control the key/scan semantics of those stores. Anyway, just an idea. I don't have need for this so I probably won't dive in on the short term.

@Matthew - thanks for the clarification! I appreciate a lot. Yes, I agree that I may have overestimated the limitations. I should probably just download it and give it a try. My remark about running the report on a cluster was there to imagine how it would work in case the report would have to do a complete scan and filter out the interesting rows for the report. So even though the database would execute the query, the report would still need to deal with all that data. But of course with a range scan the working set may be so small that this is not a cause for concern any more.

Thanks again, and kind regards,

Roland

Greg said...
This comment has been removed by a blog administrator.