Roland Bouman's blog: January 2011

Wednesday, January 26, 2011

NoSQL support lands in JasperSoft

JasperSoft, one of the leading open source BI suites just announced it is delivering connectors for a range of so-called NoSQL databases. The big names are all there: Cassandra, MongoDB, Riak, HBase, CouchDB, Neo4J, Infinispan, VoltDB and Redis.

I used to explain to people that the lack of SQL support in NoSQL databases poses a challenge for traditional Business Intelligence tools, because those all talk either SQL or MDX (and maybe some XQuey/XPath). With this development, this is no longer true, and I want to congratulate JasperSoft in spearheading this innovation.

I still have a number of reservations though. Although I personally value the ability to report on data in my NoSQL database, I think its usefulness will hava a number of limitations that are worth consideration.

Admittedly I am not an expert in the NoSQL database field, but as far my knowledge goes, both the dynamo-style key/value stores like Riak, and the Bigtable-style hashtable stores like HBase and Cassandra can basically do 2 types of read operations: fetch a single object by key, or scan everything. The fetched object can be complex and contain a lot of data, and it would certainly be nice if you could run a report on that. The scan everything operation doesn't seem that useful at the report level: for all but trivial cases, you need considerable logic to make this scan useful, and I don't think a report is the right place for this. Apart from that, if the NoSQL solution was put in place because of the large data volume, then the report itself would probably need to be executed on a cluster just to achieve acceptable response time. I may be wrong but I don't think JasperReports supports that.

So, for a full scan of those NoSQL databases, connectors at the data integration end seem more appropriate. I think the integration of Hadoop with Pentaho data integration (a.k.a Kettle) is a step in the right direction, but of course only applicable if you're a Hadoop user.

Another point is data quality. Typically reporting is done on a data warehouse or reporting environment where the data quality is kept in check by processing the raw data with a data integration and quality tools. Directly reporting on any operational database can be problematic because you skip those checks. Because the NoSQL databases offer virtually no constraints, those checks are even more important. So to me this seems like another reason why NoSQL connectivity is more useful in the data integration tools.

JasperSoft also offers connectivity for the MongoDB and CouchDB docmentstores. I think that for raw reporting on the actual source documents, the same reservations apply as I mentioned in relation to the dynamo and Bigtable style solutions. But, there may be a few more possibilities here, at least for CouchDB

CouchDB has a feature called views, which allows you to "query" the raw documents using a map/reduce job. I can certainly see why it'd be useful to build a report on top of that. Of course, you would still have to implement the logic to do a useful scan, and you would still have to deal with data quality issues, but you can do it in the map/reduce job, which seems a more appropriate place to handle this than a report.

All in all, I think this is a promising development, and I should probably get my feet wet and try it out myself. But for now, I would recommend to keep it out of the wrecking tentacles of unaware business users :)

Friday, January 07, 2011

MQL-to-SQL: A JSON-based query language for your favorite RDBMS - Part III

This is the third article in a series providing background information to my talk for the MySQL User's conference, entitled MQL-to-SQL: a JSON-based Query Language for RDBMS Access from AJAX Applications.

In the first installment, I introduced freebase, an open shared database of the world's knowledge and its JSON-based query language, the Metaweb Query Language (MQL, pronounced Mickle). In addition, I discussed the JSON data format, its syntax, its relationship with the de-facto standard client side browser scripting language JavaScript, and its increasing relevance for modern AJAX-based webapplications.

The second installment provides a brief introduction to MQL as database query language, and how it compares to the de-facto standard query language for relational database systems (RDBMS), the Structured Query Language (SQL). I argued that MQL has some advantages over SQL, in particular for programming modern webapplications. I mentioned the following reasons:

Since MQL is JSON, and JSON is JavaScript, it's a more natural fit for modern AJAX applications
MQL is almost trivial to parse, making it much easier to write tool such as editors, but also to implement advanced authorization policies
MQL is easy to generate: the structure of MQL queries is mirrored by their results. A fragment of the result can be easily augmented with a subquery making it easy to subsequently drill down into the retrieved dataset
MQL is more declarative than SQL. Both attributes and relationships are represented as JSON object properties and one need not and cannot specify join conditions to combine data from different types of objects. The "cannot" is actually A Good Thing because it means one cannot make any mistakes.
MQL is more focussed on just the data. It largely lacks functions to transform data retrieved from the database forcing application developers to do data processing in the application or middleware layer, not in the database.

In this article, I want to discuss common practices in realizing data access for applications, especially web applications, and how database query languages like SQL and MQL fit in there.

Web Application Data Access Practices

After reading the introduction of this article, one might get the idea that I hate relational databases and SQL. I don't, I love them both! It's just that when developing database applications, especially for the web, SQL isn't helping much. Or rather, it's just one tiny clog in a massive clockwork that has to be set up again and again. Let me explain...

The Data Access Problem

It just happens to be the case that I'm an application developer. Typically, I develop rich internet and intranet applications and somewhere along the line, a database is involved for storing and retrieving data. So, I need to put "something" in place that allows the user to interact with a database via the web browser.
The way I write that I need "something" so the user can interact with the database, it seems like it's just one innocent little thing. In reality, "something" becomes a whole bunch of things that need to work together:

Browsers don't speak database wire protocols - they speak HTTP to back-end HTTP servers. No matter what, there is going to be some part that is accessible via HTTP that knows how to contact the database server. Examples of solutions to this part of the problem are Common Gateway Interface (CGI) programs, server-side scripting languages like PHP or Perl (which are often themselves implemented as a CGI program) or in the case of Java, specialized Servlet classes
The component at the HTTP server that mediates between web browser and database is going to require a protocol: a set of rules that determine how a URI, a HTTP method and parameters can lead to executing a database command. Examples of approaches to design such a protocol are Remote Procedure Calls (RPC) and Representational State Transfer (REST)
There's the way back too: the application running in the web browser is going to have to understand the results coming back from the database. For data, a choice has to be made for a particular data exchange format, typically eXtensible Markup Language or JSON
The user interface at the browser end need not only understand the protocol we invented for the data exchange, Ideally it should also guide the user and be able to validate whatever data the user is going to feed it. In other words, the user interface needs to have the metadata concerning the data exchange interface.

The Webservice Solution

A typical way to tackle the data access problem is:

analyze the functionality of the application, categorizing it into a series of clear and isolated actions
identify which data flows from application to database and back for each action
decide on a data representation, and a scheme to identify actions and parameters
create one or more programs in Java, Perl, PHP, Python, Ruby or whatever fits your stack that can execute the appropriate tasks and process the associated data flows to implement the actions

For web applications, the program or programs developed in this way are typically webservices that run as part of the process of the HTTP server. The client gets to do a HTTP request, to a particular URI, using the right HTTP method, and the right parameters. The service gets to process the parameters, execute tasks such as accessing a database, and finally, sending back a HTTP response, which typically contains data requested by the application.

Development Dynamics

The problem with the webservice approach is that it isn't very flexible. It presumes the application's functionality and hence the actions are quite well-defined. Although this looks reasonable on paper, in reality development tends to be quite evolutionary.

Typically, the core functionality of applications is quite well defined, but often a lot of additional functionalities are required. Although we can pretend these could be known in advance if only we'd spend more time designing and planning in advance, in practice, they often aren't. It may seem sad, but in many cases, the best way to find out is simply to start developing, and find out along the way. Agile is the latest buzzword that captures some of these development dynamics, but there have been other buzzwords for it in the past, such as RAD (rapid application development) and DSDM (dynamic systems development method).

The problem with this approach is that it requires a lot of going back-and-forth between front- and back-end development tasks: whenever the front-end wants to develop a new feature, it is often dependent upon the back-end offering a service for it. Front-end and back-end developers often are not the same people, so what we get is front-end development cycles having to wait on back-end development cycles to complete. Or in case front-end and back-end developers are the same person, they are constantly switching between tool sets and development environments.

In part this is because front-end development is usually done in JavaScript, and although server-side JavaScript is gaining ground, the server-side is still dominated mainly by Java, PHP, C++ and ASP.NET. But it's not just a programming language problem - developing a client, and especially a frond-end for end-users, presumes a very different mindset than developing a back-end process. Front-end development should focus on usability and quality user-experience; back-end development should focus on robustness, reliability, availability, scalability and performance. Although some of these aspects influence each other, in practice, front-end development is simply a different cup of tea than back-end development.

A Simple Plan: Building a Query Service

There is a very simple solution that would largely solve the data access problem without dealing with the inefficiencies of the recurring development process: If you could build a single service that can accept any parameters, understand them, and somehow return an appropriate result, we would never have to add functionality to the service itself. Instead, the front end application would somehow have to construct the right parameters to tell the service what it wants whenever the need arises.

This sounds almost like magic, right? So we must be kidding, right? Well, we're not kidding, and it's not magic either; it's more like a cheap parlour trick.
As the title of this section suggests, a query service fits this bill.

It would be very easy to build a single service that accepts a query as a parameter, and returns its result as response. And seriously, it's not that strange an idea: many people use between one and perhaps ten or twenty different services exactly like this everyday, multiple times...and it's called a search engine.

Can't we use something like that to solve our database access problem? Well, we could. But actually, someone beat you to it already.

DBSlayer, a Webservice for SQL Queries

A couple of years ago, The New York Times released DBSlayer. DBSlayer (DataBase accesS layer) is best described as a HTTP server that acts as a database proxy. It accepts regular SQL queries via a parameter in a regular HTTP GET request, and sends a HTTP response that contains the resulting data as JSON. It currently supports only MySQL databases but announcements were made that support was planned for other database products too. DBSlayer is actually a bit more than just a database access layer, as it also supports simple failover and round-robin request distribution, which can be used to scale out database requests. But I mention it here, because it implements exactly the kind of query service that would appear to solve all the aforementioned problems.

Or would it?

Every web developer and every database administrator should realize immediately that it's not a good idea. At least, not for internet-facing applications anyway. The DBSlayer developers documented that themselves quite clearly:

Access to the DBSlayer can be controlled via firewalls; the DBSlayer should never be exposed to the outside world.

... and ...

The account DBSlayer uses to access the MySQL database should not be allowed to execute dangerous operations like dropping tables or deleting rows. Ideally, the account would only be able to run selects and/or certain stored procedures.

So there's the rub: it may be very easy and convenient from the application development point of view, but it is a horrendous idea when you think about security. A general purpose SQL query service is simply too powerful.

If a web application accidentally allows arbitrary SQL to be executed, it would be called an SQL injection vulnerability, and it would be (or at least, should be) treated as a major breach of security. Creating a service that offers exactly that behavior as a feature doesn't lessen the security concerns at all.

What about a MQL query service

In this article I tried to explain the problems that must be solved in order to arrange and manage data access for web applications. The key message is that we need to create a service that provides data access. But in doing so, we have to balance between security, functionality and flexibility.

It is fairly easy to create a webservice that exactly fulfills a particular application requirement, thus ensuring security and manageability. However, this will usually be a very inflexible service, and it will need lots of maintenance to keep up with change in application requirements. It's also easy to create a webservice that is at least as powerful as the underlying database: this would be a database proxy over HTTP, just like DBSlayer. Although it will likely never need to change since it simply passes requests on to the back-end database, it is very hard to secure it in a way that would allow external requests from possibly malignant users.

I believe that an MQL webservice actually does offer the best of both worlds, without suffering from the disadvantages. A MQL query service will be flexible enough for most web applications - MQL queries are only limited by the underlying data model, not by the set of application-specific actions designed for one particular purpose. At the same time, it will be relatively easy to efficiently analyze MQL queries and apply policies to prevent malicious use. For example, checking that a MQL query doesn't join more than X tables is quite easy.

In the forthcoming installment, I will explore the concept of a MQL webservice in more detail, and I will explain more about the MQL-to SQL project. As always, I'm looking forward to your comments, suggestions and critique so don't hesitate to leave a comment.

MQL-to-SQL: A JSON-based query language for your favorite RDBMS - Part II

This is the second article in a series to provide some background to my talk for the MySQL User's conference.
The conference will be held April 11-14 2011 in the Hyatt Regency hotel in Santa Clara, California.

Abstract: MQL is a JSON-based database query language that has some very interesting features as compared to SQL, especially for modern (AJAX) web-applications. MQL is not a standard database query language, and currently only natively supported by Freebase. However, with MQL-to-SQL, a project that provides a SQL adapter for MQL, you can run MQL queries against any RDBMS with a SQL interface.

The my previous post, I covered some background information on modern web applications, JavaScript and JSON.
The topic of this installment is the MQL database query language and how it compares to SQL. In a third article, I will discuss the mql-to-sql project,
which implements a MQL service for relational database systems.

MQL Queries

A MQL query is either a JSON object, or a JSON array of objects. In fact, the sample object we discussed in previous section about JSON is nearly a valid MQL query. For now, consider the following JSON representation of the chemical element Oxygen:

{
  "type": "/chemistry/chemical_element",
  "name": "Oxygen",
  "symbol": 'O',
  "atomic_number": 8,
  "ionization_energy": 13.6181,
  "melting_point": -2.1835e+2,
  "isotopes": []
}

The difference with the example from my previous post is that here, we added a key/value pair at the top, "type": "/chemistry/chemical_element", and we use an empty array ([])as value for the "isotopes" key.

So how is this a query? Didn't we claim JSON is a data format? So isn't this by definition data? And if this is data, how can it be a query?

Well, the answer is: yes, JSON is a data format, and so, yes: by definition, it must be data. But the paradox how some data can also be a query is quite easily solved once you realize that there are some pieces missing from the data. In this particular case, we left the "isotopes" array empty, whereas Oxygen has a bunch of isotopes in the real world.

This is how a JSON object can be a query:

By specifying values that describe known facts about an object, we define a filter or access path that can be used to find data in some database. For example, the JSON fragment above states that there exists an object of the type "/chemistry/chemical_element" which has the name "Oxygen", and the symbol "O", among other characterisics
By specifying special placeholders like null or an empty array or object (like [] and {} respectively), we define what data should be retrieved from the database so it can be returned in the result.

Executing a query with the Freebase MQL query editor

The easiest way to execute the query above is by using the Freebase Query Editor. Here's a screenshot:

In the screenshot, the left upper side is the area where you can type or paste your MQL query. You can then press the Run button, which is located below the query area on its right side.

When you hit the Run button, the query is sent to a special webservice called the mqlread service. The results appear in the results area on the right.

MQL Query Results

In the Text tab of the results area of the MQL query editor, you can see the raw response. It should be something like this:

{
  "code": "/api/status/ok",
  "result": {
    "atomic_number": 8,
    "ionization_energy": 13.6181,
    "isotopes": [
      "Oxygen-15",
      "Oxygen-16",
      "Oxygen-17",
      "Oxygen-18",
      "Oxygen-28",
      "Oxygen-19",
      "Oxygen-20",
      "Oxygen-23",
      "Oxygen-25",
      "Oxygen-24",
      "Oxygen-13",
      "Oxygen-22",
      "Oxygen-26",
      "Oxygen-21",
      "Oxygen-14",
      "Oxygen-27",
      "Oxygen-12"
    ],
    "melting_point": -218.35,
    "name": "Oxygen",
    "symbol": "O",
    "type": "/chemistry/chemical_element"
  },
  "status": "200 OK",
  "transaction_id": "cache;cache01.p01.sjc1:8101;2010-05-15T03:08:10Z;0004"
}

As you can see, the response is again a JSON object. The outermost object is the result envelope, which contains the actual query result assigned to the result property, as well as a few other fields with information about how the request was fulfilled. (Actually, when we hit the Run button, our MQL query was also first embedded into a query envelope before it was sent to the mqlread service, but let's worry about those details later.

Query by Example, or filling-in-the blanks

If you compare the actual query result assigned to the "result" property of the result envelope with the original query, you will notice that they are largely the same: at least, the keys of query object and result object all match up, and so do the values, except the one for the "isotopes" key: whereas the "isotopes" property was an empty array in the query, in the result it contains an array filled with elements, each representing a particular isotope of the Oxygen element. These array elements were retrieved from the Oxygen entry stored in Freebase, which was found by matching the properties like "name": "Oxygen" and "atomic_number": 8.

This illustrates an important concept of MQL queries: the query result mirrors the structure of the query itself. Wherever the query contains a special placeholder value, such as an empty array ([]), a null value or an empty object ({}), the query engine will "fill in the blanks" with data that is retrieved from the database.

Another way of putting it is to say a MQL query is a query by example.

To hammer this concept down further, consider the following similar query. In this case, it contains even more special placeholder values (namely, the null values for all but the "name" property:

{
  "type": "/chemistry/chemical_element",
  "name": "Oxygen",
  "symbol": null,
  "atomic_number": null,
  "ionization_energy": null,
  "melting_point": null,
  "isotopes": []
}

If you execute this query, you get a result that is essentially the same as the previous one. Because chemcical elements are identified by name (there are more properties that can identify a chemical element, name is just one of them), this query will match the same object in Freebase. This query specifies null for almost all other properties, and because null is a special placeholder for scalar values, the query engine responds by retrieving the values for those keys and returns them in the result.

Differences between MQL and SQL

Even this simple example reveals a lot about the differences between MQL and SQL.

Symmetry between Query and Result

In the previous example, we just saw that the query and its result have a similar structure. It's like the query is mirrored in the result. In SQL, this is very different. Let's see how different.

Assume for a minute we would have a relational database with a schema called "chemistry" and a table called "chemical_element". We could write an SQL query like this:

SELECT  name
,       symbol
,       atomic_number
,       ionization_energy
,       melting_point
FROM    chemistry.chemical_element
WHERE   name = 'Oxygen'

...and the result would look something like this:


+--------+--------+---------------+-------------------+---------------+
| name   | symbol | atomic_number | ionization_energy | melting_point |
+--------+--------+---------------+-------------------+---------------+
| Oxygen | O      |             8 |           13.6181 |       -218.35 | 
+--------+--------+---------------+-------------------+---------------+

Even if we forget for a moment that this particular SQL query doesn't handle retrieving the isotopes, the difference between the query and the result is striking - the structure of the SQL query, which is basically a piece of text, has very little to do with the structure of the result, which is unmistakenly tabular.

Everybody knows that SQL was designed for relational databases, so we probably shouldn't be too surprised that the result of the SQL query has a tablular form. But why does the query have to be text? Couldn't the query have been tabular too, leaving blanks where we'd expect to retrieve some data?

Lest I lose the proponents of SQL and the relational model here: don't get me wrong - I realize that an SQL query isn't just "a piece of text; rather SQL attempts to express facts about database state using a practical form of relational algebra, and the result represents the set that satisfies the facts stipulated by the query. And as we shall see later on, MQL actually has a few operator constructs that resemble the ones found in SQL. But for now the message is: MQL queries and their results look a lot like one another; SQL queries and their results do not.

Relational focus

I already mentioned the MQL is expressed in JSON, and we saw how both query and result are structured as objects. SQL results are always tabular. Superficially, this seems like the right thing to do for a query language that is supposed to work on and for relational database systems. But is it really that natural?

Many textbooks and courses about relational databases and SQL spend much time and effort on the topic of normalization, and rightly so! One of the great achievements of the relational model and normalization is that it helps to minimize or even eliminate data integrity problems that arise in non-relational storage structures and their inherent redundancy.

A typical textbook or course may start with a modelling exercise with some real-world, unnormalized (sub 1NF) data as input. The first mental step is to normalize that data to the first normal form (1NF), by splitting off multivalued attributes (a single data item with a list of values) and their big brother, repeating groups (a "table-inside-a-table"), to separate tables, on and on until all repeating groups and multi-valued attributes are eliminated. In the next phases of normalization, different forms of data redundancy are removed, moving up to even higher normal forms, typically finishing at the third normal form (3NF) or the Boyce-Codd normal form (BCNF).

The result of this normalization process is an overview of all separate independent sets of data, with their keys and relationships. From there, it's usually a small step to a physical database design. For example, in our Oxygen example, the list of isotopes is a multi-valued attribute. In a proper relational design, the isotopes would typically end up in a separate table of their own, with a foreign key pointing to a table containing the chemical elements.

The textbook usually continues with a crash course in SQL, and chances are the SELECT syntax is the first item on the menu. Within a chapter or two, you'll learn that although normalized storage is great for data maintenance, it isn't great for data presentation. Because applications and end users care a lot about data presentation, you'll learn how to use the JOIN operator to combine the results from related database tables. Usually the goal of that exercise is to end up with a result set that is typically 1NF, or at least, has a lower normal form than the source tables.

Don't get me wrong: this is great stuff! The relational model isn't fighting redundant data: its fighting data integrity issues, and the lack of control that results from storing redundant data. By first decomposing the data into independent sets for storage, and then using things like JOIN operations to re-combine them, SQL offers a good method to master data integrity issues, and to deliver consistent, reliable data to applications and end-users.

It's just a pity SQL forgot to finish what it started. Recall that the text-book modeling exercise started by eliminating repeating groups and multi-valued attributes to achieve 1NF. What SQL queries should we use to transform the data back to that format?

SQL turns out to be so single-mindedly focused on the relational model that it simply can't return sub-1NF, unnormalized data. Much like the square from flatland can't comprehend the sphere from spaceland, SQL simply hasn't got a clue about nested data structures, like multi-valued attributes and repeating groups.

Somewhere along the way, SQL forgot that the text-book course started with real-world, unnormalized data, full of repeating groups and multivalued attributes.

This represents quite a share of challenges for database application development. One class of software solutions that deal with solving this problem are the so-called object-relational mappers (ORM). It would not do enough credit to the ORM's to claim that their only purpose is to solve this problem, but it's definitely a major problem they take care of.

Parsing Queries

Everyone that has tried it knows that it isn't exactly trivial to write a fast yet fully functional SQL parser. Being able to parse SQL is a requirement for tools like query editors and report builders, but also for proxies and monitoring tools.

Parsing MQL on the other hand is almost trivially simple. At the application level, this would in theory make it quite easy to implement advanced access policies and limit the complexity of the queries on a per user or role basis.

Generating Queries

One of the things I like about MQL is that applications have to do a lot less work to formulate a query that drills down into some detail of a previously returned data set. For example, from the the result of our query about Oxygen, we just learned that there is an isotope called "Oxygen-16". Suppose we want to know more about that particular isotope, say, its relative abundance, and whether it's stable or not.

With SQL, we would have to construct a new algebraic expression that somehow combines the set of chemical elements with the set of isotopes. Although we would certainly need some data from the tabular result obtained from the previous query, we have little hope of actually re-using the query itself - at the application level, the SQL query is most likely just a piece of text, and it's probably not worth it to use string manipulation to forge a new query out of it.

Here's an example which shows the parts that should be added to accommodate this requirement, just to show that such a change isn't localized to just one spot in the original query:

SELECT  e.name
,       e.symbol
,       e.atomic_number
,       e.ionization_energy
,       e.melting_point
,       i.name
,       i.natural_abundance
,       i.stable
FROM    chemistry.chemical_element e
INNER JOIN chemistry.isotope i ON e.name = e.element_name
WHERE   e.name = 'Oxygen'
AND     i.name = 'Oxygen-16'

I won't let my head explode over the string manipulation code required to change the original query into this one. If you like, post clever solutions as a comment to this post :)

With MQL, this task is considerably easier. The query and the result have a high degree of correspondence. In our application, both would typically be represented as objects or structs. In most programming languages, it is trivial to go from the original Oxygen Query to an augmented form that retrieves the details about the isotope "Oxygen-16". This is especially true for JavaScript, where we'd simply write something like:

//execute the query and obtain the result.
//note: in a real application this would typically be an asynchronous request
//to the mqlread service which would accept and return JSON strings.
//For simplicity sake we pretend we have a mqlRead function that can accept
//a regular JavaScript object literal, convert it into a JSON string, and
//call the mqlread service synchronously, parse the the JSON query result
//into a JavaScript object and return that to the caller.

var queryResult = mqlRead({
  type: "/chemistry/chemical_element",
  name: "Oxygen",
  symbol: null,
  atomic_number: null,
  ionization_energy: null,
  melting_point: null,
  isotopes: []
});
 
//assign a subquery for "Oxygen-16" to the isotopes property of the queryResult.
//Remember, the queryResult has the same structure as the original query, just
//with the null's and the empty arrays filled with data.

queryResult.isotopes = {
  name: "Oxygen-16",
  natural_abundance: null,
  stable: null
};

//execute the modified query:

queryResult = mqlRead(queryResult);

(In the first example where we discussed the Oxygen query, you might've noticed that in the query result, the isotopes member was an array of strings, each representing a particular isotope. So naturally, you might be tempted to think that the value of the isotopes property is an array of strings. However, this is not quite the case. Rather, the isotopes property stands for the relationship between elements and its isotopes. Due to the form of the original query (having the empty array for the isotopes property), the MQL query engine responds by listing the default property of the related isotopes. In freebase, name is a special property and typically that's used as default property. So in that previous query result, the isotopes were merely represented only by their name. In the example above however, we assign a single object literal to the isotopes property which identifies one particular isotope. Because the isotopes property represents a relationship, the query should be read: "find me the oxygen element and the related isotope with the name Oxygen-16", and not "find me the oxygen element that has Oxygen-16 as its isotope".)

Of course, it's entirely possible to design data structures to hold all the data you need to generate your SQL queries (a query model). You can easily come up with something that would allow such a change to be made just as easy. But you need at least one extra step to generate the actual SQL string to send to the database. And of course, you need some extra steps again to extract data from the resultset for forging more queries.

In MQL, the effort to go from query to result and back are about as minimal as it can get. This results in less application code, which tends to be easier to understand.

SQL is declarative, but MQL is more so

When discussing its merits as a programming language, it is often mentioned that SQL is declarative rather than procedural. Often this is presented as an advantage: with a declarative language like SQL, we get to focus on the results we want, whereas a procedural language would force us to code all kinds of details about how these results should be obtained. Or so the story goes.

I won't deny SQL is declarative. For example, I don't need to spell out any particular data access algorithm required to find the Oxygen element and it's isotopes, I just write:

SELECT      e.name
,           e.symbol
,           e.atomic_number
,           e.ionization_energy
,           e.melting_point
,           i.name
,           i.natural_abundance
,           i.stable
FROM        chemistry.chemical_element         e
INNER JOIN  chemistry.chemical_element_isotope i
ON          e.atomic_number = i.atomic_number
WHERE       e.name = 'Oxygen'

But still: in order to successfully relate the chemical_element and chemical_element_isotope tables, I need to spell out that the values in chemical_element_isotope's atomic_number column have to be equal to the value in the atomic_number column of chemical_element. Come to think of it, how can I know the relationship is built on atomic_number, and not on symbol or name? And heaven forbid we accidentally compare the wrong columns, or forget one of the join conditions...

Now compare it to the equivalent MQL query:

{
  "type": "/chemistry/chemical_element",
  "name": "Oxygen",
  "symbol": null,
  "atomic_number": null,
  "ionization_energy": null,
  "melting_point": null,
  "isotopes": [{
    "name": null,
    "natural_abundance": null,
    "stable": null
  }]
}

The SQL query may be declarative, but compared to the MQL query, it requires a lot more knowledge of the underlying data model. All we had to do in MQL, is specify an isotopes property, and list whatever we want to retrieve from the corresponding isotope instances. The only way we can mess up the MQL query is when we specify the wrong property names, in which case our query would simply fail to execute. In the SQL query, we could've been mistaken about which columns to compare, and get no result at all, or worse, a rubbish result. And with just a bit of ill luck, we can accidentally cause a cartesian product. Just for the hell of it, spot the error in the following SQL statement:

SELECT      e.name
,           e.symbol
,           e.atomic_number
,           e.ionization_energy
,           e.melting_point
,           i.name
,           i.natural_abundance
,           i.stable
FROM        chemistry.chemical_element         e
INNER JOIN  chemistry.chemical_element_isotope i
ON          e.atomic_number = e.atomic_number
WHERE       e.name = 'Oxygen'

Computational Completeness

Everybody with some experience in SQL programming knows that SQL is much more than a database query language. Even standard SQL is chock-full of operators and functions that allow you to build complex expressions and calculations. In fact, most SQL dialects support so many functions and operators that the manual needs at least a separate chapter to cover them. Algebra, Encryption, String formatting, String matching, Trigonometry: these are just a few categories of functions you can find in almost any SQL dialect I heard of.

By contrast, MQL is all about the data. MQL defines a set of relational operators, but their function and scope is limited to finding objects, not doing calculations on them. There is exactly one construct in MQL that resembles a function, and it is used for counting the number of items in a result set.

Personally, I think MQL would be better if it had a few more statistical or aggregate constructs like count. But overall, my current thinking is that the fact that MQL lacks the function-jungle present in most RDBMS-es is actually A Good Thing(tm). At the very least, it ensures queries stay focused on the data, and nothing but the data.

Next Time

In this article I discussed the basics of the MQL query language, and I compared SQL and MQL on a number of accounts. In the next installment, I will discuss how this relates to developing data access services for web applications.

Thursday, January 06, 2011

MQL-to-SQL: A JSON-based query language for your favorite RDBMS - Part I

Yesterday, I wrote about how I think this year's MySQL conference will differ from prior editions. I also wrote that I will attend and that I will be speaking on MQL-to-SQL.

I promised I would explain a little bit more background about my talk, so here's the first installment.

Abstract: MQL is a JSON-based database query language that has some very interesting features as compared to SQL, especially for modern (AJAX) web-applications. MQL is not a standard database query language, and currently only natively supported by Freebase. However, with MQL-to-SQL, a project that provides a SQL adapter for MQL, you can run MQL queries against any RDBMS with a SQL interface.

This article covers mostly background information on modern web applications, JavaScript and JSON. This background information should help you understand why a JSON-based database query language is a good fit for modern web applications. A following installment will discuss the MQL database query language and how it relates to SQL. A third article will cover the mql-to-sql project itself.

MQL, a JSON-based database query language

MQL (pronounced as Mickle) is an abbreviation of Metaweb Query Language, which is the JSON-based database query language natively supported by Freebase. If you're unfamiliar with the terms Metaweb and Freebase, see this quote from Wikipedia about Metaweb:

Metaweb Technologies, Inc. is a United States company based in San Francisco that is developing Freebase, described as an "open, shared database of the world's knowledge".

...and this quote about Freebase:

On March 3, 2007 Metaweb publicly announced Freebase, described by the company as "an open shared database of the world's knowledge," and "a massive, collaboratively-edited database of cross-linked data." Often understood as a wikipedia-turned-database, Freebase provides an interface that allows non-programmers to fill in structured, or 'meta-data', of general information, and to categorize or connect data items in meaningful, or 'semantic' ways.

So, Freebase is a database of just about everything (think of it as a machine readable version of wikipedia), and MQL is its query language.

As a quick primer on the MQL query language, here's a simple MQL query (taken from the Freebase manual)

{
  "query": {
    "type":"/music/artist",
    "name":"The Police",
    "album":[]
  }
}

...and here's the result:

{
  "status": "200 OK", 
  "code": "/api/status/ok", 
  "transaction_id":"cache;cache01.p01.sjc1:8101;2008-09-18T17:56:28Z;0029",
  "result": {
    "type": "/music/artist", 
    "name": "The Police",
    "album": [
      "Outlandos d'Amour", 
      "Reggatta de Blanc", 
      "Zenyatta Mondatta", 
      "Ghost in the Machine", 
      "Synchronicity"
    ]
  }
}

I won't discuss these samples just yet - the MQL query language is the topic of the next installment. This is just a quick sample to give you an idea what MQL queries and their results look like. If you're didn't recognize the syntax already - read on, the JSON syntax is covered in the next section of this article.

The collection of data accumulated in Freebase is impressive - you should check it out sometime. However, this post is not so much about Freebase - it's the MQL query language I want to discuss here.

JSON

MQL is based on JSON (pronounced as Jason), which is an abbreviation of JavaScript Object Notation. If you're not familiar with JSON, hang on and read all of this section - it provides just enough background to understand why it is highly relevant for modern web-applications, and it provides a good-enough description of the JSON syntax for you to read and write MQL queries. You might also want to check out the Wikipedia entry on JSON for an objective overview.

If you are already familiar with JSON, then you can safely skip through to the next section. But please do take a minute to review the JSON object code sample - it is used as basis for developing simple MQL queries in the next article in this series.

Now, back to JSON.

What is JSON?

JSON is a data exchange format. Syntactically, JSON is a proper subset of JavaScript, the de-facto web browser scripting language. The JSON JavaScript subset is defined in such a way that it can express JavaScript values and objects of arbitrary complexity. However, JSON cannot be used to create an executable program - all dynamic, executable elements (functions) have been stripped away.

Because JSON literally is JavaScript (just not the full set), JavaScript programs can easily read and write data expressed as JSON. For this reason, JSON is often characterized as a JavaScript object serialization format.

An excellent description of the JSON syntax can be found at the JSON homepage. But if you don't feel like reading up on the details of JSON, no worries - you just need to remember a few JSON features to read and write MQL queries:

3 scalar* data types:
- strings - for example: "Oxygen", or 'O'
- numbers - like 8 (integer), 13.6181 (float) and -2.1835e+2 (float)
- booleans - true and false are the only possible boolean values
2 composite data types:
- arrays - an ordered comma-separated list of values, enclosed in square braces: ["Oxygen-16", "Oxygen-17", "Oxygen-18"]. The individual values in an array are usually referred to as elements
- objects - an unordered comma-separated list of key/value pairs (which are uniquely named items), enclosed in curly braces:
```
{
  "name": "Oxygen",
  "symbol": 'O',
  "atomic_number": 8,
  "ionization_energy": 13.6181,
  "melting_point": -2.1835e+2,
  "isotopes": ["Oxygen-16", "Oxygen-17", "Oxygen-18"]
}
```
  The key/value pairs of an object are usually called properties or members. Note that name and value are separated by a colon (:). Keys are strings, and have to be quoted. Values can be of any of the data types described above, including arrays (like the "isotopes" property in the example above) and objects.
A special null-value (which actually constitutes a data type of its own)

(* I should point out that JavaScript is an object-oriented language. At runtime, all data types described above, including the array and "scalar" types, are in fact objects. Objects are by definition composite, and not scalar. However, the list above is about JSONs syntactical constructs, not about their runtime JavaScript representation)

As you can see, JSON provides a few very simple rules. Yet it allows you to build data structures of arbitrary complexity: the object example above illustrates how you can bundle a collection of named values together to represent structures of a higher order. In this particular case, we denoted an object that represents the chemical element Oxygen. And, we needn't have stopped here: we could've added more key/value pairs to represent other complex properties of the element Oxygen, representing things like the person who first discovered it, a list of well known substances that contain Oxygen and so on and so forth.

AJAX and JSON

I already mentioned something about JSON in relation to JavaScript and modern web applications. By modern web applications, I mean web applications that offer a rich and highly interactive user interface that is based on AJAX (An Acronym for Asynchronous JavaScript And XML) technology.

A key feature of AJAX is the usage of client-side JavaScript code for maintaining non-blocking background communication with the web server. This is typically done using a specialized object called the XMLHttRequest. AJAX applications also tend to use JavaScript to dynamically change the contents or appearance of the page (a technique called dynamic HTML or DHTML). This allows web developers to create applications that can avoid a relatively slow page reload most of the time, making the application appear more responsive.

Going by the meaning of the AJAX acronym, you may be under the impression that AJAX applications are all about using asynchronous XML data exchange using the XMLHttpRequest, and thus the relevance or need for JSON might be lost on you. I would say that, yes, you're not wrong: AJAX is often implemented by using the XMLHttpRequest object to communicate with the server using XML messages. But for several reasons, JSON is gaining popularity as data exchange format instead of XML, and techniques like JSONP are used in addition to the XMLHttpRequest object. This technique actually has an advantage over using the XMLHttpRequest because it can be used to do requests to services that reside on another domain than the current web page, whereas this is not allowed with the XMLHttpRequest (at least, not by default and not without explicitly asking the user for confirmation). This makes JSONP a great tool for creating mash-up applications.

I'm sure you're not surprised to hear from me that the internet offers many places where you can get your XML vs JSON brawl on - I'm not particularly interested in that discussion, and I don't feel anybody has to choose sides. The point I want to bring across is that for AJAX applications, JSON is a respected data exchange format. It is excellently supported by both web browsers as well as popular AJAX frameworks, and companies like Amazon, YAHOO! and Google deliver more and more webservices that use JSON as data exchange format.

Next time

Now that you've seen how JSON works, and how it relates to AJAX web applications, you're ready to take a look at MQL queries. This is the topic of the next blog post.

Speaking at the MySQL conference 2011

I just received a confirmation that my presentation proposal for the MySQL user conference 2011 was accepted! The title for my proposal is MQL-to-SQL: a JSON-based Query Language for RDBMS Access from AJAX Applications, and it covers pretty much everything implied by the title.

As always, the Hyatt Regency Hotel in Santa Clara, California serves as the venue. The conference will be held from April 11-14. Except for the venue and period, I think this year's conference will bear few similarities to previous editions. Let me try and explain.

This year's theme is "MySQL, the ecosystem and Beyond". This means that the conference is using MySQL as an anchor for a myriad of topics which are of interest to a large majority of MySQL users. This explicitly leaves room for subjects that may not be directly related to the MySQL product proper.

So, not only products with a direct link to MySQL, such as drizzle and MariaDB are covered; NoSQL databases like CouchDB, MongoDB and Cassandra are quite well represented and the conference committee actively reached out to the PostgreSQL community to submit proposals. Traditional topics like scalability, performance and tuning remain strongly present, just like high availability, failover, and replication. As always, some of the world experts in this field will be speaking. In addition, infrastructural topics like virtualization and cloud computing are well represented (but of course, especially with regard to database management); One of the things I'm thrilled about is the presence of developer and applicattion centric topics like GIS, rapid application development, and object relational mapping.

Just take a look at the full schedule to get a taste of what this event will be offering. Personally, I think it's a great setup, and I'm happy and honored to attend!

In a series of upcoming blog posts, I plan to explain some of the subject matter regarding my own talk. But for now, I just want to tell you that I think this is going to be a great conference! I'm looking forward to attending a lot of high quality sessions, and meeting world leading experts in the MySQL and open source database ecosystem. I hope to see you there!