Friday, January 07, 2011

MQL-to-SQL: A JSON-based query language for your favorite RDBMS - Part II

This is the second article in a series to provide some background to my talk for the MySQL User's conference.
The conference will be held April 11-14 2011 in the Hyatt Regency hotel in Santa Clara, California.

Abstract: MQL is a JSON-based database query language that has some very interesting features as compared to SQL, especially for modern (AJAX) web-applications. MQL is not a standard database query language, and currently only natively supported by Freebase. However, with MQL-to-SQL, a project that provides a SQL adapter for MQL, you can run MQL queries against any RDBMS with a SQL interface.

The my previous post, I covered some background information on modern web applications, JavaScript and JSON.
The topic of this installment is the MQL database query language and how it compares to SQL. In a third article, I will discuss the mql-to-sql project,
which implements a MQL service for relational database systems.

MQL Queries

A MQL query is either a JSON object, or a JSON array of objects. In fact, the sample object we discussed in previous section about JSON is nearly a valid MQL query. For now, consider the following JSON representation of the chemical element Oxygen:
{
"type": "/chemistry/chemical_element",
"name": "Oxygen",
"symbol": 'O',
"atomic_number": 8,
"ionization_energy": 13.6181,
"melting_point": -2.1835e+2,
"isotopes": []
}

The difference with the example from my previous post is that here, we added a key/value pair at the top, "type": "/chemistry/chemical_element", and we use an empty array ([])as value for the "isotopes" key.

So how is this a query? Didn't we claim JSON is a data format? So isn't this by definition data? And if this is data, how can it be a query?

Well, the answer is: yes, JSON is a data format, and so, yes: by definition, it must be data. But the paradox how some data can also be a query is quite easily solved once you realize that there are some pieces missing from the data. In this particular case, we left the "isotopes" array empty, whereas Oxygen has a bunch of isotopes in the real world.

This is how a JSON object can be a query:
  • By specifying values that describe known facts about an object, we define a filter or access path that can be used to find data in some database. For example, the JSON fragment above states that there exists an object of the type "/chemistry/chemical_element" which has the name "Oxygen", and the symbol "O", among other characterisics
  • By specifying special placeholders like null or an empty array or object (like [] and {} respectively), we define what data should be retrieved from the database so it can be returned in the result.

Executing a query with the Freebase MQL query editor


The easiest way to execute the query above is by using the Freebase Query Editor. Here's a screenshot:

In the screenshot, the left upper side is the area where you can type or paste your MQL query. You can then press the Run button, which is located below the query area on its right side.

When you hit the Run button, the query is sent to a special webservice called the mqlread service. The results appear in the results area on the right.

MQL Query Results

In the Text tab of the results area of the MQL query editor, you can see the raw response. It should be something like this:
{
"code": "/api/status/ok",
"result": {
"atomic_number": 8,
"ionization_energy": 13.6181,
"isotopes": [
"Oxygen-15",
"Oxygen-16",
"Oxygen-17",
"Oxygen-18",
"Oxygen-28",
"Oxygen-19",
"Oxygen-20",
"Oxygen-23",
"Oxygen-25",
"Oxygen-24",
"Oxygen-13",
"Oxygen-22",
"Oxygen-26",
"Oxygen-21",
"Oxygen-14",
"Oxygen-27",
"Oxygen-12"
],
"melting_point": -218.35,
"name": "Oxygen",
"symbol": "O",
"type": "/chemistry/chemical_element"
},
"status": "200 OK",
"transaction_id": "cache;cache01.p01.sjc1:8101;2010-05-15T03:08:10Z;0004"
}
As you can see, the response is again a JSON object. The outermost object is the result envelope, which contains the actual query result assigned to the result property, as well as a few other fields with information about how the request was fulfilled. (Actually, when we hit the Run button, our MQL query was also first embedded into a query envelope before it was sent to the mqlread service, but let's worry about those details later.

Query by Example, or filling-in-the blanks

If you compare the actual query result assigned to the "result" property of the result envelope with the original query, you will notice that they are largely the same: at least, the keys of query object and result object all match up, and so do the values, except the one for the "isotopes" key: whereas the "isotopes" property was an empty array in the query, in the result it contains an array filled with elements, each representing a particular isotope of the Oxygen element. These array elements were retrieved from the Oxygen entry stored in Freebase, which was found by matching the properties like "name": "Oxygen" and "atomic_number": 8.

This illustrates an important concept of MQL queries: the query result mirrors the structure of the query itself. Wherever the query contains a special placeholder value, such as an empty array ([]), a null value or an empty object ({}), the query engine will "fill in the blanks" with data that is retrieved from the database.

Another way of putting it is to say a MQL query is a query by example.

To hammer this concept down further, consider the following similar query. In this case, it contains even more special placeholder values (namely, the null values for all but the "name" property:
{
"type": "/chemistry/chemical_element",
"name": "Oxygen",
"symbol": null,
"atomic_number": null,
"ionization_energy": null,
"melting_point": null,
"isotopes": []
}
If you execute this query, you get a result that is essentially the same as the previous one. Because chemcical elements are identified by name (there are more properties that can identify a chemical element, name is just one of them), this query will match the same object in Freebase. This query specifies null for almost all other properties, and because null is a special placeholder for scalar values, the query engine responds by retrieving the values for those keys and returns them in the result.

Differences between MQL and SQL

Even this simple example reveals a lot about the differences between MQL and SQL.

Symmetry between Query and Result

In the previous example, we just saw that the query and its result have a similar structure. It's like the query is mirrored in the result. In SQL, this is very different. Let's see how different.

Assume for a minute we would have a relational database with a schema called "chemistry" and a table called "chemical_element". We could write an SQL query like this:
SELECT  name
, symbol
, atomic_number
, ionization_energy
, melting_point
FROM chemistry.chemical_element
WHERE name = 'Oxygen'
...and the result would look something like this:

+--------+--------+---------------+-------------------+---------------+
| name | symbol | atomic_number | ionization_energy | melting_point |
+--------+--------+---------------+-------------------+---------------+
| Oxygen | O | 8 | 13.6181 | -218.35 |
+--------+--------+---------------+-------------------+---------------+
Even if we forget for a moment that this particular SQL query doesn't handle retrieving the isotopes, the difference between the query and the result is striking - the structure of the SQL query, which is basically a piece of text, has very little to do with the structure of the result, which is unmistakenly tabular.

Everybody knows that SQL was designed for relational databases, so we probably shouldn't be too surprised that the result of the SQL query has a tablular form. But why does the query have to be text? Couldn't the query have been tabular too, leaving blanks where we'd expect to retrieve some data?

Lest I lose the proponents of SQL and the relational model here: don't get me wrong - I realize that an SQL query isn't just "a piece of text; rather SQL attempts to express facts about database state using a practical form of relational algebra, and the result represents the set that satisfies the facts stipulated by the query. And as we shall see later on, MQL actually has a few operator constructs that resemble the ones found in SQL. But for now the message is: MQL queries and their results look a lot like one another; SQL queries and their results do not.

Relational focus

I already mentioned the MQL is expressed in JSON, and we saw how both query and result are structured as objects. SQL results are always tabular. Superficially, this seems like the right thing to do for a query language that is supposed to work on and for relational database systems. But is it really that natural?

Many textbooks and courses about relational databases and SQL spend much time and effort on the topic of normalization, and rightly so! One of the great achievements of the relational model and normalization is that it helps to minimize or even eliminate data integrity problems that arise in non-relational storage structures and their inherent redundancy.

A typical textbook or course may start with a modelling exercise with some real-world, unnormalized (sub 1NF) data as input. The first mental step is to normalize that data to the first normal form (1NF), by splitting off multivalued attributes (a single data item with a list of values) and their big brother, repeating groups (a "table-inside-a-table"), to separate tables, on and on until all repeating groups and multi-valued attributes are eliminated. In the next phases of normalization, different forms of data redundancy are removed, moving up to even higher normal forms, typically finishing at the third normal form (3NF) or the Boyce-Codd normal form (BCNF).

The result of this normalization process is an overview of all separate independent sets of data, with their keys and relationships. From there, it's usually a small step to a physical database design. For example, in our Oxygen example, the list of isotopes is a multi-valued attribute. In a proper relational design, the isotopes would typically end up in a separate table of their own, with a foreign key pointing to a table containing the chemical elements.

The textbook usually continues with a crash course in SQL, and chances are the SELECT syntax is the first item on the menu. Within a chapter or two, you'll learn that although normalized storage is great for data maintenance, it isn't great for data presentation. Because applications and end users care a lot about data presentation, you'll learn how to use the JOIN operator to combine the results from related database tables. Usually the goal of that exercise is to end up with a result set that is typically 1NF, or at least, has a lower normal form than the source tables.

Don't get me wrong: this is great stuff! The relational model isn't fighting redundant data: its fighting data integrity issues, and the lack of control that results from storing redundant data. By first decomposing the data into independent sets for storage, and then using things like JOIN operations to re-combine them, SQL offers a good method to master data integrity issues, and to deliver consistent, reliable data to applications and end-users.

It's just a pity SQL forgot to finish what it started. Recall that the text-book modeling exercise started by eliminating repeating groups and multi-valued attributes to achieve 1NF. What SQL queries should we use to transform the data back to that format?

SQL turns out to be so single-mindedly focused on the relational model that it simply can't return sub-1NF, unnormalized data. Much like the square from flatland can't comprehend the sphere from spaceland, SQL simply hasn't got a clue about nested data structures, like multi-valued attributes and repeating groups.

Somewhere along the way, SQL forgot that the text-book course started with real-world, unnormalized data, full of repeating groups and multivalued attributes.

This represents quite a share of challenges for database application development. One class of software solutions that deal with solving this problem are the so-called object-relational mappers (ORM). It would not do enough credit to the ORM's to claim that their only purpose is to solve this problem, but it's definitely a major problem they take care of.

Parsing Queries

Everyone that has tried it knows that it isn't exactly trivial to write a fast yet fully functional SQL parser. Being able to parse SQL is a requirement for tools like query editors and report builders, but also for proxies and monitoring tools.

Parsing MQL on the other hand is almost trivially simple. At the application level, this would in theory make it quite easy to implement advanced access policies and limit the complexity of the queries on a per user or role basis.

Generating Queries

One of the things I like about MQL is that applications have to do a lot less work to formulate a query that drills down into some detail of a previously returned data set. For example, from the the result of our query about Oxygen, we just learned that there is an isotope called "Oxygen-16". Suppose we want to know more about that particular isotope, say, its relative abundance, and whether it's stable or not.

With SQL, we would have to construct a new algebraic expression that somehow combines the set of chemical elements with the set of isotopes. Although we would certainly need some data from the tabular result obtained from the previous query, we have little hope of actually re-using the query itself - at the application level, the SQL query is most likely just a piece of text, and it's probably not worth it to use string manipulation to forge a new query out of it.

Here's an example which shows the parts that should be added to accommodate this requirement, just to show that such a change isn't localized to just one spot in the original query:
SELECT  e.name
, e.symbol
, e.atomic_number
, e.ionization_energy
, e.melting_point
, i.name
, i.natural_abundance
, i.stable

FROM chemistry.chemical_element e
INNER JOIN chemistry.isotope i ON e.name = e.element_name
WHERE e.name = 'Oxygen'
AND i.name = 'Oxygen-16'
I won't let my head explode over the string manipulation code required to change the original query into this one. If you like, post clever solutions as a comment to this post :)

With MQL, this task is considerably easier. The query and the result have a high degree of correspondence. In our application, both would typically be represented as objects or structs. In most programming languages, it is trivial to go from the original Oxygen Query to an augmented form that retrieves the details about the isotope "Oxygen-16". This is especially true for JavaScript, where we'd simply write something like:
//execute the query and obtain the result.
//note: in a real application this would typically be an asynchronous request
//to the mqlread service which would accept and return JSON strings.
//For simplicity sake we pretend we have a mqlRead function that can accept
//a regular JavaScript object literal, convert it into a JSON string, and
//call the mqlread service synchronously, parse the the JSON query result
//into a JavaScript object and return that to the caller.

var queryResult = mqlRead({
type: "/chemistry/chemical_element",
name: "Oxygen",
symbol: null,
atomic_number: null,
ionization_energy: null,
melting_point: null,
isotopes: []
});

//assign a subquery for "Oxygen-16" to the isotopes property of the queryResult.
//Remember, the queryResult has the same structure as the original query, just
//with the null's and the empty arrays filled with data.

queryResult.isotopes = {
name: "Oxygen-16",
natural_abundance: null,
stable: null
};

//execute the modified query:

queryResult = mqlRead(queryResult);
(In the first example where we discussed the Oxygen query, you might've noticed that in the query result, the isotopes member was an array of strings, each representing a particular isotope. So naturally, you might be tempted to think that the value of the isotopes property is an array of strings. However, this is not quite the case. Rather, the isotopes property stands for the relationship between elements and its isotopes. Due to the form of the original query (having the empty array for the isotopes property), the MQL query engine responds by listing the default property of the related isotopes. In freebase, name is a special property and typically that's used as default property. So in that previous query result, the isotopes were merely represented only by their name. In the example above however, we assign a single object literal to the isotopes property which identifies one particular isotope. Because the isotopes property represents a relationship, the query should be read: "find me the oxygen element and the related isotope with the name Oxygen-16", and not "find me the oxygen element that has Oxygen-16 as its isotope".)

Of course, it's entirely possible to design data structures to hold all the data you need to generate your SQL queries (a query model). You can easily come up with something that would allow such a change to be made just as easy. But you need at least one extra step to generate the actual SQL string to send to the database. And of course, you need some extra steps again to extract data from the resultset for forging more queries.

In MQL, the effort to go from query to result and back are about as minimal as it can get. This results in less application code, which tends to be easier to understand.

SQL is declarative, but MQL is more so

When discussing its merits as a programming language, it is often mentioned that SQL is declarative rather than procedural. Often this is presented as an advantage: with a declarative language like SQL, we get to focus on the results we want, whereas a procedural language would force us to code all kinds of details about how these results should be obtained. Or so the story goes.

I won't deny SQL is declarative. For example, I don't need to spell out any particular data access algorithm required to find the Oxygen element and it's isotopes, I just write:
SELECT      e.name
, e.symbol
, e.atomic_number
, e.ionization_energy
, e.melting_point
, i.name
, i.natural_abundance
, i.stable
FROM chemistry.chemical_element e
INNER JOIN chemistry.chemical_element_isotope i
ON e.atomic_number = i.atomic_number
WHERE e.name = 'Oxygen'
But still: in order to successfully relate the chemical_element and chemical_element_isotope tables, I need to spell out that the values in chemical_element_isotope's atomic_number column have to be equal to the value in the atomic_number column of chemical_element. Come to think of it, how can I know the relationship is built on atomic_number, and not on symbol or name? And heaven forbid we accidentally compare the wrong columns, or forget one of the join conditions...

Now compare it to the equivalent MQL query:
{
"type": "/chemistry/chemical_element",
"name": "Oxygen",
"symbol": null,
"atomic_number": null,
"ionization_energy": null,
"melting_point": null,
"isotopes": [{
"name": null,
"natural_abundance": null,
"stable": null
}]
}

The SQL query may be declarative, but compared to the MQL query, it requires a lot more knowledge of the underlying data model. All we had to do in MQL, is specify an isotopes property, and list whatever we want to retrieve from the corresponding isotope instances. The only way we can mess up the MQL query is when we specify the wrong property names, in which case our query would simply fail to execute. In the SQL query, we could've been mistaken about which columns to compare, and get no result at all, or worse, a rubbish result. And with just a bit of ill luck, we can accidentally cause a cartesian product. Just for the hell of it, spot the error in the following SQL statement:
SELECT      e.name
, e.symbol
, e.atomic_number
, e.ionization_energy
, e.melting_point
, i.name
, i.natural_abundance
, i.stable
FROM chemistry.chemical_element e
INNER JOIN chemistry.chemical_element_isotope i
ON e.atomic_number = e.atomic_number
WHERE e.name = 'Oxygen'

Computational Completeness


Everybody with some experience in SQL programming knows that SQL is much more than a database query language. Even standard SQL is chock-full of operators and functions that allow you to build complex expressions and calculations. In fact, most SQL dialects support so many functions and operators that the manual needs at least a separate chapter to cover them. Algebra, Encryption, String formatting, String matching, Trigonometry: these are just a few categories of functions you can find in almost any SQL dialect I heard of.

By contrast, MQL is all about the data. MQL defines a set of relational operators, but their function and scope is limited to finding objects, not doing calculations on them. There is exactly one construct in MQL that resembles a function, and it is used for counting the number of items in a result set.

Personally, I think MQL would be better if it had a few more statistical or aggregate constructs like count. But overall, my current thinking is that the fact that MQL lacks the function-jungle present in most RDBMS-es is actually A Good Thing(tm). At the very least, it ensures queries stay focused on the data, and nothing but the data.

Next Time


In this article I discussed the basics of the MQL query language, and I compared SQL and MQL on a number of accounts. In the next installment, I will discuss how this relates to developing data access services for web applications.

5 comments:

Jos van Dongen said...

MQL? So confusing! Always thought it was a Pentaho thing: http://wiki.pentaho.com/display/ServerDoc2x/03.+Pentaho+Metadata+MQL+Schema

btw, great post of course ;-)

joeharris76 said...

Great post, Roland. Kudos for championing MQL.

The beauty of having the query and response in the same format is that you could compose a partially populated object, submit it to an MQL method and get the *same* object back with the additional info you asked for. We could get away from constantly composing requests and then parsing responses.

I think the Ruby community could really get on board with it. Just needs a good example library to kick things off.

Joe

rpbouman said...

Hi Jos,

Thanks for the comment! Glad you like it:)

Yes, there are a few MQL's around, Here are a few pointers:
* Mini QL: http://emdros.org/MQL-Programmers-Guide.pdf (Looks interesting btw, never heard of this before)
* Meta Quotes Langage, for the MetaTrader software http://en.wikipedia.org/wiki/MetaTrader_4
* Metadata Query Language, from Pentaho http://wiki.pentaho.com/display/ServerDoc1x/03.+Pentaho+Metadata+MQL+Schema
* Molecular Query Language, http://en.wikipedia.org/wiki/Molecular_Query_Language
* Metaweb Query Language, the JSON-based query language supported by freebase, and the topic of this series.
* Matrix Query Language, a proprietary language for Enovia MatrixOne

rpbouman said...

Hi Joe!

thanks for the kind words - I really appreciate it :)

Yes, I completely agree - the fill-in-the-blanks approach to querying has a lot of advantages for OLTP and object oriented data access patterns, which I feel are most prominent in typical web applications.

I do think MQL is less suitable for the typical set-based operations, and it's also not a very good analytical query language purposes. I think SQL and MDX (and in some cases R) respectively remain the best solution in those cases.

As for Ruby: the implementation I have is based on PHP, but I would very much welcome implementations in other languages. Things I considered myself are Java and JavaScript (both server-side and client-side), but I decided to first try and gather interest for the concept with a quick and dirty PHP solution (which hasn't got much going for it except for the fact that it works good enough to test if the idea is viable).

So, if you're seriously interested in collaborating, I'd very much welcome that initiative.

rpbouman said...

Oh dear, there's a "Mingle Query Language" too:

http://www.thoughtworks-studios.com/mingle/2.0/help/mql_reference.html

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...