Friday, January 07, 2011

MQL-to-SQL: A JSON-based query language for your favorite RDBMS - Part III

This is the third article in a series providing background information to my talk for the MySQL User's conference, entitled MQL-to-SQL: a JSON-based Query Language for RDBMS Access from AJAX Applications.

In the first installment, I introduced freebase, an open shared database of the world's knowledge and its JSON-based query language, the Metaweb Query Language (MQL, pronounced Mickle). In addition, I discussed the JSON data format, its syntax, its relationship with the de-facto standard client side browser scripting language JavaScript, and its increasing relevance for modern AJAX-based webapplications.

The second installment provides a brief introduction to MQL as database query language, and how it compares to the de-facto standard query language for relational database systems (RDBMS), the Structured Query Language (SQL). I argued that MQL has some advantages over SQL, in particular for programming modern webapplications. I mentioned the following reasons:
  • Since MQL is JSON, and JSON is JavaScript, it's a more natural fit for modern AJAX applications
  • MQL is almost trivial to parse, making it much easier to write tool such as editors, but also to implement advanced authorization policies
  • MQL is easy to generate: the structure of MQL queries is mirrored by their results. A fragment of the result can be easily augmented with a subquery making it easy to subsequently drill down into the retrieved dataset
  • MQL is more declarative than SQL. Both attributes and relationships are represented as JSON object properties and one need not and cannot specify join conditions to combine data from different types of objects. The "cannot" is actually A Good Thing because it means one cannot make any mistakes.
  • MQL is more focussed on just the data. It largely lacks functions to transform data retrieved from the database forcing application developers to do data processing in the application or middleware layer, not in the database.
In this article, I want to discuss common practices in realizing data access for applications, especially web applications, and how database query languages like SQL and MQL fit in there.

Web Application Data Access Practices

After reading the introduction of this article, one might get the idea that I hate relational databases and SQL. I don't, I love them both! It's just that when developing database applications, especially for the web, SQL isn't helping much. Or rather, it's just one tiny clog in a massive clockwork that has to be set up again and again. Let me explain...

The Data Access Problem

It just happens to be the case that I'm an application developer. Typically, I develop rich internet and intranet applications and somewhere along the line, a database is involved for storing and retrieving data. So, I need to put "something" in place that allows the user to interact with a database via the web browser.
The way I write that I need "something" so the user can interact with the database, it seems like it's just one innocent little thing. In reality, "something" becomes a whole bunch of things that need to work together:
  • Browsers don't speak database wire protocols - they speak HTTP to back-end HTTP servers. No matter what, there is going to be some part that is accessible via HTTP that knows how to contact the database server. Examples of solutions to this part of the problem are Common Gateway Interface (CGI) programs, server-side scripting languages like PHP or Perl (which are often themselves implemented as a CGI program) or in the case of Java, specialized Servlet classes
  • The component at the HTTP server that mediates between web browser and database is going to require a protocol: a set of rules that determine how a URI, a HTTP method and parameters can lead to executing a database command. Examples of approaches to design such a protocol are Remote Procedure Calls (RPC) and Representational State Transfer (REST)
  • There's the way back too: the application running in the web browser is going to have to understand the results coming back from the database. For data, a choice has to be made for a particular data exchange format, typically eXtensible Markup Language or JSON
  • The user interface at the browser end need not only understand the protocol we invented for the data exchange, Ideally it should also guide the user and be able to validate whatever data the user is going to feed it. In other words, the user interface needs to have the metadata concerning the data exchange interface.

The Webservice Solution

A typical way to tackle the data access problem is:
  • analyze the functionality of the application, categorizing it into a series of clear and isolated actions
  • identify which data flows from application to database and back for each action
  • decide on a data representation, and a scheme to identify actions and parameters
  • create one or more programs in Java, Perl, PHP, Python, Ruby or whatever fits your stack that can execute the appropriate tasks and process the associated data flows to implement the actions
For web applications, the program or programs developed in this way are typically webservices that run as part of the process of the HTTP server. The client gets to do a HTTP request, to a particular URI, using the right HTTP method, and the right parameters. The service gets to process the parameters, execute tasks such as accessing a database, and finally, sending back a HTTP response, which typically contains data requested by the application.

Development Dynamics

The problem with the webservice approach is that it isn't very flexible. It presumes the application's functionality and hence the actions are quite well-defined. Although this looks reasonable on paper, in reality development tends to be quite evolutionary.

Typically, the core functionality of applications is quite well defined, but often a lot of additional functionalities are required. Although we can pretend these could be known in advance if only we'd spend more time designing and planning in advance, in practice, they often aren't. It may seem sad, but in many cases, the best way to find out is simply to start developing, and find out along the way. Agile is the latest buzzword that captures some of these development dynamics, but there have been other buzzwords for it in the past, such as RAD (rapid application development) and DSDM (dynamic systems development method).

The problem with this approach is that it requires a lot of going back-and-forth between front- and back-end development tasks: whenever the front-end wants to develop a new feature, it is often dependent upon the back-end offering a service for it. Front-end and back-end developers often are not the same people, so what we get is front-end development cycles having to wait on back-end development cycles to complete. Or in case front-end and back-end developers are the same person, they are constantly switching between tool sets and development environments.

In part this is because front-end development is usually done in JavaScript, and although server-side JavaScript is gaining ground, the server-side is still dominated mainly by Java, PHP, C++ and ASP.NET. But it's not just a programming language problem - developing a client, and especially a frond-end for end-users, presumes a very different mindset than developing a back-end process. Front-end development should focus on usability and quality user-experience; back-end development should focus on robustness, reliability, availability, scalability and performance. Although some of these aspects influence each other, in practice, front-end development is simply a different cup of tea than back-end development.

A Simple Plan: Building a Query Service

There is a very simple solution that would largely solve the data access problem without dealing with the inefficiencies of the recurring development process: If you could build a single service that can accept any parameters, understand them, and somehow return an appropriate result, we would never have to add functionality to the service itself. Instead, the front end application would somehow have to construct the right parameters to tell the service what it wants whenever the need arises.

This sounds almost like magic, right? So we must be kidding, right? Well, we're not kidding, and it's not magic either; it's more like a cheap parlour trick.
As the title of this section suggests, a query service fits this bill.

It would be very easy to build a single service that accepts a query as a parameter, and returns its result as response. And seriously, it's not that strange an idea: many people use between one and perhaps ten or twenty different services exactly like this everyday, multiple times...and it's called a search engine.

Can't we use something like that to solve our database access problem? Well, we could. But actually, someone beat you to it already.

DBSlayer, a Webservice for SQL Queries

A couple of years ago, The New York Times released DBSlayer. DBSlayer (DataBase accesS layer) is best described as a HTTP server that acts as a database proxy. It accepts regular SQL queries via a parameter in a regular HTTP GET request, and sends a HTTP response that contains the resulting data as JSON. It currently supports only MySQL databases but announcements were made that support was planned for other database products too. DBSlayer is actually a bit more than just a database access layer, as it also supports simple failover and round-robin request distribution, which can be used to scale out database requests. But I mention it here, because it implements exactly the kind of query service that would appear to solve all the aforementioned problems.

Or would it?

Every web developer and every database administrator should realize immediately that it's not a good idea. At least, not for internet-facing applications anyway. The DBSlayer developers documented that themselves quite clearly:
Access to the DBSlayer can be controlled via firewalls; the DBSlayer should never be exposed to the outside world.
... and ...
The account DBSlayer uses to access the MySQL database should not be allowed to execute dangerous operations like dropping tables or deleting rows. Ideally, the account would only be able to run selects and/or certain stored procedures.
So there's the rub: it may be very easy and convenient from the application development point of view, but it is a horrendous idea when you think about security. A general purpose SQL query service is simply too powerful.

If a web application accidentally allows arbitrary SQL to be executed, it would be called an SQL injection vulnerability, and it would be (or at least, should be) treated as a major breach of security. Creating a service that offers exactly that behavior as a feature doesn't lessen the security concerns at all.

What about a MQL query service

In this article I tried to explain the problems that must be solved in order to arrange and manage data access for web applications. The key message is that we need to create a service that provides data access. But in doing so, we have to balance between security, functionality and flexibility.

It is fairly easy to create a webservice that exactly fulfills a particular application requirement, thus ensuring security and manageability. However, this will usually be a very inflexible service, and it will need lots of maintenance to keep up with change in application requirements. It's also easy to create a webservice that is at least as powerful as the underlying database: this would be a database proxy over HTTP, just like DBSlayer. Although it will likely never need to change since it simply passes requests on to the back-end database, it is very hard to secure it in a way that would allow external requests from possibly malignant users.

I believe that an MQL webservice actually does offer the best of both worlds, without suffering from the disadvantages. A MQL query service will be flexible enough for most web applications - MQL queries are only limited by the underlying data model, not by the set of application-specific actions designed for one particular purpose. At the same time, it will be relatively easy to efficiently analyze MQL queries and apply policies to prevent malicious use. For example, checking that a MQL query doesn't join more than X tables is quite easy.

In the forthcoming installment, I will explore the concept of a MQL webservice in more detail, and I will explain more about the MQL-to SQL project. As always, I'm looking forward to your comments, suggestions and critique so don't hesitate to leave a comment.

1 comment:

andry said...

json is a very interesting language to be used. very good tutorial and can hopefully help me in building json in the application that I created for this lecture. thank you

SAP HANA Trick: DISTINCT STRING_AGG

Nowadays, many SQL implementations offer some form of aggregate string concatenation function. Being an aggregate function, it has the effe...