Thursday, December 01, 2011

Common Schema: dependencies routines

Are you a MySQL DBA? Checkout the common_schema project by Oracle Ace Shlomi Noach.

The common_schema is an open source MySQL schema that packs a number of utility views, functions and stored procedures. You can use these utilities to simplify MySQL database administration and development. Shlomi just released revision 178, and I'm happy and proud to be working together with Shlomi on this project.

Among the many cool features created by Shlomi, such as foreach, repeat_exec and exec_file, there are a few %_dependencies procedures I contributed:

  • get_event_dependencies(schema_name, event_name)

  • get_routine_dependencies(schema_name, routine_name)

  • get_sql_dependencies(sql, default_schema)

  • get_view_dependencies(schema_name, view_name)

All these procedures return a resultset that indicates which schema objects are used by the object identified by the input parameters. Here are a few examples that should give you an idea:

mysql> call common_schema.get_routine_dependencies('common_schema', 'get_routine_dependencies');
+---------------+----------------------+-------------+--------+
| schema_name | object_name | object_type | action |
+---------------+----------------------+-------------+--------+
| common_schema | get_sql_dependencies | procedure | call |
| mysql | proc | table | select |
+---------------+----------------------+-------------+--------+
2 rows in set (0.19 sec)

Query OK, 0 rows affected (0.19 sec)

mysql> call common_schema.get_routine_dependencies('common_schema', 'get_sql_dependencies');
+---------------+-------------------+-------------+--------+
| schema_name | object_name | object_type | action |
+---------------+-------------------+-------------+--------+
| common_schema | _get_sql_token | procedure | call |
| common_schema | _sql_dependencies | table | create |
| common_schema | _sql_dependencies | table | drop |
| common_schema | _sql_dependencies | table | insert |
| common_schema | _sql_dependencies | table | select |
+---------------+-------------------+-------------+--------+
5 rows in set (1.59 sec)
Of course, there's always a lot to be desired. The main shortcomings as I see it now is that the dependencies are listed only one level deep: that is, the dependencies are not recursively analyzed. Another problem is that there is currently nothing to calculate reverse dependencies (which would arguably be more useful).

The good news is, this is all open source, and your contributions are welcome! If you're interested in the source code of these routines, checkout the common_schema project, and look in the common_schema/routines/dependencies directory.

If you'd like to add recursive dependencies, or reverse dependencies, then don't hesitate and contribute. If you have a one-off contribution that relates directly to these dependencies routines, then it's probably easiest if you email me directly, and I'll see what I can do to get it in. If you are interested in more long term contribution, it's probably best if you write Shlomi, as he is the owner of the common_schema project.

You can even contribute without implementing new features or fixing bugs. You can simply contribute by using the software and find bugs or offer suggestions to improve it. If you found a bug, or have an idea for an improvement or an entirely new feature, please use the issue tracker.

For now, enjoy, and untill next time.

Friday, October 21, 2011

MySQL Hacks: Preventing deletion of specific rows

Recently, someone emailed me:
I have a requirement in MYSQL as follows:
we have a table EMP and we have to restrict the users not delete employees with DEPT_ID = 10. If user executes a DELETE statement without giving any WHERE condition all the rows should be deleted except those with DEPT_ID = 10.

We are trying to write a BEFORE DELETE trigger but we are not able to get this functionality.

I have seen your blog where you explained about Using an UDF to Raise Errors from inside MySQL Procedures and/or Triggers. Will it helps me to get this functionality? Could you suggest if we have any other alternatives to do this as well?
Frankly, I usually refer people that write me these things to a public forum, but this time I felt like giving it a go. I figured it would be nice to share my solution and I'm also curious if others found other solutions still.

(Oh, I should point out that I haven't asked what the underlying reasons are for this somewhat extraordinary requirement. I normally would do that if I would be confronted with sucha a requirement in a professional setting. In this case I'm only interested in finding a crazy hack)

Attempt 1: Re-insert deleted rows with a trigger

My first suggestion was:
Raising the error won't help you achieve your goal: as soon as you raise the error, the statement will either abort (in case of a non-transactional table) or rollback all row changes made up to raising the error (in case of a transactional table)

Although I find the requirement strange, here's a trick you could try:

write a AFTER DELETE FOR EACH ROW trigger that re-inserts the rows back into the table in case the condition DEPT_ID = 10 is met.

Hope this helps...

Alas, I should've actually tried it myself before replying, because it doesn't work. If you do try it, a DELETEresults in this runtime error:
Can't update table 'emp' in stored function/trigger because it is already used by statement which invoked this stored function/trigger.
This is also known as "the mutating table problem".

Attempt 2: Re-insert deleted rows into a FEDERATED table

As it turns out, there is a workaround that meets all of the original demands. The workaround relies on the FEDERATED storage engine, which we can use to trick MySQL into thinking we're manipulating a different table than the one that fires the trigger. My first attempt went something like this:

CREATE TABLE t (
id INT AUTO_INCREMENT PRIMARY KEY,
dept_id INT,
INDEX(dept_id)
);

CREATE TABLE federated_t (
id INT AUTO_INCREMENT PRIMARY KEY,
dept_id INT,
INDEX(dept_id)
)
ENGINE FEDERATED
CONNECTION = 'mysql://root@localhost:3306/test/t';

DELIMITER //

CREATE TRIGGER adr_t
AFTER DELETE ON t
FOR EACH ROW
IF old.dept_id = 10 THEN
INSERT INTO t_federated
VALUES (old.id, old.dept_id);
END IF;
//

DELIMITER ;
So the idea is to let the trigger re-insert the deleted rows back into the federated table, which in turn points to the original table that fired the trigger to fool MySQL into thinking it isn't touching the mutating table. Although this does prevent one from deleting any rows that satisfy the DEPT_ID = 10 condition, it does not work as intended:

mysql> INSERT INTO t VALUES (1,10), (2,20), (3,30);
Query OK, 3 rows affected (0.11 sec)

mysql> DELETE FROM t;
ERROR 1159 (08S01): Got timeout reading communication packets

mysql> SELECT * FROM t;
+----+---------+
| id | dept_id |
+----+---------+
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
+----+---------+
3 rows in set (0.00 sec)
At this point I can only make an educated guess about the actual underlying reason for this failure. It could be that the deletion is locking the rows or even the table, thereby blocking the insert into the federated table until we get a timeout. Or maybe MySQL enters into an infinite loop of deletions and insertions until we hit a timeout. I didn't investigate, so I don't know, but it seems clear this naive solution doesn't solve he problem.

Attempt 3: Deleting from the FEDERATED table and re-inserting into the underlying table

It turns out that we can solve it with a FEDERATED table by turning the problem around: Instead of manipulating the original table, we can INSERT and DELETE from the FEDERATED table, and have an AFTER DELETE trigger on the FEDERATED table re-insert the deleted rows back into the original table:

DROP TRIGGER adr_t;

DELIMITER //

CREATE TRIGGER adr_federated_t
AFTER DELETE ON federated_t
FOR EACH ROW
IF old.dept_id = 10 THEN
INSERT INTO t
VALUES (old.id, old.dept_id);
END IF;
//

DELIMITER ;
Now, the DELETE does work as intended:

mysql> DELETE FROM federated_t;
Query OK, 3 rows affected (0.14 sec)

mysql> SELECT * FROM federated_t;
+----+---------+
| id | dept_id |
+----+---------+
| 1 | 10 |
+----+---------+
1 row in set (0.00 sec)
Of course, to actually use this solution, one would grant applications access only to the federated table, and "hide" the underlying table so they can't bypass the trigger by deleting rows directly from the underlying table.

Now, even though this solution does seem to fit the original requirements, I would not recommend it for several reasons:
  • It uses the FEDERATED storage engine, which hasn't been well supported. For that reason, it isn't enabled by default, and you need access to the MySQL configuration to enable it, limiting the applicability of this solution. Also, you could run into some nasty performance problems with the FEDERATED storage engine
  • The solution relies on a trigger. In MySQL, triggers can really limit performance
  • Perhaps the most important reason is that this solution performs "magic" by altering the behaviour of SQL statements. Arguably, this is not so much the fault of the solution as it is of the original requirement.

An Alternative without relying on magic: a foreign key constraint

If I were to encounter the original requirement in a professional situation, I would argue that we should not desire to alter the semantics of SQL commands. If we tell the RDBMS to delete all rows from a table, it should either succeed and result in all rows being deleted, or it should fail and fail completely, leaving the data unchanged.

So how would we go about implementing a solution for this changed requirement?

We certainly could try the approach that was suggested in the original request: create a trigger that raises an exception whenever we find the row should not be deleted. However, this would still rely on a trigger (which is slow). And if you're not on MySQL 5.5 (or higher), you would have to use one of the ugly hacks to raise an exception.

As it turns out, there is a very simple solution that does not rely on triggers. We can create a "guard table" that references the table we want to protect using a foreign key constraint:

mysql> CREATE TABLE t_guard (
-> dept_id INT PRIMARY KEY,
-> FOREIGN KEY (dept_id)
-> REFERENCES t(dept_id)
-> );
Query OK, 0 rows affected (0.11 sec)

mysql> INSERT INTO t_guard values (10);
Query OK, 1 row affected (0.08 sec)

mysql> DELETE FROM t;
ERROR 1451 (23000): Cannot delete or update a parent row: a foreign key constraint fails (`test`.`t_guard`, CONSTRAINT `t_guard_ibfk_1` FOREIGN KEY (`dept_id`) REFERENCES `t` (`dept_id`))
mysql> DELETE FROM t WHERE dept_id != 10;
Query OK, 2 rows affected (0.05 sec)
(Like in the prior example with the federated table, the guard table would not be accessible to the application, and the "guard rows" would have to be inserted by a privileged user)

Finally: what a quirkyy foreign key constraint!

You might have noticed that there's something quite peculiar about the foreign key constraint: typically, foreign key constraints serve to relate "child" rows to their respective "parent" row. To do that, the foreign key would typically point to a column (or set of columns) that make up either the primary key or a unique constraint in the parent table. But in this case, the referenced column dept_id in the t table is contained only in an index which is not unique. Strange as it may seem, this is allowed by MySQL (or rather, InnoDB). In this particular case, this flexibility (or is it a bug?) serves us quite well, and it allows us to guard many rows in the t table with dept_id = 10 with just one single row in the guard table.

Friday, October 07, 2011

Fighting Spam: Word Verification

Hi All,

this is a quick note to let you know that from now on, commenters on this blog will need to complete a word verification (captcha) step.

Personally, I regret to have to take this measure. Let me explain why I'm doing it anyway.

Since 3 months or so, moderating comments on this blog is becoming a real drag due to a surge in anonymous spam. While bloggers spam detection is quite good, I still get notificaton mails prompting me to moderate. I feel this is consuming more of my time than it's worth.

Except for requiring word verification, other policies (or lack thereof) are still in effect: all comments are moderated, but anyone can comment, even anonymously. In practice, all real comments get published - even negative or derogatory ones (should I receive them).

Sorry for the convenience, but I hope you'll understand.

Wednesday, August 24, 2011

Re-implementing udf_init_error in MySQL 5.5 and up

To whom it may concern -

Today, I received an email from a user of the udf_init_error UDF (which resides in the lib_mysqludf_udf library). The purpose of this UDF is to generate an error condition, which can be used to abruptly terminate a trigger or stored procedure. As such it is a workaround for bug #11661. This is all described extensively in my now ancient article here.

The user wrote me because of a problem experienced in MySQL 5.5:
...calling
select udf_init_error('Transaction Cannot Be Done Because....');
will return user friendly error message:
Transaction Cannot Be Done Because....
. But in MySQL 5.5, it returns
Can't initialize function 'udf_init_error; Transaction Cannot Be Done Because....
The Can't initialize function 'udf_init_error; bit is so annoying! How can I get rid of that?
I explained that the UDF still works like it should; it's just that at some point during the 5.0 lifecycle, the format of the error message was changed. (I can't recall exactly which version that was, but I did file bug #38452 that describes this issue).

Anyway, I suggested to move away from using the udf_init_error() UDF, and port all dependent code to use the SIGNAL syntax instead, which was introduced in MySQL 5.5. (For a friendly introduction to using the SIGNAL syntax, please check out one of my prior articles).

Unfortunately, for this particular user this would not be an easy task:
The use of SIGNAL did come to my mind, but the implementation is not easy. I have thousands of stored routines to modify. Besides, I'm already satisfied with what the UDF does.
On the one hand, It makes me happy to hear the udf_init_error() UDF served him so well that he wrote so many routines that rely on it; on the other hand, I feel bad that this is holding him back from upgrading to MySQL 5.5.

For everybody that is in this same position, I'd like to suggest the following solution: simply re-implement udf_init_error() as a stored SQL function that uses the SIGNAL functionality instead. The error message returned to the client will not be exactly the same as in the olden MySQL 5.0 days, but at least there will not be an annoying complaint about a UDF that cannot be initialized.

Here's a very simple example that illustrates how to do it:
CREATE FUNCTION udf_init_error(
p_message VARCHAR(80)
)
RETURNS INTEGER
DETERMINISTIC
NO SQL
BEGIN
DECLARE err CONDITION FOR SQLSTATE '45000';
SIGNAL err SET MESSAGE_TEXT = p_message;
RETURN 1;
END;
I hope this helps.

Monday, August 15, 2011

Proposals for Codebits.EU

Codebits is an annual 3-day conference about software and, well, code. It's organized by SAPO and this year's edition is to be held on November 10 thru 12 at the Pavilhão Atlântico, Sala Tejo in Lisbon, Portugal.

I've never attended SAPO Codebits before, but I heard good things about it from Datacharmer Giuseppe Maxia. The interesting thing about the way this conference is organized is that all proposals are available to the public, which can also vote for the proposals. This year's proposals are looking very interesting already, with high quality proposals from Giuseppe about database replication with Tungsten replicator, Pentaho's chief of data integration Matt Casters about Kettle (aka Pentaho data integration), and Pedro Alves from webdetails who will be talking about "Big Data" analysis and dashboarding work he did for the Mozilla team.

There are many more interesting talks, and you should simply check out the proposals for yourself and give a thumbs up or a thumbs down according to whether you'd like see a particular proposal at the conference. I decided to send in a few proposals as well:So, if you like what you see here, take a minute to vote and shape this codebits conference. I'm hoping to meet you there!

Friday, August 12, 2011

Regarding the MySQL Conference and Expo 2012

Last week, Baron Schwartz announced the Percona Live MySQL Conference and Expo 2012.

Percona organized MySQL related conferences and seminars before, and from what I've heard, with considerable success and to satisfaction of its attendees, and there's one coming up in London in October 2011. But arguably, last week's announcement is quite different from the prior Percona conferences. It's different, because it seeks to replace the annual O'Reilly MySQL Conference and Expo.

Everyone that has read the announcement will have no trouble recognizing it as a replacement, since it reads:
We all know that the entire MySQL community has been waiting to see if there will be a MySQL conference next year in the traditional date and location. To the best of our knowledge, no one else was planning one, so we decided to keep the tradition alive.
If you're still in doubt:

  • The conference title contains the phrase: "MySQL Conference and Expo".

  • It's to be held in the Hyatt Regency Hotel in Santa Clara, which has been the venue for the O'Reilly MySQL Conference and Expo since at least 2005.

  • It's scheduled to take place midway April, exactly like the O'Reilly conferences used to be.

  • The scope of the conference encompasses the entire "eco-system" - whatever that is: Developers and DBAs; tools and techniques; tutorials, talks and BOFs. It's about users, but also explicitly about companies and businesses.

Immediately following the announcement, bloggers from the MySQL community - all of which I respect, and consider friends of mine - started posting their opinions:(In this list, I tried to maintain the affiliation of these bloggers as appropriate and relevant as possible. Please let me know if you feel I wrongly associated someone with a particular company or organizational body)

Except for Henrik's post, all of these express a negative attitude towards Percona's announcement. The critique focuses on a few themes:

  • Giuseppe and Sheeri express similar thoughts. They recall how Baron Schwartz and Peter Zaitsev (both from Percona, and now initiating the 2012 conference) criticized the O'Reilly MySQL Conference and Expo 2008 edition for increasingly becoming an event focused at business and vendors, rather than at users (see here and here). There seems to be a hidden accusation that now, only a few years later, Baron and Peter are "guilty" of organizing a business-oriented conference themsevles.

  • Also both Sheeri and Giuseppe's posts express the concern that Oracle might not allow any of its MySQL engineers and architects to speak at the conference. This would arguably make it a less interesting conference as Oracle is a major -if not the main- contributor to both MySQL and InnoDB.

  • All bloggers argue that organizing a conference of this scale should not be the effort of a single company. In particular Kaj Arnö and Monty Widenius allude to the possibility of O'Reilly organizing the conference again, just like the way things used to be. They both explicitly include a list of the major companies contributing to MySQL which they envision should help drive such a conference.

    The main concern here is that when a single company organizes this event, it will be their event. In other words, it will not be neutral. There is serious concern for unfair competition, as the organizing company gets to decide or exert greater influence on which talks are approved, and how talks are scheduled against each other

Although I understand the critique, I do not agree with it. I hope I'm not offending any of my friends, but I think none of these seemingly sensible arguments against Percona organizing the MySQL Conference have true merit.

But before I explain, I think it's interesting to observe that nobody seems to assume it as a given that O'Reilly would be organizing another MySQL Conference and Expo. Monty comes closest to saying something about it:
The reason for my state of mind is that although there have been rumors about discontinuance of the O'Reilly arranged conference there hasn't been any announcement about this.

In fact, I have been working with O'Reilly to try to setup next year's O'Reilly MySQL conference with the intention of having it 'exactly like before', even if Oracle would not participate.
So basically, because O'Reilly didn't say they weren't going to do one, it might be possible, right :) I tend to look at it differently: It means exactlty nothing when someone, O'Reilly included, didn't announce something. The way I see it, O'Reilly has nothing to gain by announcing that they will not be organizing another MySQL Conference. Similarly, they've got nothing to lose by not announcing they aren't.

In the end, organizing conferences is one of O'Reilly's business activities. The mere fact that they've been organizing one during the previous years does not bestow any special responsibility upon them to inform potential attendees and sponsors that they are discontinuing such an activity.

It's interesting that Monty mentions he was working together with O'Reilly on it. I have no reason to doubt it, but I do suspect that whatever was in the works, it was probably not going to happen at the traditional location and at the traditional time window. Silicon Valley Conference centers are busy places, and need to be reserved well in advance - starting to work on it less than three quarters in advance can probably not be considered "well in advance" for an event of this scale.

Now, here are my arguments as to why I do not share the opinions I mentioned above:

  • If you read Peter and Baron's posts from way back in 2008 (I included the links already, but here and here they are again), you will notice that they were not in fact criticizing the O'Reilly MySQL Conference and Expo at all. They simply recognized there was a gap and felt there should also be a community-driven conference. In fact, Baron initiated such an event, the Open SQL Camp. That turned out to be such a great success that others started organizing OpenSQL Camps too.

    Now, if you read Baron's announcement for the Percona MySQL Conference and Expo, you'll notice that precisely because the tables are turned, they now feel the need to maintain a business-driven MySQL conference. They simply recognized that now there is the risk of a gap as far as a business-driven MySQL conference is concerned. In other words, there is no question of should this be a business-driven event or a community-driven event. Both kinds of events are needed, and the business one wasn't being taken care of, neither by O'Reilly, nor by Oracle.

  • The concern that Oracle might not allow its engineers to attend a conference that is organized by a competitor seems reasonable. But it assumes that they would allow it if it was a vendor-neutral conference, or at least a conference that could be perceived as such. To those that have been involved to some extent in the organization of the 2010 and 2011 editions of the MySQL Conference, it should be no secret that Oracle's participation hasn't been exactly eager. Just listen to Tim O'Reilly's own talk at the MySQL 2010 conference. If that doesn't convince you, look at the sponsor list for the 2011 edition: no Oracle. And if that still doesn't convince you - Last year it was very unclear whether Oracle was sending any delegation at all. Only at a very late stage did we receive proposals from Oracle.

    To be clear, I am not blaming Oracle for not wishing to participate in a particular conference. They have their own strategy and they are entitled to execute that however they see fit, even if that includes not sponsoring or speaking at a major MySQL conference. I'm just arguing that whether or not such a conference is organized by a vendor neutral party does not seem to be a part of Oracle's consideration. There is in my opinion absolutely no guarantee that Oracle would participate if things really would be like they used to, and O'Reilly and not Percona would be organizing another edition.

  • The final matter I'd like to discuss is the idea of all major MySQL contributing companies on working togehter to organize a conference. Although I think that's a very sympathetic idea, it doesn't seem very realistic to me. Or at least, it doesn't seem realistic that this would lead to a Santa Clara conference in April 2012. So maybe this is something all involved parties should discuss for the years to come.

My final remarks are that in the end, I am mostly happy that at least one party is willing to take the up-front risk in securing the venue. Percona has announced that, just like O'Reilly, they want to set up a board of community members to drive the program. I think other companies have legitimate concerns over vendor neutrality, but I have no reason to doubt that Baron, and by extension, Percona are doing whatever they can to safeguard that.

What is left is Percona's company name in the conference title. While Percona is likely to use that to drive their own business, this is not really different from all pre-2010 MySQL conferences, where MySQL AB and then Sun used it to drive theirs. That does not mean it excludes competitors using the conference to their advantage to pursue their business interests. There has always been room for competing vendors, and arguably that is what made the event not just *a* MySQL conference, but *the* MySQL conference.

I understand that not everybody is happy about how things are going now, and it would be great if all companies that feel they have a stake here collaborate in the future. But for now, I'm really happy there will be a 2012 edition, and I thank Percona for organizing it. I will definitely send in proposals as soon as the call for papers is open, and I hope everybody that feels they have something to talk about or present will do the same. I really believe this can be a conference exactly like it was before, with the only difference that it's organized by Percona, and not O'Reilly.

Wednesday, June 22, 2011

Working with namespaces and namespace prefixes in XSLT 1.0

To whom it may concern -

I'm currently developing a Xslt 1.0 stylesheet for analysis of XML Schema documents. As part of this work, I developed a couple of templates for working with namespace names and prefixes, and I like to share them via this post. The code is not incredibly hard or advanced, but it gets the job done and it may save you some time if you need something similar.
  • get-local-name: Return the local name part of a given QName. function in XPath 2.0
  • get-prefix: Return the prefix part of a given QName.
  • get-ns-name: Return the namespace name associated to the given prefix.
  • get-ns-prefix: Return a prefix that can be used to denote the given namespace name.
  • resolve-ns-identifier: Return the namespace name for a given QName prefix
Before I discuss the code, I want to make a few remarks:
  1. This s all about Xslt 1.0 and its query langue XPath 1.0. All these things can be solved much more conveniently in XPath 2.0, and hence in Xslt 2.0 because that builds on Xpath 2.0 (and not XPath 1.0 like Xslt 1.0 does)
  2. If you're planning to use this in a web-browser, and you want to target Firefox, your're out of luck. Sorry. Firefox is a greatt browser, but unlike Chrome, Opera and even Internet Explorer, it doesn't care enough about Xslt to fix bug #94270, which has been living in their bug tracker as long as August 2001 (nope - I didn't mistype 2011, that's 2001 as in almost a decade ago)

get-local-name

Return the local name part of a given QName. This is functionally equivalent to the fn:local-name-from-QName
<!-- get the last substring after the *last* colon (or he argument if no colon) -->
<xsl:template name="get-local-name">
<xsl:param name="qname"/>
<xsl:choose>
<xsl:when test="contains($qname, ':')">
<xsl:call-template name="get-local-name">
<xsl:with-param name="qname" select="substring-after($qname, ':')"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$qname"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>

get-prefix

Return the prefix part of a given QName. This is functionally equivalent to the fn:prefix-from-QName function in XPath 2.0
<!-- get the substring before the *last* colon (or empty string if no colon) -->
<xsl:template name="get-prefix">
<xsl:param name="qname"/>
<xsl:param name="prefix" select="''"/>
<xsl:choose>
<xsl:when test="contains($qname, ':')">
<xsl:call-template name="get-prefix">
<xsl:with-param name="qname" select="substring-after($qname, ':')"/>
<xsl:with-param name="prefix" select="concat($prefix, substring-before($qname, ':'))"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$prefix"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>

get-ns-name

Return the namespace name associated to the given prefix. This is functionally equivalent to the fn:namespace-uri-for-prefix function in XPath 2.0. The main difference is that this template does the lookup against the namespace definitions that are in effect in the current context, whereas the XPath 2.0 function allows the element which is used as context to be passed in as argument.
<!-- get the namespace uri for the namespace identified by the prefix in the parameter -->
<xsl:template name="get-ns-name">
<xsl:param name="ns-prefix"/>
<xsl:variable name="ns-node" select="namespace::node()[local-name()=$ns-prefix]"/>
<xsl:value-of select="$ns-node"/>
</xsl:template>

get-ns-prefix

Return a prefix that can be used to denote the given namespace name. This template is complementary to the get-ns-name template. This template assumes only one prefix will be defined for each namespace. The namspace is resolved against the current context.
<!-- get the namespace prefix for the namespace name parameter -->
<xsl:template name="get-ns-prefix">
<xsl:param name="ns-name"/>
<xsl:variable name="ns-node" select="namespace::node()[.=$ns-name]"/>
<xsl:value-of select="local-name($ns-node)"/>
</xsl:template>

resolve-ns-identifier

Return the namespace name for a given QName prefix (be it a namespace prefix or a namspace name). This template is useful to generically obtain a namespace name when feeding it the prefix part of a QName. If the prefix happens to be a namespace name, then that is returned, but if it happens to be a namespace prefix, then a lookup is performed to return the namspace name. This template also looks at the namspaces in effect in the current context.
<!-- return the namespace name -->
<xsl:template name="resolve-ns-identifier">
<xsl:param name="ns-identifier"/>
<xsl:choose>
<xsl:when test="namespace::node()[.=$ns-identifier]">
<xsl:value-of select="$ns-identifier"/>
</xsl:when>
<xsl:when test="namespace::node()[local-name()=$ns-identifier]">
<xsl:value-of select="namespace::node()[local-name()=$ns-identifier]"/>
</xsl:when>
<xsl:otherwise>
<xsl:message terminate="yes">
Error: "<xsl:value-of select="$ns-identifier"/>" is neither a valid namespace prefix nor a valid namespace name.
</xsl:message>
</xsl:otherwise>
</xsl:choose>
</xsl:template>

Saturday, June 18, 2011

HPCC vs Hadoop at a glance

Update

Since this article was written, HPCC has undergone a number of significant changes and updates. This addresses some of the critique voiced in this blog post, such as the license (updated from AGPL to Apache 2.0) and integration with other tools. For more information, refer to the comments placed by Flavio Villanustre and Azana Baksh.

The original article can be read unaltered below:

Yesterday I noticed this tweet by Andrei Savu: . This prompted me to read the related GigaOM article and then check out the HPCC Systems website.

If you're too lazy to read the article or visit that website:
HPCC (High Performance Computing Cluster) is a massive parallel-processing computing platform that solves Big Data problems. The platform is now Open Source!


HPCC Systems compares itself to Hadoop, which I think is completely justified in terms of functionality. Its product originated as a homegrown solution of LexisNexis Risk Solutions allowing its customers (banks, insurance companies, law enforcment and federal government) to quickly analyze billions of records, and as such it has been in use for a decade or so. It is now open sourced, and I already heard an announcement that Pentaho is its major Business Intelligence Partner.

Based on the limited information a made a quick analysis, which I emailed to the HPCC Systems CTO, Armando Escalante. My friend Jos van Dongen said it was a good analysis and told me I should post it. Now, I don't really have time to make a nice blog post out of it, but I figured it can't hurt to just repeat what I said in my emails. So here goes:

Just going by the documentation, I see a two real unique selling points in HPCC Systems as compared to Hadoop:

  • Real-time query performance (as opposed to only analytic jobs). HPCC offers two difference setups, labelled Thor and Roxie. Functionalitywise, Thor should be compared to a Map/Reduce cluster like Hadoop: it's good for doing fairly long running analyses on large volumes of data. Roxie is a different beast, and designed to offer fast data access, supporting ad-hoc real-time queries
  • Integrated toolset (as opposed to hodgepodge of third party tools). We're talking about an IDE, job monitoring, code repository, scheduler, configuration manager, and whatnot. This really looks like like big productivity boosters, which may make Big Data processing a lot more accessible to companies that don't have the kind of development teams required to work with Hadoop.

(there may be many more benefits, but these are just the ones I could clearly distill from the press release and the website)

Especially for Business Intelligence, Roxie maybe a big thing. If real-time Big Data queries could be integrated with Business Intelligence OLAP and reporting tools, then this is certainly a big thing. I can't disclose the details but I have trustworthy information that integration with Pentaho's Analysis Engine, the Mondrian ROLAP engine is underway and will be available as an Enterprise feature.

A few things that look different but which may not matter too much when looking at HPCC and Hadoop from a distance:
  • ECL, the "Enterprise Control Language", which is a declarative query language (as opposed to just Map/Reduce). This initially seems like a big difference but Hadoop has tools like pig and sqoop and hive. Now, it could be that ECL is vastly superior to these hadoop tools, but my hunch is you'd have to be careful in how you position that. If you choose a head-on strategy in promoting ECL as opposed to pig, then the chances are that people will just spend their energy in discovering the things that pig can do and ECL cannot (not sure if those features actually exist, but that is what hadoop fanboys will look for), and in addition, the pig developers might simply clone the unique ECL features and the leveling of that playing field will just be a matter of time. This does not mean you shouldn't promote ECL - on the contrary, if you feel it is a more productive language than pig or any other hadoop tool, then by all means let your customers and prospects know. Just be careful and avoid downplaying the hadoop equivalents because that strategy could backfire.

  • Windows support. It's really nice that HPCC Systems is available for Microsoft Windows, it makes that a lot easier for Microsoft shops (and there are a lot of them). That said, customers that really have a big-data problem will solve it no matter what their internal software policies are. So they'd happily start running hadoop on linux if that solves their problems.
  • Maturity. On paper HPCC looks more mature than hadoop. It's hard to tell how much that matters though because hadoop has all the momentum. People might choose for hadoop because they anticipate that the maturity will come thanks to the sheer number of developers committing to that platform.


The only thing I can think of where HPCC looks like it has a disadvantage as compared to Hadoop is adoption rate and licensing. I hope these will prove not to be significant hurdles for HPCC, but I think that these might be bigger problems then they seem. Especially the AGPL licensing seems problematic to me.

The AGPL is not well regarded by anyone I know - not in the open source world. The general idea seems to be that even more than plain GPL3 it restricts how the software may be used. If the goal of open sourcing HPCC is to gain mindshare and a developer community (something that hadoop has done and is doing extremely well) then a more permissive license is really the way to go.

If you look at products like MySQL but also Pentaho - they are both very strongly corporately led products. The have a good number of users, but few contributions from outside the company, and this is probably due to a combination of GPL licensing and the additional requirement for handing over the copyright of any contributions to the company. Hence these products don't really benefit from an open source development model (or at least not as much as they could). For these companies, Open source may help initially to gain a lot of users, but those are in majority the users that just want a free ride: conversion rates to enterprise edition customers are quite low. It might be enough to make a decent buck, but eventually you'll hit a cap on how far you can grow. I'm not saying this is bad - you only need to grow as much as you have to, but it is something to be aware of.

Contrast this to Hadoop. The have a Apache 2.0 permissive license, and this results in many individuals but also companies contributing to the project. And there are still companies like Cloudera that manage to make a good living off of the services around their distribution of Hadoop. You don't lose the ability to develop add-ons either with this model - apache 2.0 allows all that. The difference with GPL (and AGPL) of course is that it allows this also to other users and companies. So the trick to stay on top in this model is to simply offer the best product (as opposed to being the sole holder of the copyright to he code).

Anyway - that is it for now - I hope this is helpful.

Monday, May 30, 2011

Cleaning webpages with Pentaho Data Integration and JTidy

Here's an issue I've come across multiple times: I need to scrape HTML websites to extract data. Pentaho Data Integration (kettle) has lots of functionality on-board to make this an easy process, except one: it does not support reading data directly from HTML.

In this short post, I provide a simple tip to clean HTML pages and convert them to XML so you can extract its data using the conventional "Get data from XML" step. The solution hinges on two ingredients:

Standard Kettle tools for Webservices

Kettle is really good at fetching data from the web and extracting data from webservices, be they in a SOAP/XML, REST/JSON or RSS flavor. (There is an extensive chapter on this subject in Pentaho Kettle Solutions). But when you're dealing with plain old HTML, things can get pretty hairy.

If you're lucky, the page may be in XHTML, and in that case it's worth trying the Get Data from XML step. However, quite often a webpage that claims to be XHTML is not well-formed XML, and even if it is, Kettle does not understand things like &nbsp; entities, which are valid in XHTML, but not in plain XML. And of course, more often than not, you're not lucky, and XHTML represents only a minor fraction of all the web pages out there.

Workaround: JavaScript string manipulation

In the past, I usually worked around these issues. In practice, some quick and dirty string manipulation using the Modified Javascript Value step and some built-in indexOf(), substring and replace() functions go a long way.

In most cases I don't really need the entire web page, but only a <table>, <ul> or <ol> element in the <body>. Excising only the interesting sections out of the page using plain string manipulation will often get rid of most of the cruft that prevents the data from being treated as XML. For example, if we only need to get the rows from a table with a particular id attribute, we can use a JavaScript snippet like this:

//table we're looking for
var startHandle = "<table class=\"lvw\" cellpadding=0 cellspacing=0>";
var startPosition= html.indexOf(startHandle);
//look beyond the start tag to lose the invalid unquoted attributes
startPosition += startHandle.length;

//find where this table ends (lucky us, no nested table elements :)
var endHandle = "</table>";
var endPosition = html.indexOf(endHandle, startPosition);

//make a complete table fragment out of it again
var table = "<table>" + html.substring(startPosition, endPosition + endHandle.length);

//replace nbsp entities, empty unclosed img elements, and value-less nowrap attributes
table = table.replace(/&nbsp;|<img[^>]>|nowrap/ig, "");


There are of course no guarantees that the sections you cut out like that are in fact well-formed XML, but in my experience it's often worth a try.

A better way: using JTidy

While the JavaScript workaround may just work for your particular case, it certainly has disadvantages. Sometimes it may just be not so simple to clean the HTML with plain string manipulation. And of course there is a performance issue too - the JavaScript step can be quite slow.

Fortunately, there is a better way.

Using a user-defined Java Class step we can have JTidy do the dirty work of cleaning the HTML and converting it to XML, which we can then process in a sane way with Kettle's Get Data from XML step.

We need to do two things to make this work: first, you have to download JTidy, unzip it, and place the jtidy-r938.jar in the libext directory, which resides immediately in your kettle installation directory. (note that if you were running spoon, you need to restart it before it will be picked up). Second, you need a little bit of glue code for the User-defined Java class step so Kettle can use the Tidy class inside the jar. With some help from the pentaho wiki and the JTidy JavaDoc documentation, I came up with the following Java snippet to make it work:

import org.w3c.tidy.Tidy;
import java.io.StringReader;
import java.io.StringWriter;

protected Tidy tidy;

public boolean init(StepMetaInterface stepMetaInterface, StepDataInterface stepDataInterface)
{
//create and configure a Tidy instance
tidy = new Tidy();
tidy.setXmlOut(true);
return parent.initImpl(stepMetaInterface, stepDataInterface);
}

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException {
Object[] r;
//Get row from incoming stream.
//Bail out if its not there.
if ((r = getRow()) == null) {
setOutputDone();
return false;
}

//read the value of the html input field
//the html field happens to be the 5th field in the stream,
//because java arrays start at 0, we use index 4 to reference it
StringReader html = new StringReader((String)r[4]);

//use tidy to parse html to xml
StringWriter xml = new StringWriter();
tidy.parse(html, xml);

//assign the xml to the output row
//note we simply overwrite the original html field from the input row.
r[4] = xml.toString();

//push the output row to the outgoing stream.
putRow(data.outputRowMeta, r);
return true;
}

(Tip: for more examples and background information on the user-defined java class step, check out the excellent blog posts by Slawomir Chodnicki, Matt Casters and the video walk-through by Dein Einspanjer)

The big advantage of using Tidy is that you can be sure that the result is well-formed XML. In addition, you can have JTidy report on any errors or warnings, which makes it much more robust than any ad-hoc string manipulation you can come up with.

Wednesday, May 18, 2011

Check out this excellent article by Shlomi Noach!!

Check out this excellent article by Shlomi Noach!!

Really - my life is much happier now, and as a bonus I got a free set of steak knives and even lost 20 pounds. (and MySQL!)

Tuesday, May 10, 2011

Managing kettle job configuration

Over time I've grown a habit of making a configuration file for my kettle jobs. This is especially useful if you have a reusable job, where the same work has to be done but against different conditions. A simple example where I found this useful is when you have separate development, testing and production environments: when you're done developing your job, you transfer the .kjb file (and its dependencies) to the testing environment. This is the easy part. But the job still has to run within the new environment, against different database connections, webservice urls and file system paths.

Variables


In the past, much has been written about using kettle variables, parameters and arguments. Variables are the basic features that provide the mechanism to configure the transformation steps and job entries: instead of using literal configuration values, you use a variable reference. This way, you can initialize all variables to whatever values are appropriate at that time, and for that environment. Today, I don't want to discuss variables and variable references - instead I'm just focussing on how to manage the configuration once you already used variable references inside your your jobs and transformations.

Managing configuration


To manage the configuration, I typically start the main job with a set-variables.ktr transformation. This transformation reads configuration data from a config.properties file and assigns it to the variables so any subsequent jobs and transformations can access the configration data through variable references. The main job has one parameter called ${CONFIG_DIR} which has to be set by the caller so the set-variables.ktr transformation knows where to look for its config.properties file:


Reading configuration properties


The config.properties file is just a list of key/value pairs separated by an equals sign. Each key represents a variable name, and the value the appropriate value. The following snippet should give you an idea:
#staging database connection
STAGING_DATABASE=staging
STAGING_HOST=localhost
STAGING_PORT=3351
STAGING_USER=staging
STAGING_PASSWORD=$74g!n9
The set-variables.ktr transformation reads it using a "Property Input" step, and this yields a stream of key/value pairs:


Pivoting key/value pairs to use the "set variables" step


In the past, I used to set the variables using the "Set variables" step. This step works by creating a variable from selected fields in the incoming stream and assigning the field value to it. This means that you can't just feed the stream of key/value pairs from the property input step into the set variables step: the stream coming out of the property input step contains multiple rows with just two fields called "Key" and "value". Feeding it directly into the "Set variables" step would just lead to creating two variables called Key and Value, and they would be assigned values multiple times for all key/value pairs in the stream. So in order to meaningfully assign variable, I used to pivot the stream of key/value pairs into a single row having one field for each key in the stream using the "Row Denormaliser" step:

As you can see in the screenshot, "Key" is the key field: the value of this field is scanned to determine in which output fields to put the corresponding value. There are no fields that make up a grouping: rather, we want all key/value pairs to end up in one big row. Or put another way, there is just one group comprising all key/value pairs. Finally, the grid below specifies for each distinct value of the "Key" field to which output field name it should be mapped, and in all cases, we want the value of the "Value" field to be stored in those fields.

Drawbacks


There are two important drawbacks to this approach:

  • The "Row Normaliser" uses the value of the keys to map the value to a new field. This means that we have to type the name of each and every variable appearing in the config.properties file. So you manually need to keep he config.propeties and the "Denormaliser" synchronized, and in practice it's very easy to make mistakes here.
  • Due to the fact that the "Row Denormaliser" step literally needs to know all variables, the set-variables.ktr transformation becomes specific for just one particular project.

Given these drawbacks, I seriously started to question the usefulness of a separate configuration file: because the set-variables.ktr transformation has to know all variables names anyway, I was tempted to store the configration values themselves also inside the transformation (using a "generate rows" or "data grid" step or something like that), and "simply" make a new set-variables.ktr transformation for every environment. Of course, that didn't feel right either.

Solution: Javascript


As it turns out, there is in fact a very simple solution that solves all of these problems: don't use the "set variables" step for this kind of problem! We still need to set the variables of course, but we can conveniently do this using a JavaScript step. The new set-variables.ktr transformation now looks like this:



The actual variable assignemnt is done with Kettle's built-in setVariable(key, value, scope). The key and value from the incoming stream are passed as arguments to the key and value arguments of the setVariable() function. The third argument of the setVariable() function is a string that identifies the scope of the variable, and must have one of the following values:

  • "s" - system-wide

  • "r" - up to the root

  • "p" - up to the parent job of this transormation

  • "g" - up to the grandparent job of this transormation

For my purpose, I settle for "r".

The bonus is that this set-variables.ktr is less complex than the previous one and is now even completely independent of the content of the configuration. It has become a reusable transformation that you can use over and over.

Wednesday, January 26, 2011

NoSQL support lands in JasperSoft

JasperSoft, one of the leading open source BI suites just announced it is delivering connectors for a range of so-called NoSQL databases. The big names are all there: Cassandra, MongoDB, Riak, HBase, CouchDB, Neo4J, Infinispan, VoltDB and Redis.

I used to explain to people that the lack of SQL support in NoSQL databases poses a challenge for traditional Business Intelligence tools, because those all talk either SQL or MDX (and maybe some XQuey/XPath). With this development, this is no longer true, and I want to congratulate JasperSoft in spearheading this innovation.

I still have a number of reservations though. Although I personally value the ability to report on data in my NoSQL database, I think its usefulness will hava a number of limitations that are worth consideration.

Admittedly I am not an expert in the NoSQL database field, but as far my knowledge goes, both the dynamo-style key/value stores like Riak, and the Bigtable-style hashtable stores like HBase and Cassandra can basically do 2 types of read operations: fetch a single object by key, or scan everything. The fetched object can be complex and contain a lot of data, and it would certainly be nice if you could run a report on that. The scan everything operation doesn't seem that useful at the report level: for all but trivial cases, you need considerable logic to make this scan useful, and I don't think a report is the right place for this. Apart from that, if the NoSQL solution was put in place because of the large data volume, then the report itself would probably need to be executed on a cluster just to achieve acceptable response time. I may be wrong but I don't think JasperReports supports that.

So, for a full scan of those NoSQL databases, connectors at the data integration end seem more appropriate. I think the integration of Hadoop with Pentaho data integration (a.k.a Kettle) is a step in the right direction, but of course only applicable if you're a Hadoop user.

Another point is data quality. Typically reporting is done on a data warehouse or reporting environment where the data quality is kept in check by processing the raw data with a data integration and quality tools. Directly reporting on any operational database can be problematic because you skip those checks. Because the NoSQL databases offer virtually no constraints, those checks are even more important. So to me this seems like another reason why NoSQL connectivity is more useful in the data integration tools.

JasperSoft also offers connectivity for the MongoDB and CouchDB docmentstores. I think that for raw reporting on the actual source documents, the same reservations apply as I mentioned in relation to the dynamo and Bigtable style solutions. But, there may be a few more possibilities here, at least for CouchDB

CouchDB has a feature called views, which allows you to "query" the raw documents using a map/reduce job. I can certainly see why it'd be useful to build a report on top of that. Of course, you would still have to implement the logic to do a useful scan, and you would still have to deal with data quality issues, but you can do it in the map/reduce job, which seems a more appropriate place to handle this than a report.

All in all, I think this is a promising development, and I should probably get my feet wet and try it out myself. But for now, I would recommend to keep it out of the wrecking tentacles of unaware business users :)

Friday, January 07, 2011

MQL-to-SQL: A JSON-based query language for your favorite RDBMS - Part III

This is the third article in a series providing background information to my talk for the MySQL User's conference, entitled MQL-to-SQL: a JSON-based Query Language for RDBMS Access from AJAX Applications.

In the first installment, I introduced freebase, an open shared database of the world's knowledge and its JSON-based query language, the Metaweb Query Language (MQL, pronounced Mickle). In addition, I discussed the JSON data format, its syntax, its relationship with the de-facto standard client side browser scripting language JavaScript, and its increasing relevance for modern AJAX-based webapplications.

The second installment provides a brief introduction to MQL as database query language, and how it compares to the de-facto standard query language for relational database systems (RDBMS), the Structured Query Language (SQL). I argued that MQL has some advantages over SQL, in particular for programming modern webapplications. I mentioned the following reasons:
  • Since MQL is JSON, and JSON is JavaScript, it's a more natural fit for modern AJAX applications
  • MQL is almost trivial to parse, making it much easier to write tool such as editors, but also to implement advanced authorization policies
  • MQL is easy to generate: the structure of MQL queries is mirrored by their results. A fragment of the result can be easily augmented with a subquery making it easy to subsequently drill down into the retrieved dataset
  • MQL is more declarative than SQL. Both attributes and relationships are represented as JSON object properties and one need not and cannot specify join conditions to combine data from different types of objects. The "cannot" is actually A Good Thing because it means one cannot make any mistakes.
  • MQL is more focussed on just the data. It largely lacks functions to transform data retrieved from the database forcing application developers to do data processing in the application or middleware layer, not in the database.
In this article, I want to discuss common practices in realizing data access for applications, especially web applications, and how database query languages like SQL and MQL fit in there.

Web Application Data Access Practices

After reading the introduction of this article, one might get the idea that I hate relational databases and SQL. I don't, I love them both! It's just that when developing database applications, especially for the web, SQL isn't helping much. Or rather, it's just one tiny clog in a massive clockwork that has to be set up again and again. Let me explain...

The Data Access Problem

It just happens to be the case that I'm an application developer. Typically, I develop rich internet and intranet applications and somewhere along the line, a database is involved for storing and retrieving data. So, I need to put "something" in place that allows the user to interact with a database via the web browser.
The way I write that I need "something" so the user can interact with the database, it seems like it's just one innocent little thing. In reality, "something" becomes a whole bunch of things that need to work together:
  • Browsers don't speak database wire protocols - they speak HTTP to back-end HTTP servers. No matter what, there is going to be some part that is accessible via HTTP that knows how to contact the database server. Examples of solutions to this part of the problem are Common Gateway Interface (CGI) programs, server-side scripting languages like PHP or Perl (which are often themselves implemented as a CGI program) or in the case of Java, specialized Servlet classes
  • The component at the HTTP server that mediates between web browser and database is going to require a protocol: a set of rules that determine how a URI, a HTTP method and parameters can lead to executing a database command. Examples of approaches to design such a protocol are Remote Procedure Calls (RPC) and Representational State Transfer (REST)
  • There's the way back too: the application running in the web browser is going to have to understand the results coming back from the database. For data, a choice has to be made for a particular data exchange format, typically eXtensible Markup Language or JSON
  • The user interface at the browser end need not only understand the protocol we invented for the data exchange, Ideally it should also guide the user and be able to validate whatever data the user is going to feed it. In other words, the user interface needs to have the metadata concerning the data exchange interface.

The Webservice Solution

A typical way to tackle the data access problem is:
  • analyze the functionality of the application, categorizing it into a series of clear and isolated actions
  • identify which data flows from application to database and back for each action
  • decide on a data representation, and a scheme to identify actions and parameters
  • create one or more programs in Java, Perl, PHP, Python, Ruby or whatever fits your stack that can execute the appropriate tasks and process the associated data flows to implement the actions
For web applications, the program or programs developed in this way are typically webservices that run as part of the process of the HTTP server. The client gets to do a HTTP request, to a particular URI, using the right HTTP method, and the right parameters. The service gets to process the parameters, execute tasks such as accessing a database, and finally, sending back a HTTP response, which typically contains data requested by the application.

Development Dynamics

The problem with the webservice approach is that it isn't very flexible. It presumes the application's functionality and hence the actions are quite well-defined. Although this looks reasonable on paper, in reality development tends to be quite evolutionary.

Typically, the core functionality of applications is quite well defined, but often a lot of additional functionalities are required. Although we can pretend these could be known in advance if only we'd spend more time designing and planning in advance, in practice, they often aren't. It may seem sad, but in many cases, the best way to find out is simply to start developing, and find out along the way. Agile is the latest buzzword that captures some of these development dynamics, but there have been other buzzwords for it in the past, such as RAD (rapid application development) and DSDM (dynamic systems development method).

The problem with this approach is that it requires a lot of going back-and-forth between front- and back-end development tasks: whenever the front-end wants to develop a new feature, it is often dependent upon the back-end offering a service for it. Front-end and back-end developers often are not the same people, so what we get is front-end development cycles having to wait on back-end development cycles to complete. Or in case front-end and back-end developers are the same person, they are constantly switching between tool sets and development environments.

In part this is because front-end development is usually done in JavaScript, and although server-side JavaScript is gaining ground, the server-side is still dominated mainly by Java, PHP, C++ and ASP.NET. But it's not just a programming language problem - developing a client, and especially a frond-end for end-users, presumes a very different mindset than developing a back-end process. Front-end development should focus on usability and quality user-experience; back-end development should focus on robustness, reliability, availability, scalability and performance. Although some of these aspects influence each other, in practice, front-end development is simply a different cup of tea than back-end development.

A Simple Plan: Building a Query Service

There is a very simple solution that would largely solve the data access problem without dealing with the inefficiencies of the recurring development process: If you could build a single service that can accept any parameters, understand them, and somehow return an appropriate result, we would never have to add functionality to the service itself. Instead, the front end application would somehow have to construct the right parameters to tell the service what it wants whenever the need arises.

This sounds almost like magic, right? So we must be kidding, right? Well, we're not kidding, and it's not magic either; it's more like a cheap parlour trick.
As the title of this section suggests, a query service fits this bill.

It would be very easy to build a single service that accepts a query as a parameter, and returns its result as response. And seriously, it's not that strange an idea: many people use between one and perhaps ten or twenty different services exactly like this everyday, multiple times...and it's called a search engine.

Can't we use something like that to solve our database access problem? Well, we could. But actually, someone beat you to it already.

DBSlayer, a Webservice for SQL Queries

A couple of years ago, The New York Times released DBSlayer. DBSlayer (DataBase accesS layer) is best described as a HTTP server that acts as a database proxy. It accepts regular SQL queries via a parameter in a regular HTTP GET request, and sends a HTTP response that contains the resulting data as JSON. It currently supports only MySQL databases but announcements were made that support was planned for other database products too. DBSlayer is actually a bit more than just a database access layer, as it also supports simple failover and round-robin request distribution, which can be used to scale out database requests. But I mention it here, because it implements exactly the kind of query service that would appear to solve all the aforementioned problems.

Or would it?

Every web developer and every database administrator should realize immediately that it's not a good idea. At least, not for internet-facing applications anyway. The DBSlayer developers documented that themselves quite clearly:
Access to the DBSlayer can be controlled via firewalls; the DBSlayer should never be exposed to the outside world.
... and ...
The account DBSlayer uses to access the MySQL database should not be allowed to execute dangerous operations like dropping tables or deleting rows. Ideally, the account would only be able to run selects and/or certain stored procedures.
So there's the rub: it may be very easy and convenient from the application development point of view, but it is a horrendous idea when you think about security. A general purpose SQL query service is simply too powerful.

If a web application accidentally allows arbitrary SQL to be executed, it would be called an SQL injection vulnerability, and it would be (or at least, should be) treated as a major breach of security. Creating a service that offers exactly that behavior as a feature doesn't lessen the security concerns at all.

What about a MQL query service

In this article I tried to explain the problems that must be solved in order to arrange and manage data access for web applications. The key message is that we need to create a service that provides data access. But in doing so, we have to balance between security, functionality and flexibility.

It is fairly easy to create a webservice that exactly fulfills a particular application requirement, thus ensuring security and manageability. However, this will usually be a very inflexible service, and it will need lots of maintenance to keep up with change in application requirements. It's also easy to create a webservice that is at least as powerful as the underlying database: this would be a database proxy over HTTP, just like DBSlayer. Although it will likely never need to change since it simply passes requests on to the back-end database, it is very hard to secure it in a way that would allow external requests from possibly malignant users.

I believe that an MQL webservice actually does offer the best of both worlds, without suffering from the disadvantages. A MQL query service will be flexible enough for most web applications - MQL queries are only limited by the underlying data model, not by the set of application-specific actions designed for one particular purpose. At the same time, it will be relatively easy to efficiently analyze MQL queries and apply policies to prevent malicious use. For example, checking that a MQL query doesn't join more than X tables is quite easy.

In the forthcoming installment, I will explore the concept of a MQL webservice in more detail, and I will explain more about the MQL-to SQL project. As always, I'm looking forward to your comments, suggestions and critique so don't hesitate to leave a comment.

MQL-to-SQL: A JSON-based query language for your favorite RDBMS - Part II

This is the second article in a series to provide some background to my talk for the MySQL User's conference.
The conference will be held April 11-14 2011 in the Hyatt Regency hotel in Santa Clara, California.

Abstract: MQL is a JSON-based database query language that has some very interesting features as compared to SQL, especially for modern (AJAX) web-applications. MQL is not a standard database query language, and currently only natively supported by Freebase. However, with MQL-to-SQL, a project that provides a SQL adapter for MQL, you can run MQL queries against any RDBMS with a SQL interface.

The my previous post, I covered some background information on modern web applications, JavaScript and JSON.
The topic of this installment is the MQL database query language and how it compares to SQL. In a third article, I will discuss the mql-to-sql project,
which implements a MQL service for relational database systems.

MQL Queries

A MQL query is either a JSON object, or a JSON array of objects. In fact, the sample object we discussed in previous section about JSON is nearly a valid MQL query. For now, consider the following JSON representation of the chemical element Oxygen:
{
"type": "/chemistry/chemical_element",
"name": "Oxygen",
"symbol": 'O',
"atomic_number": 8,
"ionization_energy": 13.6181,
"melting_point": -2.1835e+2,
"isotopes": []
}

The difference with the example from my previous post is that here, we added a key/value pair at the top, "type": "/chemistry/chemical_element", and we use an empty array ([])as value for the "isotopes" key.

So how is this a query? Didn't we claim JSON is a data format? So isn't this by definition data? And if this is data, how can it be a query?

Well, the answer is: yes, JSON is a data format, and so, yes: by definition, it must be data. But the paradox how some data can also be a query is quite easily solved once you realize that there are some pieces missing from the data. In this particular case, we left the "isotopes" array empty, whereas Oxygen has a bunch of isotopes in the real world.

This is how a JSON object can be a query:
  • By specifying values that describe known facts about an object, we define a filter or access path that can be used to find data in some database. For example, the JSON fragment above states that there exists an object of the type "/chemistry/chemical_element" which has the name "Oxygen", and the symbol "O", among other characterisics
  • By specifying special placeholders like null or an empty array or object (like [] and {} respectively), we define what data should be retrieved from the database so it can be returned in the result.

Executing a query with the Freebase MQL query editor


The easiest way to execute the query above is by using the Freebase Query Editor. Here's a screenshot:

In the screenshot, the left upper side is the area where you can type or paste your MQL query. You can then press the Run button, which is located below the query area on its right side.

When you hit the Run button, the query is sent to a special webservice called the mqlread service. The results appear in the results area on the right.

MQL Query Results

In the Text tab of the results area of the MQL query editor, you can see the raw response. It should be something like this:
{
"code": "/api/status/ok",
"result": {
"atomic_number": 8,
"ionization_energy": 13.6181,
"isotopes": [
"Oxygen-15",
"Oxygen-16",
"Oxygen-17",
"Oxygen-18",
"Oxygen-28",
"Oxygen-19",
"Oxygen-20",
"Oxygen-23",
"Oxygen-25",
"Oxygen-24",
"Oxygen-13",
"Oxygen-22",
"Oxygen-26",
"Oxygen-21",
"Oxygen-14",
"Oxygen-27",
"Oxygen-12"
],
"melting_point": -218.35,
"name": "Oxygen",
"symbol": "O",
"type": "/chemistry/chemical_element"
},
"status": "200 OK",
"transaction_id": "cache;cache01.p01.sjc1:8101;2010-05-15T03:08:10Z;0004"
}
As you can see, the response is again a JSON object. The outermost object is the result envelope, which contains the actual query result assigned to the result property, as well as a few other fields with information about how the request was fulfilled. (Actually, when we hit the Run button, our MQL query was also first embedded into a query envelope before it was sent to the mqlread service, but let's worry about those details later.

Query by Example, or filling-in-the blanks

If you compare the actual query result assigned to the "result" property of the result envelope with the original query, you will notice that they are largely the same: at least, the keys of query object and result object all match up, and so do the values, except the one for the "isotopes" key: whereas the "isotopes" property was an empty array in the query, in the result it contains an array filled with elements, each representing a particular isotope of the Oxygen element. These array elements were retrieved from the Oxygen entry stored in Freebase, which was found by matching the properties like "name": "Oxygen" and "atomic_number": 8.

This illustrates an important concept of MQL queries: the query result mirrors the structure of the query itself. Wherever the query contains a special placeholder value, such as an empty array ([]), a null value or an empty object ({}), the query engine will "fill in the blanks" with data that is retrieved from the database.

Another way of putting it is to say a MQL query is a query by example.

To hammer this concept down further, consider the following similar query. In this case, it contains even more special placeholder values (namely, the null values for all but the "name" property:
{
"type": "/chemistry/chemical_element",
"name": "Oxygen",
"symbol": null,
"atomic_number": null,
"ionization_energy": null,
"melting_point": null,
"isotopes": []
}
If you execute this query, you get a result that is essentially the same as the previous one. Because chemcical elements are identified by name (there are more properties that can identify a chemical element, name is just one of them), this query will match the same object in Freebase. This query specifies null for almost all other properties, and because null is a special placeholder for scalar values, the query engine responds by retrieving the values for those keys and returns them in the result.

Differences between MQL and SQL

Even this simple example reveals a lot about the differences between MQL and SQL.

Symmetry between Query and Result

In the previous example, we just saw that the query and its result have a similar structure. It's like the query is mirrored in the result. In SQL, this is very different. Let's see how different.

Assume for a minute we would have a relational database with a schema called "chemistry" and a table called "chemical_element". We could write an SQL query like this:
SELECT  name
, symbol
, atomic_number
, ionization_energy
, melting_point
FROM chemistry.chemical_element
WHERE name = 'Oxygen'
...and the result would look something like this:

+--------+--------+---------------+-------------------+---------------+
| name | symbol | atomic_number | ionization_energy | melting_point |
+--------+--------+---------------+-------------------+---------------+
| Oxygen | O | 8 | 13.6181 | -218.35 |
+--------+--------+---------------+-------------------+---------------+
Even if we forget for a moment that this particular SQL query doesn't handle retrieving the isotopes, the difference between the query and the result is striking - the structure of the SQL query, which is basically a piece of text, has very little to do with the structure of the result, which is unmistakenly tabular.

Everybody knows that SQL was designed for relational databases, so we probably shouldn't be too surprised that the result of the SQL query has a tablular form. But why does the query have to be text? Couldn't the query have been tabular too, leaving blanks where we'd expect to retrieve some data?

Lest I lose the proponents of SQL and the relational model here: don't get me wrong - I realize that an SQL query isn't just "a piece of text; rather SQL attempts to express facts about database state using a practical form of relational algebra, and the result represents the set that satisfies the facts stipulated by the query. And as we shall see later on, MQL actually has a few operator constructs that resemble the ones found in SQL. But for now the message is: MQL queries and their results look a lot like one another; SQL queries and their results do not.

Relational focus

I already mentioned the MQL is expressed in JSON, and we saw how both query and result are structured as objects. SQL results are always tabular. Superficially, this seems like the right thing to do for a query language that is supposed to work on and for relational database systems. But is it really that natural?

Many textbooks and courses about relational databases and SQL spend much time and effort on the topic of normalization, and rightly so! One of the great achievements of the relational model and normalization is that it helps to minimize or even eliminate data integrity problems that arise in non-relational storage structures and their inherent redundancy.

A typical textbook or course may start with a modelling exercise with some real-world, unnormalized (sub 1NF) data as input. The first mental step is to normalize that data to the first normal form (1NF), by splitting off multivalued attributes (a single data item with a list of values) and their big brother, repeating groups (a "table-inside-a-table"), to separate tables, on and on until all repeating groups and multi-valued attributes are eliminated. In the next phases of normalization, different forms of data redundancy are removed, moving up to even higher normal forms, typically finishing at the third normal form (3NF) or the Boyce-Codd normal form (BCNF).

The result of this normalization process is an overview of all separate independent sets of data, with their keys and relationships. From there, it's usually a small step to a physical database design. For example, in our Oxygen example, the list of isotopes is a multi-valued attribute. In a proper relational design, the isotopes would typically end up in a separate table of their own, with a foreign key pointing to a table containing the chemical elements.

The textbook usually continues with a crash course in SQL, and chances are the SELECT syntax is the first item on the menu. Within a chapter or two, you'll learn that although normalized storage is great for data maintenance, it isn't great for data presentation. Because applications and end users care a lot about data presentation, you'll learn how to use the JOIN operator to combine the results from related database tables. Usually the goal of that exercise is to end up with a result set that is typically 1NF, or at least, has a lower normal form than the source tables.

Don't get me wrong: this is great stuff! The relational model isn't fighting redundant data: its fighting data integrity issues, and the lack of control that results from storing redundant data. By first decomposing the data into independent sets for storage, and then using things like JOIN operations to re-combine them, SQL offers a good method to master data integrity issues, and to deliver consistent, reliable data to applications and end-users.

It's just a pity SQL forgot to finish what it started. Recall that the text-book modeling exercise started by eliminating repeating groups and multi-valued attributes to achieve 1NF. What SQL queries should we use to transform the data back to that format?

SQL turns out to be so single-mindedly focused on the relational model that it simply can't return sub-1NF, unnormalized data. Much like the square from flatland can't comprehend the sphere from spaceland, SQL simply hasn't got a clue about nested data structures, like multi-valued attributes and repeating groups.

Somewhere along the way, SQL forgot that the text-book course started with real-world, unnormalized data, full of repeating groups and multivalued attributes.

This represents quite a share of challenges for database application development. One class of software solutions that deal with solving this problem are the so-called object-relational mappers (ORM). It would not do enough credit to the ORM's to claim that their only purpose is to solve this problem, but it's definitely a major problem they take care of.

Parsing Queries

Everyone that has tried it knows that it isn't exactly trivial to write a fast yet fully functional SQL parser. Being able to parse SQL is a requirement for tools like query editors and report builders, but also for proxies and monitoring tools.

Parsing MQL on the other hand is almost trivially simple. At the application level, this would in theory make it quite easy to implement advanced access policies and limit the complexity of the queries on a per user or role basis.

Generating Queries

One of the things I like about MQL is that applications have to do a lot less work to formulate a query that drills down into some detail of a previously returned data set. For example, from the the result of our query about Oxygen, we just learned that there is an isotope called "Oxygen-16". Suppose we want to know more about that particular isotope, say, its relative abundance, and whether it's stable or not.

With SQL, we would have to construct a new algebraic expression that somehow combines the set of chemical elements with the set of isotopes. Although we would certainly need some data from the tabular result obtained from the previous query, we have little hope of actually re-using the query itself - at the application level, the SQL query is most likely just a piece of text, and it's probably not worth it to use string manipulation to forge a new query out of it.

Here's an example which shows the parts that should be added to accommodate this requirement, just to show that such a change isn't localized to just one spot in the original query:
SELECT  e.name
, e.symbol
, e.atomic_number
, e.ionization_energy
, e.melting_point
, i.name
, i.natural_abundance
, i.stable

FROM chemistry.chemical_element e
INNER JOIN chemistry.isotope i ON e.name = e.element_name
WHERE e.name = 'Oxygen'
AND i.name = 'Oxygen-16'
I won't let my head explode over the string manipulation code required to change the original query into this one. If you like, post clever solutions as a comment to this post :)

With MQL, this task is considerably easier. The query and the result have a high degree of correspondence. In our application, both would typically be represented as objects or structs. In most programming languages, it is trivial to go from the original Oxygen Query to an augmented form that retrieves the details about the isotope "Oxygen-16". This is especially true for JavaScript, where we'd simply write something like:
//execute the query and obtain the result.
//note: in a real application this would typically be an asynchronous request
//to the mqlread service which would accept and return JSON strings.
//For simplicity sake we pretend we have a mqlRead function that can accept
//a regular JavaScript object literal, convert it into a JSON string, and
//call the mqlread service synchronously, parse the the JSON query result
//into a JavaScript object and return that to the caller.

var queryResult = mqlRead({
type: "/chemistry/chemical_element",
name: "Oxygen",
symbol: null,
atomic_number: null,
ionization_energy: null,
melting_point: null,
isotopes: []
});

//assign a subquery for "Oxygen-16" to the isotopes property of the queryResult.
//Remember, the queryResult has the same structure as the original query, just
//with the null's and the empty arrays filled with data.

queryResult.isotopes = {
name: "Oxygen-16",
natural_abundance: null,
stable: null
};

//execute the modified query:

queryResult = mqlRead(queryResult);
(In the first example where we discussed the Oxygen query, you might've noticed that in the query result, the isotopes member was an array of strings, each representing a particular isotope. So naturally, you might be tempted to think that the value of the isotopes property is an array of strings. However, this is not quite the case. Rather, the isotopes property stands for the relationship between elements and its isotopes. Due to the form of the original query (having the empty array for the isotopes property), the MQL query engine responds by listing the default property of the related isotopes. In freebase, name is a special property and typically that's used as default property. So in that previous query result, the isotopes were merely represented only by their name. In the example above however, we assign a single object literal to the isotopes property which identifies one particular isotope. Because the isotopes property represents a relationship, the query should be read: "find me the oxygen element and the related isotope with the name Oxygen-16", and not "find me the oxygen element that has Oxygen-16 as its isotope".)

Of course, it's entirely possible to design data structures to hold all the data you need to generate your SQL queries (a query model). You can easily come up with something that would allow such a change to be made just as easy. But you need at least one extra step to generate the actual SQL string to send to the database. And of course, you need some extra steps again to extract data from the resultset for forging more queries.

In MQL, the effort to go from query to result and back are about as minimal as it can get. This results in less application code, which tends to be easier to understand.

SQL is declarative, but MQL is more so

When discussing its merits as a programming language, it is often mentioned that SQL is declarative rather than procedural. Often this is presented as an advantage: with a declarative language like SQL, we get to focus on the results we want, whereas a procedural language would force us to code all kinds of details about how these results should be obtained. Or so the story goes.

I won't deny SQL is declarative. For example, I don't need to spell out any particular data access algorithm required to find the Oxygen element and it's isotopes, I just write:
SELECT      e.name
, e.symbol
, e.atomic_number
, e.ionization_energy
, e.melting_point
, i.name
, i.natural_abundance
, i.stable
FROM chemistry.chemical_element e
INNER JOIN chemistry.chemical_element_isotope i
ON e.atomic_number = i.atomic_number
WHERE e.name = 'Oxygen'
But still: in order to successfully relate the chemical_element and chemical_element_isotope tables, I need to spell out that the values in chemical_element_isotope's atomic_number column have to be equal to the value in the atomic_number column of chemical_element. Come to think of it, how can I know the relationship is built on atomic_number, and not on symbol or name? And heaven forbid we accidentally compare the wrong columns, or forget one of the join conditions...

Now compare it to the equivalent MQL query:
{
"type": "/chemistry/chemical_element",
"name": "Oxygen",
"symbol": null,
"atomic_number": null,
"ionization_energy": null,
"melting_point": null,
"isotopes": [{
"name": null,
"natural_abundance": null,
"stable": null
}]
}

The SQL query may be declarative, but compared to the MQL query, it requires a lot more knowledge of the underlying data model. All we had to do in MQL, is specify an isotopes property, and list whatever we want to retrieve from the corresponding isotope instances. The only way we can mess up the MQL query is when we specify the wrong property names, in which case our query would simply fail to execute. In the SQL query, we could've been mistaken about which columns to compare, and get no result at all, or worse, a rubbish result. And with just a bit of ill luck, we can accidentally cause a cartesian product. Just for the hell of it, spot the error in the following SQL statement:
SELECT      e.name
, e.symbol
, e.atomic_number
, e.ionization_energy
, e.melting_point
, i.name
, i.natural_abundance
, i.stable
FROM chemistry.chemical_element e
INNER JOIN chemistry.chemical_element_isotope i
ON e.atomic_number = e.atomic_number
WHERE e.name = 'Oxygen'

Computational Completeness


Everybody with some experience in SQL programming knows that SQL is much more than a database query language. Even standard SQL is chock-full of operators and functions that allow you to build complex expressions and calculations. In fact, most SQL dialects support so many functions and operators that the manual needs at least a separate chapter to cover them. Algebra, Encryption, String formatting, String matching, Trigonometry: these are just a few categories of functions you can find in almost any SQL dialect I heard of.

By contrast, MQL is all about the data. MQL defines a set of relational operators, but their function and scope is limited to finding objects, not doing calculations on them. There is exactly one construct in MQL that resembles a function, and it is used for counting the number of items in a result set.

Personally, I think MQL would be better if it had a few more statistical or aggregate constructs like count. But overall, my current thinking is that the fact that MQL lacks the function-jungle present in most RDBMS-es is actually A Good Thing(tm). At the very least, it ensures queries stay focused on the data, and nothing but the data.

Next Time


In this article I discussed the basics of the MQL query language, and I compared SQL and MQL on a number of accounts. In the next installment, I will discuss how this relates to developing data access services for web applications.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...