Wednesday, May 26, 2010

A small issue of SQL standards

From a functional perspective, the core SQL support in all major and minor RDBMS-es is reasonably similar. In this light, it's sometimes quite disturbing to find how some very basic things work so differently across different products. Consider this simple statement:
SELECT  'a' /* this is a comment */ 'b'
FROM onerow
What should the result be? (You can assume that onerow is an existing table that contains one row)

It turns out popular RDBMS-es mostly disagree with one another.

In Oracle XE, we get this:
SELECT  'a' /* comment */ 'b'
*
ERROR at line 1:
ORA-00923: FROM keyword not found where expected

PostgreSQL 8.4 also treats it as a syntax error, and thus seems compatible with Oracle's behavior:
ERROR:  syntax error at or near "'b'"
LINE 1: SELECT 'a' /* this is a comment */ 'b'

In Microsoft SQL Server 2008 we get:
b
-
a

(1 rows affected)
As you can see, MS SQL treats the query as SELECT 'a' AS b FROM onerow.

Finally, in MySQL 5.1, we get:
+----+
| a |
+----+
| ab |
+----+
1 row in set (0.00 sec)
So in MySQL, its as if the comment isn't there at all, and as if the string literals 'a' and 'b' are actually just one string literal 'ab'.

So what does the SQL standard say? In my copy of the 2003 edition, I find this (ISO/IEC 9075-2:2003 (E) 5.3 <literal>, page 145):
Syntax Rules
1) In a <character string literal> or <national character string literal>, the sequence:
<quote> <character representation>... <quote> <separator> <quote> <character representation>... <quote>
is equivalent to the sequence
<quote> <character representation>... <character representation>... <quote>
If we lookup the definition of <separator>, it reads:
<separator> ::= { <comment> | <white space> }...
So in this case, MySQL does the "right" thing, and basically ignores the comment, treating 'a' and 'b' as a single string constant 'ab'.
UPDATE 1: As always, the devil is in the details. And trust me, the SQL standard has many of them (details that is - I'll leave it up to the reader to decide for the devils, although I have a suspicion in a particular direction). Read on, and make sure to read Nick's comment on this post - it turns out PostgreSQL seems to behave exactly according to the standard in this case.

UPDATE 2: Serg also posted a comment citing yet another part of the standard that states that all comments implicitly count as a newline. This would mean that there doesn't have to be a literal newline character in or following the comment. In this case, my original remark that MySQL got it right would hold again.

I should state that I think very highly of both Nick and Serg, and as far as I am concerned, they're both right. I can't help but seeing this as yet more support for my statement that the SQL standard is so complex it is almost or perhaps completely impossible to get it right.

Do you find this too bold? If so, I'd really love to hear your thoughts on it. Please help us solve this paradox, I only want to understand what the standard really says.
If you try the same thing with a single line comment, all products mentioned react the same as with the initial query, except for PostgreSQL, which now treats the query according to the standard.

Now don't get me wrong. This post is not designed to bash or glorify any of the products mentioned. I think all of them are great in their own way. I am quite aware that although MySQL happens to adhere to the standard here, it violates it in other places. Finally, I should point out that I don't have a specific opinion on what the right behavior should be. I just want it to be the same on all platforms.

At the same time, I realize that for SQL it's probably too late - up to an extent, incompatibility is considered normal, and database professionals tend to be specialized in particular products anyway. So I'm not holding my breath for the grand unification of SQL dialects.

When I encountered this issue, I did have to think about that other rathole of incompatibilities I have to deal with professionally, which is web-browsers. An interesting development there is the HTML 5 specification, which actually defines an algorithm for parsing HTML - even invalid HTML. This is quite different from the approach taken by most standards, which typically define only an abstract grammar, but leave the implementation entirely up to the vendors. In theory, providing parsing instructions as detailed as done in HTML 5 should make it easier to create correct parsers, and hopefully this will contribute to a more robust web.

Anyway. That was all. Back to work...

UPDATE: I just heard that Sybase (unsurprisingly) behaves similar to MS SQL for this query (that is, query is valid, and returns 'a' in a column called b). I checked SQLite myself, which is also in that camp.

Nick also pointed out that LucidDB also provides a standard compliant implementation, in other words, it behaves exactly like PostgreSQL for this particular query. However, Julian, who was and is closely involved in LucidDB agrees with Serg that the comment should probably count as a newline, and filed a bug for LucidDB.

I checked Firebird 2.1.3, and they are in the Oracle camp: in both cases, the query gives a syntax error.

23 comments:

kabel said...

I think that you've got a good point -- it's probably too late. Not to be pessimistic, but with the NoSQL movement gaining attention, if not traction, it seems like the major SQL RDBMSes are more concerned with competition/proving their relevancy than with standards compliance (among other things). It's a shame, really.

Side note -- blog more! You're always a welcome sight in my RSS aggregator.

rpbouman said...

Hi kabel!

indeed - my remarks wasn't meant to be pessimistic - it's just the way it is, most of the time we can find a way to work around the incompatibilities.

The only reason I seized the opportunity now is that I happened to run into this particular thing, and because the feature seems so basic (I mean, string literals are about as basic as it gets, right?), and the differences so remarkable.

As for blogging: thanks for your kind words! I appreciate it a lot. I would blog more if I had the time - I used to do semi-articles but I find I have less and less time to sit down to write them. If all goes well I should have a bit more time in the not-too-distant future.

Thanks, and kind regards,

Roland.

Antony said...

Wow, this is interesting/useful to know, since I sometimes use the string concatenation literals. Since it has been in the standard for a long time, I had always assumed that the other databases supported it as well.

rpbouman said...

Antony, this is exactly why I hit this issue. I am looking for a way to write cross-RDBMS compatible SQL scripts that do a lot of string concatenation. It's mindblowing how incredibly hard this is:

0) CONCAT(), the solution that's most familiar to MySQL users won't do. It is not supported in PostgreSQL or MS SQL, and Oracle only supports a two-argument version of that function.
1) The SQL standard defines || as string concatenation operator, which is supported by Oracle and Postgres
2) In MySQL you can use || but only after including PIPES_AS_CONCAT in the sql_mode
3) In MS SQL, all hope is lost as they only support the + (plus) as string concatenation operator.

The standard SQL comments feature seems like a nice way out but as you can see it is not supportted well enough to rely on. Plus, it only works for string *literals*, not arbitrary sting expressions.

Anonymous said...

There are various standards over the years. This could be the new spec

rpbouman said...

@Anonymous: do you mean, the SQL standard has a couple of editions? Yes, that's true. I quoted from the 2003 edition. The same text is also in the 1999 and 1992 editions.

I'd be glad to look it up in both earlier and newer versions of the standard (anyone have links to share?) but I suspect something like this isn't likely to change much.

If you ask me, SQL is simply a classic case of failed standard. Many people point out that the problem is that the standard came after the implementations, but I don't think that is the real problem. The real problem is that it is almost impossible to grasp the standard in its entirety, you literally need to be an expert on it to wield it. Complex standards tend to not get implemented very well.

Unknown said...

Roland,

I tried the same on LucidDB and came up with the same behavior as Postgres.

-- ON ONE SINGLE LINE
SELECT 'a' /* this is a comment */ 'b' from onerow;
Error: From line 1, column 37 to line 1, column 39: String literal continued on same line (state=,code=0)
-- ON TWO LINES
SELECT 'a' /* this is a comment */
'b' from BENCH100;
+---------+
| EXPR$0 |
+---------+
| ab |
+---------+

I reviewed the Farrago/LucidDB validator, and there's a code comment in there that is very specific that says "// per the SQL std, each string fragment must be on a different line." So, my reading of your above led me to the same conclusion... but, like you so aptly point out, the standard is HUGE and COMPLEX. So I think Postgres, LucidDB and MySQL got the string concat piece you point out correct.

However, I think only Postgres and LucidDB are meeting ISO/IEC 9075-2:2003 (E) 5.3.4:

"In a <character string literal>, <national character string literal>, <Unicode character string literal>, or <binary string literal>, a <separator> shall contain a <newline>."

rpbouman said...

Nick, thanks for commenting!

I re-read my copy of the spec, and I think your interpretation is right. Thanks!

This is not the first time I misinterpreted the SQL spec. I am not ashamed to admit it, I think this neatly illustrates my point about complexity.

See, basically, we have a seemingly nice and reasonably unambiguous set of BNF rules that supposedly define the SQL grammar (if they don't do that, then what is the point of including them in the spec, right?).

But then, after pages and pages of BNF, they suddenly start adding plain English to define exceptions to what is implied by the BNF. And not once but twice: First, they add a rule to define a special case of the string literal to explicitly allow interjection of "separators", and then they add yet another exception to attenuate meaning of "separator", but only in case it appears inside a string literal.

It seems to me they should've fixed that immediately in the BNF itself, and if that was too complex, they should've made a compromise and not allow any exception at all.

Anyway - thanks again for clearing this up. I'll add an update to the blog to reflect this information.

Unknown said...

On the other hand, in 5.3, 12) of Syntax rules it says

11) SQL text containing one or more instances of is equivalent to the same SQL text with the replaced with .

So, every comment is a newline.

Arnold Daniels said...

Kabel: NoSQL doesn't have a standard of any kind. I don't think anybody will move to NoSQL because of a lack of SQL standards.

Pavel Stěhule said...

Hello

concat function for postgresql:

postgres=# CREATE OR REPLACE FUNCTION concat(VARIADIC text[])
RETURNS text AS $$
SELECT array_to_string($1,'')
$$ LANGUAGE sql;

postgres=# SELECT concat('a','b','c');
concat
--------
abc
(1 row)

rpbouman said...

Hi Serg!

Thanks for commenting - I really appreciate your insight in this matter too. Just like Nick, you mention a point that is in my opinion highly relevant.

I found your reference in ISO/IEC 9075-2:2003 (E) 5.2:

"
11) SQL text containing one or more instances of <comment> is equivalent to the same SQL text with the <comment> replaced with <newline>.
"

So indeed, it seems the clause mentioned by Nick would automatically be satisfied if this implied replacement was performed before parsing string literals.

Now, I already mentioned I have misinterpreted the standard many times before. Personally I wouldn't know what takes precedence, but either way, I think this is all highly confusing.

Until I find a way to unambiguously interpret all these rules and how they affect each other, I am very much tempted to add this as extra proof to my earlier statement that the SQL standard is needlessly complex, and therefore, error prone to implement.

rpbouman said...

Hi Arnold!

thanks for commenting. Yes, I think this is an excellent point - When I hit the hay last night, I mulled over kabels point and thought about something similar: NoSQL doesn't really many anything in particular, but one characteristic that these platforms happen so share is the virtual absence of any query language (Hadoop being a notable exception). Basically, these platforms expose an API. I haven't looked at those in detail yet, but I suspect they are all specific for the product. So indeed, hardly any standards at all.

Don't get me wrong - I don't think this is a bad thing. But I do think this is just a sign of how young it all is.

rpbouman said...

@Pavel Stěhule: thanks for contributing this, I really appreciate it!

For the string concatenation, I think the main problem really is MS SQL, I can use standard || in all products without requiring elevated privileges.

Anyway, thanks again :)

Anonymous said...

AFAIK Sybase behaves the same way as MS SQL because Microsoft bought Sybase engine and came up with MS SQL Server 4 years back. I don't think they would have worked on the SQL engine ever since ;-)

rpbouman said...

Hi Gopal!

Thanks for the info. I thought the Sybase legacy goes furhter back than 4 years, but indeed - I wrote that Sybase "unsurprisingly" exposes the same behaviour exactly because I expect that part of the code to be the same or at least compatible due to the legacy.

Julian Hyde said...

I second Nick's comments. If I recall correctly, I was the one who actually read the standard and made LucidDB deal with string literals correctly.

By the way, another important part of the standard that the big guys -- well, definitely Oracle -- have ignored is the VALUES clause. There are many solutions to the problem of how to evaluate a single-row expression, almost as many as database vendors.

In Oracle you have to write

SELECT 'a' || 'b' FROM dual

and this is a non-portable hack. (I heard that someone once accidentally inserted another row into DUAL and half of their SQL*Forms application broke. You'll also notice that DUMMY has a unique index on its DUMMY column; without it, the Oracle optimizer would generate some very dumb plans.)

In MySQL you can write

SELECT 'a' || 'b'

which is convenient but non-standard.

The provides VALUES, the table constructor. So you can write standard way is the best: you can write

VALUES 'a' || 'b'

or if you're feeling verbose

SELECT 'a' || 'b' FROM (VALUES (1))

VALUES also supports multiple rows and multiple columns elegantly:

VALUES (1, 'Foo'), (2, 'Bar')

Julian

Julian Hyde said...

Regarding your statement "I can't help but seeing this as yet more support for my statement that the SQL standard is so complex it is almost or perhaps completely impossible to get it right."

I agree that the SQL standard is big and complex; but I don't agree that it is impossible to implement. Writing a database would be complex even if there were no standard to adhere to. Having a standard actually makes the process easier, in a lot of ways.

What would you do to make the standard easier to understand? Add some examples? Remove some of the content? Translate it from standardese to plain English?

I actually wouldn't do any of these. It's a standard, and the intended audience are developers implementing the standard. Those people want the standard to be very, very precise. (For examples, I turn to Jim Melton's books such as "Understanding the new SQL.")

The one thing I would do is create a publicly accessible Test Compatibility Kit (TCK) that people can run against their chosen database. Vendors such as Oracle have claimed for years that they comply with SQL standards. If there was a public TCK, we could publish the results and shame them into action.

Julian

Julian Hyde said...

I agree with Serg about comments being equivalent to newlines. I have logged http://issues.eigenbase.org/browse/FRG-399 to fix LucidDB.

Unknown said...

Julian,

In my tests I first tested with (VALUES (1)) and then created the single record table to verify the same behavior. :)

Thanks for logging FRG-399!

rpbouman said...

Posting on behalf of Pekka (see below). Pekka, thanks, this is a very useful comment. Kind regards,

Roland.

Pekka:

Roland,

I tried to comment on your blog but there was some unexplained Blogger error.

You express surprise about the recent timing of the Microsoft Sybase code forking where g13n said... “with MS SQL Server (version) 4 (many) years back”. I have added the missing words that make his comment less ambiguous. The split took place when Microsoft diverged from the Sybase base with release 4.21 on Windows NT 3.1 in 1992.

Microsoft have indicated in at least SQL Server 2008 that the functionality you describe is deprecated and will be retired in a future release at http://msdn.microsoft.com/en-us/library/ms143729.aspx under the section Features Not Supported in the Next Version of SQL Server feature 184.

I suspect the construct does not report an error to provide backwards compatibility with historical code, which as you indicate goes back at least to before the SQL-92 standard.

Regards,
Pekka Pihlajasaari

Lazer epilasyon Adana said...

11) SQL text containing one or more instances of is equivalent to the same SQL text with the replaced with

rpbouman said...

Lazer epilasyon Adana, unfortunately some parts of your comment seem to have been filtered out by blogger. Also, it'd be good to include the source of this text.

thansk in advance,

Roland

SAP HANA Trick: DISTINCT STRING_AGG

Nowadays, many SQL implementations offer some form of aggregate string concatenation function. Being an aggregate function, it has the effe...