Wednesday, May 26, 2010

A small issue of SQL standards

From a functional perspective, the core SQL support in all major and minor RDBMS-es is reasonably similar. In this light, it's sometimes quite disturbing to find how some very basic things work so differently across different products. Consider this simple statement:
SELECT  'a' /* this is a comment */ 'b'
FROM onerow
What should the result be? (You can assume that onerow is an existing table that contains one row)

It turns out popular RDBMS-es mostly disagree with one another.

In Oracle XE, we get this:
SELECT  'a' /* comment */ 'b'
*
ERROR at line 1:
ORA-00923: FROM keyword not found where expected

PostgreSQL 8.4 also treats it as a syntax error, and thus seems compatible with Oracle's behavior:
ERROR:  syntax error at or near "'b'"
LINE 1: SELECT 'a' /* this is a comment */ 'b'

In Microsoft SQL Server 2008 we get:
b
-
a

(1 rows affected)
As you can see, MS SQL treats the query as SELECT 'a' AS b FROM onerow.

Finally, in MySQL 5.1, we get:
+----+
| a |
+----+
| ab |
+----+
1 row in set (0.00 sec)
So in MySQL, its as if the comment isn't there at all, and as if the string literals 'a' and 'b' are actually just one string literal 'ab'.

So what does the SQL standard say? In my copy of the 2003 edition, I find this (ISO/IEC 9075-2:2003 (E) 5.3 <literal>, page 145):
Syntax Rules
1) In a <character string literal> or <national character string literal>, the sequence:
<quote> <character representation>... <quote> <separator> <quote> <character representation>... <quote>
is equivalent to the sequence
<quote> <character representation>... <character representation>... <quote>
If we lookup the definition of <separator>, it reads:
<separator> ::= { <comment> | <white space> }...
So in this case, MySQL does the "right" thing, and basically ignores the comment, treating 'a' and 'b' as a single string constant 'ab'.
UPDATE 1: As always, the devil is in the details. And trust me, the SQL standard has many of them (details that is - I'll leave it up to the reader to decide for the devils, although I have a suspicion in a particular direction). Read on, and make sure to read Nick's comment on this post - it turns out PostgreSQL seems to behave exactly according to the standard in this case.

UPDATE 2: Serg also posted a comment citing yet another part of the standard that states that all comments implicitly count as a newline. This would mean that there doesn't have to be a literal newline character in or following the comment. In this case, my original remark that MySQL got it right would hold again.

I should state that I think very highly of both Nick and Serg, and as far as I am concerned, they're both right. I can't help but seeing this as yet more support for my statement that the SQL standard is so complex it is almost or perhaps completely impossible to get it right.

Do you find this too bold? If so, I'd really love to hear your thoughts on it. Please help us solve this paradox, I only want to understand what the standard really says.
If you try the same thing with a single line comment, all products mentioned react the same as with the initial query, except for PostgreSQL, which now treats the query according to the standard.

Now don't get me wrong. This post is not designed to bash or glorify any of the products mentioned. I think all of them are great in their own way. I am quite aware that although MySQL happens to adhere to the standard here, it violates it in other places. Finally, I should point out that I don't have a specific opinion on what the right behavior should be. I just want it to be the same on all platforms.

At the same time, I realize that for SQL it's probably too late - up to an extent, incompatibility is considered normal, and database professionals tend to be specialized in particular products anyway. So I'm not holding my breath for the grand unification of SQL dialects.

When I encountered this issue, I did have to think about that other rathole of incompatibilities I have to deal with professionally, which is web-browsers. An interesting development there is the HTML 5 specification, which actually defines an algorithm for parsing HTML - even invalid HTML. This is quite different from the approach taken by most standards, which typically define only an abstract grammar, but leave the implementation entirely up to the vendors. In theory, providing parsing instructions as detailed as done in HTML 5 should make it easier to create correct parsers, and hopefully this will contribute to a more robust web.

Anyway. That was all. Back to work...

UPDATE: I just heard that Sybase (unsurprisingly) behaves similar to MS SQL for this query (that is, query is valid, and returns 'a' in a column called b). I checked SQLite myself, which is also in that camp.

Nick also pointed out that LucidDB also provides a standard compliant implementation, in other words, it behaves exactly like PostgreSQL for this particular query. However, Julian, who was and is closely involved in LucidDB agrees with Serg that the comment should probably count as a newline, and filed a bug for LucidDB.

I checked Firebird 2.1.3, and they are in the Oracle camp: in both cases, the query gives a syntax error.

Tuesday, May 25, 2010

MySQL, Oracle and NoSQL: In the grand scheme...

...NoSQL is just larger than a fly's dropping, and MySQL and Oracle are more alike than either of their respective fanboys would like to admit.

Courtesy of Google trends:



I guess I won't be changing my career just yet.


UPDATE: I tried a few terms for "Microsoft SQL Server" before posting (SQL Server, MS SQL) but found none that came up with what I felt like was a realistic volume (they are all much, much lower than I expected). @MarkGStacey suggested trying "SQL 2008", "SQL 2005" and "SQL 2000", and those return much better results indeed (though still much lower than MySQL or Oracle). Anyway - I'd love to have some way of bunching up all those terms and have Google Trends show them as one trend, but I haven't figured out a way to do that. If you know how, please drop a line at let me know.

I'll adjust the blog if I find a more satisfactory solution.

Tuesday, May 04, 2010

MySQL: The maximum value of an integer

Did you ever have the need to find the maximum value of an integer in MySQL? Yeah, me neither. Anyway, some people seem to need this, and this is what I came up with:

SELECT ~0 as max_bigint_unsigned
, ~0 >> 32 AS max_int_unsigned
, ~0 >> 40 AS max_mediumint_unsigned
, ~0 >> 48 AS max_smallint_unsigned
, ~0 >> 56 AS max_tinyint_unsigned
, ~0 >> 1 AS max_bigint_signed
, ~0 >> 33 AS max_int_signed
, ~0 >> 41 AS max_mediumint_signed
, ~0 >> 49 AS max_smallint_signed
, ~0 >> 57 AS max_tinyint_signed
\G

*************************** 1. row ***************************
max_bigint_unsigned: 18446744073709551615
max_int_unsigned: 4294967295
max_mediumint_unsigned: 16777215
max_smallint_unsigned: 65535
max_tinyint_unsigned: 255
max_bigint_signed: 9223372036854775807
max_int_signed: 2147483647
max_mediumint_signed: 8388607
max_smallint_signed: 32767
max_tinyint_signed: 127
1 row in set (0.00 sec)

In case you're wondering how it works, read up on what the tilde (~) does: it peforms a bitwise negation. In other words, if flips bits that are 1 to 0, and vice versa. So, ~0 means, set all the bits to 1, because in the integer one 0, all the bits are a binary 0. Now, in MySQL, at runtime, there is only one integer type, which is an 8-byte integer value or a bigint. Therefore, ~0 is by definition the largest possible integer value.

MySQL defines a family of integer types for storage: bigint (8 bytes), int (4 bytes), mediumint (3 bytes), smallint (2 bytes) and tinyint (1 byte). To find the maximum values of those types, we can use the right-bitshift operator >> to push the most-significant bits at the left side of ~0 down to the right, for the appropriate number of bytes to get the maximum values of the other integer flavors. So,

int type: big int medium small tiny
bit #: 64 56 48 40 32 24 16 8 1
~0 = 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 = 18446744073709551615
~0 >> 32 = 00000000 00000000 00000000 00000000 11111111 11111111 11111111 11111111 = 4294967295
~0 >> 40 = 00000000 00000000 00000000 00000000 00000000 11111111 11111111 11111111 = 16777215
~0 >> 48 = 00000000 00000000 00000000 00000000 00000000 00000000 11111111 11111111 = 65535
~0 >> 56 = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 11111111 = 255

Now, for each of the integer flavors, MySQL lets you define them to be either signed or unsigned. This is implemented using a so-called sign bit. The sign bit is the most significant bit (so, bit #64 in a bigint, bit #32 in an int, and so on and so forth). If the sign bit equals 0, the integer is positive and if it equals 1, the integer is negative. So, to get the maximum values for the signed integer flavors, we can use the same recipe, we just need to push the bits even one more bit to the right, like so:

int type: big int medium small tiny
bit #: 64 56 48 40 32 24 16 8 1
~0 >> 1 = 01111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 = 9223372036854775807
~0 >> 33 = 00000000 00000000 00000000 00000000 01111111 11111111 11111111 11111111 = 2147483647
~0 >> 41 = 00000000 00000000 00000000 00000000 00000000 01111111 11111111 11111111 = 8388607
~0 >> 49 = 00000000 00000000 00000000 00000000 00000000 00000000 01111111 11111111 = 32767
~0 >> 57 = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 01111111 = 127

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...