Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Friday, November 09, 2007

Random RDBMS and SQL Myths debunked

A few times now, I've been wanting to write this down. I know: a lot of people will go *shrug*. Others may find me pedantic. Some of will say I'm being a smart-ass. Whatever...but I just got to write down a few of these common misconceptions that keep floating around.

None of these misconceptions are really harmful - in most cases, they do not lead to misunderstanding or miscommunication. However, when you are writing about these subjects, you'll often find that a sloppy definition you used in some place will bite you in the tail, and make it harder to explain something later on. So, that is why I from time to time get kind of obsessed with finding just the right words.

I'm not pretending I have the right words though. But there are a few informal ways of saying things that at a glance look right but are in fact wrong. Here's a random list of some of them:

PRIMARY KEY and UNIQUE constraints are unique indexes

Wrong - an index is just a convenient implementation to detect duplicate entries, and this is used by all RDBMS-es I am familiar with to implement PRIMARY KEY and UNIQUE constraints. However, the fact that there is a distinction is evident in for example the Oracle SQL syntax. For example, in ALTER TABLE ... DROP CONSTRAINT you can specify whether the associated index should be kept or also discarded.

Some people argue that it does not make sense to make the distinction in case the RDBMS does not maintain the constraint and index as separate objects. (This is the case in for example MySQL.)

Well, maybe...but I disagree. When I think about constraints, I'm thinking about business rules and guarding them to maintain database integrity. When talking about indexes, I'm thinking about performance and access paths. Quite different things, and in my opinion a shame to throw away the words to express the difference in my opinion.

A table consists of rows and columns

No - there is nothing wrong with an empty table. In other words, it does not consist of rows. It may or may not contain rows, but that is a different story.

A scalar subquery returns one column and one row

Wrong - first of all, a scalar subquery may return zero rows, in which case it evaluates to null, which is perfectly valid. But there is more to it.

Whether something is or is not a subquery is matter of syntaxis. The SQL grammer is defined so that if you encounter a query between parenthesis where a scalar value is appropriate, then that query (including the parentheses) will be parsed as a scalar subquery. In other words, the text satisfies the production rule for the non-terminal symbol "scalar subquery".

The parser will usually be smart enough to verify whether the subquery yields one column, but the number of rows returned is a runtime affair.

Suppose the query that makes up the scalar subquery would in fact return more than one row...would it suddenly not be a scalar subquery anymore? Of course not. It is still a scalar subquery - it just happens to be impossible to execute it. In other words, it violates the semantics of a scalar subquery and is therefore invalid. But the mere fact that we can conlcude that must imply that it is a scalar subquery.

A subquery is a SELECT statement that appears as a part of another SELECT statement

Wrong - For the same reasons as the previous issue. A statement is a syntactical construct. It has to do with discovering a pattern in a piece of text so that it satisfies a particular rule in the SQL grammer. That grammar does not have a rule that allows statements to be nested - not in pure SQL anyway (Of course, in stored procedures, one can have statement blocks like BEGIN...END, IF...END IF etc that really can contain other statements)

Of course, if we would take the SELECT that makes up the subquery and run it in isolation, it would be a SELECT-statement. Bit that is exactly the heart of the matter: because we are regarging it as part of another statement, it cannnot be a statement itself. This is simply a matter of definition of course - most people will immediately understand what is meant.

What would be better to say though is that a subquery is a query or query expression that appears as part of another SQL statement. However, this is also not correct: CREATE VIEW for example does contain a query expression, but this would most likely not be called a subquery. For this particular case, you can argue that there is nothing sub-ish about the query expression, because it is simply an essential part of the CREATE VIEW statement.

But what to think of CREATE TABLE...AS SELECT... and INSERT INTO...SELECT? The query expression is certainly not an essential part of CREATE TABLE and INSERT INTO, and in that sense, the query does look like it is subordinate to the statement it is part of.

You could argue that a query is a subquery if it appears inside another query. That seems sound, but what to think of UPDATE ... SET = (SELECT ...)? Personally I am reluctant to call an UPDATE statement a query - I tend to think of a query as a SELECT statement or sometimes a query expression.

I can think of only one thing that really is a defining characteristic of a subquery though - that is that the query expression must appear within parentheses. So, again, a matter of syntax more than a matter of semantics. I must admit I'm still not very satisfied with this though...What do you think?

NULL is the absence of a value

Variants of this statement go like "NULL is a missing value" or "NULL is not a value".

With slight doubt, I say: wrong. It certainly is true that many people use NULL to convey that something is not there or that something is not applicable. But this is a matter of choice, it does not change the meaning of NULL itself. If we use the same line of reasoning as we used for the subquery myth, we must conclude that NULL is certainly a valid value expression. It can legally appear anywhere where we can put a value. It is IMO also perfectly ok to say things like "...that expression evaluates to NULL".

So what does the SQL standard say? Well, here's a quote:

...the null value is neither equal to any other value nor not equal to any other value — it is unknown
whether or not it is equal to any given value....

So, I'm in that camp too: NULL is a value, and if we have a NULL in say, the integer domain, we just don't know which of all possible integers it is.

Foreign keys must reference a primary key

Wrong - a Unique constraint is mostly just as acceptable.

In MySQLs InnoDB it is even more relaxed - the foreign key only needs to reference the prefix of an index in the parent table, although this is so exotic, it should probably be ignored.

This table is not normalized - it still contains redundancy

Wrong - a table is normalized when it is in the first Normal form. There are a few different opinions what that means exactly, but it usually works to say that a table is not normalized when it contains repeating groups.

A slightly stronger statement is to say that a table is not normalized when it contains data that is not atomic. This is stronger because it does not cover only repeating groups, but also columns that, for a single row, do not contain a single value. For example, a first name/last name combination in one column is not atomic, and therefore, a table that contains such values is not normalized. (There are opinions that require even more than this, but for practical purposes the sense of atomic values works pretty well.)

The source of confusion is in what happens beyond the first normal form. Although a table maybe normalized, it can still contain redundancy. By removing redundancy, you can progressively achieve a higher normal form. In many cases, one would require at least third normal form or the Boyce-Codd normal form for building database schemas. Many people say "normalized" when they actually mean "in at least the third normal form".

So - what do you think? Pedantic? Have other myths? Maybe you have a good, satisfactory definition for subqueries? Or maybe you find an error in my debunkings?

Just drop me a comment on this post - thanks in advance.


rudy said...

pedantic? perhaps

but a valuable and interesting post nonetheless

thanks, roland


adaniels said...

Je bent lekker bezig om op elke slak zout te leggen.

David H said...

I agree with the difference between unique and primary indexes, a table only requiring columns to exist, and NULL being a value. However, there are a couple of other examples you gave that we use that bring up my statement below...

I think often we make these types of daily assumptions because:

1) We don't have time to say it otherwise. For example, we say that the sky is blue, when in reality we know that it is the reflection of particles in the air that gives a blue appearance;

2) It is the better of two alternatives that gives us insight into understanding how we should use that information. Example: What I consider a normalized table is indeed in the third normal form, but biz dev can barely understand normalization (and denormalization) for their reporting purposes let alone going to the next level of semantic accuracy. Thus, as long as our reporting personnel continue to see NULLs as the absence of a value as opposed to the presence of one, the programmers continue to program with the understanding that foreign indexes point to primary keys, and biz dev understands that their reporting tables are slightly denormalized for their best purposes, then the world will continue to be round and the sky will remain blue.



danzinger said...

Clever post imho :)

I dont think you should justify the motivation of the post ;) i agree with mostly of your arguments and ive explained some of these dozens of times to oracle developers on several projects.

I could add two if you let me :)

A FK HAS TO be indexed, if not, there will be block problems (maybe is too Oracle specific)
False. There will be blocks, yes, but only if this PK/UK referenced is updated or deleted!, if not, is not "mandatory" to index this constraint!

And well... NULL is not the same as "NULL" ;)


Roland Bouman said...


Rudy, thanks! I appreciate your kind words ;-)

Arnold: heh :-) true, I am nitpicking. However, I can't help playing what I call the substitution game all the time, both when writing an reading. It goes like this, you take a (technical) text, and spot definitions. Then afterwards, you consistently replace the words in the remainder of the text by its definition.

In too many cases, you end up with inconsisten non-sense. I think that in many cases it is possible to write texts that withstand the substitution game. However, you must be more careful in how you define things, *and* find a way to not overly complicate things.

That is hard - at least I think it is. But I feel it really does matter and really does add value.

Excellent cases where playing the substitution game should work are technicak specifications. Good specifications really do withstand the substitution game. Unfortunately, they usually are a lousy read. But I guess that is the challenge of a technical writer.


"We don't have time to say it otherwise."

Very true, I know what you mean ;-)

And I agree, in many cases we don't have the time. In normal conversation for example it would be ludicrous to constantly insist on complete correctness.

However, when writing something down, it may certainly cost a bit of time. And, in a technical context, it usually is relatively simple to be correct, at least as compared to your example.

What I mean is this. Your example involves the senses, human experience. Generally speaking it is impossible to provide a completely correct explanation for experience, simply because we can't be sure it is the same for everyone.

Then there is the notion of context. Your explanation of why the sky is blue would be too complex and non-informational in many contexts, but it might be exactly appropriate in a physics textbook.

The technical stuff I target here is quite different. It is known on beforehand how things should work. Take for example the "statement in a statement" thing. We simply know that a statement cannot be part of another statement, because the grammar was defined that way.

Surely, it is easy to be correct about such a thing- much more so than something intangible as the working of the human senses and experience. Getting that right the first time does not have to cost too much attention from the reader, and as a bonus, it will be easier to explain things later on.

If your definitions are sloppy from the start on, you need to patch and go like: "You might remember how we said there and there that this is this and that. Well, you can forget all about it because it is actually all quite different"

This leap may be excusable if the subject is extremely though, but in many cases it was a matter of the author not *taking* the time required to understand exactly what was going on. At least, this is my opinion ;-)

Danzinger: thanks! About your two myths - the first one, I get (actually, InnoDB *does* require an index on the foreign key constraint column, but I agree that the concept of foreign key does not have to be so bold as to require it)

'NULL is not the same as "NULL"'

I'm afraid I don't get this one. I mean, is this a common myth? And surely this is a correct statement, no? I mean, NULL is not equal to the string "NULL"? Maybe I am misunderstanding you. Please, explain.

kind regards


danzinger said...

Hi Roland,

Thanks for the InnoDB clarification ;)

The "NULL" issue is a common mistake that ive seen a couple of times, the first time i got myself crazy about why that record didnt showed up when i issued "WHERE column IS NULL".
well the answer was that the person who inserted that record, didnt know difference between NULL "antivalue" and "NULL" string!!!

Time after... it was funny to discover ;)


Patrick said...

Hello Roland,

Well, no more comment than 'this post is nice, because those myths are real !'
I won't discuss about clarification on "primary key / unique constraint" or other subjects you mentionned, I'm just certain I will reference this post in some rdbms training sessions
Thanks for giving us all those nice technical tips