Roland Bouman's blog: NULL

Showing posts with label NULL. Show all posts

Tuesday, April 21, 2009

Not so random slashdot comment Gem

Java 8 will replace String with String2, which will treat empty string and null the same.

By by characterZer0 (138196), on Monday April 20, @08:44AM

Friday, July 04, 2008

Calculating the Nth percentile in MySQL

Yesterday, I was on the freenode ##pentaho irc channel when Andres Chaves asked me how to calculate the N^th percentile in MySQL. He saw a solution somewhere using subqueries, but wasn't too happy about it.

A while ago I wrote about calulating the median in MySQL, and it turns out the N^th percentile can be calculated using a similar, single-pass approach, not relying on subqueries, UDFs, or user-defined variables.

The percentile....

So, what is a percentile exactly? Here's what the wikipedia says:

A percentile is the value of a variable below which a certain percent of observations fall. So the 20th percentile is the value (or score) below which 20 percent of the observations may be found.

....and the median

The wikipedia continues and hints at the relationship between the N^th percentile and the median:

The 25th percentile is also known as the first quartile; the 50^th percentile as the median.

Sidenote: as I understand it, this latter remark concerning the median is not entirely correct. The median is "the middle value": the number of observations with a higher value is equal to the number of observations that has a lower value. For a series with an even number of observations, say {1,2,3,4}, there isn't one middle value, there are two, in this case: 2 and 3. Typically, the median is computed by taking the arithmic mean of the two middle values, which would be (2+3) / 2 = 2.5 in this particular case. But for this example, the 50th percentile would be 3 and not 2.5. However, in most practical cases the values are fairly evenly distributed and sufficiently large, which means the difference between the median and 50th percentile will be small or absent.

However, the median and percentile problems are quite similar: in both cases, a value is picked or computed that is higher than the value found in a particular portion of the total number of observations. For the median, the requirement is that that particular portion is equal to the number of observations that exceeds it; for the N^th percentile the requirement is that that portion constitutes N percent of the total number of observations.

The Solution

In the following example, we calculate the 90^th percentile of film lengths:


SELECT  SUBSTRING_INDEX(
            SUBSTRING_INDEX(
                GROUP_CONCAT(                 -- 1) make a sorted list of values
                    f.length
                    ORDER BY f.length
                    SEPARATOR ','
                )
            ,   ','                           -- 2) cut at the comma
            ,   90/100 * COUNT(*) + 1         --    at the position beyond the 90% portion
            )
        ,   ','                               -- 3) cut at the comma
        ,   -1                                --    right after the desired list entry
        )                 AS `90th Percentile`
FROM    sakila.film AS f

Here, the literal 90 represents which percentile we want, that is, "the 90^th percentile".

(If you like, you can leave out the SEPARATOR ',' bit, as the default separator is a comma anyway. I just wanted to have a clear indication for the source of the ',' arguments in the SUBSTRING_INDEX() calls)

Explanation

The median and N^th percentile problem can be solved using a similar apporoach: first, we use the string aggregate function GROUP_CONCAT() to create an ordered list of values. Then we use the substring variation SUBSTRING_INDEX() to find and excise a list entry at a particular desired position.

The differences between the solutions for the median and the N^th is mainly in which entry we have to pick from the list. (See my prior article for the full story on median). For the N^th percentile, we first calculate the desired portion of the total number of observations. Because N is defined as a percentage, we divide by 100 to get the actual fraction of the total number of observations:


    N / 100

Then, we multiply by the total number of observations to find the number of observations that make up the actual portion of observations within the specified percentile:


    N / 100 * COUNT(*)

Because we want to find the observation for which the specified portion has a lower value, we need to look at the next entry instead of the last entry within the portion, so we add 1:


    N / 100 * COUNT(*) + 1

Caveats

There are a number of things to look out for:

Percentile value not unique

When we calculate the 90^th percentile of the film lengths, we get 173:


+-----------------+
| 90th Percentile |
+-----------------+
| 173             |
+-----------------+

If we check the result by counting the portion of films with a length lower than 173 we see:


mysql> SELECT 100 * COUNT(IF(f.length < 173, 1, NULL))/COUNT(*) `Percentage`
    -> FROM   film AS f;
+------------+
| Percentage |
+------------+
|    89.4212 |
+------------+

The reason that we do not get 90% is that there are multiple occurrences of films with length equal to 173:


mysql> SELECT title FROM film WHERE length = 173;
+----------------------+
| title                |
+----------------------+
| BALLROOM MOCKINGBIRD |
| CONQUERER NUTS       |
| FIRE WOLVES          |
| GLADIATOR WESTWARD   |
| PIZZA JUMANJI        |
| TALENTED HOMICIDE    |
| VELVET TERMINATOR    |
+----------------------+
7 rows in set (0.01 sec)

So, even though we may have picked the entry at the right position, this may still be a value within the specified portion instead of beyond. The definitions I found for the N^th percentile do not stipulate any mechanism to deal with this kind of ambiguity, whereas for the median, the correct value is always found by averaging the left and right median if necessary.

What about NULL

The second caveat are NULL values. Currently this method does not work when the column for which you want to caculate percentile values is nullable. It is possible to work around this though. If you can exclude the rows with the NULL value entirely, you can simply add a WHERE clause. This is a good idea also because it will cull the number of rows to process.

It may not be acceptable to throw away the rows with NULL values, for example if there is another expression in your SELECT list that needs to do something with all rows. You can then still work around it, with some extra hassle. It would involve tweaking the GROUP_CONCAT to ignore NULL values. This could be done like this:


GROUP_CONCAT(
    IF(<column> IS NULL
    ,   ''
    , <column>) 
    ORDER BY <column>
    SEPARATOR ','
)

This will ensure that GROUP_CONCAT() does not also return NULL when a NULL value is present in the specified column. If there are NULL values, these will end up as a list of comma's in the head of the result:


,,,,<non-null-value1>,...,<non-null-valueN> -- 4 NULL values

Assuming we know the number of NULL's (and we do) we can clean up our list easily with just SUBSTRING():


SUBSTRING(
    GROUP_CONCAT(
        IF(<column> IS NULL
        ,   ''
        , <column>) 
        ORDER BY <column>
        SEPARATOR ','
    )
,   SUM(IF(<column> IS NULL, 1, 0)) + 1
)

Because we are ignoring the NULL values in our list, we must likewise ignore them in our calculation of the portion of rows. So, instead if COUNT(*) we should use COUNT(<column>) in order to not count the NULL values.

`group_concat_max_len`

When using GROUP_CONCAT(), an issue that should always be on your radar is the maximum length of the GROUP_CONCAT() result. If the result value exceeds the maximum length, the GROUP_CONCAT() result will be truncated, and a warning will be issued:


1 row in set, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+--------------------------------------+
| Level   | Code | Message                              |
+---------+------+--------------------------------------+
| Warning | 1260 | 1 line(s) were cut by GROUP_CONCAT() |
+---------+------+--------------------------------------+

It is really important to be aware of any warnings, as a truncation of the GROUP_CONCAT() result messes up the entire calculation.

The maximum length of the GROUP_CONCAT() result is controlled through the group_concat_max_len system variable.

It can be set thus:


SET @@group_concat_max_len := <num-bytes>

The maximum practical value is the maximum packet size, which is available as the max_allowed_packet system variable. This means you can write:


SET @@group_concat_max_len := @@max_allowed_packet;

and you will never be bothered by this problem again. The GROUP_CONCAT() result can still be too large though (namely, larger than the maximum packet size) but in that case you will get a proper error instead of a truncated result.

You should realize that setting the group_concat_max_len to a high value may lead to memory problems, as each GROUP_CONCAT() invocation may individually reserve the specified amount of memory to deal with its result.

Finally...

I will maintain this percentile calculation as a snippet on the MySQL Forge site. Please go there to find the latest version.

Now, finally, I have a question for you. When I wrote about the median calculation, I mentioned that I thought it was an original method, and I asked whether someone could deny that claim. I did not get any reaction, so I'd like to repeat my question: Do you know of any book or text or blog that describes this technique? If so, let me know so I can provide proper accreditation.

TIA, Roland.

Friday, November 09, 2007

Random RDBMS and SQL Myths debunked

A few times now, I've been wanting to write this down. I know: a lot of people will go *shrug*. Others may find me pedantic. Some of will say I'm being a smart-ass. Whatever...but I just got to write down a few of these common misconceptions that keep floating around.

None of these misconceptions are really harmful - in most cases, they do not lead to misunderstanding or miscommunication. However, when you are writing about these subjects, you'll often find that a sloppy definition you used in some place will bite you in the tail, and make it harder to explain something later on. So, that is why I from time to time get kind of obsessed with finding just the right words.

I'm not pretending I have the right words though. But there are a few informal ways of saying things that at a glance look right but are in fact wrong. Here's a random list of some of them:

PRIMARY KEY and UNIQUE constraints are unique indexes

Wrong - an index is just a convenient implementation to detect duplicate entries, and this is used by all RDBMS-es I am familiar with to implement PRIMARY KEY and UNIQUE constraints. However, the fact that there is a distinction is evident in for example the Oracle SQL syntax. For example, in ALTER TABLE ... DROP CONSTRAINT you can specify whether the associated index should be kept or also discarded.

Some people argue that it does not make sense to make the distinction in case the RDBMS does not maintain the constraint and index as separate objects. (This is the case in for example MySQL.)

Well, maybe...but I disagree. When I think about constraints, I'm thinking about business rules and guarding them to maintain database integrity. When talking about indexes, I'm thinking about performance and access paths. Quite different things, and in my opinion a shame to throw away the words to express the difference in my opinion.

A table consists of rows and columns

No - there is nothing wrong with an empty table. In other words, it does not consist of rows. It may or may not contain rows, but that is a different story.

A scalar subquery returns one column and one row

Wrong - first of all, a scalar subquery may return zero rows, in which case it evaluates to null, which is perfectly valid. But there is more to it.

Whether something is or is not a subquery is matter of syntaxis. The SQL grammer is defined so that if you encounter a query between parenthesis where a scalar value is appropriate, then that query (including the parentheses) will be parsed as a scalar subquery. In other words, the text satisfies the production rule for the non-terminal symbol "scalar subquery".

The parser will usually be smart enough to verify whether the subquery yields one column, but the number of rows returned is a runtime affair.

Suppose the query that makes up the scalar subquery would in fact return more than one row...would it suddenly not be a scalar subquery anymore? Of course not. It is still a scalar subquery - it just happens to be impossible to execute it. In other words, it violates the semantics of a scalar subquery and is therefore invalid. But the mere fact that we can conlcude that must imply that it is a scalar subquery.

A subquery is a SELECT statement that appears as a part of another SELECT statement

Wrong - For the same reasons as the previous issue. A statement is a syntactical construct. It has to do with discovering a pattern in a piece of text so that it satisfies a particular rule in the SQL grammer. That grammar does not have a rule that allows statements to be nested - not in pure SQL anyway (Of course, in stored procedures, one can have statement blocks like BEGIN...END, IF...END IF etc that really can contain other statements)

Of course, if we would take the SELECT that makes up the subquery and run it in isolation, it would be a SELECT-statement. Bit that is exactly the heart of the matter: because we are regarging it as part of another statement, it cannnot be a statement itself. This is simply a matter of definition of course - most people will immediately understand what is meant.

What would be better to say though is that a subquery is a query or query expression that appears as part of another SQL statement. However, this is also not correct: CREATE VIEW for example does contain a query expression, but this would most likely not be called a subquery. For this particular case, you can argue that there is nothing sub-ish about the query expression, because it is simply an essential part of the CREATE VIEW statement.

But what to think of CREATE TABLE...AS SELECT... and INSERT INTO...SELECT? The query expression is certainly not an essential part of CREATE TABLE and INSERT INTO, and in that sense, the query does look like it is subordinate to the statement it is part of.

You could argue that a query is a subquery if it appears inside another query. That seems sound, but what to think of UPDATE ... SET = (SELECT ...)? Personally I am reluctant to call an UPDATE statement a query - I tend to think of a query as a SELECT statement or sometimes a query expression.

I can think of only one thing that really is a defining characteristic of a subquery though - that is that the query expression must appear within parentheses. So, again, a matter of syntax more than a matter of semantics. I must admit I'm still not very satisfied with this though...What do you think?

NULL is the absence of a value

Variants of this statement go like "NULL is a missing value" or "NULL is not a value".

With slight doubt, I say: wrong. It certainly is true that many people use NULL to convey that something is not there or that something is not applicable. But this is a matter of choice, it does not change the meaning of NULL itself. If we use the same line of reasoning as we used for the subquery myth, we must conclude that NULL is certainly a valid value expression. It can legally appear anywhere where we can put a value. It is IMO also perfectly ok to say things like "...that expression evaluates to NULL".

So what does the SQL standard say? Well, here's a quote:

...the null value is neither equal to any other value nor not equal to any other value — it is unknown
whether or not it is equal to any given value....

So, I'm in that camp too: NULL is a value, and if we have a NULL in say, the integer domain, we just don't know which of all possible integers it is.

Foreign keys must reference a primary key

Wrong - a Unique constraint is mostly just as acceptable.

In MySQLs InnoDB it is even more relaxed - the foreign key only needs to reference the prefix of an index in the parent table, although this is so exotic, it should probably be ignored.

This table is not normalized - it still contains redundancy

Wrong - a table is normalized when it is in the first Normal form. There are a few different opinions what that means exactly, but it usually works to say that a table is not normalized when it contains repeating groups.

A slightly stronger statement is to say that a table is not normalized when it contains data that is not atomic. This is stronger because it does not cover only repeating groups, but also columns that, for a single row, do not contain a single value. For example, a first name/last name combination in one column is not atomic, and therefore, a table that contains such values is not normalized. (There are opinions that require even more than this, but for practical purposes the sense of atomic values works pretty well.)

The source of confusion is in what happens beyond the first normal form. Although a table maybe normalized, it can still contain redundancy. By removing redundancy, you can progressively achieve a higher normal form. In many cases, one would require at least third normal form or the Boyce-Codd normal form for building database schemas. Many people say "normalized" when they actually mean "in at least the third normal form".

So - what do you think? Pedantic? Have other myths? Maybe you have a good, satisfactory definition for subqueries? Or maybe you find an error in my debunkings?

Just drop me a comment on this post - thanks in advance.