Yesterday, I was on the freenode ##pentaho irc channel when

Andres Chaves asked me how to calculate the

*N*^{th} percentile in MySQL. He saw a solution somewhere using subqueries, but wasn't too happy about it.

A while ago I wrote about

calulating the median in MySQL, and it turns out the

*N*^{th} percentile can be calculated using a similar, single-pass approach, not relying on subqueries, UDFs, or user-defined variables.

### The percentile....

So, what is a percentile exactly? Here's what the wikipedia says:

A percentile is the value of a variable below which a certain percent of observations fall. So the 20th percentile is the value (or score) below which 20 percent of the observations may be found.

#### ....and the median

The wikipedia continues and hints at the relationship between the

*N*^{th} percentile and the

median:

The 25th percentile is also known as the first quartile; the 50^{th} percentile as the median.

Sidenote: as I understand it, this latter remark concerning the median is not entirely correct. The median is "the middle value": the number of observations with a higher value is equal to the number of observations that has a lower value. For a series with an even number of observations, say {1,2,3,4}, there isn't one middle value, there are two, in this case: 2 and 3. Typically, the median is computed by taking the arithmic mean of the two middle values, which would be (2+3) / 2 = 2.5 in this particular case. But for this example, the 50th percentile would be 3 and not 2.5. However, in most practical cases the values are fairly evenly distributed and sufficiently large, which means the difference between the median and 50th percentile will be small or absent.

However, the median and percentile problems are quite similar: in both cases, a value is picked or computed that is higher than the value found in a particular portion of the total number of observations. For the median, the requirement is that that particular portion is equal to the number of observations that exceeds it; for the

*N*^{th} percentile the requirement is that that portion constitutes

*N* percent of the total number of observations.

### The Solution

In the following example, we calculate the 90

^{th} percentile of film lengths:

SELECT SUBSTRING_INDEX(

SUBSTRING_INDEX(

GROUP_CONCAT( -- 1) make a sorted list of values

f.length

ORDER BY f.length

SEPARATOR ','

)

, ',' -- 2) cut at the comma

, 90/100 * COUNT(*) + 1 -- at the position beyond the 90% portion

)

, ',' -- 3) cut at the comma

, -1 -- right after the desired list entry

) AS `90th Percentile`

FROM sakila.film AS f

Here, the literal

`90`

represents which percentile we want, that is, "the 90

^{th} percentile".

(If you like, you can leave out the

`SEPARATOR ','`

bit, as the default separator is a comma anyway. I just wanted to have a clear indication for the source of the

`','`

arguments in the

`SUBSTRING_INDEX()`

calls)

### Explanation

The median and

*N*^{th} percentile problem can be solved using a similar apporoach: first, we use the string aggregate function

`GROUP_CONCAT()`

to create an ordered list of values. Then we use the substring variation

`SUBSTRING_INDEX()`

to find and excise a list entry at a particular desired position.

The differences between the solutions for the median and the

*N*^{th} is mainly in which entry we have to pick from the list. (See my prior article for the full story on median). For the

*N*^{th} percentile, we first calculate the desired portion of the total number of observations. Because

*N* is defined as a percentage, we divide by 100 to get the actual fraction of the total number of observations:

*N* **/ 100**

Then, we multiply by the total number of observations to find the number of observations that make up the actual portion of observations within the specified percentile:

*N* / 100 *** COUNT(*)**

Because we want to find the observation for which the specified portion has a lower value, we need to look at the next entry instead of the last entry within the portion, so we add 1:

*N* / 100 * COUNT(*) **+ 1**

### Caveats

There are a number of things to look out for:

#### Percentile value not unique

When we calculate the 90

^{th} percentile of the film lengths, we get 173:

+-----------------+

| 90th Percentile |

+-----------------+

| 173 |

+-----------------+

If we check the result by counting the portion of films with a length lower than

`173`

we see:

mysql> SELECT 100 * COUNT(IF(f.length < 173, 1, NULL))/COUNT(*) `Percentage`

-> FROM film AS f;

+------------+

| Percentage |

+------------+

| 89.4212 |

+------------+

The reason that we do not get 90% is that there are multiple occurrences of films with length equal to 173:

mysql> SELECT title FROM film WHERE length = 173;

+----------------------+

| title |

+----------------------+

| BALLROOM MOCKINGBIRD |

| CONQUERER NUTS |

| FIRE WOLVES |

| GLADIATOR WESTWARD |

| PIZZA JUMANJI |

| TALENTED HOMICIDE |

| VELVET TERMINATOR |

+----------------------+

7 rows in set (0.01 sec)

So, even though we may have picked the entry at the right position, this may still be a value within the specified portion instead of beyond. The definitions I found for the

*N*^{th} percentile do not stipulate any mechanism to deal with this kind of ambiguity, whereas for the median, the correct value is always found by averaging the left and right median if necessary.

#### What about NULL

The second caveat are

`NULL`

values. Currently this method does not work when the column for which you want to caculate percentile values is nullable. It is possible to work around this though. If you can exclude the rows with the

`NULL`

value entirely, you can simply add a

`WHERE`

clause. This is a good idea also because it will cull the number of rows to process.

It may not be acceptable to throw away the rows with NULL values, for example if there is another expression in your

`SELECT`

list that needs to do something with all rows. You can then still work around it, with some extra hassle. It would involve tweaking the

`GROUP_CONCAT`

to ignore NULL values. This could be done like this:

GROUP_CONCAT(

IF(<column> IS NULL

, ''

, <column>)

ORDER BY <column>

SEPARATOR ','

)

This will ensure that

`GROUP_CONCAT()`

does not also return

`NULL`

when a

`NULL`

value is present in the specified column. If there are

`NULL`

values, these will end up as a list of comma's in the head of the result:

**,,,,**<non-null-value1>,...,<non-null-valueN> -- 4 NULL values

Assuming we know the number of

`NULL`

's (and we do) we can clean up our list easily with just

`SUBSTRING()`

:

SUBSTRING(

GROUP_CONCAT(

IF(<column> IS NULL

, ''

, <column>)

ORDER BY <column>

SEPARATOR ','

)

, SUM(IF(<column> IS NULL, 1, 0)) + 1

)

Because we are ignoring the NULL values in our list, we must likewise ignore them in our calculation of the portion of rows. So, instead if

`COUNT(*)`

we should use

`COUNT(<column>)`

in order to not count the

`NULL`

values.

`group_concat_max_len`

When using

`GROUP_CONCAT()`

, an issue that should always be on your radar is the maximum length of the

`GROUP_CONCAT()`

result. If the result value exceeds the maximum length, the

`GROUP_CONCAT()`

result will be truncated, and a warning will be issued:

1 row in set, 1 warning (0.00 sec)

mysql> show warnings;

+---------+------+--------------------------------------+

| Level | Code | Message |

+---------+------+--------------------------------------+

| Warning | 1260 | 1 line(s) were cut by GROUP_CONCAT() |

+---------+------+--------------------------------------+

It is really important to be aware of any warnings, as a truncation of the

`GROUP_CONCAT()`

result messes up the entire calculation.

The maximum length of the

`GROUP_CONCAT()`

result is controlled through the

`group_concat_max_len`

system variable.

It can be set thus:

SET @@group_concat_max_len := <num-bytes>

The maximum practical value is the maximum packet size, which is available as the

`max_allowed_packet`

system variable. This means you can write:

SET @@group_concat_max_len := @@max_allowed_packet;

and you will never be bothered by this problem again. The

`GROUP_CONCAT()`

result can still be too large though (namely, larger than the maximum packet size) but in that case you will get a proper error instead of a truncated result.

You should realize that setting the

`group_concat_max_len`

to a high value may lead to memory problems, as each

`GROUP_CONCAT()`

invocation may individually reserve the specified amount of memory to deal with its result.

### Finally...

I will maintain this percentile calculation as a

snippet on the

MySQL Forge site. Please go there to find the latest version.

Now, finally, I have a question for you. When I wrote about the median calculation, I mentioned that I thought it was an original method, and I asked whether someone could deny that claim. I did not get any reaction, so I'd like to repeat my question: Do you know of any book or text or blog that describes this technique? If so, let me know so I can provide proper accreditation.

TIA, Roland.