Comments on Roland Bouman's blog: Calculating the Nth percentile in MySQL

Thanks for the detailed responses!

2013-12-13T22:38:33.129+01:00

Thanks for the detailed responses!

Hi! whether the subquery is faster than the 2-ste...

2013-12-13T12:41:10.125+01:00

Hi!

whether the subquery is faster than the 2-step solution is not so easy to say in this case.

MySQL Subqueries often don't perform too good. Some of that is fixed in MySQL 5.7 and in MariaDB. And just like with all the other solutions that relyu on GROUP CONCAT it needs the memory to build the list of strings.

However, the 2-step solution with LIMIT has it's own problems (and these even hold for the first solution since that uses LIMIT too). Even though the LIMIT selects only one result row, it still needs to build a temporary table where it stores the sorted result at least up to the row of interest. Since that row is at the very tail of the entire set, this is not trivial and takes time as well.

I can't really make a general statement at this point, you're going to have to try and test with your data to see what performs best. That said, if you don't need the data instantly and you have the time to caclulate the result before hand, then I guess it shouldn't matter too much.

Oh yeah... One more question, doing the subquery (...

2013-12-12T21:13:57.322+01:00

Oh yeah... One more question, doing the subquery (your first example) is faster than doing two calls, correct?

I'm programming in php using mysqli object.

Thanks for the incredibly detailed reply! I shou...

2013-12-12T21:09:39.098+01:00

Thanks for the incredibly detailed reply!

I should have thought to use an embedded select, but I'm not strong in my SQL.

My application is not mission critical, as I'm just trying to get the Xth percentiles of certain data points on what people think is funny. I was grabbing anything that was better than AVG(), but its too much. I need to narrow down to a high percentile.

If I get a chance I may try both ways (using all the data and just subsets) to see how things change. Maybe one can pull funnier stuff out (but "funny" is subjective, so no REAL way to tell which is better).

Thanks again!

Hi Terry! The group concat function does not supp...

2013-12-12T01:36:16.390+01:00

Hi Terry!

The group concat function does not support a LIMIT clause (syntax here http://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_group-concat) so I don't think your example could work.
What you can do is use LIMIT at the end of the SELECT expression, and then apply the aggregation.
However, LIMIT is applied at the very end of the operation, so you have to do it in 2 steps (otherwise, the LIMIT will be applied after the aggregation, which is not what you want)

Something like this:

SELECT ...GROUP_CONCAT(...)
FROM (
SELECT ...
FROM myTable
ORDER BY myDate DESC
LIMIT 0, 100
)

...and then this will caclulate it over the sample of the last (that's because of the DESC, with ASC it would have been the first) 100 values.

While the idea of using only a sample of the values to caculate the percentile is interesting, this particular approach has quite a bunch of problems. There's just no way to know whether the sample you're drawing is a good representation of your entire dataset.

There are several other approaches but each has their own problems. One popular but potentially unreliable method is to use user-defined variables. A clean, but slow approach is to either use a self join or subqueries.

It should be possible to use LIMIT to your advantage if it is acceptable to do the task in 2 steps.
Step 1 would be to calculate the ordinal of the desired percentile value
Step 2 would be to use LIMIT to actually fetch that value.

Example:

mysql> select count(amount)*.90 from sakila.payment ;
+-------------------+
| count(amount)*.90 |
+-------------------+
| 14444.10 |
+-------------------+
1 row in set (0.01 sec)

mysql> select amount from sakila.payment order by amount limit 14444, 1;
+--------+
| amount |
+--------+
| 6.99 |
+--------+
1 row in set (0.02 sec)

HTH,

Roland.

I'm having problems with the GROUP_CONCAT limi...

2013-12-11T23:06:58.301+01:00

I'm having problems with the GROUP_CONCAT limit and don't have access to change config files on the server.

However I have a field in the database for the date each data point was inserted.

Is there a way to order the data by the date and then only use the most recent 100 data points?

I'm not stong in my SQL but this is what I think I'd want if it were syntactically correct:

SELECT SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(
f.length
ORDER BY f.length, f.createdAt LIMIT 0,100
SEPARATOR ','
)
, ','
, 90/100 * COUNT(*) + 1
)
, ','
, -1
) AS `90th Percentile`
FROM myTable AS f

Dear @Anonymous, Thanks for your comment! First...

2012-02-09T12:53:06.754+01:00

Dear @Anonymous,

Thanks for your comment!

First, I should point out that there is absolutely nothing wrong with the given definition of the median as the "number of observations with a higher value is equal to the number of observations that has a lower value."

What is wrong is my example, or at the very least it is confusing. I wrote a plus there to indicate that both 2 and 3 are the middle values (if we disregard the two middle values 2 and 3 themselves). That was silly and confusing. The final figure, 2.5, is the mean of 2 and 3 which is normally accepted as the "right" approximation of the median in case there is an even number of observations. But as an example, this is really confusing since the mean of the entire series is also 2.5. And indeed, typically the arithmetic mean is used as the best measure for the average.

I will update the text to more clearly describe what I meant.

what you define as median is the average in fact: ...

2012-02-09T10:45:01.528+01:00

what you define as median is the average in fact:

"the number of observations with a higher value is equal to the number of observations that has a lower value. For the series {1,2,3,4} the median would be 2+3 = 2.5."

the median would be 2 or 3, the difference between median and avarage is that the median is included in the list of values, the one that has 50% of values being lower and 50% higher.

Thanks! this helped! Ilya.

2011-10-13T05:55:31.564+02:00

Thanks! this helped! Ilya.

Roland, thanks a lot for this. I did try some sim...

2009-12-14T15:53:58.991+01:00

Roland, thanks a lot for this.
I did try some simpler solutions, like ordering the dataset and limiting at the 90th percentile value, but unfortunately, passing variables to mysql LIMIT CLAUSE does not work...even now, after 4 years of people complaining about that.
I tried preparing the statements, and executing them, but as much as it worked in a mysql editor, it did not work in my Jasper Report, who totally refused executing a prepared statement.

This is how i got back to your solution, adapted it, and used it for doing the 90th percentile on a big set of response times

For those of you still looking for a customized query,working on numbers, here is the script i have used:
SELECT
cast(substring_index(substring_index((group_concat(cast(t as char) order by t) ),',',(round(0.9*count(t)+1))),',',-1) as signed int) as 90perc from testresults where testrun_id=2 group by http_request

In this case, t is the value of the response time, http_request is the request in the table, and testrun_id is just the id of the testcase i have run.

I do hope it will help. And it will of course, only help if used together with modifying the group_concat system variable limit.

Thanks again!
Alex

Dear anonymous, this is not a bug. In the very po...

2009-12-02T21:58:37.092+01:00

Dear anonymous,

this is not a bug. In the very post you are commenting on, this is mentioned beneath the header max_group_concat_len.

You seem to be the type that doesn't want to waste too much time on reading, So I will post the fix right here.

Simply set the max_group_concat_len server variable to the practical maximum value:

SET @@group_concat_max_len := @@max_allowed_packet;

You can also set this globally in your mysqld option file.

kind regards,

Roland

There is a bug with this solution when the concate...

2009-12-02T21:53:59.928+01:00

There is a bug with this solution when the concatenated string in the group_concat function is larger then 1024 bytes.

This string gets truncated.

Gody, thanks! (focussing on the user variable prob...

2008-07-22T00:16:00.000+02:00

Gody, thanks!

(focussing on the user variable problem)

Yeah it is true - most of the user variable solutions I posted aren't really safe. I kind of got lost doing them, forgetting to properly test the results.

The GROUP_CONCAT solution is safe though, as there is explicit ordering there (independent of an index) and no user variables.

I stumbled across your blog when I wanted to find ...

2008-07-21T22:56:00.000+02:00

I stumbled across your blog when I wanted to find ideas for good queries for calculating quantiles in a (very) big dataset.

Maybe I can help on some issues (or not) and ask a question:

For some different definitions/calculations of quantiles (which include percentiles) see f.e.
http://mathworld.wolfram.com/Quantile.html (excellent site in general including references )

I believe Q1 and Q5 mentioned there are most commonly used.
(Q5 should give the more common median defintion which may help resolve some issues for users should you implement it).

Regarding the median: For large draws from a distribtution
(whose quantiles you want to compute) the competing definitions you mention (and for Qx) will not matter since the values converge for large N. Again (see above)I believe the more common estimator is X(ceil(N/2)) for odd and
(X(floor(N/2)) + X(ceil(N/2)))/2 for even N.

Regarding the issue of equal ranks:
Since strictly speaking I believe quantiles need unique orders (no equal elements) the most common procedure -I think- is to pretend they were.
So rank 1 2 3 instead of 1 2 2
(2 and 3 being equal) to calculate the quantiles.

Regarding your (RB) solutions:
Is it possible for your queries with user-defined variables to run into problems when mysql does not process the numbers in the right order? e.g. if part of a more complicated query including joins
(e.g. as described in
http://dev.mysql.com/doc/refman/5.0/en/user-variables.html)
or when randomly ordered on disk (e.g. no index)

Gody

Vladimi, sure, see:lib_mysqludf_stathttp://www.mys...

2008-07-08T18:58:00.000+02:00

Vladimi,

sure, see:

lib_mysqludf_stat

http://www.mysqludf.org/lib_mysqludf_stat/index.php

Roland,I think it's a great idea. Btw, is there a ...

2008-07-08T17:30:00.000+02:00

Roland,

I think it's a great idea. Btw, is there a suitable UDF-package where it could be included?

Hi Vladimir, great you got it to work! Maybe we ca...

2008-07-07T23:50:00.000+02:00

Hi Vladimir, great you got it to work!

Maybe we can work together and create UDFs for this too? Just for the fun and of course to find out if a UDF can beat user variables?

Just let me know and we'll make it happen.

Roland,the following query worked for me and showe...

2008-07-07T22:30:00.000+02:00

Roland,
the following query worked for me and showed pretty good performance:

mysql> set @cnt1 := @cnt2 := 0; select c, p from (SELECT c, (@cnt1:= @cnt1 + 0 + COUNT(*)) p FROM tt GROUP BY c) x where p > (select count(*) from tt)*0.9 limit 1;
Query OK, 0 rows affected (0.00 sec)

+-------+---------+
| c | p |
+-------+---------+
| 90270 | 3600026 |
+-------+---------+
1 row in set (3.06 sec)

table size: 4M, distinct value count: 100803

the division didn't work properly, also I had to use sub-select as checking p in HAVING re-evaluated p and gave incorrect output.

Roland,the idea with user variables is excellent e...

2008-07-07T21:05:00.000+02:00

Roland,

the idea with user variables is excellent except it didn't work for me "as is" (I even updated the server to the 5.1.25-community). The form

@cum_cnt:=@cum_cnt + COUNT(*)

worked for me simply as COUNT(*) which was surprising as it obviously worked for you. I quickly surfed through the variables and SQL modes manual sections, but didn't find anything relevant. Anyway, after some time "hands-on keyboard" I found the way to make it work. Here's the log:

mysql> show create table a \G
*************************** 1. row ****
Table: a
Create Table: CREATE TABLE `a` (
`c` int(11) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> select * from a;
+------+
| c |
+------+
| 1 |
| 2 |
| 2 |
| 2 |
| 3 |
| 4 |
| 4 |
+------+
7 rows in set (0.00 sec)

mysql> set @num1 := @num2 := @num3 := 0; select count(*), @num1 := (@num1 + count(*)) c1, @num2 := (@num2 + count(*) + 0) c2, @num3 := (@num3 + 0 + cou
nt(*)) c3 from a group by c;
Query OK, 0 rows affected (0.00 sec)

+----------+------+------+------+
| count(*) | c1 | c2 | c3 |
+----------+------+------+------+
| 1 | 1 | 1 | 1 |
| 3 | 3 | 3 | 4 |
| 1 | 1 | 1 | 5 |
| 2 | 2 | 2 | 7 |
+----------+------+------+------+
4 rows in set (0.00 sec)

As you can see it works only in one specific form (c3). I will try to do the benchmark today...

Vladimir, your idea concerning the "accumulative" ...

2008-07-07T13:31:00.000+02:00

Vladimir,

your idea concerning the "accumulative" or "cumulative" aggregate functions is pretty interesting!

It's probably pretty trivial to create UDFs for these to prove the concept. I am quite sure your hunch that they will be pretty fast is a good one - I expect this to be the fastest solution, provided the group by can be executed efficiently.

Note that you can already do this trick with user variables:

-- allow batching
DELIMITER go statements

-- initialize
SELECT COUNT(*), 0
INTO @cnt, @cum_cnt
FROM sakila.film;

-- calculate percentile
SELECT length
, @cum_cnt:=@cum_cnt + COUNT(*)/@cnt as p
FROM film
GROUP BY length
HAVING p >= 0.9
LIMIT 1;

go

Can you repeat your benchmark with this trick and see what it gives? Thanks in advance,

Roland.

Roland,what I meant by "typical value" is exactly ...

2008-07-07T13:00:00.000+02:00

Roland,

what I meant by "typical value" is exactly what is described in the wikipedia article about the median http://en.wikipedia.org/wiki/Median . In the example with poormen and a richman in the room the "typical" value "on the table" is $5. I apologize if my explanations were not very clear...

About the INCREMENTAL thing: (again my apologies I was in a hurry, and also made a mistake in the pseudocode) the example would be as follows (I will imply the film table in the example):

SELECT COUNT(INC length), length FROM film GROUP BY length

here COUNT(INC length) would be the same as normal COUNT(length) except it wouldnt reset the function value after every group, for example if we have length values 1,1,2,3,4,4,4,5 then the query above would give us

COUNT(INC len) len
2 1
3 2
4 3
7 4
8 5

while the normal

SELECT COUNT(length), length FROM film GROUP BY length

would give

COUNT(len) len
2 1
1 2
1 3
3 4
1 5

the result above as you understand can be reached by a fairly trivial modification to the aggregate function code. Now having the cumulative result above we can continue our calculation:

SELECT COUNT(INC length)/(SELECT COUNT(*) FROM film) p, length FROM film GROUP BY length

this query will return us all distinct "length" values from the table along with their percentile values. And finally we filter out what we [dont] need by e.g. adding

HAVING p >= 0.9 ORDER BY p LIMIT 1

Regards,
Vladimir

Hi Vladimir, thanks for coming back and taking the...

2008-07-07T12:22:00.000+02:00

Hi Vladimir,

thanks for coming back and taking the time to think about this.

"what I still remember from university times is that the median intuitively is the most "typical" value (or values) in the distribution. Just out of curiosity I would appreciate if you point me to some books/applications of this "mean" median."

Well, there are three 'similar but different' metrics to characterize the 'typical' value for a normal distribution:
* mean: sum(value)/count(*)
* mode: most abundant value
* median: 'middle' value

If you google for "statistical median" (don't use the quotes) you will get numerous articles, and all I have seen first explain that it is the middle value and that the mean of the two middle values is taken in case there is an even number of observations. I am sure you will find proper references to text books too if you look hard enough. (I've looked into a number of my Dutch statistics texbooks, and all mention the 'mean' trick, some also mention the 50th percentile. It just proves statistics is not that exact I guess.)

Concerning you alternative solution: This is great stuff, Thank you!
Clearly your solution performs much nicer for larger sets.

Concerning your INCREMENTAL proposal:
Maybe I don't understand completely. The statement you gave:

SELECT ... SUM(INCREMENTAL length)/(SELECT COUNT(*)) ... GROUP BY length ...

is not entirely complete so maybe something got lost in translation. What are you calculating exactly here? Where is the COUNT(*) coming from?

kind regards,

Roland,OK, not putting under a question the validi...

2008-07-07T10:03:00.000+02:00

Roland,

OK, not putting under a question the validity of your definition anymore I'm still fairly surprised, as what I still remember from university times is that the median intuitively is the most "typical" value (or values) in the distribution. Just out of curiosity I would appreciate if you point me to some books/applications of this "mean" median.

The calculation:
Of course your join examples will be very slow as you over join all the rows. However this is not necessary. What we need is the number of values that are "below" the given. So instead of using the whole table as the join predicate we can rather use something like:

(select count(*) c, length l from film group by l [order by l])

If the number of distinct values in the table is relatively small this will work fast regardless of the total number of rows in the table.

I've got the following query that worked for me:

select sql_no_cache
sum(g1.r) sr, g2.length l, sum(g1.r)/(select count(*) from film) p
from
(select count(*) r, length from film group by length) g1
join
(select count(*) r, length from film group by length) g2
on g1.length < g2.length
group by g2.length
having p > 0.9
order by p
limit 1

(of course it can be optimized to use temp tables; we might also play with "WITH ROLLUP" instead of select count(*) for the whole table)

I have created a MyISAM table and tried this query and the group_concat one (spec: my quite old notebook PM-2000, 2Gb, 4500rpm, sql_no_cache, default memory settings, except increased concat buffer to 14M, every query run 3 times, win xp):

4M int rows, 999 distinct values:

group_concat: 1 min 6.23 sec
group_concat: 1 min 6.31 sec
group_concat: 1 min 6.48 sec
groupby-join: 5.36 sec
groupby-join: 5.28 sec
groupby-join: 5.31 sec

1M int rows, 999 distinct values:

group_concat: 5.31 sec
group_concat: 5.30 sec
group_concat: 5.31 sec
groupby-join: 2.47 sec
groupby-join: 2.52 sec
groupby-join: 2.47 sec

100K int rows, 999 distinct values:

group_concat: 0.56 sec
group_concat: 0.56 sec
group_concat: 0.56 sec
groupby-join: 1.63 sec
groupby-join: 1.61 sec
groupby-join: 1.63 sec

I must admit that when N of distinct rows reaches approx. 10K I get pretty the opposite results if the total number of rows is relatively small. Basically we get into the same situation as with joining the whole tables.

But what I think is the right solution is having something like this on the server side:

SELECT ... SUM(INCREMENTAL length)/(SELECT COUNT(*)) ... GROUP BY length ...

I.e. a flag that tells the server to not reset per-group counters in the aggregate functions. This would be quite a trivial change in the Item_sum class and would make sense not only for SUM, but maybe also for MIN/MAX, AVG, COUNT and maybe some other aggregate functions. In this case the query would be very simple and fast and not consume any extra memory. (Maybe I will have time to write such a patch :) )

Regards,
Vladimir

Great ideaPavel

2008-07-06T13:34:00.000+02:00

Great idea

Pavel

Vladimir, thanks for your reply.Regarding the medi...

2008-07-04T20:36:00.000+02:00

Vladimir,

thanks for your reply.

Regarding the median definition: there are a number of definitions. I know wikipedia is not authoritative, but what I read there matches exactly what I find in several statistics textbooks:

In probability theory and statistics, a median is described as the number separating the higher half of a sample, a population, or a probability distribution, from the lower half. The median of a finite list of numbers can be found by arranging all the observations from lowest value to highest value and picking the middle one. If there is an even number of observations, the median is not unique, so one often takes the mean of the two middle values. At most half the population have values less than the median and at most half have values greater than the median. If both groups contain less than half the population, then some of the population is exactly equal to the median. For example, if a < b < c, then the median of the list {a, b, c} is b, and if a < b < c < d, then the median of the list {a, b, c, d} is the mean of b and c, i.e. it is (b + c)/2.

(from: wikipedia)

I believe the 'trick' with taking the mean is sometimes called the 'financial median' and I am sure it is know in other terms as well.

It maybe so that some sources define the median as a special case of a percentile (or some other quantile) However, which way you turn it, there will always be a problem whenever a single value appears multiple times and does not allow a clean cut along the quantile boundary.

Your remark that the median must be taken from the distribution is just yet another notion of the concept median, but the mean is just as accepted, although perhaps not for the same applications.

Your observation that all values are concatenated is correct. I agree that it may seem inefficient, and I am sure it is when the dataset grows very large. However, for this particular example of 2000 films (sakila database),

I get:

my solution: returns immediately with 173

A solution with a JOIN:

SELECT DISTINCT p.length
FROM film p
INNER JOIN film v
ON p.length > v.length
GROUP BY p.film_id
HAVING CAST(
COUNT(*) / (
SELECT COUNT (*)
FROM film
)
AS DECIMAL(3,2)
) = 0.90

takes about 0.50 sec, and returns 174

A solution with a subquery:

SELECT DISTINCT length
FROM film f
WHERE .90 =
CAST(
(SELECT COUNT(*)
FROM film p
WHERE p.length < f.length
)
/ (SELECT COUNT(*) FROM film)
AS DECIMAL(3,2)
)

takes 0.13 seconds, also returning 174.

So for this dataset, my method is by far the fastest. I tried also on a larger dataset of payments, calulating the 90th percentile of the amount. (16000 something rows).

There, my method returns in 0.09s or less, and the join and subquery solution just don't seem to want to return anytime soon - i killed them.

I would be very interested in hearing alternative solutions though. I mean, I don't claim my method is elegant or whatever - I would much rather use something that is not dependent upon group_concat_max_len, so if you know something, please post here and let us all know.

Thanks, and kind regards,

Roland