Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Monday, July 07, 2008

Calculating Percentiles with MySQL, Round 2

My previous post on calculating percentiles with MySQL generated some comments and good discussion. In particular, I got some very interesting comments from Vladimir.

Basically, Vladimir was doubtful whether the GROUP_CONCAT() solution would be optimal in comparison to a JOIN. His proposal is to solve it like this:

SELECT SUM(g1.r) sr
, g2.length l
, SUM(g1.r)/(SELECT COUNT(*) FROM film) p
FROM (SELECT COUNT(*) r, length FROM film GROUP BY length) g1
JOIN (SELECT COUNT(*) r, length FROM film GROUP BY length) g2
ON g1.length < g2.length
GROUP BY g2.length
HAVING p > 0.9
ORDER BY p
LIMIT 1

First, this query sets up two identical subqueries in the FROM list using GROUP BY and COUNT() to calculate the number of occurrences of each distinct value. Then, these are joined and GROUP BY is again applied to calculate the total number of rows having a lower value. Finally, HAVING is used to find the groups in the upper percentiles, and LIMIT and ORDER BY are used to single out the one desired percentile value.

As it turns out, this solution is slower for moderately small data sets, but much faster for large data sets. He benchmarked it for a total of 999 distinct values on varying number of rows. Here are his slightly rounded numbers:

#rows: group_concat: groupby-join:
4M 1 min 6 sec 5.3 sec
1M 5 sec 2.5 sec
100K 0.5 sec 1.6 sec

Although GROUP_CONCAT() seems to break down pretty soon, he also writes:

I must admit that when N of distinct rows reaches approx. 10K I get pretty the opposite results if the total number of rows is relatively small. Basically we get into the same situation as with joining the whole tables.

He concluded by saying:

But what I think is the right solution is having something like this on the server side:

SELECT COUNT(INC length)/(SELECT COUNT(*) FROM film) p, length
FROM film
GROUP BY length
HAVING p >= 0.9 ORDER BY p LIMIT 1

Where "INC" is a flag that tells the server to not reset per-group counters in the aggregate functions. This would be quite a trivial change in the Item_sum class and would make sense not only for SUM, but maybe also for MIN, MAX, AVG, COUNT and maybe some other aggregate functions.

So, COUNT(INC length) would be the cumulative count, or a running total of counts. The fun thing is, you can already do exactly that using user-defined variables. Look:

-- allow statement batching
DELIMITER go

-- initialize
SELECT 0, COUNT(*)
INTO @cum_cnt, @cnt
FROM sakila.payment;

-- calculate percentiles
SELECT @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p, -- running fraction of #rows per distinct amount
amount
FROM sakila.payment
GROUP BY amount
HAVING p >= 0.9
LIMIT 1;

go

and this gets us the result:

Query OK, 1 row affected (0.01 sec)

+-------------+--------+
| p | amount |
+-------------+--------+
| 0.904542334 | 6.99 |
+-------------+--------+
1 row in set (0.03 sec)

Here is the equivalent GROUP_CONCAT solution:

SELECT SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(
p.amount
ORDER BY p.amount
SEPARATOR ','
)
, ','
, 90/100 * COUNT(*) + 1
)
, ','
, -1
) AS `90th Percentile`
FROM sakila.payment AS p;

...and it is considerably slower:

+-----------------+
| 90th Percentile |
+-----------------+
| 6.99 |
+-----------------+
1 row in set (0.08 sec)

(sakila.payment has 16049 rows and 19 distinct values for amount)

So, the sane thing to do would be to forget about that GROUP_CONCAT idea, and use this method. It does not have the nasty drawbacks of having to mess with group_concat_max_len and I am pretty sure Vladimir's method will be faster across the board anyway.

The one thing you could object about is the extra query to initialize the user-defined variables. You can get around that by initializing those in a single row subquery in the FROM clause, a technique described by Baron Schwartz (see for example this post)

SELECT @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p,
amount
FROM sakila.payment
CROSS JOIN (SELECT @cum_cnt:=0
, @cnt:=COUNT(*)
FROM sakila.payment) p
GROUP BY amount
HAVING p >= 0.9
LIMIT 1;

This has the advantage that the variables are initialized in one go with the entire query, ensuring you are not accidentally working with uninitialized or garbage variables.

(BTW: If you want to do more with these user-defined variables, I can highly recommend more from Baron's site, see for example his article on advanced user defined variable techniques.)

30 comments:

Pavel Stehule said...

I am testing your ideas on PostgreSQL - where I can use relative unlimited arrays.

create or replace function nth_percentil(anyarray, int)
returns anyelement as $$
select $1[$2/100.0 * array_upper($1,1) + 1];
$$ language sql immutable strict;

pagila=# select nth_percentil(array(select length from film order by 1),90);
nth_percentil
---------------
173
(1 row)

There is array based solution mutch faster than others - 6ms versus 50ms (using derived tables). I belive so solution based on your idea has to be most fast, but MySQL lost lot of time with parsing of string.

Roland Bouman said...

Hi Pavel,

Thanks for chiming in.

Wow, that's a big difference - 6 vs 50.

I bet you are right the GROUP_CONCAT() solution loses time on both parsing as well as concatenating the string. For example, the group concat has to convert the numeric value to a string, copy to the result string etc.

I am considering to write the entire thing as a UDF too, just for the hell of it.

Anyway, I appreciate your findings, let me know if you find other ways ;)

kind regards,

Roland

Pavel Stehule said...

Arrays are really usefull - I cannot live without arrays :) - its important mainly for stored procedure's programming.

Using OFFSET and LIMIT is fast too:

SELECT DISTINCT amount
FROM payment
ORDER BY 1
OFFSET 0.9 * (SELECT count(DISTINCT amount) FROM payment)
LIMIT 1;

this query is only 1/3 slower than array based query.

Pavel

Precious said...

Hi, how can I add group to Vladimir's query, i.e. that it calculate the percentile per each group. I made it work with the GROUP_CONCAT() solution, but it fails for my big dataset...

any suggestions?

Sarah

Anonymous said...

Roland: Many thanks for your suggestion.

I try this query for calculation of a '90th percentile':

[code]
SELECT SUM(g1.r) sr,
g2.O2_PPM l,
SUM(g1.r)/(SELECT COUNT(*) FROM listfiles) p
FROM (SELECT COUNT(*) r, O2_PPM
FROM listfiles GROUP BY O2_PPM) g1
JOIN (SELECT COUNT(*) r, O2_PPM FROM listfiles GROUP BY O2_PPM) g2
ON g1.O2_PPM < g2.O2_PPM
GROUP BY g2.O2_PPM
HAVING p > 0.9
ORDER BY p
LIMIT 1[/code]

And the ouptus is:
[code]
sr l p
46 27787 0,92
[/code]

I need calculating the 90th percentile in my table mysql for this columns:
O2_PPM, N2_PPM, CO_PPM, CO2_PPM, H2_PPM, CH4_PPM, C2H6_PPM, C2H4_PPM, C2H2_PPM, C2H2_C2H4_PPM, CH4_H2_PPM, C2H4_C2H6_PPM, H20_PPM

I need 13 differents queries for calculating the '90th percentile' ?

And I have another problem:
If I use Microsoft excel for calculation of a '90th percentile' the output for the column O2_PPM is 22559,5... in mysql 27787, what's the difference?

Chevy

Roland Bouman said...

Hi anonymous,

yes, I think you'll need to set up the same logic 13 times. You may be able to write it into one monster query, but essentially you're going to have to repeat the logic.

I can't say based on this information why you're getting different results. You may have been bitten by the user-defined variables.

Another thing you could try is checkout the statistical language R. Personally I am not familiar with it but I hear many good things about it.

Good luck, Roland.

eco said...

Hi Roland, thanks a lot for your answer.

I try this query ( http://rpbouman.blogspot.com/2008/07/calculating-nth-percentile-in-mysql.html, previous post on calculating percentiles with MySQL ) :

SELECT SUM(g1.r) sr,
g2.o2_PPM l,
SUM(g1.r)/(SELECT COUNT(*) FROM cm714.listfiles) p
FROM (SELECT COUNT(*) r, o2_PPM
FROM cm714.listfiles GROUP BY o2_PPM) g1
JOIN (SELECT COUNT(*) r, o2_PPM FROM cm714.listfiles GROUP BY o2_PPM) g2
ON g1.o2_PPM < g2.o2_PPM
GROUP BY g2.o2_PPM
HAVING p > 0.9
ORDER BY p
LIMIT 1

And I have for 90th Percentile 27787.

I try this query (http://rpbouman.blogspot.com/2008/07/calculating-percentiles-with-mysql.html):
SELECT
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(f.o2_PPM
ORDER BY f.o2_PPM
SEPARATOR ',')
, ',', 90/100 * COUNT(*) + 1),',',-1)
AS `90th Percentile`
FROM cm714.listfiles AS f

And I have for 90th Percentile 23995.

If I use Microsoft excel the output for the column O2_PPM is 22559,5.

Which one of this options is right ?

Please my name is Chevy... not anonymous... thanks

Roland Bouman said...

Hi Chevy,

based on this information, I can't say which one is right. For the method that depends on user-defined variables, it's possible that you got bitten because that method actually abuses user-defined variables. You may get a different result if you add an index on the column that you're calculating the percentile for.

The method using GROUP_CONCAT should be right as long as the group_concat_max_len value is large enough to hold the entire set. If it's not you should see a warning in the query result, and if you do see that you're going to have to up the @@group_concat_max_len value so it can hold all of your set.

Excel may or may not (please check that documentation) use an average of two values in case the percentile boundary falls between two values. None of my methods do that.

I can't possibly figure out which of these things apply to your problem. I hope you can find out yourself though, based on what I just wrote. If not, you can try and post again, and I can try to find the time to look deeper into it.

kind regards,

Roland.

eco said...

Hi Roland, thanks a lot for new answer.

I send in this link the archive sql and xls:
http://free.7host06.com/popigu/_roland.zip

Thanks so much for your help.

Best regards,

Chevy

Roland Bouman said...

Chevy, this is a bit quick.

Have you tried to follow up on my suggestions to find out yourself what the reason is for the different results? You really should put some effort into it yourself first, and preferably let me know if you find some problem in my code. Hope this is clear now.

good luck,

Roland.

eco said...

Roland: this is the solution to problem-
Now match the `90th Percentile` in mysql and excel.
Thanks for you excellent post and help.
Kind regards
Chevy

SELECT SUM(g1.r) sr,
g2.O2_PPM l,
SUM(g1.r)/(SELECT COUNT(*) FROM cm714.listfiles) p
FROM (SELECT COUNT(*) r, O2_PPM
FROM cm714.listfiles GROUP BY O2_PPM) g1
JOIN (SELECT COUNT(*) r, O2_PPM FROM cm714.listfiles GROUP BY O2_PPM) g2
ON g1.O2_PPM <= g2.O2_PPM
GROUP BY g2.O2_PPM
HAVING p >= 0.9
ORDER BY p
LIMIT 1

Roland Bouman said...

Hi Chevy,

not sure what the intention is of your last post. I hope it means you solved your problem.

Best regards,

Roland

jamma said...

Roland:
Having trouble trying to adapt this method for data that has groups. I can't figure out how to reinitialize the variable for each new group. For example, my data set has Customer, Period, Sales. I want to calculate the sales percentile of each customer for each period. Any ideas?

Roland Bouman said...

Jamma,

yes when using user-defined variables you're essentially making the query dependent upon execution order. There are some tricks to control it, but it is generally inadvisable to do so.

Take a look at the previous post, that describes a safe but slower method to do it:

http://rpbouman.blogspot.com/2008/07/calculating-nth-percentile-in-mysql.html

Turbo said...

This doesn't work for me for some reason on mysql version 5.0.84
The cumulative count doesn't seem to work with GROUP BY clause (works otherwise).
The expression @cum_cnt := @cum_cnt + COUNT(*) evaluates to COUNT(*) as always. In other words, the value of @cum_cnt on RHS is always 0, for every row.

An example with a simpler table:-
> CREATE TABLE mytest (x INT);
> INSERT INTO mytest VALUES (2),(3),(4),(1),(2),(4),(1),(5),(6),(4),(5),(1),(3),(2);
Query below works as expected:-
> SELECT x, @cum_cnt:=@cum_cnt+x FROM mytest, (SELECT @cum_cnt:=0) t;
+------+----------------------+
| x | @cum_cnt:=@cum_cnt+x |
+------+----------------------+
| 2 | 2 |
| 3 | 5 |
| 4 | 9 |
| 1 | 10 |
| 2 | 12 |
| 4 | 16 |
| 1 | 17 |
| 5 | 22 |
| 6 | 28 |
| 4 | 32 |
| 5 | 37 |
| 1 | 38 |
| 3 | 41 |
| 2 | 43 |
+------+----------------------+

But this one does not:-
> SELECT x, COUNT(*), @cum_cnt:=@cum_cnt+COUNT(*) FROM mytest, (SELECT @cum_cnt:=0) t GROUP BY x;
+------+----------+-----------------------------+
| x | COUNT(*) | @cum_cnt:=@cum_cnt+COUNT(*) |
+------+----------+-----------------------------+
| 1 | 3 | 3 |
| 2 | 3 | 3 |
| 3 | 2 | 2 |
| 4 | 3 | 3 |
| 5 | 2 | 2 |
| 6 | 1 | 1 |
+------+----------+-----------------------------+

Same results if I take out the initialization of @cum_cnt from the select clause and use set just before returning the query:-

> SET @cum_cnt:=0; SELECT x, COUNT(*), @cum_cnt:=@cum_cnt+COUNT(*) FROM mytest GROUP BY x;
Query OK, 0 rows affected (0.00 sec)

+------+----------+-----------------------------+
| x | COUNT(*) | @cum_cnt:=@cum_cnt+COUNT(*) |
+------+----------+-----------------------------+
| 1 | 3 | 3 |
| 2 | 3 | 3 |
| 3 | 2 | 2 |
| 4 | 3 | 3 |
| 5 | 2 | 2 |
| 6 | 1 | 1 |
+------+----------+-----------------------------+

Turbo said...

Also, have you had any luck converting this into a UDF?

Roland Bouman said...

@Turbo, the variables are a trap. Please check out my previous post http://rpbouman.blogspot.com/2008/07/calculating-nth-percentile-in-mysql.html, it describes another (slower) method that does at least compute the result reliably.

Roland Bouman said...

@Turbo luck? I haven't tried.

Turbo said...

BTW, I came up with this solution which seems to work for me, and is hopefully faster than join solution. But I am skeptical there must be something wrong because no one else suggested it.

For the table mytest posted in the comment above, check out this query:-
SELECT @p, x
FROM
(SELECT x FROM mytest ORDER BY x ) t1, -- Sort x
(SELECT @n:=0, @p:=0.9, -- initialize some variables.
@p_row:=CEIL(@p * (SELECT COUNT(*) FROM mytest )) -- index that lies at 90th percentile of the sorted rows
) t3
WHERE (@n:=@n+1) = @p_row -- where is used so that we can spit element at @p_row'th row. A hack cause limit doesn't accept variables.
LIMIT 1;

Turbo said...

I didn't do extensive benchmarks, but on my table of 15K rows:-
1. Join method proposed by Vladimir took 1.053s
2. My method took 0.234s

Roland Bouman said...

@turbo: and the group_concat trick I proposed?

CRIngrow said...

Percentiles produce different results compared to EXCEL
Great article(s) however the results from the formula give different results
to those returned by Excel percentile.inc and percentile.exc functions. This appears to be because this function returns a value from the sample where as Excel does an interpolation. SI anyone aware of a solution that also does the interpolation ?
Many Thanks - Chris

Christopher said...

Not a terribly helpful comment, but an opinion nonetheless:

Y'ALL! THANK YOU FOR SHARING YOUR KNOWLEDGE WITH THE WORLD!

The original post and ensuing dialogue has been immensely helpful. Thank you, thank you.

Joe said...

I am a newbie and struggling to calculate percentiles "across groups". I tried to tweak some of the codes above, but failed. Does anyone have a solution in hand?

Joe said...

I am a newbie and struggling to calculate percentiles "across groups". I tried to tweak some of the codes above, but failed. Does anyone have a solution in hand?

Roland Bouman said...

Joe,

can you post some sample input data and a sample of the desired result?

Joe said...

Thanks Roland,
input data:
DROP TABLE IF EXISTS `test`;
CREATE TABLE `test` (
`grade_id` int(11) NOT NULL auto_increment,
`student_name` varchar(32) character set utf8 default NULL,
`grade` int(10) unsigned default NULL,
`course` varchar(1) character set utf8 default NULL,
`year` int(2) unsigned default NULL,
PRIMARY KEY (`grade_id`)
) ENGINE=MyISAM AUTO_INCREMENT=6 DEFAULT CHARSET=latin1;

INSERT INTO `test` VALUES (1,'Wallace',95,'a',90),(2,'Gromit',97,'a',90),(3,'Shaun',85,'b',90),(4,'McGraw',92,'b',90),(5,'Preston',92,'b',90)
, (6,'Wallace',95,'b',91),(7,'Gromit',97,'b',91),(8,'Shaun',85,'b',91),(9,'McGraw',92,'a',91),(10,'Preston',92,'a',91);

/*Groups are defined by course & year*/

/*result*/
student_name grade course year percentile
Wallace 95 a 90 0.5
Gromit 97 a 90 1.0
Shaun 85 b 90 0.3
McGraw 92 b 90 0.7
Preston 92 b 90 1.0
McGraw 92 a 91 0.5
Preston 92 a 91 1.0
Shaun 85 b 91 0.3
Wallace 95 b 91 0.7
Gromit 97 b 91 1.0

Roland Bouman said...

Joe, I will get back at this. First, allow me to post your additional comment here (I hit the wrong button and accidentally deleted your original one):

"Just to clarify the steps involved in calculating percentile in the example above: 1. order 'grade' in ascending order, 2. create a series from 1 to n for each category, defined by 'course' and 'year'(n is the total number of records in each category), and 3. divide this series by the total number of records in the respective categories (i.e., n).

My ultimate goal is to find out minimum grade required to be in the top 5%, top 3% etc. Once I have the percentile column I have this flexibility. "

Joe said...

Thank you, Roland.

Anonymous said...

According to section 9.3 of the MySQL Reference Manual :
"Note: In a SELECT statement, each expression is evaluated only when sent to the client. This means that in a HAVING, GROUP BY, or ORDER BY clause, you cannot refer to an expression that involves variables that are set in the SELECT list."