Roland Bouman's blog: Search results for percentile

Showing posts sorted by relevance for query percentile. Sort by date Show all posts

Friday, July 04, 2008

Calculating the Nth percentile in MySQL

Yesterday, I was on the freenode ##pentaho irc channel when Andres Chaves asked me how to calculate the N^th percentile in MySQL. He saw a solution somewhere using subqueries, but wasn't too happy about it.

A while ago I wrote about calulating the median in MySQL, and it turns out the N^th percentile can be calculated using a similar, single-pass approach, not relying on subqueries, UDFs, or user-defined variables.

The percentile....

So, what is a percentile exactly? Here's what the wikipedia says:

A percentile is the value of a variable below which a certain percent of observations fall. So the 20th percentile is the value (or score) below which 20 percent of the observations may be found.

....and the median

The wikipedia continues and hints at the relationship between the N^th percentile and the median:

The 25th percentile is also known as the first quartile; the 50^th percentile as the median.

Sidenote: as I understand it, this latter remark concerning the median is not entirely correct. The median is "the middle value": the number of observations with a higher value is equal to the number of observations that has a lower value. For a series with an even number of observations, say {1,2,3,4}, there isn't one middle value, there are two, in this case: 2 and 3. Typically, the median is computed by taking the arithmic mean of the two middle values, which would be (2+3) / 2 = 2.5 in this particular case. But for this example, the 50th percentile would be 3 and not 2.5. However, in most practical cases the values are fairly evenly distributed and sufficiently large, which means the difference between the median and 50th percentile will be small or absent.

However, the median and percentile problems are quite similar: in both cases, a value is picked or computed that is higher than the value found in a particular portion of the total number of observations. For the median, the requirement is that that particular portion is equal to the number of observations that exceeds it; for the N^th percentile the requirement is that that portion constitutes N percent of the total number of observations.

The Solution

In the following example, we calculate the 90^th percentile of film lengths:


SELECT  SUBSTRING_INDEX(
            SUBSTRING_INDEX(
                GROUP_CONCAT(                 -- 1) make a sorted list of values
                    f.length
                    ORDER BY f.length
                    SEPARATOR ','
                )
            ,   ','                           -- 2) cut at the comma
            ,   90/100 * COUNT(*) + 1         --    at the position beyond the 90% portion
            )
        ,   ','                               -- 3) cut at the comma
        ,   -1                                --    right after the desired list entry
        )                 AS `90th Percentile`
FROM    sakila.film AS f

Here, the literal 90 represents which percentile we want, that is, "the 90^th percentile".

(If you like, you can leave out the SEPARATOR ',' bit, as the default separator is a comma anyway. I just wanted to have a clear indication for the source of the ',' arguments in the SUBSTRING_INDEX() calls)

Explanation

The median and N^th percentile problem can be solved using a similar apporoach: first, we use the string aggregate function GROUP_CONCAT() to create an ordered list of values. Then we use the substring variation SUBSTRING_INDEX() to find and excise a list entry at a particular desired position.

The differences between the solutions for the median and the N^th is mainly in which entry we have to pick from the list. (See my prior article for the full story on median). For the N^th percentile, we first calculate the desired portion of the total number of observations. Because N is defined as a percentage, we divide by 100 to get the actual fraction of the total number of observations:


    N / 100

Then, we multiply by the total number of observations to find the number of observations that make up the actual portion of observations within the specified percentile:


    N / 100 * COUNT(*)

Because we want to find the observation for which the specified portion has a lower value, we need to look at the next entry instead of the last entry within the portion, so we add 1:


    N / 100 * COUNT(*) + 1

Caveats

There are a number of things to look out for:

Percentile value not unique

When we calculate the 90^th percentile of the film lengths, we get 173:


+-----------------+
| 90th Percentile |
+-----------------+
| 173             |
+-----------------+

If we check the result by counting the portion of films with a length lower than 173 we see:


mysql> SELECT 100 * COUNT(IF(f.length < 173, 1, NULL))/COUNT(*) `Percentage`
    -> FROM   film AS f;
+------------+
| Percentage |
+------------+
|    89.4212 |
+------------+

The reason that we do not get 90% is that there are multiple occurrences of films with length equal to 173:


mysql> SELECT title FROM film WHERE length = 173;
+----------------------+
| title                |
+----------------------+
| BALLROOM MOCKINGBIRD |
| CONQUERER NUTS       |
| FIRE WOLVES          |
| GLADIATOR WESTWARD   |
| PIZZA JUMANJI        |
| TALENTED HOMICIDE    |
| VELVET TERMINATOR    |
+----------------------+
7 rows in set (0.01 sec)

So, even though we may have picked the entry at the right position, this may still be a value within the specified portion instead of beyond. The definitions I found for the N^th percentile do not stipulate any mechanism to deal with this kind of ambiguity, whereas for the median, the correct value is always found by averaging the left and right median if necessary.

What about NULL

The second caveat are NULL values. Currently this method does not work when the column for which you want to caculate percentile values is nullable. It is possible to work around this though. If you can exclude the rows with the NULL value entirely, you can simply add a WHERE clause. This is a good idea also because it will cull the number of rows to process.

It may not be acceptable to throw away the rows with NULL values, for example if there is another expression in your SELECT list that needs to do something with all rows. You can then still work around it, with some extra hassle. It would involve tweaking the GROUP_CONCAT to ignore NULL values. This could be done like this:


GROUP_CONCAT(
    IF(<column> IS NULL
    ,   ''
    , <column>) 
    ORDER BY <column>
    SEPARATOR ','
)

This will ensure that GROUP_CONCAT() does not also return NULL when a NULL value is present in the specified column. If there are NULL values, these will end up as a list of comma's in the head of the result:


,,,,<non-null-value1>,...,<non-null-valueN> -- 4 NULL values

Assuming we know the number of NULL's (and we do) we can clean up our list easily with just SUBSTRING():


SUBSTRING(
    GROUP_CONCAT(
        IF(<column> IS NULL
        ,   ''
        , <column>) 
        ORDER BY <column>
        SEPARATOR ','
    )
,   SUM(IF(<column> IS NULL, 1, 0)) + 1
)

Because we are ignoring the NULL values in our list, we must likewise ignore them in our calculation of the portion of rows. So, instead if COUNT(*) we should use COUNT(<column>) in order to not count the NULL values.

`group_concat_max_len`

When using GROUP_CONCAT(), an issue that should always be on your radar is the maximum length of the GROUP_CONCAT() result. If the result value exceeds the maximum length, the GROUP_CONCAT() result will be truncated, and a warning will be issued:


1 row in set, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+--------------------------------------+
| Level   | Code | Message                              |
+---------+------+--------------------------------------+
| Warning | 1260 | 1 line(s) were cut by GROUP_CONCAT() |
+---------+------+--------------------------------------+

It is really important to be aware of any warnings, as a truncation of the GROUP_CONCAT() result messes up the entire calculation.

The maximum length of the GROUP_CONCAT() result is controlled through the group_concat_max_len system variable.

It can be set thus:


SET @@group_concat_max_len := <num-bytes>

The maximum practical value is the maximum packet size, which is available as the max_allowed_packet system variable. This means you can write:


SET @@group_concat_max_len := @@max_allowed_packet;

and you will never be bothered by this problem again. The GROUP_CONCAT() result can still be too large though (namely, larger than the maximum packet size) but in that case you will get a proper error instead of a truncated result.

You should realize that setting the group_concat_max_len to a high value may lead to memory problems, as each GROUP_CONCAT() invocation may individually reserve the specified amount of memory to deal with its result.

Finally...

I will maintain this percentile calculation as a snippet on the MySQL Forge site. Please go there to find the latest version.

Now, finally, I have a question for you. When I wrote about the median calculation, I mentioned that I thought it was an original method, and I asked whether someone could deny that claim. I did not get any reaction, so I'd like to repeat my question: Do you know of any book or text or blog that describes this technique? If so, let me know so I can provide proper accreditation.

TIA, Roland.

Monday, July 07, 2008

Calculating Percentiles with MySQL, Round 2

My previous post on calculating percentiles with MySQL generated some comments and good discussion. In particular, I got some very interesting comments from Vladimir.

Basically, Vladimir was doubtful whether the GROUP_CONCAT() solution would be optimal in comparison to a JOIN. His proposal is to solve it like this:


SELECT   SUM(g1.r) sr
,        g2.length l
,        SUM(g1.r)/(SELECT COUNT(*) FROM film) p
FROM    (SELECT COUNT(*) r, length FROM film GROUP BY length) g1
JOIN    (SELECT COUNT(*) r, length FROM film GROUP BY length) g2
ON       g1.length < g2.length
GROUP BY g2.length
HAVING p > 0.9
ORDER BY p
LIMIT 1

First, this query sets up two identical subqueries in the FROM list using GROUP BY and COUNT() to calculate the number of occurrences of each distinct value. Then, these are joined and GROUP BY is again applied to calculate the total number of rows having a lower value. Finally, HAVING is used to find the groups in the upper percentiles, and LIMIT and ORDER BY are used to single out the one desired percentile value.

As it turns out, this solution is slower for moderately small data sets, but much faster for large data sets. He benchmarked it for a total of 999 distinct values on varying number of rows. Here are his slightly rounded numbers:


#rows:  group_concat:  groupby-join: 
4M      1 min 6    sec    5.3 sec
1M            5    sec    2.5 sec
100K          0.5  sec    1.6 sec

Although GROUP_CONCAT() seems to break down pretty soon, he also writes:

I must admit that when N of distinct rows reaches approx. 10K I get pretty the opposite results if the total number of rows is relatively small. Basically we get into the same situation as with joining the whole tables.

He concluded by saying:

But what I think is the right solution is having something like this on the server side:
SELECT COUNT(INC length)/(SELECT COUNT(*) FROM film) p, length 
FROM film 
GROUP BY length
HAVING p >= 0.9 ORDER BY p LIMIT 1 
Where "INC" is a flag that tells the server to not reset per-group counters in the aggregate functions. This would be quite a trivial change in the Item_sum class and would make sense not only for SUM, but maybe also for MIN, MAX, AVG, COUNT and maybe some other aggregate functions.

So, COUNT(INC length) would be the cumulative count, or a running total of counts. The fun thing is, you can already do exactly that using user-defined variables. Look:


-- allow statement batching
DELIMITER go                

-- initialize
SELECT      0, COUNT(*)     
INTO        @cum_cnt, @cnt
FROM        sakila.payment;

-- calculate percentiles
SELECT      @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p,  -- running fraction of #rows per distinct amount
            amount 
FROM        sakila.payment
GROUP BY    amount 
HAVING p >= 0.9 
LIMIT 1;

go

and this gets us the result:


Query OK, 1 row affected (0.01 sec)

+-------------+--------+
| p           | amount |
+-------------+--------+
| 0.904542334 |   6.99 |
+-------------+--------+
1 row in set (0.03 sec)

Here is the equivalent GROUP_CONCAT solution:


SELECT  SUBSTRING_INDEX(
             SUBSTRING_INDEX(
                 GROUP_CONCAT(
                     p.amount
                     ORDER BY p.amount
                     SEPARATOR ','
                 )
             ,   ','
             ,   90/100 * COUNT(*) + 1
             )
         ,   ','
         ,   -1
         )   AS `90th Percentile`
FROM    sakila.payment AS p;

...and it is considerably slower:


+-----------------+
| 90th Percentile |
+-----------------+
| 6.99            |
+-----------------+
1 row in set (0.08 sec)

(sakila.payment has 16049 rows and 19 distinct values for amount)

So, the sane thing to do would be to forget about that GROUP_CONCAT idea, and use this method. It does not have the nasty drawbacks of having to mess with group_concat_max_len and I am pretty sure Vladimir's method will be faster across the board anyway.

The one thing you could object about is the extra query to initialize the user-defined variables. You can get around that by initializing those in a single row subquery in the FROM clause, a technique described by Baron Schwartz (see for example this post)


SELECT      @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p, 
            amount 
FROM        sakila.payment
CROSS JOIN (SELECT @cum_cnt:=0
            ,      @cnt:=COUNT(*) 
            FROM   sakila.payment) p
GROUP BY    amount 
HAVING p >= 0.9 
LIMIT 1;

This has the advantage that the variables are initialized in one go with the entire query, ensuring you are not accidentally working with uninitialized or garbage variables.

(BTW: If you want to do more with these user-defined variables, I can highly recommend more from Baron's site, see for example his article on advanced user defined variable techniques.)

Tuesday, September 15, 2009

MySQL: Another Ranking trick

I just read SQL: Ranking without self join, in which Shlomi Noach shares a nice MySQL-specific trick based on user-defined variables to compute rankings.

Shlomi's trick reminds me somewhat of the trick I came across little over a year ago to caclulate percentiles. At that time, several people pointed out to me too that using user-defined variables in this way can be unreliable.

The problem with user-defined variables

So what is the problem exaclty? Well, whenever a query assigns to a variable, and that same variable is read in another part of the query, you're on thin ice. That's because the result of the read is likely to differ depending on whether the assignment took place before or after the read. Not surprising when you think about it - the whole point of variable assignment is to change its value, which by definition causes a different result when subsequently reading the variable (unless you assigned the already assigned value of course, duh...).

Now watch that previous statement clearly - the word subsequently is all-important.

See, that's the problem. The semantics of a SQL SELECT statement is to obtain a (tabular) resultset - not specifying an algorithm to construct that resultset. It is the job of the RDBMS to figure out an algorithm and thus, you can't be sure in what order individual expressions (including variable evaluation and assignment) are executed.

The MySQL manual states it like this:

The order of evaluation for user variables is undefined and may change based on the elements contained within a given query. In SELECT @a, @a := @a+1 ..., you might think that MySQL will evaluate @a first and then do an assignment second, but changing the query (for example, by adding a GROUP BY, HAVING, or ORDER BY clause) may change the order of evaluation.

The general rule is never to assign a value to a user variable in one part of a statement and use the same variable in some other part of the same statement. You might get the results you expect, but this is not guaranteed.

So what good are these variables anyway?

On the one hand, this looks really lame: can't MySQL just figure out the correct order of doing the calulations? Well, that is one way of looking at it. But there is an equally valid reason not to do that. If the calculations would influence execution order, it would drastically lessen the number of ways that are available to optimize the statement.

This begs the question: Why is it possible at all to assign values to the user-defined variables? The answer is quite simple: you can use it to pass values between statetments. My hunch is the variables were created in the olden days to overcome some limitations resulting from the lack of support for subqueries. Having variables at least enables you to execute a query and assign the result temporarily for use in a subsequent statement. For example, to find the student with the highest score, you can do:


mysql> select @score:=max(score) from score;
+--------------------+
| @score:=max(score) |
+--------------------+
|                 97 |
+--------------------+
1 row in set (0.00 sec)

mysql> select * from score where score = @score;
+----------+--------------+-------+
| score_id | student_name | score |
+----------+--------------+-------+
|        2 | Gromit       |    97 |
+----------+--------------+-------+
1 row in set (0.03 sec)

There is nothing wrong with this approach - problems start arising only when reading and writing the same variable in one and the same statement.

Another way - serializing the set with `GROUP_CONCAT`

Anyway, the percentile post I just linked to contains another solution for that problem that relies on GROUP_CONCAT. It turns out we can use the same trick here.

(Some people may like to point out that using GROUP_CONCAT is not without issues either, because it may truncate the list in case the pre-assigned string buffer is not large enough. I wrote about dealing with that limitation in several places and I remain recommending to set the group_concat_max_len server variable to the value set for the max_packet_size server variable like so:

SET @@group_concat_max_len := @@max_allowed_packet;

)

The best way to understand how it works is to think of the problem in a few steps. First, we make an ordered list of all the values we want to rank. We can do this with GROUP_CONCAT like this:


mysql> SELECT  GROUP_CONCAT(
    ->             DISTINCT score
    ->             ORDER BY score  DESC
    ->         )                   AS scores
    -> FROM    score
    -> ;
+-------------+
| scores      |
+-------------+
| 97,95,92,85 |
+-------------+
1 row in set (0.00 sec)

Now that we have this list, we can use the FIND_IN_SET function to look up the position of any particlar value contained in the list. Because the list is ordered in descending order (due to the ORDER BY ... DESC), and contains only unique values (due to the DISTINCT), this position is in fact the rank number. For example, if we want to know the rank of all scores with the value 92, we can do:


mysql> SELECT FIND_IN_SET(92, '97,95,92,85')
+--------------------------------+
| FIND_IN_SET(92, '97,95,92,85') |
+--------------------------------+
|                              3 |
+--------------------------------+
1 row in set (0.00 sec)

So, the answer is 3 because 92 is the third entry in the list.

(If you're wondering how it's possible that we can pass the integer 92 as first argument for FIND_IN_SET: the function expects string arguments, and automatically converts whichever non-string typed value we pass to a string. In the case of the integer 92, it is silently converted to the string '92')

Of course, we are't really interested in looking up ranks for individual numbers one at a time; rather, we'd like to combine this with a query on the scores table that does it for us. Likewise, we don't really want to manually supply the list of values as a string constant, we want to substitute that with the query we wrote to generate that list.
So, we get:


mysql> SELECT score_id, student_name, score
    -> ,      FIND_IN_SET(
    ->            score
    ->        ,  (SELECT  GROUP_CONCAT(
    ->                        DISTINCT score
    ->                        ORDER BY score  DESC
    ->                    )
    ->            FROM    score)
    ->        ) as rank
    -> FROM   score;
+----------+--------------+-------+------+
| score_id | student_name | score | rank |
+----------+--------------+-------+------+
|        1 | Wallace      |    95 |    2 |
|        2 | Gromit       |    97 |    1 |
|        3 | Shaun        |    85 |    4 |
|        4 | McGraw       |    92 |    3 |
|        5 | Preston      |    92 |    3 |
+----------+--------------+-------+------+
5 rows in set (0.00 sec)

Alternatively, if you think that subqueries are for the devil, you can rewrite this to a CROSS JOIN like so:


SELECT      score_id, student_name, score
,           FIND_IN_SET(
                score
            ,   scores
            ) AS rank
FROM        score
CROSS JOIN (SELECT  GROUP_CONCAT(
                       DISTINCT score
                       ORDER BY score  DESC
                    ) AS scores
            FROM    score) scores

Now that we have a solutions, lets see how it compares to Shlomi's original method. To do this, I am using the payment table from the sakila sample database.

First, Shlomi's method:


mysql> SELECT   payment_id
    -> ,        amount
    -> ,        @prev := @curr
    -> ,        @curr := amount
    -> ,        @rank := IF(@prev = @curr, @rank, @rank+1) AS rank
    -> FROM     sakila.payment
    -> ,       (SELECT @curr := null, @prev := null, @rank := 0) sel1
    -> ORDER BY amount DESC;
+------------+--------+----------------+-----------------+------+
| payment_id | amount | @prev := @curr | @curr := amount | rank |
+------------+--------+----------------+-----------------+------+
|        342 |  11.99 |           NULL |           11.99 |    1 |
.        ... .  ..... .          ..... .           ..... .    . .
|      15456 |   0.00 |           0.00 |            0.00 |   19 |
+------------+--------+----------------+-----------------+------+
16049 rows in set (0.09 sec)

Wow! It sure is fast :) Now, the GROUP_CONCAT solution, using a subquery:


mysql> SELECT payment_id, amount
    -> ,      FIND_IN_SET(
    ->            amount
    ->        ,  (SELECT  GROUP_CONCAT(
    ->                        DISTINCT amount
    ->                        ORDER BY amount  DESC
    ->                    )
    ->            FROM    sakila.payment)
    ->        ) as rank
    -> FROM   sakila.payment
+------------+--------+------+
| payment_id | amount | rank |
+------------+--------+------+
|          1 |   2.99 |   15 |
.          . .   .... .   .. .
|      16049 |   2.99 |   15 |
+------------+--------+------+
16049 rows in set (0.14 sec)

(In case you're wondering why the results are different, this is because the result set for Shlomi's solution is necessarily ordered by ascending rank (or descending amount - same difference. To obtain the identical result, you need to add an ORDER BY clause to my query. But since the point was to calculate the ranks, I didn't bother. Of course, adding an ORDER BY could slow things down even more.)

Quite a bit slower, bummer. But at leastt we can't run into nasties with the user variables anymore. For this data set, I get about the same performance with the CROSS JOIN, but I should warn that I did not do a real benchmark.

Conclusion

Don't fall into the trap of reading and writing the same user-defined variable in the same statement. Although it seems like a great device and can give you very good performance, you cannot really control the order of reads and writes. Even if you can, you must check it again whenever you have reason to believe the query will be solved differently by the server. This is of course the case whenever you upgrade the server. But also seemingly harmless changes like adding an index to a table may change the order of execution.

Almost all cases where people want to read and write to the same user variables within the same query, they are dealing with a kind of serialization problem. They are trying to maintain state in a variable in order to use it across rows. In many cases, the right way to do that is to use a self-join. But this may not always be feasible, as pointed out in Shlomi's original post. For example, rewriting the payment rank query using a self join is not going to make you happy.

Often, there is a way out. You can use GROUP_CONCAT to serialize a set of rows. Granted, you need at least one pass for that, and another one to do something useful with the result, but this still a lot better than dealing with semi-cartesian self join issues.

Friday, March 19, 2010

Greatest N per group: top 3 with GROUP_CONCAT()

In my opinion, one of the best things that happened to Planet MySQL lately, is Explain Extended, a blog by Alex Bolenok (also known as Quassnoi on Stackoverflow).

I never had the pleasure of meeting Alex in person, but his articles are always interesting and of high quality, and the SQL wizardry he pulls off is downright inspiring. I really feel humbled by the creativity of some of his solutions and his apparent experience with multiple RDBMS products.

Alex' most recent post is about aggregation, and finding a top 3 based on the aggregate:

In MySQL I have a table called meanings with three columns: word, meaning, person. word has 16 possible values, meaning has 26. A person assigns one or more meanings to each word. In the sample above, person 1 assigned two meanings to word 2. There will be thousands of persons. I need to find the top three meanings for each of the 16 words, with their frequencies. Is it possible to solve this with a single MySQL query?

Alex presents a solution that uses GROUP_CONCAT basically as a poor man's windowing function, a technique I have described on several occasions in the past for ranking, median and percentile solutions in MySQL.

Now, Alex' solution is very clever and there are some elements that I think are very creative. That said, I think his solution can be improved still. Normally I wouldn't write a blog about it, and simply leave a comment on his blog, but his blog supports comments only for general articles, which is why I present it here:


SELECT  word
,       CONCAT(
            SUBSTRING_INDEX(
                GROUP_CONCAT(meaning ORDER BY num DESC), ',', 1
            )
        ,   ' ('
        ,   SUBSTRING_INDEX(
                GROUP_CONCAT(num ORDER BY num DESC), ',', 1
            ) / SUM(num) * 100
        ,   '%)'
        ) rank1
,       CONCAT(
            SUBSTRING_INDEX(
                SUBSTRING_INDEX(
                    GROUP_CONCAT(meaning ORDER BY num DESC), ',', 2
                ), ',', -1
            )
        ,   ' ('
        ,   SUBSTRING_INDEX(
                SUBSTRING_INDEX(
                    GROUP_CONCAT(num ORDER BY num DESC), ',', 2
                ), ',', -1
            ) / SUM(num) * 100
        ,   '%)'
        ) rank2
,       CONCAT(
            SUBSTRING_INDEX(
                SUBSTRING_INDEX(
                    GROUP_CONCAT(meaning ORDER BY num DESC), ',', 3
            ), ',', -1)
        ,   ' ('
        ,   SUBSTRING_INDEX(
                SUBSTRING_INDEX(
                    GROUP_CONCAT(num ORDER BY num DESC), ',', 3
                ), ',', -1
            ) / SUM(num) * 100
        ,   '%)'
        ) rank3
FROM    (
        SELECT      word, meaning, COUNT(*) num
        FROM        t_meaning m
        GROUP BY    word,meaning
        ) a
GROUP BY word

This gives me output like this:


+------+--------------+--------------+--------------+
| word | rank1        | rank2        | rank3        |
+------+--------------+--------------+--------------+
|    1 | 16 (3.9728%) | 17 (3.9648%) | 12 (3.9632%) |
|    2 | 9 (3.9792%)  | 10 (3.9632%) | 20 (3.9328%) |
|    3 | 20 (3.9744%) | 13 (3.968%)  | 1 (3.9648%)  |
|    4 | 26 (3.952%)  | 7 (3.9456%)  | 17 (3.9424%) |
|    5 | 9 (4.008%)   | 21 (3.9824%) | 20 (3.936%)  |
|    6 | 19 (3.9504%) | 10 (3.9488%) | 13 (3.9408%) |
|    7 | 23 (4.0464%) | 12 (3.976%)  | 19 (3.9648%) |
|    8 | 23 (4.0112%) | 3 (4.0096%)  | 8 (3.9328%)  |
|    9 | 10 (4.016%)  | 19 (3.984%)  | 15 (3.9616%) |
|   10 | 10 (4.0304%) | 14 (3.9344%) | 11 (3.9312%) |
|   11 | 16 (3.9584%) | 6 (3.9296%)  | 19 (3.9232%) |
|   12 | 7 (3.9968%)  | 1 (3.9392%)  | 26 (3.9264%) |
|   13 | 8 (4.048%)   | 25 (3.9712%) | 23 (3.9616%) |
|   14 | 16 (3.9936%) | 26 (3.9632%) | 4 (3.9536%)  |
|   15 | 22 (4.0608%) | 12 (4.0048%) | 1 (3.9632%)  |
|   16 | 14 (4.0032%) | 18 (3.9712%) | 4 (3.9488%)  |
+------+--------------+--------------+--------------+
16 rows in set (0.63 sec)

On my laptop, my solution is about 30% faster than the one presented by Alex. Personally I think mine is easier to understand too, but that is a matter of taste.

Anyway, I'm just posting this to share my solution - I do not intend to downplay the one presented by Alex. Instead, I invite everyone interested in SQL, MySQL and PostgreSQL to keep an eye on Alex' blog as well as his excellent answers on Stackoverflow. He's an SQL jedi master in my book :)

Of course, if you have a better solution to crack this problem in MySQL, please leave a comment. I'd love to hear what other people are doing to cope with these kinds of queries.

Tuesday, July 08, 2008

MySQL Percentile aftermath: Calculating all quantiles

Are you getting fed up yet with my posts on calculating percentiles? Well, I'm sorry but I'm not quite finished.

Here's a simple, fast method to calculate the specified number of quantiles:


-- set the number of quantiles, for exmple:
--   quartiles: 4
--   deciles: 10
--   percentiles: 100

SET @quantiles:=4;  -- select quartiles

-- calculate all quantiles
-- 
--
SELECT     amount                     AS metric
,          @n DIV (@c DIV @quantiles) AS quantile
,          @n                         AS N
FROM       sakila.payment
CROSS JOIN (
            SELECT @n:=0                           -- rownumber
            ,      @c:=COUNT(*)                    -- need this to calculate quantile partitions
            FROM   sakila.payment
           ) c
WHERE      NOT (                                   -- modulo zero (=false), we are at the quantile
               (@n:=@n+1) % (@c DIV @quantiles)    -- rownumber equal to the quantile partition? 
           ) 
ORDER BY   amount;                                 -- need ordered partitions

You can find this snippet on MySQL Forge.