Percona Live MySQL User's Conference, San Francisco, April 10-12th, 2012 Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Friday, July 04, 2008

Calculating the Nth percentile in MySQL

Yesterday, I was on the freenode ##pentaho irc channel when Andres Chaves asked me how to calculate the Nth percentile in MySQL. He saw a solution somewhere using subqueries, but wasn't too happy about it.

A while ago I wrote about calulating the median in MySQL, and it turns out the Nth percentile can be calculated using a similar, single-pass approach, not relying on subqueries, UDFs, or user-defined variables.

The percentile....


So, what is a percentile exactly? Here's what the wikipedia says:

A percentile is the value of a variable below which a certain percent of observations fall. So the 20th percentile is the value (or score) below which 20 percent of the observations may be found.

....and the median


The wikipedia continues and hints at the relationship between the Nth percentile and the median:

The 25th percentile is also known as the first quartile; the 50th percentile as the median.

Sidenote: as I understand it, this latter remark concerning the median is not entirely correct. The median is "the middle value": the number of observations with a higher value is equal to the number of observations that has a lower value. For a series with an even number of observations, say {1,2,3,4}, there isn't one middle value, there are two, in this case: 2 and 3. Typically, the median is computed by taking the arithmic mean of the two middle values, which would be (2+3) / 2 = 2.5 in this particular case. But for this example, the 50th percentile would be 3 and not 2.5. However, in most practical cases the values are fairly evenly distributed and sufficiently large, which means the difference between the median and 50th percentile will be small or absent.

However, the median and percentile problems are quite similar: in both cases, a value is picked or computed that is higher than the value found in a particular portion of the total number of observations. For the median, the requirement is that that particular portion is equal to the number of observations that exceeds it; for the Nth percentile the requirement is that that portion constitutes N percent of the total number of observations.

The Solution


In the following example, we calculate the 90th percentile of film lengths:

SELECT SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT( -- 1) make a sorted list of values
f.length
ORDER BY f.length
SEPARATOR ','
)
, ',' -- 2) cut at the comma
, 90/100 * COUNT(*) + 1 -- at the position beyond the 90% portion
)
, ',' -- 3) cut at the comma
, -1 -- right after the desired list entry
) AS `90th Percentile`
FROM sakila.film AS f

Here, the literal 90 represents which percentile we want, that is, "the 90th percentile".

(If you like, you can leave out the SEPARATOR ',' bit, as the default separator is a comma anyway. I just wanted to have a clear indication for the source of the ',' arguments in the SUBSTRING_INDEX() calls)

Explanation


The median and Nth percentile problem can be solved using a similar apporoach: first, we use the string aggregate function GROUP_CONCAT() to create an ordered list of values. Then we use the substring variation SUBSTRING_INDEX() to find and excise a list entry at a particular desired position.

The differences between the solutions for the median and the Nth is mainly in which entry we have to pick from the list. (See my prior article for the full story on median). For the Nth percentile, we first calculate the desired portion of the total number of observations. Because N is defined as a percentage, we divide by 100 to get the actual fraction of the total number of observations:

N / 100

Then, we multiply by the total number of observations to find the number of observations that make up the actual portion of observations within the specified percentile:

N / 100 * COUNT(*)

Because we want to find the observation for which the specified portion has a lower value, we need to look at the next entry instead of the last entry within the portion, so we add 1:

N / 100 * COUNT(*) + 1

Caveats


There are a number of things to look out for:

Percentile value not unique


When we calculate the 90th percentile of the film lengths, we get 173:

+-----------------+
| 90th Percentile |
+-----------------+
| 173 |
+-----------------+

If we check the result by counting the portion of films with a length lower than 173 we see:

mysql> SELECT 100 * COUNT(IF(f.length < 173, 1, NULL))/COUNT(*) `Percentage`
-> FROM film AS f;
+------------+
| Percentage |
+------------+
| 89.4212 |
+------------+

The reason that we do not get 90% is that there are multiple occurrences of films with length equal to 173:

mysql> SELECT title FROM film WHERE length = 173;
+----------------------+
| title |
+----------------------+
| BALLROOM MOCKINGBIRD |
| CONQUERER NUTS |
| FIRE WOLVES |
| GLADIATOR WESTWARD |
| PIZZA JUMANJI |
| TALENTED HOMICIDE |
| VELVET TERMINATOR |
+----------------------+
7 rows in set (0.01 sec)

So, even though we may have picked the entry at the right position, this may still be a value within the specified portion instead of beyond. The definitions I found for the Nth percentile do not stipulate any mechanism to deal with this kind of ambiguity, whereas for the median, the correct value is always found by averaging the left and right median if necessary.

What about NULL


The second caveat are NULL values. Currently this method does not work when the column for which you want to caculate percentile values is nullable. It is possible to work around this though. If you can exclude the rows with the NULL value entirely, you can simply add a WHERE clause. This is a good idea also because it will cull the number of rows to process.

It may not be acceptable to throw away the rows with NULL values, for example if there is another expression in your SELECT list that needs to do something with all rows. You can then still work around it, with some extra hassle. It would involve tweaking the GROUP_CONCAT to ignore NULL values. This could be done like this:

GROUP_CONCAT(
IF(<column> IS NULL
, ''
, <column>)
ORDER BY <column>
SEPARATOR ','
)

This will ensure that GROUP_CONCAT() does not also return NULL when a NULL value is present in the specified column. If there are NULL values, these will end up as a list of comma's in the head of the result:

,,,,<non-null-value1>,...,<non-null-valueN> -- 4 NULL values

Assuming we know the number of NULL's (and we do) we can clean up our list easily with just SUBSTRING():

SUBSTRING(
GROUP_CONCAT(
IF(<column> IS NULL
, ''
, <column>)
ORDER BY <column>
SEPARATOR ','
)
, SUM(IF(<column> IS NULL, 1, 0)) + 1
)

Because we are ignoring the NULL values in our list, we must likewise ignore them in our calculation of the portion of rows. So, instead if COUNT(*) we should use COUNT(<column>) in order to not count the NULL values.

group_concat_max_len


When using GROUP_CONCAT(), an issue that should always be on your radar is the maximum length of the GROUP_CONCAT() result. If the result value exceeds the maximum length, the GROUP_CONCAT() result will be truncated, and a warning will be issued:

1 row in set, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+--------------------------------------+
| Level | Code | Message |
+---------+------+--------------------------------------+
| Warning | 1260 | 1 line(s) were cut by GROUP_CONCAT() |
+---------+------+--------------------------------------+

It is really important to be aware of any warnings, as a truncation of the GROUP_CONCAT() result messes up the entire calculation.

The maximum length of the GROUP_CONCAT() result is controlled through the group_concat_max_len system variable.

It can be set thus:

SET @@group_concat_max_len := <num-bytes>

The maximum practical value is the maximum packet size, which is available as the max_allowed_packet system variable. This means you can write:

SET @@group_concat_max_len := @@max_allowed_packet;
and you will never be bothered by this problem again. The GROUP_CONCAT() result can still be too large though (namely, larger than the maximum packet size) but in that case you will get a proper error instead of a truncated result.

You should realize that setting the group_concat_max_len to a high value may lead to memory problems, as each GROUP_CONCAT() invocation may individually reserve the specified amount of memory to deal with its result.

Finally...


I will maintain this percentile calculation as a snippet on the MySQL Forge site. Please go there to find the latest version.

Now, finally, I have a question for you. When I wrote about the median calculation, I mentioned that I thought it was an original method, and I asked whether someone could deny that claim. I did not get any reaction, so I'd like to repeat my question: Do you know of any book or text or blog that describes this technique? If so, let me know so I can provide proper accreditation.

TIA, Roland.

30 comments:

gmax said...

Excellent post, my friend!
It's always a pleasure reading your blog!

Giuseppe

Roland Bouman said...

Giuseppe,

thanks for you kind words ;)

Baron said...

Great hack!

Is group_concat_max_length in characters, or bytes? Documentation doesn't say -- must use the source. I'm too lazy.

Roland Bouman said...

Baron,

thanks! Note that I haven't benchmarked it - I should, really.

Anyway - darn, you're right: http://bugs.mysql.com/bug.php?id=37874

My results indicate it is bytes though:

mysql> select @@group_concat_max_len;
+------------------------+
| @@group_concat_max_len |
+------------------------+
| 10 |
+------------------------+
1 row in set (0.00 sec)

+--------------------+---------------+
| char_length('中') | length('中') |
+--------------------+---------------+
| 1 | 3 |
+--------------------+---------------+
1 row in set (0.00 sec)

mysql> select group_concat('中') from (select 1 union select 2 union select 3) a;
+---------------------+
| group_concat('中') |
+---------------------+
| 中,中, |
+---------------------+
1 row in set, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+--------------------------------------+
| Level | Code | Message |
+---------+------+--------------------------------------+
| Warning | 1260 | 1 line(s) were cut by GROUP_CONCAT() |
+---------+------+--------------------------------------+
1 row in set (0.00 sec)

Vladimir said...

Roland,

quite interesting. My 0.02 UAH... First regarding the median definition - it is (according to offline sources) the 50 percentile - so that lower 50% will have a value not more than median, and upper - not less, so there seems to be a problem with percentile definition (so you may need to fix your script). Also "For the series {1,2,3,4} the median would be 2+3 = 2.5." is not quite correct as median is a value from the destribution, in this case it can be 2 or 3 but not "2,5". Next, regarding the implementation - as I understood you concat the whole table. Is this correct? I mean why it is better than subqueries?

Roland Bouman said...

Vladimir,

thanks for your reply.

Regarding the median definition: there are a number of definitions. I know wikipedia is not authoritative, but what I read there matches exactly what I find in several statistics textbooks:

In probability theory and statistics, a median is described as the number separating the higher half of a sample, a population, or a probability distribution, from the lower half. The median of a finite list of numbers can be found by arranging all the observations from lowest value to highest value and picking the middle one. If there is an even number of observations, the median is not unique, so one often takes the mean of the two middle values. At most half the population have values less than the median and at most half have values greater than the median. If both groups contain less than half the population, then some of the population is exactly equal to the median. For example, if a < b < c, then the median of the list {a, b, c} is b, and if a < b < c < d, then the median of the list {a, b, c, d} is the mean of b and c, i.e. it is (b + c)/2.

(from: wikipedia)

I believe the 'trick' with taking the mean is sometimes called the 'financial median' and I am sure it is know in other terms as well.

It maybe so that some sources define the median as a special case of a percentile (or some other quantile) However, which way you turn it, there will always be a problem whenever a single value appears multiple times and does not allow a clean cut along the quantile boundary.

Your remark that the median must be taken from the distribution is just yet another notion of the concept median, but the mean is just as accepted, although perhaps not for the same applications.

Your observation that all values are concatenated is correct. I agree that it may seem inefficient, and I am sure it is when the dataset grows very large. However, for this particular example of 2000 films (sakila database),

I get:

my solution: returns immediately with 173

A solution with a JOIN:

SELECT DISTINCT p.length
FROM film p
INNER JOIN film v
ON p.length > v.length
GROUP BY p.film_id
HAVING CAST(
COUNT(*) / (
SELECT COUNT (*)
FROM film
)
AS DECIMAL(3,2)
) = 0.90

takes about 0.50 sec, and returns 174

A solution with a subquery:

SELECT DISTINCT length
FROM film f
WHERE .90 =
CAST(
(SELECT COUNT(*)
FROM film p
WHERE p.length < f.length
)
/ (SELECT COUNT(*) FROM film)
AS DECIMAL(3,2)
)

takes 0.13 seconds, also returning 174.

So for this dataset, my method is by far the fastest. I tried also on a larger dataset of payments, calulating the 90th percentile of the amount. (16000 something rows).

There, my method returns in 0.09s or less, and the join and subquery solution just don't seem to want to return anytime soon - i killed them.

I would be very interested in hearing alternative solutions though. I mean, I don't claim my method is elegant or whatever - I would much rather use something that is not dependent upon group_concat_max_len, so if you know something, please post here and let us all know.

Thanks, and kind regards,

Roland

Pavel Stehule said...

Great idea

Pavel

Vladimir said...

Roland,

OK, not putting under a question the validity of your definition anymore I'm still fairly surprised, as what I still remember from university times is that the median intuitively is the most "typical" value (or values) in the distribution. Just out of curiosity I would appreciate if you point me to some books/applications of this "mean" median.

The calculation:
Of course your join examples will be very slow as you over join all the rows. However this is not necessary. What we need is the number of values that are "below" the given. So instead of using the whole table as the join predicate we can rather use something like:

(select count(*) c, length l from film group by l [order by l])

If the number of distinct values in the table is relatively small this will work fast regardless of the total number of rows in the table.

I've got the following query that worked for me:

select sql_no_cache
sum(g1.r) sr, g2.length l, sum(g1.r)/(select count(*) from film) p
from
(select count(*) r, length from film group by length) g1
join
(select count(*) r, length from film group by length) g2
on g1.length < g2.length
group by g2.length
having p > 0.9
order by p
limit 1

(of course it can be optimized to use temp tables; we might also play with "WITH ROLLUP" instead of select count(*) for the whole table)

I have created a MyISAM table and tried this query and the group_concat one (spec: my quite old notebook PM-2000, 2Gb, 4500rpm, sql_no_cache, default memory settings, except increased concat buffer to 14M, every query run 3 times, win xp):

4M int rows, 999 distinct values:

group_concat: 1 min 6.23 sec
group_concat: 1 min 6.31 sec
group_concat: 1 min 6.48 sec
groupby-join: 5.36 sec
groupby-join: 5.28 sec
groupby-join: 5.31 sec

1M int rows, 999 distinct values:

group_concat: 5.31 sec
group_concat: 5.30 sec
group_concat: 5.31 sec
groupby-join: 2.47 sec
groupby-join: 2.52 sec
groupby-join: 2.47 sec

100K int rows, 999 distinct values:

group_concat: 0.56 sec
group_concat: 0.56 sec
group_concat: 0.56 sec
groupby-join: 1.63 sec
groupby-join: 1.61 sec
groupby-join: 1.63 sec

I must admit that when N of distinct rows reaches approx. 10K I get pretty the opposite results if the total number of rows is relatively small. Basically we get into the same situation as with joining the whole tables.

But what I think is the right solution is having something like this on the server side:

SELECT ... SUM(INCREMENTAL length)/(SELECT COUNT(*)) ... GROUP BY length ...

I.e. a flag that tells the server to not reset per-group counters in the aggregate functions. This would be quite a trivial change in the Item_sum class and would make sense not only for SUM, but maybe also for MIN/MAX, AVG, COUNT and maybe some other aggregate functions. In this case the query would be very simple and fast and not consume any extra memory. (Maybe I will have time to write such a patch :) )

Regards,
Vladimir

Roland Bouman said...

Hi Vladimir,

thanks for coming back and taking the time to think about this.

"what I still remember from university times is that the median intuitively is the most "typical" value (or values) in the distribution. Just out of curiosity I would appreciate if you point me to some books/applications of this "mean" median."

Well, there are three 'similar but different' metrics to characterize the 'typical' value for a normal distribution:
* mean: sum(value)/count(*)
* mode: most abundant value
* median: 'middle' value

If you google for "statistical median" (don't use the quotes) you will get numerous articles, and all I have seen first explain that it is the middle value and that the mean of the two middle values is taken in case there is an even number of observations. I am sure you will find proper references to text books too if you look hard enough. (I've looked into a number of my Dutch statistics texbooks, and all mention the 'mean' trick, some also mention the 50th percentile. It just proves statistics is not that exact I guess.)

Concerning you alternative solution: This is great stuff, Thank you!
Clearly your solution performs much nicer for larger sets.

Concerning your INCREMENTAL proposal:
Maybe I don't understand completely. The statement you gave:

SELECT ... SUM(INCREMENTAL length)/(SELECT COUNT(*)) ... GROUP BY length ...

is not entirely complete so maybe something got lost in translation. What are you calculating exactly here? Where is the COUNT(*) coming from?

kind regards,

Vladimir said...

Roland,

what I meant by "typical value" is exactly what is described in the wikipedia article about the median http://en.wikipedia.org/wiki/Median . In the example with poormen and a richman in the room the "typical" value "on the table" is $5. I apologize if my explanations were not very clear...

About the INCREMENTAL thing: (again my apologies I was in a hurry, and also made a mistake in the pseudocode) the example would be as follows (I will imply the film table in the example):

SELECT COUNT(INC length), length FROM film GROUP BY length

here COUNT(INC length) would be the same as normal COUNT(length) except it wouldnt reset the function value after every group, for example if we have length values 1,1,2,3,4,4,4,5 then the query above would give us

COUNT(INC len) len
2 1
3 2
4 3
7 4
8 5

while the normal

SELECT COUNT(length), length FROM film GROUP BY length

would give

COUNT(len) len
2 1
1 2
1 3
3 4
1 5

the result above as you understand can be reached by a fairly trivial modification to the aggregate function code. Now having the cumulative result above we can continue our calculation:

SELECT COUNT(INC length)/(SELECT COUNT(*) FROM film) p, length FROM film GROUP BY length

this query will return us all distinct "length" values from the table along with their percentile values. And finally we filter out what we [dont] need by e.g. adding

HAVING p >= 0.9 ORDER BY p LIMIT 1

Regards,
Vladimir

Roland Bouman said...

Vladimir,

your idea concerning the "accumulative" or "cumulative" aggregate functions is pretty interesting!

It's probably pretty trivial to create UDFs for these to prove the concept. I am quite sure your hunch that they will be pretty fast is a good one - I expect this to be the fastest solution, provided the group by can be executed efficiently.

Note that you can already do this trick with user variables:

-- allow batching
DELIMITER go statements

-- initialize
SELECT COUNT(*), 0
INTO @cnt, @cum_cnt
FROM sakila.film;

-- calculate percentile
SELECT length
, @cum_cnt:=@cum_cnt + COUNT(*)/@cnt as p
FROM film
GROUP BY length
HAVING p >= 0.9
LIMIT 1;

go

Can you repeat your benchmark with this trick and see what it gives? Thanks in advance,

Roland.

Vladimir said...

Roland,

the idea with user variables is excellent except it didn't work for me "as is" (I even updated the server to the 5.1.25-community). The form

@cum_cnt:=@cum_cnt + COUNT(*)

worked for me simply as COUNT(*) which was surprising as it obviously worked for you. I quickly surfed through the variables and SQL modes manual sections, but didn't find anything relevant. Anyway, after some time "hands-on keyboard" I found the way to make it work. Here's the log:

mysql> show create table a \G
*************************** 1. row ****
Table: a
Create Table: CREATE TABLE `a` (
`c` int(11) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> select * from a;
+------+
| c |
+------+
| 1 |
| 2 |
| 2 |
| 2 |
| 3 |
| 4 |
| 4 |
+------+
7 rows in set (0.00 sec)

mysql> set @num1 := @num2 := @num3 := 0; select count(*), @num1 := (@num1 + count(*)) c1, @num2 := (@num2 + count(*) + 0) c2, @num3 := (@num3 + 0 + cou
nt(*)) c3 from a group by c;
Query OK, 0 rows affected (0.00 sec)

+----------+------+------+------+
| count(*) | c1 | c2 | c3 |
+----------+------+------+------+
| 1 | 1 | 1 | 1 |
| 3 | 3 | 3 | 4 |
| 1 | 1 | 1 | 5 |
| 2 | 2 | 2 | 7 |
+----------+------+------+------+
4 rows in set (0.00 sec)

As you can see it works only in one specific form (c3). I will try to do the benchmark today...

Vladimir said...

Roland,
the following query worked for me and showed pretty good performance:

mysql> set @cnt1 := @cnt2 := 0; select c, p from (SELECT c, (@cnt1:= @cnt1 + 0 + COUNT(*)) p FROM tt GROUP BY c) x where p > (select count(*) from tt)*0.9 limit 1;
Query OK, 0 rows affected (0.00 sec)

+-------+---------+
| c | p |
+-------+---------+
| 90270 | 3600026 |
+-------+---------+
1 row in set (3.06 sec)

table size: 4M, distinct value count: 100803

the division didn't work properly, also I had to use sub-select as checking p in HAVING re-evaluated p and gave incorrect output.

Roland Bouman said...

Hi Vladimir, great you got it to work!

Maybe we can work together and create UDFs for this too? Just for the fun and of course to find out if a UDF can beat user variables?

Just let me know and we'll make it happen.

Vladimir said...

Roland,

I think it's a great idea. Btw, is there a suitable UDF-package where it could be included?

Roland Bouman said...

Vladimi,

sure, see:

lib_mysqludf_stat

http://www.mysqludf.org/lib_mysqludf_stat/index.php

Anonymous said...

I stumbled across your blog when I wanted to find ideas for good queries for calculating quantiles in a (very) big dataset.

Maybe I can help on some issues (or not) and ask a question:

For some different definitions/calculations of quantiles (which include percentiles) see f.e.
http://mathworld.wolfram.com/Quantile.html (excellent site in general including references )

I believe Q1 and Q5 mentioned there are most commonly used.
(Q5 should give the more common median defintion which may help resolve some issues for users should you implement it).

Regarding the median: For large draws from a distribtution
(whose quantiles you want to compute) the competing definitions you mention (and for Qx) will not matter since the values converge for large N. Again (see above)I believe the more common estimator is X(ceil(N/2)) for odd and
(X(floor(N/2)) + X(ceil(N/2)))/2 for even N.

Regarding the issue of equal ranks:
Since strictly speaking I believe quantiles need unique orders (no equal elements) the most common procedure -I think- is to pretend they were.
So rank 1 2 3 instead of 1 2 2
(2 and 3 being equal) to calculate the quantiles.

Regarding your (RB) solutions:
Is it possible for your queries with user-defined variables to run into problems when mysql does not process the numbers in the right order? e.g. if part of a more complicated query including joins
(e.g. as described in
http://dev.mysql.com/doc/refman/5.0/en/user-variables.html)
or when randomly ordered on disk (e.g. no index)

Gody

Roland Bouman said...

Gody, thanks!

(focussing on the user variable problem)

Yeah it is true - most of the user variable solutions I posted aren't really safe. I kind of got lost doing them, forgetting to properly test the results.

The GROUP_CONCAT solution is safe though, as there is explicit ordering there (independent of an index) and no user variables.

Anonymous said...

There is a bug with this solution when the concatenated string in the group_concat function is larger then 1024 bytes.

This string gets truncated.

Roland Bouman said...

Dear anonymous,

this is not a bug. In the very post you are commenting on, this is mentioned beneath the header max_group_concat_len.

You seem to be the type that doesn't want to waste too much time on reading, So I will post the fix right here.

Simply set the max_group_concat_len server variable to the practical maximum value:

SET @@group_concat_max_len := @@max_allowed_packet;

You can also set this globally in your mysqld option file.

kind regards,

Roland

qants said...

Roland, thanks a lot for this.
I did try some simpler solutions, like ordering the dataset and limiting at the 90th percentile value, but unfortunately, passing variables to mysql LIMIT CLAUSE does not work...even now, after 4 years of people complaining about that.
I tried preparing the statements, and executing them, but as much as it worked in a mysql editor, it did not work in my Jasper Report, who totally refused executing a prepared statement.

This is how i got back to your solution, adapted it, and used it for doing the 90th percentile on a big set of response times

For those of you still looking for a customized query,working on numbers, here is the script i have used:
SELECT
cast(substring_index(substring_index((group_concat(cast(t as char) order by t) ),',',(round(0.9*count(t)+1))),',',-1) as signed int) as 90perc from testresults where testrun_id=2 group by http_request

In this case, t is the value of the response time, http_request is the request in the table, and testrun_id is just the id of the testcase i have run.

I do hope it will help. And it will of course, only help if used together with modifying the group_concat system variable limit.

Thanks again!
Alex

Anonymous said...

Thanks! this helped! Ilya.

Anonymous said...

what you define as median is the average in fact:

"the number of observations with a higher value is equal to the number of observations that has a lower value. For the series {1,2,3,4} the median would be 2+3 = 2.5."

the median would be 2 or 3, the difference between median and avarage is that the median is included in the list of values, the one that has 50% of values being lower and 50% higher.

Roland Bouman said...

Dear @Anonymous,

Thanks for your comment!

First, I should point out that there is absolutely nothing wrong with the given definition of the median as the "number of observations with a higher value is equal to the number of observations that has a lower value."

What is wrong is my example, or at the very least it is confusing. I wrote a plus there to indicate that both 2 and 3 are the middle values (if we disregard the two middle values 2 and 3 themselves). That was silly and confusing. The final figure, 2.5, is the mean of 2 and 3 which is normally accepted as the "right" approximation of the median in case there is an even number of observations. But as an example, this is really confusing since the mean of the entire series is also 2.5. And indeed, typically the arithmetic mean is used as the best measure for the average.

I will update the text to more clearly describe what I meant.

Anonymous said...

I'm having problems with the GROUP_CONCAT limit and don't have access to change config files on the server.

However I have a field in the database for the date each data point was inserted.

Is there a way to order the data by the date and then only use the most recent 100 data points?

I'm not stong in my SQL but this is what I think I'd want if it were syntactically correct:

SELECT SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(
f.length
ORDER BY f.length, f.createdAt LIMIT 0,100
SEPARATOR ','
)
, ','
, 90/100 * COUNT(*) + 1
)
, ','
, -1
) AS `90th Percentile`
FROM myTable AS f

Roland Bouman said...

Hi Terry!

The group concat function does not support a LIMIT clause (syntax here http://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_group-concat) so I don't think your example could work.
What you can do is use LIMIT at the end of the SELECT expression, and then apply the aggregation.
However, LIMIT is applied at the very end of the operation, so you have to do it in 2 steps (otherwise, the LIMIT will be applied after the aggregation, which is not what you want)

Something like this:

SELECT ...GROUP_CONCAT(...)
FROM (
SELECT ...
FROM myTable
ORDER BY myDate DESC
LIMIT 0, 100
)

...and then this will caclulate it over the sample of the last (that's because of the DESC, with ASC it would have been the first) 100 values.

While the idea of using only a sample of the values to caculate the percentile is interesting, this particular approach has quite a bunch of problems. There's just no way to know whether the sample you're drawing is a good representation of your entire dataset.

There are several other approaches but each has their own problems. One popular but potentially unreliable method is to use user-defined variables. A clean, but slow approach is to either use a self join or subqueries.

It should be possible to use LIMIT to your advantage if it is acceptable to do the task in 2 steps.
Step 1 would be to calculate the ordinal of the desired percentile value
Step 2 would be to use LIMIT to actually fetch that value.

Example:

mysql> select count(amount)*.90 from sakila.payment ;
+-------------------+
| count(amount)*.90 |
+-------------------+
| 14444.10 |
+-------------------+
1 row in set (0.01 sec)

mysql> select amount from sakila.payment order by amount limit 14444, 1;
+--------+
| amount |
+--------+
| 6.99 |
+--------+
1 row in set (0.02 sec)

HTH,

Roland.

Anonymous said...

Thanks for the incredibly detailed reply!

I should have thought to use an embedded select, but I'm not strong in my SQL.

My application is not mission critical, as I'm just trying to get the Xth percentiles of certain data points on what people think is funny. I was grabbing anything that was better than AVG(), but its too much. I need to narrow down to a high percentile.

If I get a chance I may try both ways (using all the data and just subsets) to see how things change. Maybe one can pull funnier stuff out (but "funny" is subjective, so no REAL way to tell which is better).

Thanks again!

Anonymous said...

Oh yeah... One more question, doing the subquery (your first example) is faster than doing two calls, correct?

I'm programming in php using mysqli object.

Roland Bouman said...

Hi!

whether the subquery is faster than the 2-step solution is not so easy to say in this case.

MySQL Subqueries often don't perform too good. Some of that is fixed in MySQL 5.7 and in MariaDB. And just like with all the other solutions that relyu on GROUP CONCAT it needs the memory to build the list of strings.

However, the 2-step solution with LIMIT has it's own problems (and these even hold for the first solution since that uses LIMIT too). Even though the LIMIT selects only one result row, it still needs to build a temporary table where it stores the sorted result at least up to the row of interest. Since that row is at the very tail of the entire set, this is not trivial and takes time as well.

I can't really make a general statement at this point, you're going to have to try and test with your data to see what performs best. That said, if you don't need the data instantly and you have the time to caclulate the result before hand, then I guess it shouldn't matter too much.

Anonymous said...

Thanks for the detailed responses!