Tuesday, July 08, 2008

MySQL Percentile aftermath: Calculating all quantiles

Are you getting fed up yet with my posts on calculating percentiles? Well, I'm sorry but I'm not quite finished.

Here's a simple, fast method to calculate the specified number of quantiles:

-- set the number of quantiles, for exmple:
-- quartiles: 4
-- deciles: 10
-- percentiles: 100

SET @quantiles:=4; -- select quartiles

-- calculate all quantiles
--
--
SELECT amount AS metric
, @n DIV (@c DIV @quantiles) AS quantile
, @n AS N
FROM sakila.payment
CROSS JOIN (
SELECT @n:=0 -- rownumber
, @c:=COUNT(*) -- need this to calculate quantile partitions
FROM sakila.payment
) c
WHERE NOT ( -- modulo zero (=false), we are at the quantile
(@n:=@n+1) % (@c DIV @quantiles) -- rownumber equal to the quantile partition?
)
ORDER BY amount; -- need ordered partitions

You can find this snippet on MySQL Forge.

29 comments:

Minerva's priest said...

Thanks the query works well. BTW I could not make it work in mysql-query-browser. As a matter of fact the error is due to malfunction of SET statement in the browser. Works just fine on mysql.

rpbouman said...

Hi!

Glad to hear it is working for you ;)

The reason that it might not work, or work differently in the query browser is that it creates a new session for each separate statement (or script) you execute.

The second solution that initializes the variables inside the statement should work fine if you also initialize the @quantiles variable there.

ALternatively, you can open a script tab , and paste all code there and execute it as a script.

I hope this helps,

Kind regards,

Roland

Minerva's priest said...

Thanks for the reply. That makes sense, btw I was too fast to say that the query works. Here is the output from the query,

select count(*) from table1;
+----------+
| count(*) |
+----------+
| 237247 |
+----------+
output from the quantile query
+---------------------+----------+--------+
| metric | quantile | N |
+---------------------+----------+--------+
| -0.4282900000000001 | 4 | 237247 |
| 0.0327670000000000 | 4 | 237247 |
| 0.3726560000000000 | 4 | 237247 |
| 0.6718811000000000 | 4 | 237247 |
+---------------------+----------+--------+

Notice the quantile and the N fields. They are the same for all quantiles.

rpbouman said...

Hi!

can you post your code, and if possible, example data so we can see what is happening?

Thanks!

Minerva's priest said...

--I first create a dummy dataset with 100 observations in a spreadsheet
--Load it in mysql
create table example (rowid int NOT NULL, metric REAL);

LOAD DATA LOCAL INFILE '~/example.csv'
INTO TABLE example
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(rowid, metric);

set @row=0;
create table quantile as SELECT metric
, @n DIV (@c DIV @quantiles) AS quantile
, @n AS N,
-- I added row since the quantile variable was misbehaving
@row:=@row+1 as quant
FROM example
CROSS JOIN (
SELECT @n:=0 -- rownumber
, @c:=COUNT(*) -- need this to calculate quantile partitions
FROM example
) c
WHERE NOT ( -- modulo zero (=false), we are at the quantile
(@n:=@n+1) % (@c DIV @quantiles) -- rownumber equal to the quantile partition?
)
ORDER BY metric;

+-------------------+----------+------+-------+
| metric | quantile | N | quant |
+-------------------+----------+------+-------+
| 0.411637636786402 | 4 | 100 | 1 |
| 0.623982239490814 | 4 | 100 | 2 |
| 0.863134303529804 | 4 | 100 | 3 |
| 0.913423039292731 | 4 | 100 | 4 |
+-------------------+----------+------+-------+


--More dignostic
select count(*), b.quant from example as a, quantile as b where a.metric<=b.metric group by b.metric;
+----------+-------+
| count(*) | quant |
+----------+-------+
| 44 | 1 |
| 60 | 2 |
| 81 | 3 |
| 86 | 4 |
+----------+-------+

rpbouman said...

Hi!

I think I may have found a cause.

Can you please see what happens if you add an index for the metric column?

ALTER TABLE example ADD INDEX(metric);

Minerva's priest said...

Thanks, it works now! It will be great if you could write what was going wrong.

Daniele Medri said...

Ciao Roland,

for more info about quantile() take a look to R code. Theare many way to calculate these values and one considerable choice refer to find outliers.

IQR*1.5 (tukey rule)

where

IQR is Inter-Quantile-Range

Following this rule, everything lower or greater Q1 and Q4 is an outlier.

Howard Chong said...

I got it to work. Thanks. Running:


+-------------------------+------------------------------+
| Variable_name | Value |
+-------------------------+------------------------------+
| protocol_version | 10 |
| version | 5.1.36-community-log |
| version_comment | MySQL Community Server (GPL) |
| version_compile_machine | ia32 |
| version_compile_os | Win32 |
+-------------------------+------------------------------+

MikeB said...

Hi Roland,

I’m trying to build a database of our past project results, and one of the things I’d like to gain access to is quartile numbers.

I’ve been reviewing the code from above, and I think I’ve managed to get it working, but still having a few odd issues – I was hoping you might be able to help me out.

Here is how my process works, and then I’ll explain what seems odd.

First I create a table from a query, I call the table quartileTable and it has the original ID, and a data value, the columns are called ID and metric (used names from a post above).

I then create an index on metric (again, from your prev. suggestion) which got things working much better than I had them originally.

I run your sample code, and get the following output:

Metric quantile N
29 1 58
54 2 116
76 3 174
99 4 232


The weird thing is that there are 233 records in the initial table, any idea why it is only returning 232?

Also, if I take the initial tabledata and dump it to excel, my numbers are slightly different (N=57 in the first quartile, third quartile is 77). This may be partly due to the 232 instead of 233. The min and max in the table is also 1 and 100, so shouldn’t q4 be 100?

Thanks,

Mike

rpbouman said...

Hi Michael!

The short answer to your question is that your number of cases, 233,
cannot be divided perfectly in 4 even quartiles. There is no universal agreement on how to calculate the quartiles in this case.

My method takes about the crudest way to deal with this, and in your particular case, it results in the 233th row to be ignored completely. Whether this is acceptable or not depends.

My assumption has always been that
if the distribution of data is reasonably even, and the number of
cases is sufficiently large, there is virtually no difference in the
final quartile result.

The numbers you quote from excel are interesting. I am completely
baffled why the first quartile in that case has only 57 rows: in my
mind, 233 / 4 is a little over 58, so it seems strange to have a
quartile with less than 58 rows.

However, it could make sense to have one quartile that contains the excess row, I mean the row that would otherwise be ignored in my crude method.

I hope this helps for now.

Roland

CintiJohn said...

Hi Roland:

This code rocks but I do have a question on whether or not it requires a "minumum" number of cases.

I use the code in a stored procedure whereby an individual can enter no less than 30 ID's and the code generates decile breaks but I noticed that if 33 ID's are entered I'll get back 11 breaks rather than 10 (see below).

ID|Metric|nthtile break
-----------------------
28|7.7 |1
9 |9.5949|2
10|11 |3
26|12.1 |4
6 |13.3 |5
12|13.4 |6
20|14.3 |7
2 |15.151|8
23|15.715|9
24|18 |10
7 |25.8 |11

Any thoughts would be greatly appreciated.

mortee said...

May I suggest to replace the WHERE clause by

ceil((@n := @n + 1) / @c * @quantiles) * @c / @quantiles - @n < 1

This checks if @n is the last one in any whole Q'th of the complete set. Thus, not each quantile will find the same amount of observations (may be off by one), but it won't round the complete set to Q either, instead process the whole thing.

spencer7593 said...

There is no guaranteed behavior of MySQL user variables when used in this way (the MySQL documentation is careful to point this out.) In order to get predictable behavior, it is imperative that the order of operations in the execution plan be specified. Because MySQL always materializes inline views (unlike Oracle), we can use inline views to specify the order of operations. (We need to be aware that this behavior of MySQL regarding inline views is subject to change in future releases of MySQL.)

This query specifies the order of operations, so that we can be assured that the rownumber (n) is assigned to the correct rows.

SELECT r.amount AS metric
, r.n AS n
, r.n DIV (r.cnt DIV r.quantiles) AS quantile
FROM (SELECT @n := @n + 1 AS n
, q.quantiles
, q.amount
FROM (SELECT v.quantiles
, c.cnt
, p.amount
FROM (SELECT @quantiles AS quantiles) v
JOIN (SELECT COUNT(1) AS cnt FROM sakila.payment) c
JOIN sakila.payment p
ORDER BY p.amount
) q
) r
WHERE r.n MOD (r.cnt DIV r.quantiles) = 0
ORDER BY r.n


Note this query includes the same quantile derivation as your original query, which (as you explain) is rudimentary), and works for a number of rows that is an exact multiple of quantiles. (The expression to accommodate a non-multiple number of rows is more involved.

The ONLY place that we actually need a user variable (@n) is where we use it to generate a rownumber (n) for an ordered set of rows. (I do reference the @quantiles user variable in one place in the statement, which will be evaluated in the first step in the execution. This could just as easily be a literal constant.)

The point is that we need to be VERY careful that the order of execution in the plan is actually specified, and that this will not be changed by the number of rows in the table or added indexes. We are relying on the behavior of MySQL that causes all inline views to be materialized, to guarantee that the user variables "work" correctly.

(The expression to more accurately derive which rows are the quantile rows for a given number of rows (r.cnt) and a given number of quantiles (r.quantiles) is more involved, but it is easy to see where those expressions get "plugged" into the SELECT list and WHERE clause of the outermost query.)

rpbouman said...

Hi @spencer7593,

thanks for the in-depth comment - much appreciated. Yes, I agree that the order of operations can to some extent be forced - I know a few people that have explored this thoroughly.

Regarding the inline view materialization: I recently learnt that mariadb has done some work on the optimizer that could change this.

If you think about it, the whole premisse of a fixed, predictable evaluation order is at odds with the declarative nature of SQL: guaranteeing a particular evaluation order potentially limits the query planner to come up with the best plan. So I would still recommend to stay away from it (unless there is a very good reason not to)

spencer7593 said...

I agree with you that user variables can be dangerous. And if (when?) the MySQL optimizer stops forcing inline views to be materialized, the whole " controlling the order of operations" goes out the window.


Of course, user variables will still be supported. What goes out the window are the tricks like the "@n := @n +1" we use to assign rownumbers.

But then again, we can also hope that MySQL will add an Oracle-like ROWNUM pseudo-column, add analytic functions (like Oracle and SQL Server have), and that the optimizer will be able to "push" predicates from an outer query down into an inline view.

rpbouman said...

Hi @spencer7593,

yes agreed. user variables to store a value to parameterize a subsequent query, that will always work, just not assignments within one multi-row statement.

I would also like to see analytical functions. But I doubt they will be implemented any time soon. It's just not on the typical MySQL user's radar.

vak said...

looks just brilliant, thank you!

sadly there are no WITH...AS (aka CTE) in mysql. In my case "sakila.payment" is a rusult of huge JOINs. So, it looks like one has to textually repeat those JOINs twice and also this will be then double queried to mysql... Any chance to avoid this?.. (my access to DB is read-only, so, there will be no tmp-tables, no VIEWs for me...)

regards,
Valery

rpbouman said...

@vak, how many values are you handling? There is a way to do this with GROUP_CONCAT, provided you can set group_concat_max_len to a sufficiently large value (at the session level, no admin priv required)

vak said...

oh, Roland, your blog is still alive, thanks the heaven)

on the dev box i've got 20k values IRR, but on the live one it will be a huge thing...

vak said...

oh, Roland, your blog is still alive, thanks the heaven)

on the dev box i've got 20k values IRR, but on the live one it will be a huge thing...

vak said...

also, albeit the 'metric' column looks more or less plausible to me, the quantile values next to the metric values look not corresponding...

+---------+----------+------+
| metric | quantile | N |
+---------+----------+------+
| -18.40 | 2 | 4970 |
| 0.00 | 3 | 7455 |
| 0.00 | 4 | 9940 |
| 5000.00 | 1 | 2485 |
+---------+----------+------+

not sure if I am getting things right.

P.S. sorry for the mess in posting this comment originally to the other topic (not sure how it happened)

rpbouman said...

@vak, yeah your result does like wrong; the values should go up along with the quantile value.

The solution is based on user variables, this can lead to unstable results.

That said I think there may be a way around it that fixes all these problems, as long as you can afford to do it in multiple queries. Is that acceptible?

If so, you can do:

1) do a SELECT count(*) query to determine the number of obervations you're dealing with

2) perform one query of the form SELECT value FROM table ORDER BY value LIMIT offset, 1

You would have to execute query #3 for each quantile, and you'd have to calculate the offset as #quantile * (count(*) / quantiles)

Unfortunately LIMIT cannot be parameterized so you'll have to calculate the exact value before hand.

sakila.payment quartiles example:

SELECT COUNT(*) div 4 FROM sakila.payment

yields: 4012

(it's div 4 because we have 4 quartiles)

SELECT amount FROM payment ORDER BY amount LIMIT 4012, 1;
UNION ALL
SELECT amount FROM payment ORDER BY amount LIMIT 8024, 1;
UNION ALL
SELECT amount FROM payment ORDER BY amount LIMIT 12036, 1;
UNION ALL
SELECT MAX(amount) FROM payment;

vak said...

thank you, Roland in my case it will work out, indeed )

rpbouman said...

@vak, oh darn, it seems my syntax is not entirely correct. This is:

(SELECT amount FROM payment ORDER BY amount LIMIT 4012, 1)
UNION ALL
(SELECT amount FROM payment ORDER BY amount LIMIT 8024, 1)
UNION ALL
(SELECT amount FROM payment ORDER BY amount LIMIT 12036, 1)
;

vak said...

i don't use union at all, i just use the idea that you meant and do all the rest on client's side -- this time it is PHP.

rpbouman said...

@vak, I filed a feature request for it:

http://bugs.mysql.com/bug.php?id=72892

You could consider adding a comment to voice your support. That said, these things tend to take time. As in, years and years. So don't hold your breath.

Another possibility would be to ask your DBA to give you a schema of your own where you are allowed to create stored routines. Because then you could open a CURSOR, and run through the recordset just once, storing only those rows that you need. That would be faster than multiple queries.

vak said...

Hi Roland, i took a look at your feature request and have my 2 cents if i dare:

1. if LIMIT is not parametrizable -- BTW, i was really surprised about this your reasonable comment -- then this feature request will be in vain. Thus, one needs parametrization.

2. Doing ORDER BY multiple times is indeed expensive/expansive, but this is just a particular case of unwanted redundancy... because of not having WITH...AS (aka CTE)

3. if one would go further and consider statistical sampling, Monte-Carlo methods, etc, then we'll come fast to the arbitrary access to the rows, specified by its ordinal number. A natural way for this would be something like: SELECT ... LIMIT ( SELECT row_number FROM desired_row_numbers);

i far don't have your experience, so you definitely see it much deeper.

regards,
Valery

rpbouman said...

Vak, all your points are rational and reasonable.

The reason why I voiced my feature request so modestly is that MySQL has a history of avoiding changes to the parser and to a lesser extent the optimizer. (Actually, things have improved under Oracle)

So my thinking was it would be better to ask for what seems like a relatively simple feature that solves the main problem, i.e. the multiple scans.

IMO, parameterization, or even using a select statement are "just" a convenience, but one can live without these features as long one is capable of generating statements dynamically (which one usually is).

What we cannot do from within the application, is avoid multiple table scans. Well we can, but we'd need to pull the entire dataset.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...