Here's a simple, fast method to calculate the specified number of quantiles:

-- set the number of quantiles, for exmple:

-- quartiles: 4

-- deciles: 10

-- percentiles: 100

SET @quantiles:=4; -- select quartiles

-- calculate all quantiles

--

--

SELECT amount AS metric

, @n DIV (@c DIV @quantiles) AS quantile

, @n AS N

FROM sakila.payment

CROSS JOIN (

SELECT @n:=0 -- rownumber

, @c:=COUNT(*) -- need this to calculate quantile partitions

FROM sakila.payment

) c

WHERE NOT ( -- modulo zero (=false), we are at the quantile

(@n:=@n+1) % (@c DIV @quantiles) -- rownumber equal to the quantile partition?

)

ORDER BY amount; -- need ordered partitions

You can find this snippet on MySQL Forge.

## 17 comments:

Thanks the query works well. BTW I could not make it work in mysql-query-browser. As a matter of fact the error is due to malfunction of SET statement in the browser. Works just fine on mysql.

Hi!

Glad to hear it is working for you ;)

The reason that it might not work, or work differently in the query browser is that it creates a new session for each separate statement (or script) you execute.

The second solution that initializes the variables inside the statement should work fine if you also initialize the @quantiles variable there.

ALternatively, you can open a script tab , and paste all code there and execute it as a script.

I hope this helps,

Kind regards,

Roland

Thanks for the reply. That makes sense, btw I was too fast to say that the query works. Here is the output from the query,

select count(*) from table1;

+----------+

| count(*) |

+----------+

| 237247 |

+----------+

output from the quantile query

+---------------------+----------+--------+

| metric | quantile | N |

+---------------------+----------+--------+

| -0.4282900000000001 | 4 | 237247 |

| 0.0327670000000000 | 4 | 237247 |

| 0.3726560000000000 | 4 | 237247 |

| 0.6718811000000000 | 4 | 237247 |

+---------------------+----------+--------+

Notice the quantile and the N fields. They are the same for all quantiles.

Hi!

can you post your code, and if possible, example data so we can see what is happening?

Thanks!

--I first create a dummy dataset with 100 observations in a spreadsheet

--Load it in mysql

create table example (rowid int NOT NULL, metric REAL);

LOAD DATA LOCAL INFILE '~/example.csv'

INTO TABLE example

FIELDS TERMINATED BY ','

LINES TERMINATED BY '\n'

(rowid, metric);

set @row=0;

create table quantile as SELECT metric

, @n DIV (@c DIV @quantiles) AS quantile

, @n AS N,

-- I added row since the quantile variable was misbehaving

@row:=@row+1 as quant

FROM example

CROSS JOIN (

SELECT @n:=0 -- rownumber

, @c:=COUNT(*) -- need this to calculate quantile partitions

FROM example

) c

WHERE NOT ( -- modulo zero (=false), we are at the quantile

(@n:=@n+1) % (@c DIV @quantiles) -- rownumber equal to the quantile partition?

)

ORDER BY metric;

+-------------------+----------+------+-------+

| metric | quantile | N | quant |

+-------------------+----------+------+-------+

| 0.411637636786402 | 4 | 100 | 1 |

| 0.623982239490814 | 4 | 100 | 2 |

| 0.863134303529804 | 4 | 100 | 3 |

| 0.913423039292731 | 4 | 100 | 4 |

+-------------------+----------+------+-------+

--More dignostic

select count(*), b.quant from example as a, quantile as b where a.metric<=b.metric group by b.metric;

+----------+-------+

| count(*) | quant |

+----------+-------+

| 44 | 1 |

| 60 | 2 |

| 81 | 3 |

| 86 | 4 |

+----------+-------+

Hi!

I think I may have found a cause.

Can you please see what happens if you add an index for the metric column?

ALTER TABLE example ADD INDEX(metric);

Thanks, it works now! It will be great if you could write what was going wrong.

Ciao Roland,

for more info about quantile() take a look to R code. Theare many way to calculate these values and one considerable choice refer to find outliers.

IQR*1.5 (tukey rule)

where

IQR is Inter-Quantile-Range

Following this rule, everything lower or greater Q1 and Q4 is an outlier.

I got it to work. Thanks. Running:

+-------------------------+------------------------------+

| Variable_name | Value |

+-------------------------+------------------------------+

| protocol_version | 10 |

| version | 5.1.36-community-log |

| version_comment | MySQL Community Server (GPL) |

| version_compile_machine | ia32 |

| version_compile_os | Win32 |

+-------------------------+------------------------------+

Hi Roland,

I’m trying to build a database of our past project results, and one of the things I’d like to gain access to is quartile numbers.

I’ve been reviewing the code from above, and I think I’ve managed to get it working, but still having a few odd issues – I was hoping you might be able to help me out.

Here is how my process works, and then I’ll explain what seems odd.

First I create a table from a query, I call the table quartileTable and it has the original ID, and a data value, the columns are called ID and metric (used names from a post above).

I then create an index on metric (again, from your prev. suggestion) which got things working much better than I had them originally.

I run your sample code, and get the following output:

Metric quantile N

29 1 58

54 2 116

76 3 174

99 4 232

The weird thing is that there are 233 records in the initial table, any idea why it is only returning 232?

Also, if I take the initial tabledata and dump it to excel, my numbers are slightly different (N=57 in the first quartile, third quartile is 77). This may be partly due to the 232 instead of 233. The min and max in the table is also 1 and 100, so shouldn’t q4 be 100?

Thanks,

Mike

Hi Michael!

The short answer to your question is that your number of cases, 233,

cannot be divided perfectly in 4 even quartiles. There is no universal agreement on how to calculate the quartiles in this case.

My method takes about the crudest way to deal with this, and in your particular case, it results in the 233th row to be ignored completely. Whether this is acceptable or not depends.

My assumption has always been that

if the distribution of data is reasonably even, and the number of

cases is sufficiently large, there is virtually no difference in the

final quartile result.

The numbers you quote from excel are interesting. I am completely

baffled why the first quartile in that case has only 57 rows: in my

mind, 233 / 4 is a little over 58, so it seems strange to have a

quartile with less than 58 rows.

However, it could make sense to have one quartile that contains the excess row, I mean the row that would otherwise be ignored in my crude method.

I hope this helps for now.

Roland

Hi Roland:

This code rocks but I do have a question on whether or not it requires a "minumum" number of cases.

I use the code in a stored procedure whereby an individual can enter no less than 30 ID's and the code generates decile breaks but I noticed that if 33 ID's are entered I'll get back 11 breaks rather than 10 (see below).

ID|Metric|nthtile break

-----------------------

28|7.7 |1

9 |9.5949|2

10|11 |3

26|12.1 |4

6 |13.3 |5

12|13.4 |6

20|14.3 |7

2 |15.151|8

23|15.715|9

24|18 |10

7 |25.8 |11

Any thoughts would be greatly appreciated.

May I suggest to replace the WHERE clause by

ceil((@n := @n + 1) / @c * @quantiles) * @c / @quantiles - @n < 1

This checks if @n is the last one in any whole Q'th of the complete set. Thus, not each quantile will find the same amount of observations (may be off by one), but it won't round the complete set to Q either, instead process the whole thing.

There is no guaranteed behavior of MySQL user variables when used in this way (the MySQL documentation is careful to point this out.) In order to get predictable behavior, it is imperative that the order of operations in the execution plan be specified. Because MySQL always materializes inline views (unlike Oracle), we can use inline views to specify the order of operations. (We need to be aware that this behavior of MySQL regarding inline views is subject to change in future releases of MySQL.)

This query specifies the order of operations, so that we can be assured that the rownumber (n) is assigned to the correct rows.

SELECT r.amount AS metric

, r.n AS n

, r.n DIV (r.cnt DIV r.quantiles) AS quantile

FROM (SELECT @n := @n + 1 AS n

, q.quantiles

, q.amount

FROM (SELECT v.quantiles

, c.cnt

, p.amount

FROM (SELECT @quantiles AS quantiles) v

JOIN (SELECT COUNT(1) AS cnt FROM sakila.payment) c

JOIN sakila.payment p

ORDER BY p.amount

) q

) r

WHERE r.n MOD (r.cnt DIV r.quantiles) = 0

ORDER BY r.n

Note this query includes the same quantile derivation as your original query, which (as you explain) is rudimentary), and works for a number of rows that is an exact multiple of quantiles. (The expression to accommodate a non-multiple number of rows is more involved.

The ONLY place that we actually need a user variable (@n) is where we use it to generate a rownumber (n) for an ordered set of rows. (I do reference the @quantiles user variable in one place in the statement, which will be evaluated in the first step in the execution. This could just as easily be a literal constant.)

The point is that we need to be VERY careful that the order of execution in the plan is actually specified, and that this will not be changed by the number of rows in the table or added indexes. We are relying on the behavior of MySQL that causes all inline views to be materialized, to guarantee that the user variables "work" correctly.

(The expression to more accurately derive which rows are the quantile rows for a given number of rows (r.cnt) and a given number of quantiles (r.quantiles) is more involved, but it is easy to see where those expressions get "plugged" into the SELECT list and WHERE clause of the outermost query.)

Hi @spencer7593,

thanks for the in-depth comment - much appreciated. Yes, I agree that the order of operations can to some extent be forced - I know a few people that have explored this thoroughly.

Regarding the inline view materialization: I recently learnt that mariadb has done some work on the optimizer that could change this.

If you think about it, the whole premisse of a fixed, predictable evaluation order is at odds with the declarative nature of SQL: guaranteeing a particular evaluation order potentially limits the query planner to come up with the best plan. So I would still recommend to stay away from it (unless there is a very good reason not to)

I agree with you that user variables can be dangerous. And if (when?) the MySQL optimizer stops forcing inline views to be materialized, the whole " controlling the order of operations" goes out the window.

Of course, user variables will still be supported. What goes out the window are the tricks like the "@n := @n +1" we use to assign rownumbers.

But then again, we can also hope that MySQL will add an Oracle-like ROWNUM pseudo-column, add analytic functions (like Oracle and SQL Server have), and that the optimizer will be able to "push" predicates from an outer query down into an inline view.

Hi @spencer7593,

yes agreed. user variables to store a value to parameterize a subsequent query, that will always work, just not assignments within one multi-row statement.

I would also like to see analytical functions. But I doubt they will be implemented any time soon. It's just not on the typical MySQL user's radar.

Post a Comment