Monday, December 17, 2007

Calculating the Financial Median in MySQL

I believe I found a new method to calculate the median in MySQL. I would not be surprised if this method has been figured out by somebody else already. However, I can't seem to find any resources on the internet describing this method, so for now I flatter myself by assuming the method is original.

(Please do post your comments to this blog to correct me on that should I be wrong so I have a chance to rectify.)

The method I'm describing is a one-pass, pure SQL method. It does not require subqueries, cursors or user variables. However, it does rely on the MySQL specific functions GROUP_CONCAT() and SUBSTRING_INDEX()

I'll be maintaining a snippet for this method at MySQL Forge.
If you want to know what the median is, and how my snippet works, read on.

Some background


Like the mean and the mode, the median is an important metric to characterize the distribution of values in a collection. If we have a ordered collection of (numerical) values, the median is the value for which the number of entries that has a value that is higher than the median is exactly equal to the number of entries that has a value that is lower than the median. If there is an odd number of entries in the collection, the value of the median corresponds to the value of the entry that lies exactly in the middle of the list. If there is an even number of entries, the median is calculated as the mean of the two middle values.

MySQL offers a number of aggregate functions. Unfortunately, MySQL does not offer a function to calculate the median.

Even though MySQL does not support a MEDIAN() function natively, it is still possible to calculate it. You can:

The snippet

Here's a snippet that shows how to calculate the median replacement cost for a film in the sakila sample database:

select
(
substring_index( -- left median: max value in lower half:
substring_index(
group_concat( -- list all values in ascending order
f.replacement_cost
order by f.replacement_cost
)
, ','
, ceiling(count(*)/2) -- left half of the list
)
, ','
, -1 -- keep only the last value in list
)
+ substring_index( -- right median: min value in upper half:
substring_index(
group_concat( -- list all values in ascending order
f.replacement_cost
order by f.replacement_cost
)
, ','
, -ceiling(count(*)/2) -- right half of the list
)
, ','
, 1 -- keep only the first value in list
)
) / 2 -- average of left and right medians
as median
from sakila.film f;
(For the latest version, refer to MySQL Forge)

So, how does this snippet work? In the remainder of this post, I'll explain the inner workings of this method in a top-down fashion.

The mean of the left and right median

The method I'm describing always takes the mean of the 'left' and 'right' median.

select
(
left-median(f.replacement_cost)
+ right-median(f.replacement_cost)
) / 2 -- average of left and right medians
as median
from sakila.film f;
(Note that the usage of left-median() and right-median() is just an explanation of the structure - in reality there are not two distinct functions by that name)

The terms 'left median' and 'right median' are not common so they need an explanation.

Let's visualize the process to determine the median. We can do this by imagining that we have an ordered list of values and that we point our left index finger to the lowest value in the list and our right index finger to the highest value in the list.

Now, we look at our fingers. If there is more than one entry between our right and left finger, we move or left finger one entry to the right and our right finger one entry to the left, and we keep doing that until there are no more entries between our left and right finger. Once we're there, the value of the entry that is pointed to by our left finger is the 'left median' and the value of the entry pointed to by our right finger is the 'right median'.

If we had an even number of entries, then the left and right median each correspond to distinct entries - if there was an odd number of entries then the left and right median correspond to one and the same entry.

At any rate, once we found the left and right median, it is clear that their mean is the true median. If we have an even number of entries, we have to calculate the mean of the two middle values anyway, and if there is an odd number of entries, taking the mean of two identical values results of course in that same value which does thus result in a correct value for the median.

GROUP_CONCAT: an ordered list of values

In the example, we use GROUP_CONCAT to generate a list of values in ascending order:

GROUP_CONCAT( -- list all values in ascending order
f.replacement_cost
ORDER BY f.replacement_cost
)

This gets us a string consisting of concatenated replacement_cost values in ascending order, separated by the default separator, which is a comma (',').

We use the same list in both the calculation of the left and the right median.

Note that the length of the concatenation result returned by GROUP_CONCAT() is limited. By default, it is as small as 1024 bytes. Personally I think this is way too small so I have it configured to be 64K by default. You can set the length at runtime too like this:

SET group_concat_max_len := 65535
You can specify larger values than 65535 too, and I suspect that the maximum packet size is the practical maximum:

SET group_concat_max_len := @@max_allowed_packet
To inspect the current value, you can do this:

SELECT @@group_concat_max_len

Getting the 'left' half of the list

Once we have the list of values, we can split it in two halves with little effort. We do this using the SUBSTRING_INDEX() function.

The SUBSTRING_INDEX() function processes a string argument and gets a substring based on the position of a particular occurrence of another substring.

In this case, the comma ',' is the substring that separates the values in our ordered list. But of all the commas in the list, which occurrence of the comma do we need to find?

Suppose our list contains 4 values. Then, these values are separated by three comma's:

values: '1,2,3,4'
commas: ^ ^ ^
1 2 3
If we want to divide this list in two equal halves, then the second comma is the divisor between the left and right halves of our list. With SUBSTRING_INDEX() this expression would get us the left half of the this list:

SUBSTRING_INDEX('1,2,3,4', ',', 2) -- the substring up to the 2nd occurrence of ','
and the result will be:

'1,2'
So now we have the first half of the list, and by definition, the last entry in that list, '2' is the left median of the original list '1,2,3,4'.

Now what if we would've had an odd number of entries in our list? Supose our list would've been like this:

values: '1,2,3'
commas: ^ ^
1 2
In this case too, we need the second comma to end up with a left substring that has the left median as last entry in the list (which also happens to be the proper median because this is a list with an odd number of entries).

As it turns out, we can conveniently generalize the SUBSTRING_INDEX expression like this:

SUBSTRING_INDEX(list, separator, CEILING(#entries/2))
In other words, if we divide the number of entries in our list by two, and then round to the nearest higher integer, this gives us the particular occurrence of the separator what we are looking for to halve our list as required. Of course, calculating the number of entries is simply a matter of using the COUNT() aggregate function.

Excising the 'left' median

To actually obtain the left median itself, we just need to excise the last value from the left half of the list of values. We do this by applying SUBSTRING_INDEX() again.

Again, we need to make a substring in terms of the occurrence of the comma that separates the values in our list. This time, we need to get the substring found directly after the last comma in the list. With SUBSTRING_INDEX() we can conveniently express this in the following manner:

SUBSTRING_INDEX(list, separator, -1)
This means: search the list from right to left and find the first occurrence of the separator. Return the substring that appears after the separator (that is, the substring appearing on the right hand of the separator).

Getting the 'right' median

The process to obtain the right median is a mirror of obtaining the left median: instead of obtaining the last value in the left half of the ordered list of values, we now need to obtain the first value of the right half of the list. This is actually as simple as reversing the sign of the occurrence argument in the SUBSTRING_INDEX() calls:

left half: SUBSTRING_INDEX(list, separator, CEILING(COUNT(*)/2))
right half: SUBSTRING_INDEX(list, separator, -CEILING(COUNT(*)/2))

last entry: SUBSTRING_INDEX(list, separator, -1)
first entry: SUBSTRING_INDEX(list, separator, 1)

A few remarks

I think that in many cases, this can be a fair method to calculate the median. The advantage of this method is that it is relatively fast because the query itself is relatively simple.

It would be interesting to see how this method behaves when handling millions of rows. Maybe I will run some benchmarks on that later on.

In the mean while, feel free to post your thoughts, suggstions or critique on this blog.

6 comments:

Anonymous said...

Roland,
Thanks for that, and your detailed explanation. I am an SQL Newbie and you have helped me very much.

Elegant coding.

Steve

--CELKO-- said...

Another Standard SQL trick:

SELECT AVG(x)
FROM (SELECT x,
ROW_NUMBER()OVER(ORDER BY x ASC) AS up,
ROW_NUMBER()OVER(ORDER BY x DESC) AS dn
FROM Foobar) AS X(x, up, dn)
WHERE up IN (dn, dn+1, dn-1);

This is easier to see with a number line:
1 2 3 4 5 6 8 9 = up
9 8 6 5 4 3 2 1 = dn

the middle subset is {4,5} and the average is 4.5 or the median.

Unknown said...

Roland:
I have been puzzling over how to bring a median calculation into some standard queries I have. Is there a way of using your methods or pure sql methods in a stored function or proceedure and bring the results into my other existing queries.

I have been using the udf median function for a while, but I am trying to move my database to a webserver that won't let me add user defined functions

Thanks.

rpbouman said...

Hi Josh!

Unfortunately, MySQL does not let you build aggregrate functions with the stored routine language. So I don't think it's possible to come up with a good solution for this :(

Anonymous said...

Your a monster!!!! Superb!!!!
TRILLION THANKS!!!!

Anonymous said...

Thanks for the code. It was a lifesaver. Dealing with large result sets, I found a function flavor to be much less unwieldy and simpler to performance tune.

-------------------------------

DELIMITER $$
CREATE FUNCTION `median`( commaDelimitedData varchar(255)) RETURNS decimal(17,3) DETERMINISTIC
BEGIN
declare eleCount int;

set eleCount = (length(commaDelimitedData) - length(replace(commaDelimitedData,',','')))+1;
if ((eleCount=1) and length(trim(commaDelimitedData))=0) then
set eleCount=0;
end if;

return (substring_index( -- left median: max value in lower half:
substring_index(
commaDelimitedData
, ','
, ceiling(eleCount/2) -- left half of the list
)
, ','
, -1 -- keep only the last value in list
)
+ substring_index( -- right median: min value in upper half:
substring_index(
commaDelimitedData
, ','
, -ceiling(eleCount/2) -- right half of the list
)
, ','
, 1 -- keep only the first value in list
)
) / 2; -- average of left and right medians
END

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...