(Please do post your comments to this blog to correct me on that should I be wrong so I have a chance to rectify.)
The method I'm describing is a one-pass, pure SQL method. It does not require subqueries, cursors or user variables. However, it does rely on the MySQL specific functions
GROUP_CONCAT()
and SUBSTRING_INDEX()
I'll be maintaining a snippet for this method at MySQL Forge.
If you want to know what the median is, and how my snippet works, read on.
Some background
Like the mean and the mode, the median is an important metric to characterize the distribution of values in a collection. If we have a ordered collection of (numerical) values, the median is the value for which the number of entries that has a value that is higher than the median is exactly equal to the number of entries that has a value that is lower than the median. If there is an odd number of entries in the collection, the value of the median corresponds to the value of the entry that lies exactly in the middle of the list. If there is an even number of entries, the median is calculated as the mean of the two middle values.
MySQL offers a number of aggregate functions. Unfortunately, MySQL does not offer a function to calculate the median.
Even though MySQL does not support a
MEDIAN()
function natively, it is still possible to calculate it. You can:- install one of the numerous UDF's floating around on the internet
- Use pure SQL like suggested in Chapter 23 of Joe Celko's SQL for Smarties
The snippet
Here's a snippet that shows how to calculate the median replacement cost for a film in the sakila sample database:(For the latest version, refer to MySQL Forge)
select
(
substring_index( -- left median: max value in lower half:
substring_index(
group_concat( -- list all values in ascending order
f.replacement_cost
order by f.replacement_cost
)
, ','
, ceiling(count(*)/2) -- left half of the list
)
, ','
, -1 -- keep only the last value in list
)
+ substring_index( -- right median: min value in upper half:
substring_index(
group_concat( -- list all values in ascending order
f.replacement_cost
order by f.replacement_cost
)
, ','
, -ceiling(count(*)/2) -- right half of the list
)
, ','
, 1 -- keep only the first value in list
)
) / 2 -- average of left and right medians
as median
from sakila.film f;
So, how does this snippet work? In the remainder of this post, I'll explain the inner workings of this method in a top-down fashion.
The mean of the left and right median
The method I'm describing always takes the mean of the 'left' and 'right' median.(Note that the usage of
select
(
left-median(f.replacement_cost)
+ right-median(f.replacement_cost)
) / 2 -- average of left and right medians
as median
from sakila.film f;
left-median()
and right-median()
is just an explanation of the structure - in reality there are not two distinct functions by that name)The terms 'left median' and 'right median' are not common so they need an explanation.
Let's visualize the process to determine the median. We can do this by imagining that we have an ordered list of values and that we point our left index finger to the lowest value in the list and our right index finger to the highest value in the list.
Now, we look at our fingers. If there is more than one entry between our right and left finger, we move or left finger one entry to the right and our right finger one entry to the left, and we keep doing that until there are no more entries between our left and right finger. Once we're there, the value of the entry that is pointed to by our left finger is the 'left median' and the value of the entry pointed to by our right finger is the 'right median'.
If we had an even number of entries, then the left and right median each correspond to distinct entries - if there was an odd number of entries then the left and right median correspond to one and the same entry.
At any rate, once we found the left and right median, it is clear that their mean is the true median. If we have an even number of entries, we have to calculate the mean of the two middle values anyway, and if there is an odd number of entries, taking the mean of two identical values results of course in that same value which does thus result in a correct value for the median.
GROUP_CONCAT
: an ordered list of values
In the example, we use GROUP_CONCAT
to generate a list of values in ascending order:
GROUP_CONCAT( -- list all values in ascending order
f.replacement_cost
ORDER BY f.replacement_cost
)
This gets us a string consisting of concatenated
replacement_cost
values in ascending order, separated by the default separator, which is a comma (','
).We use the same list in both the calculation of the left and the right median.
Note that the length of the concatenation result returned by
GROUP_CONCAT()
is limited. By default, it is as small as 1024 bytes. Personally I think this is way too small so I have it configured to be 64K by default. You can set the length at runtime too like this:You can specify larger values than 65535 too, and I suspect that the maximum packet size is the practical maximum:
SET group_concat_max_len := 65535
To inspect the current value, you can do this:
SET group_concat_max_len := @@max_allowed_packet
SELECT @@group_concat_max_len
Getting the 'left' half of the list
Once we have the list of values, we can split it in two halves with little effort. We do this using theSUBSTRING_INDEX()
function.The
SUBSTRING_INDEX()
function processes a string argument and gets a substring based on the position of a particular occurrence of another substring. In this case, the comma
','
is the substring that separates the values in our ordered list. But of all the commas in the list, which occurrence of the comma do we need to find? Suppose our list contains 4 values. Then, these values are separated by three comma's:
If we want to divide this list in two equal halves, then the second comma is the divisor between the left and right halves of our list. With
values: '1,2,3,4'
commas: ^ ^ ^
1 2 3
SUBSTRING_INDEX()
this expression would get us the left half of the this list:and the result will be:
SUBSTRING_INDEX('1,2,3,4', ',', 2) -- the substring up to the 2nd occurrence of ','
So now we have the first half of the list, and by definition, the last entry in that list,
'1,2'
'2'
is the left median of the original list '1,2,3,4'
. Now what if we would've had an odd number of entries in our list? Supose our list would've been like this:
In this case too, we need the second comma to end up with a left substring that has the left median as last entry in the list (which also happens to be the proper median because this is a list with an odd number of entries).
values: '1,2,3'
commas: ^ ^
1 2
As it turns out, we can conveniently generalize the
SUBSTRING_INDEX
expression like this:In other words, if we divide the number of entries in our list by two, and then round to the nearest higher integer, this gives us the particular occurrence of the separator what we are looking for to halve our list as required. Of course, calculating the number of entries is simply a matter of using the
SUBSTRING_INDEX(list, separator, CEILING(#entries/2))
COUNT()
aggregate function.Excising the 'left' median
To actually obtain the left median itself, we just need to excise the last value from the left half of the list of values. We do this by applyingSUBSTRING_INDEX()
again. Again, we need to make a substring in terms of the occurrence of the comma that separates the values in our list. This time, we need to get the substring found directly after the last comma in the list. With
SUBSTRING_INDEX()
we can conveniently express this in the following manner:This means: search the list from right to left and find the first occurrence of the separator. Return the substring that appears after the separator (that is, the substring appearing on the right hand of the separator).
SUBSTRING_INDEX(list, separator, -1)
Getting the 'right' median
The process to obtain the right median is a mirror of obtaining the left median: instead of obtaining the last value in the left half of the ordered list of values, we now need to obtain the first value of the right half of the list. This is actually as simple as reversing the sign of the occurrence argument in theSUBSTRING_INDEX()
calls:
left half: SUBSTRING_INDEX(list, separator, CEILING(COUNT(*)/2))
right half: SUBSTRING_INDEX(list, separator, -CEILING(COUNT(*)/2))
last entry: SUBSTRING_INDEX(list, separator, -1)
first entry: SUBSTRING_INDEX(list, separator, 1)
A few remarks
I think that in many cases, this can be a fair method to calculate the median. The advantage of this method is that it is relatively fast because the query itself is relatively simple.It would be interesting to see how this method behaves when handling millions of rows. Maybe I will run some benchmarks on that later on.
In the mean while, feel free to post your thoughts, suggstions or critique on this blog.