## Wednesday, July 09, 2008

### A fast, single pass method to calculate the median in MySQL

After stepping off of the `GROUP_CONCAT()` solution for calculating quantiles I figured it would be nice to find a better way to calculate the median too.

### Solution

I previously wrote on how to calculate the median using `GROUP_CONCAT()`, but I think that this is a better way:
`SELECT     AVG(length) AS median                 -- take the average of left and right median,          MIN(length) AS left_median            -- ,          MAX(length) AS right_median           --,          @l          AS left_median_position   --,          @r          AS right_median_position  --FROM (     SELECT     @n, length                       -- @n is just here to facilitate debug     FROM       film     CROSS JOIN (                SELECT @n:=0                ,      @r := COUNT(*) DIV 2 + 1  -- right median or true median                ,      @l := COUNT(*) DIV 2      -- left median, or true medain                           + IF(                               COUNT(*) % 2      -- even or odd?                              , 1                 -- odd, need next value                             , 0                 -- even, need true left median                             )                FROM   film                ) ``WHERE      (@n:=@n+1)                                   -- row numberBETWEEN    @l AND @r                                    -- select two middle onesORDER BY   length                                       -- need to sort to get middle values) ```

Apart from not relying on setting the buffer size for `group_concat_max_len`, this solution is also faster than the `GROUP_CONCAT()` solution. For example, calculating the median `amount` from the `payment` table in the `sakila` database takes some 0.12 seconds using the `GROUP_CONCAT()` solution and about 0.03 seconds with this method.

### Explanation

At the heart of the solution are the user-defined variables:

• `@n`: the row number

• `@l`: the position of the 'left median' that is the row number with the highest value in the lower half of all rows

• `@r`: the position of the 'right median', that is the row number with the lowest value in the higher half of all rows

These are initialized in the inmost subquery:
`SELECT @n:=0,      @r := COUNT(*) DIV 2 + 1  -- right median or true median,      @l := COUNT(*) DIV 2      -- left median, or true medain           + IF(               COUNT(*) % 2      -- even or odd?              , 1                 -- odd, need next value             , 0                 -- even, need true left median             )FROM   film`

Note that this yields one row, initializing the row number `@n` to zero. We can calculate the position for the right median immediately by doing
`COUNT(*) DIV 2 + 1`

You see, first we divide the total number of rows by two. If there is an even number of rows, `COUNT(*) DIV 2` will give us the left median, and adding `1` is then by definition the right median. If `COUNT(*)` is an odd number, it still holds: the integer division rounds down to the nearest integer value, and adding `1` then gives us the position of the true median in that case.

The calculation of the left median is along the same lines:
`@l := COUNT(*) DIV 2      -- left median, or true medain    + IF(        COUNT(*) % 2      -- even or odd?       , 1                 -- odd, need next value      , 0                 -- even, need true left median      )`

We just have to take care that in case we do have an odd number of rows, we need to pick the true median position here too.

To do that we first find out if there is an odd number of rows using `COUNT(*) % 2`. This calculates the remainder of dividing the total number of rows by two. In case of an odd number of rows, the remainder is `1` which is considered to be `TRUE` by the `IF` function. In this case, `1` is added, effectively making `@l` equal to `@r` (that is, both hold the position of the true median). In case of an even number of rows, we add `0` as `COUNT(*) % 2` is already the desired position of the left median.

Now that we have these values, we can use `@n` to generate a rownumber, which we can compare against the calculated `@l` and `@r` values:
`WHERE      (@n:=@n+1)                                   -- row numberBETWEEN    @l AND @r                                    -- select two middle onesORDER BY   length                                       -- need to sort to get middle values`

And this will give us at most two rows in case of an even number of rows, and one row for an odd number of rows.

The final step is to average the values from the rows, which is required to calculate the median in case of an even number of rows:
`SELECT     AVG(length) AS median                 -- take the average of left and right median...FROM (...) ```

Anonymous said...

Creative.

But very fragile.

It depends a lot on how MySQL executes the query and may break even in the next minor version.

First - assigning and using user variables in the same query - it's a big no-no.

Second - using ORDER BY in the derived table. This makes no sense, because a derived table is a still table, and a table is an unordered set of rows. MySQL has full freedom to ignore your ORDER BY clause (although it doesn't do it at the moment). SQL standard doesn't allow ORDER BY to appear at that place at all.

rpbouman said...

Hi Sergei,

In the mean while I have seen a number of cases where the user variables don't count as I intended, confirming your objections.

For example, a similar query will work when the metric is indexed (and that index is used) but will break otherwise.

Concerning the ORDER BY in the derived table: yeah, I agree. It does make sense for this particular trick, but indeed there is no reason why an optimizer wouldn't ignore it.

That said, is there anything general you can say about user-defined variables an their order of evaluation? I am particularly interested in hearing which things are safe.

kind regards, and thanks again

Roland

Anonymous said...

From the manual:
The general rule is never to assign a value to a user variable in one part of a statement and use the same variable in some other part of the same statement. You might get the results you expect, but this is not guaranteed.

rpbouman said...

Thank Justin,

yeah, I knew this was in the doc.

I mean, yes it is not good practice to ignore this warning, but I have seen the initialization like done in this way work stable.

The real killer for me though is the order of processing. The presence of an index can make it or break it, and that is just too much hassle to take care of.

So, I have to rethink this I guess...

But thanks for the doc ref - it's good to have it in this thread.

### DuckDB bag of tricks: Processing PGN chess games with DuckDB - Rolling up each game's lines into a single game row (6/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is the sixth installment of a s...