Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Tuesday, September 15, 2009

MySQL: Another Ranking trick

I just read SQL: Ranking without self join, in which Shlomi Noach shares a nice MySQL-specific trick based on user-defined variables to compute rankings.

Shlomi's trick reminds me somewhat of the trick I came across little over a year ago to caclulate percentiles. At that time, several people pointed out to me too that using user-defined variables in this way can be unreliable.

The problem with user-defined variables

So what is the problem exaclty? Well, whenever a query assigns to a variable, and that same variable is read in another part of the query, you're on thin ice. That's because the result of the read is likely to differ depending on whether the assignment took place before or after the read. Not surprising when you think about it - the whole point of variable assignment is to change its value, which by definition causes a different result when subsequently reading the variable (unless you assigned the already assigned value of course, duh...).

Now watch that previous statement clearly - the word subsequently is all-important.

See, that's the problem. The semantics of a SQL SELECT statement is to obtain a (tabular) resultset - not specifying an algorithm to construct that resultset. It is the job of the RDBMS to figure out an algorithm and thus, you can't be sure in what order individual expressions (including variable evaluation and assignment) are executed.

The MySQL manual states it like this:

The order of evaluation for user variables is undefined and may change based on the elements contained within a given query. In SELECT @a, @a := @a+1 ..., you might think that MySQL will evaluate @a first and then do an assignment second, but changing the query (for example, by adding a GROUP BY, HAVING, or ORDER BY clause) may change the order of evaluation.

The general rule is never to assign a value to a user variable in one part of a statement and use the same variable in some other part of the same statement. You might get the results you expect, but this is not guaranteed.

So what good are these variables anyway?

On the one hand, this looks really lame: can't MySQL just figure out the correct order of doing the calulations? Well, that is one way of looking at it. But there is an equally valid reason not to do that. If the calculations would influence execution order, it would drastically lessen the number of ways that are available to optimize the statement.

This begs the question: Why is it possible at all to assign values to the user-defined variables? The answer is quite simple: you can use it to pass values between statetments. My hunch is the variables were created in the olden days to overcome some limitations resulting from the lack of support for subqueries. Having variables at least enables you to execute a query and assign the result temporarily for use in a subsequent statement. For example, to find the student with the highest score, you can do:

mysql> select @score:=max(score) from score;
+--------------------+
| @score:=max(score) |
+--------------------+
| 97 |
+--------------------+
1 row in set (0.00 sec)

mysql> select * from score where score = @score;
+----------+--------------+-------+
| score_id | student_name | score |
+----------+--------------+-------+
| 2 | Gromit | 97 |
+----------+--------------+-------+
1 row in set (0.03 sec)
There is nothing wrong with this approach - problems start arising only when reading and writing the same variable in one and the same statement.

Another way - serializing the set with GROUP_CONCAT


Anyway, the percentile post I just linked to contains another solution for that problem that relies on GROUP_CONCAT. It turns out we can use the same trick here.

(Some people may like to point out that using GROUP_CONCAT is not without issues either, because it may truncate the list in case the pre-assigned string buffer is not large enough. I wrote about dealing with that limitation in several places and I remain recommending to set the group_concat_max_len server variable to the value set for the max_packet_size server variable like so:
SET @@group_concat_max_len := @@max_allowed_packet;
)

The best way to understand how it works is to think of the problem in a few steps. First, we make an ordered list of all the values we want to rank. We can do this with GROUP_CONCAT like this:

mysql> SELECT GROUP_CONCAT(
-> DISTINCT score
-> ORDER BY score DESC
-> ) AS scores
-> FROM score
-> ;
+-------------+
| scores |
+-------------+
| 97,95,92,85 |
+-------------+
1 row in set (0.00 sec)

Now that we have this list, we can use the FIND_IN_SET function to look up the position of any particlar value contained in the list. Because the list is ordered in descending order (due to the ORDER BY ... DESC), and contains only unique values (due to the DISTINCT), this position is in fact the rank number. For example, if we want to know the rank of all scores with the value 92, we can do:

mysql> SELECT FIND_IN_SET(92, '97,95,92,85')
+--------------------------------+
| FIND_IN_SET(92, '97,95,92,85') |
+--------------------------------+
| 3 |
+--------------------------------+
1 row in set (0.00 sec)
So, the answer is 3 because 92 is the third entry in the list.

(If you're wondering how it's possible that we can pass the integer 92 as first argument for FIND_IN_SET: the function expects string arguments, and automatically converts whichever non-string typed value we pass to a string. In the case of the integer 92, it is silently converted to the string '92')

Of course, we are't really interested in looking up ranks for individual numbers one at a time; rather, we'd like to combine this with a query on the scores table that does it for us. Likewise, we don't really want to manually supply the list of values as a string constant, we want to substitute that with the query we wrote to generate that list.
So, we get:

mysql> SELECT score_id, student_name, score
-> , FIND_IN_SET(
-> score
-> , (SELECT GROUP_CONCAT(
-> DISTINCT score
-> ORDER BY score DESC
-> )
-> FROM score)
-> ) as rank
-> FROM score;
+----------+--------------+-------+------+
| score_id | student_name | score | rank |
+----------+--------------+-------+------+
| 1 | Wallace | 95 | 2 |
| 2 | Gromit | 97 | 1 |
| 3 | Shaun | 85 | 4 |
| 4 | McGraw | 92 | 3 |
| 5 | Preston | 92 | 3 |
+----------+--------------+-------+------+
5 rows in set (0.00 sec)

Alternatively, if you think that subqueries are for the devil, you can rewrite this to a CROSS JOIN like so:

SELECT score_id, student_name, score
, FIND_IN_SET(
score
, scores
) AS rank
FROM score
CROSS JOIN (SELECT GROUP_CONCAT(
DISTINCT score
ORDER BY score DESC
) AS scores
FROM score) scores

Now that we have a solutions, lets see how it compares to Shlomi's original method. To do this, I am using the payment table from the sakila sample database.

First, Shlomi's method:

mysql> SELECT payment_id
-> , amount
-> , @prev := @curr
-> , @curr := amount
-> , @rank := IF(@prev = @curr, @rank, @rank+1) AS rank
-> FROM sakila.payment
-> , (SELECT @curr := null, @prev := null, @rank := 0) sel1
-> ORDER BY amount DESC;
+------------+--------+----------------+-----------------+------+
| payment_id | amount | @prev := @curr | @curr := amount | rank |
+------------+--------+----------------+-----------------+------+
| 342 | 11.99 | NULL | 11.99 | 1 |
. ... . ..... . ..... . ..... . . .
| 15456 | 0.00 | 0.00 | 0.00 | 19 |
+------------+--------+----------------+-----------------+------+
16049 rows in set (0.09 sec)

Wow! It sure is fast :) Now, the GROUP_CONCAT solution, using a subquery:

mysql> SELECT payment_id, amount
-> , FIND_IN_SET(
-> amount
-> , (SELECT GROUP_CONCAT(
-> DISTINCT amount
-> ORDER BY amount DESC
-> )
-> FROM sakila.payment)
-> ) as rank
-> FROM sakila.payment
+------------+--------+------+
| payment_id | amount | rank |
+------------+--------+------+
| 1 | 2.99 | 15 |
. . . .... . .. .
| 16049 | 2.99 | 15 |
+------------+--------+------+
16049 rows in set (0.14 sec)


(In case you're wondering why the results are different, this is because the result set for Shlomi's solution is necessarily ordered by ascending rank (or descending amount - same difference. To obtain the identical result, you need to add an ORDER BY clause to my query. But since the point was to calculate the ranks, I didn't bother. Of course, adding an ORDER BY could slow things down even more.)

Quite a bit slower, bummer. But at leastt we can't run into nasties with the user variables anymore. For this data set, I get about the same performance with the CROSS JOIN, but I should warn that I did not do a real benchmark.

Conclusion

Don't fall into the trap of reading and writing the same user-defined variable in the same statement. Although it seems like a great device and can give you very good performance, you cannot really control the order of reads and writes. Even if you can, you must check it again whenever you have reason to believe the query will be solved differently by the server. This is of course the case whenever you upgrade the server. But also seemingly harmless changes like adding an index to a table may change the order of execution.

Almost all cases where people want to read and write to the same user variables within the same query, they are dealing with a kind of serialization problem. They are trying to maintain state in a variable in order to use it across rows. In many cases, the right way to do that is to use a self-join. But this may not always be feasible, as pointed out in Shlomi's original post. For example, rewriting the payment rank query using a self join is not going to make you happy.

Often, there is a way out. You can use GROUP_CONCAT to serialize a set of rows. Granted, you need at least one pass for that, and another one to do something useful with the result, but this still a lot better than dealing with semi-cartesian self join issues.

42 comments:

datacharmer said...

Way to go Roland! Keep them coming!

water outbreaks said...

Oooh! That's the problem I had with user variables. I had written a query that used user variables to do a version of limit by, but for a calculated value rather than number of rows. It seemed to work great. I entrhusastically benchmarked it against a big table and it stopped working correctly. This is why.

Shlomi N. said...

Roland,

Great work!
Indeed, after being pointed out at the issue with user variables, I realized I was quite aware of it a few weeks ago, doing my SQL charting code; which made me write a lot of sub-sub-sub queries to manage the order.

However, I've noticed a place for improvement in your solution: please note that you re-evaluate the GROUP_CONCAT per row; this can be evaluated just once for the entire query:
[pre]
SELECT
payment_id,
amount,
FIND_IN_SET(amount, @values_set) as rank
FROM
sakila.payment,
(SELECT @values_set := GROUP_CONCAT(
DISTINCT amount
ORDER BY amount DESC)
FROM sakila.payment) AS sel1
;
[/pre]

I've found this to reduce query time from approx. 0.10-0.11 seconds to 0.5-0.8 seconds, which is up to 50% time savings.
I'm surprised (for the better) that FIND_IN_SET works so quickly. I'll remember that.

Regards,
Shlomi

Shlomi N. said...

Updating my previous comment, it's equal to the CROSS JOIN solution you've offered, which is superior in processing efforts to the subquery solution (the latter being recalculated over and over).

Regards,
Shlomi

Roland Bouman said...

Hi all! thanks for commenting.

@datacharmer: thanks Giuseppe!

@water outbreaks: you aren't the first one that gets bitten. It is actually pretty counterintuitive, and I wish we'd get a warning from MySQL. Anyway, now you know why.

@Shlomi: thanks! Well, thank you for the inspiration. It turns out once can hack up quite a good deal of these analyticish queries.

Anyway, nice you found out that the JOIN is indeed faster. Perhaps it will be the same in 5.4.

I would like to point out one thing concerning your solution. I know that in this case, it is probably safe to do the assignement and the reading in the same query. But if you really want to use the user variable, and be on the safe side, you really should use two statements.

Now, the nice thing is, you do not have to suffer the cost of an extra rountrip, because you can batch the statements. It works like this:

DELIMITER //

SET @amounts := (
SELECT GROUP_CONCAT(
DISTINCT amount
ORDER BY amount DESC) amount
FROM sakila.payment
);

SELECT payment_id, amount
, FIND_IN_SET(amount, @amounts) as rank
FROM sakila.payment;
//

DELIMITER ;

I wrote about that earlier here http://rpbouman.blogspot.com/2008/02/most-misunderstood-character-in-mysqls.html

Cheers,
Roland.

Shlomi N. said...

Roland,

The JOIN is faster because the subquery needs to be recalculated over and over. As I understand it, this is perfectly correct; so I'm not sure 5.4 or 6.0 should or will change this. I may be wrong on this.

Shlomi

Roland Bouman said...

Shlomi,

the subquery is scalar and in no way dependent upon the outer query. The optimizer does not have to be particularly smart to deduce that the result simply cannot change. It is kinda like a deterministic function with static arguments. It should be evaluated just once, and the cached result should be recycled.

I thought MySQL 5.4 was introducing more clever subquery handling, and I was hoping this was on the menu. I could be wrong about that, but I am pretty sure there is no logical reason why the subquery should evaluate the query for each row of the outer query.

Log Buffer said...

"Roland Bouman responded with another ranking trick [...]"

Log Buffer #162

Anonymous said...

Why doesn't MySQL just implement the SQL2003 standard and support analytic functions?

Roland Bouman said...

Hi Anonymous,

As I am sure you're aware, questions of the form

"Why doens't company X add feature Y to product Z"

hardly ever get a satisfying answer, less so even from individuals that are not formally bound to said company, feature and product.

I hope you find this answer satisfactory.

kind regards,

Roland

Sean V said...

Roland:
How would you go about ranking 5 different fields simultaneously? More specifically, I want to rank sales by rep across 5 different periods (last 60 days, last 30, last 14, last 7, and last 1). Can this method be used to do such a thing efficiently?

-- Sean

Roland Bouman said...

Sean, please post a few lines of table data, and a few lines of desired output. I am having trouble understanding what you mean exactly, but perhaps I can give it a go if you give me a concrete example. TIA, roland.

Sean V said...

Hi Roland:
Here are a few lines from the table. The fields of interest are simply user_id, amount, and a datestamp.

[pre]
user_id amount created_at
23546670 6.24 2010-01-01 00:00:08
20800766 10.71 2010-01-01 00:00:08
22159897 3.68 2010-01-01 00:00:16
9509898 9.95 2010-01-01 00:00:16
22159897 2.29 2010-01-01 00:00:16
11975765 6.27 2010-01-01 00:00:16
6860935 3.76 2010-01-01 00:00:17
24078245 20.00 2010-01-01 00:00:42
24092397 0.37 2010-01-01 00:00:48
15674071 43.65 2010-01-01 00:00:48
24092397 67.78 2010-01-01 00:00:48
20251912 0.98 2010-01-01 00:00:55
20251912 2.00 2010-01-01 00:00:55
20822749 1.81 2010-01-01 00:00:55
24092397 69.00 2010-01-01 00:00:58
9947968 3.60 2010-01-01 00:00:59
9947968 23.79 2010-01-01 00:00:59
11568829 17.63 2010-01-01 00:00:59
14489884 1.70 2010-01-01 00:01:07
23436345 1.11 2010-01-01 00:01:07
14489884 0.28 2010-01-01 00:01:07
19172178 6.00 2010-01-01 00:01:21
12333245 1.40 2010-01-01 00:01:55
17530334 2.00 2010-01-01 00:01:55
19023884 14.00 2010-01-01 00:02:43
[/pre]

The idea is to see which reps rank highest in sales ("amount") over different time windows. If a rep with a high 60 day rank has a low 14 day rank, they are "slipping", whereas a low 60 day rank and a high 14 day rank points to a "rising star".

I am currently calculating the sales amounts for each window in a possibly ham-fisted way, like so:

[pre]
SELECT user_id,
SUM(CASE WHEN created_at between date_sub(curdate(),interval 60 day) and curdate() THEN 1 ELSE 0 END * amount) AS amount_L60D,
SUM(CASE WHEN created_at between date_sub(curdate(),interval 30 day) and curdate() THEN 1 ELSE 0 END * amount) AS amount_L30D,
SUM(CASE WHEN created_at between date_sub(curdate(),interval 14 day) and curdate() THEN 1 ELSE 0 END * amount) AS amount_L14D,
SUM(CASE WHEN created_at between date_sub(curdate(),interval 7 day) and curdate() THEN 1 ELSE 0 END * amount) AS amount_L7D,
SUM(CASE WHEN created_at between date_sub(curdate(),interval 1 day) and curdate() THEN 1 ELSE 0 END * amount) AS amount_L1D
FROM transactions
WHERE created_at between date_sub(curdate(),interval 60 day) and curdate()
GROUP BY user_id
[/pre]

I can stuff the result into a temp table and then run your ranking code (iteratively) over it, but would love a perspective on a better way to do it. Analytical queries like this seem to be your power alley, after all.

The ultimate output should look something like this:

[pre]user_id amount_L60D amount_L30D amount_L14D amount_L7D amount_L1D rank_L60D rank_L30D rank_L14D rank_L7D rank_L1D
13108175 $17,311.79 $10,526.27 $4,925.22 $2,374.15 $0.00 1 1 1 2 1351
12448416 $16,092.72 $8,835.80 $4,104.98 $1,686.23 $459.95 2 2 2 10 2
20709796 $15,546.22 $5,253.88 $2,732.97 $1,058.96 $112.43 3 12 9 31 90
18535661 $15,223.41 $6,041.95 $2,715.30 $1,566.04 $18.68 4 6 10 11 592
12303435 $13,874.91 $7,537.55 $3,443.29 $1,810.60 $425.32 5 3 6 7 3
7265582 $11,606.16 $4,948.26 $1,964.96 $1,421.00 $90.94 6 19 30 14 136

[/pre]

Thanks for having a look!

Keymann said...

I made my own ranking system and i did not have to store the ranks in the database:
http://stackoverflow.com/questions/2941898/building-ranking-system-with-data-from-the-database/2944786#2944786

Anonymous said...

So this is great and totally helpful.

However, my team wants the Rank to work more like it does in Golf or the Olympics, where you actually skip rank values for people who are tied. In your version above, two folks are tied with Rank=3 and another person follows with Rank=4.
What we need are for those 2 people to be tied with Rank=3 and the next person would have Rank=5. (There would be no Rank = 4).
Make sense?
I'm thinking the only way to pull this off might be those user variables...
Any thoughts?

Roland Bouman said...

Hi Anonymous,

yes - good point!
Fortunately, this is very easy to do also with the GROUP_CONCAT approach - instead of using

GROUP_CONCAT(DISTINCT col ORDER BY col)

use


GROUP_CONCAT(col ORDER BY col)

In other words, omit the DISTINCT keyword.

Try it, it just works.

kind regards,

Roland.

Anonymous said...

Thanks again.

This is working as you describe, however I ran into a minor problem to be aware of.

GROUP_CONCAT by default cuts off at 1024 characters, so if you have a lot of data it doesn't work. The lower ranked values will return "0" if they are not found.

You can change the cutoff length using:
SET [GLOBAL | SESSION] group_concat_max_len = val;

Also here: http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html

Roland Bouman said...

@Anonymous,

Yes, I am well aware of the fact that GROUP_CONCAT has a fixed length buffer. In fact, I *always* mention that whenever I recommend GROUP_CONCAT, and this article is no exception.

If you'll re-read this article, you'll see it mentioned, and you'll also learn my solution to the problem "yes I can set the length, but what length should I set it to?"

Anonymous said...

This is good but what if you are trying to rank from a formula!
e.g. (SUM(late) / SUM(deliveries)

Roland Bouman said...

Anonymous,

if the formula does not rely on any aggregates, you can simply ORDER BY in the GROUP_CONCAT on that formula.

Before I can answer the question with your particular formula, I would need to see your entire query as you would write it without any ranking.

regards, Roland.

Anonymous said...

My solution outperfom all of yours : I'm using a variables to compare with the previous score in the dataset and than increase the rank or not (in case of same score)

SSET @rank=0;
SET @previous_score=0;
SELECT
CASE
WHEN @previous_score<>score THEN @rank:=@rank+1 END AS temp1,
CASE
WHEN @previous_score<>score THEN @previous_score:=score
ELSE @previous_score:=score END AS temp2,

@rank as rank, id_user, score
FROM users ORDER BY score DESC;

I'm pleased to share this with you.

Vince

Roland Bouman said...

Hi Vince,

thanks for sharing this. If you read the first part of the article, the whole point of this article was to develop a method that does not rely on user-defined variables.

I am well aware of the performance gains that can be achieved if you do use user-defined variables, and I have written several articles on this blog that do use this device. However, as I outlined in this article, and as is also documented in the MySQL Reference Manual, user-defined variables are not reliable if you read and assign them within the same statement.

I hope that clears it up.

kind regards,

Roland.

Anonymous said...

Thank you Roland for your answer.
My bad, I didn't read the hole article...
I did some testing and the results I get are valid.
Do you think that this method isn't reliable because MySQL documentation state it ? Or do you have a proof that it is really not reliable?

Cheers

Vince

Roland Bouman said...

Vince,

the problem with user-defined variables is that there is no way to explicitly control the order in which the expressions that change the values of the user-defined variables are executed.

In your particular example, you define an explicit ORDER BY clause, and you may think this takes care of the problem. However, the ORDER BY clause only requires that the result is returned in order, not that the actual operations are executed in order too. If you add/drop indexes that change the access pattern, you may find unexpected results.

Anonymous said...

wow great work... i just tried this and it's like a magic... hehehe i'll be using it for my tabulation... thank you very much for this trick...

Paul Fenley said...

Along the same lines... Here's my problem you may be able to help me with...

The premise [I manage a web site http://SaveOregonWrestling.org, I'm adding a page (Fight Ladder) to list donors who specify their donation for a specific weight class... NCAA wrestling has 10 weight classes... - I want the top 3 donors listed in each weight class, ranked by their donation amount.] I am trying to implement a MySql query on a single table:

table = FightLadder
columns = ID, PublishName, WeightClass, Total

Below is the current PHP code I'm using which works as long as there's only 30 rows in the table. The moment I add row 31, the results are wack. e.g., a donor who donated in weight class 149 gets moved up to weight class 141 in the list!! : ( I also get a stream of "MySql WARNING error messages about moving to the next row", unless I add the @ in front of my var code, e.g. $champ1a = @mysql_result($result, 0,0);

[code]

$query="SELECT c.PublishName, c.State, c.Total, c.WeightClass, d.ranknum
FROM FightLadder AS c
INNER JOIN (
SELECT a.ID, COUNT(*) AS ranknum
FROM FightLadder AS a
INNER JOIN FightLadder AS b ON (a.WeightClass = b.WeightClass) AND (a.Total <= b.Total)
GROUP BY a.ID
HAVING COUNT(*) <= 3
) AS d ON (c.ID = d.ID)
ORDER BY c.WeightClass, c.Total DESC, d.ranknum";

$result = mysql_query($query)

[/code]

Here's where I got the above code (for SQL Server - Mine's MySql)

http://rickosborne.org/blog/index.php/2008/01/07/sql-getting-top-n-rows-for-a-grouped-query/

Can you help me with the correct code to produce a [$result = mysql_query($query)] where only the top 3 donors in each weight class are listed?

Thanks in advance for your valuable time.

-Paul, IT Volunteer
Save Oregon Wrestling Foundation
800-553-0135 ext. 105

Anonymous said...

Hi,
I'm quite new in MySQL but I have to create a ranking based on a specific SELECT using joined tables. My original select is this:

SELECT
l.cod_loja,
l.nome_loj,
SUM(c.vlrbase) AS soma
FROM
tab_comissao c INNER JOIN tab_prot p ON (c.ade=p.ade)
INNER JOIN tab_lojas l ON (l.cod_loja=p.cod_loja)
WHERE c.ade = p.ade AND p.cod_loja = l.cod_loja
AND c.datalanc>='2011-01-04'
AND c.datalanc<='2011-02-02'
GROUP BY l.cod_loja
ORDER BY soma DESC

Where gives me this result:

3 store-3 345874.54
7 store-7 213022.86
6 store-6 209934.94
...
...
...
57 store-57 2827.18

So, how I could to this in order to show me the correct ranking?

Thanks in advance,

Kleyber Derick

Robert said...

Good day sir, I have a problem regarding my ranking script, I already apply your tutorial but unfortunately if there is a tie with the average it will look to other columns to compare for tie breaker, Thank you very much in advance

Rank Name ave C1 C2
1 Mark Gil Guevarra 26.80 07.95 08.25 <-winner
2 Jess Menes 24.25 07.35 07.50 <-winner
3 Jef Llares 24.15 07.20 07.20 <-this will be change to rank 4
3 Noriel Arguelles 24.15 07.20 07.30 <-winner
4 James Bandiez 21.40 07.05 06.45 <-and this to rank 5
4 RObert Tribiana 21.40 07.05 05.50 <-and this to rank 5
5 Julius Kahugan 17.50 05.85 05.25 <-and this to rank 6

Anonymous said...

Does anyone know how I'd make this an actual update statement. I want to take the rand-id and insert that value to my eventrank filed.

SELECT event, totallength
, FIND_IN_SET(
totallength
, (SELECT GROUP_CONCAT(
DISTINCT totallength
ORDER BY totallength DESC
)
FROM eresults)
) as rank
FROM eresults where event='Med Ball' and status !='DNC';

The above gives me what I'd expect. I've tried the below with no luck.

UPDATE eresults JOIN(
SELECT event, totallength
, FIND_IN_SET(
totallength
, (SELECT GROUP_CONCAT(
DISTINCT totallength
ORDER BY totallength DESC
)
FROM eresults)
) as rank
FROM eresults where event='Med Ball' and status !='DNC')
ranks ON (ranks.id = eresults.id)
SET eventrank = ranks.rank;

Anonymous said...

I really liked the article, and the very cool blog

Keith said...

I implemented this strategy for ranking today and it works great! Unfortunately, once I tried it with our Live dataset I noticed some warnings during execution. As mentioned by yourself and Shlomi there is a memory limit, and I seem to be hitting that limit even after setting group_concat_max_len to max_allowed_packets, which is 16M.

Warning | 1260 | 9 line(s) were cut by GROUP_CONCAT()

It makes sense this would happen because we are creating a giant string in memory that is very dependent on the size of the dataset. I'm afraid this solution may not be very scalable :/ As a gauge, our dataset was a little over 250,000.

The solution we ended up with uses a rownum technique. The query takes 1.5-2 seconds on our dataset of 250,000.

insert into player_stat_rankings SELECT t.player_id, t.score, @rownum:=@rownum+1 AS rank FROM (SELECT @rownum:=0) r, player_stat t order by score desc;

Roland Bouman said...

@Keith, yes, when you're in a situation where you can use this trick, then that is the fastest solution. But you always have to be aware of this documented limitation:

"
As a general rule, you should never assign a value to a user variable and read the value within the same statement. You might get the results you expect, but this is not guaranteed. The order of evaluation for expressions involving user variables is undefined and may change based on the elements contained within a given statement; in addition, this order is not guaranteed to be the same between releases of the MySQL Server. In SELECT @a, @a:=@a+1, ..., you might think that MySQL will evaluate @a first and then do an assignment second. However, changing the statement (for example, by adding a GROUP BY, HAVING, or ORDER BY clause) may cause MySQL to select an execution plan with a different order of evaluation.
"

http://dev.mysql.com/doc/refman/5.5/en/user-variables.html

In my experience, the ranking result can become unstable after adding an index, joining to other tables, adding an order by or adding a specific group by clause. Technically it's possible that a server upgrade could change the results as well.

Keith said...

Right, which is the point of your post. I also realized this weekend that the rownum solution does not give the same ranks for the same scores :/

Roland Bouman said...

Keith,

maybe you can still use the rownum trick if you force execution order. For instance you could try:

SELECT @rank:=@rank+1, ..columns...
FROM (
... original query incl. ORDER BY clause...
)
INNER JOIN (SELECT @rank:=0)

Keith said...

I cannot get the INNER JOIN to work as it tells me I have a mysterious syntax error at the INNER JOIN. I can, however, set the @rank variable before the ranking query. Any issues with doing it this way?

SET @rank:=0;
SELECT @rank:=@rank+1, ...

Roland Bouman said...

Keith,

initializing the var before the actual query is fine too - it shouldn't matter for the result. the join is just a trick to make it self-contained.

Keith said...

Thanks for a good discussion Roland :)

Roland Bouman said...

No problem Keith. HtH :)

Anonymous said...

I just learned a lot from your article. Thank you.

Andrea

Sylvain Goumy said...

Hi,

Since this article is from 2009, I wonder wether mysql has now evolved on one or the other ranking technic?

I mean, is there a scalable and reliable way to get rank number now?

Best regards

Roland Bouman said...

Sylvain,

no. MySQL 5.6 does not have window functions or any other technique that I know of that would make this easier or faster.

Josh said...

I'm trying to wrap my head around this. I'm not sure if I'm misunderstanding it, or a few different problems are getting mixed up. I think the problem that the MySQL docs refer to is not the same as the problem that you run into with ranking.

@a:=@a+1 is a single statement and not vulnerable to execution order changes. However, if you have that, and then do an if() comparison, you might not get what you expect, and that is what the MySQL doc is referring to.

The problem that you run into when ranking, is that the field selection (SELECT @a:=@a+1) does not need to happen row-by-row after it's been ordered. In one example I have, if I order a specific column it works correctly, but if I try to use a function it fails (probably doing the calculation before the ordering).

As Roland points out, if you use a subquery for the order statement, and do the count in an outer query, I think you should be safe. So far I haven't been able to see a problem, but I would welcome further testing or info about what might mess it up.