Locations of visitors to this page Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Friday, March 19, 2010

Greatest N per group: top 3 with GROUP_CONCAT()

In my opinion, one of the best things that happened to Planet MySQL lately, is Explain Extended, a blog by Alex Bolenok (also known as Quassnoi on Stackoverflow).

I never had the pleasure of meeting Alex in person, but his articles are always interesting and of high quality, and the SQL wizardry he pulls off is downright inspiring. I really feel humbled by the creativity of some of his solutions and his apparent experience with multiple RDBMS products.

Alex' most recent post is about aggregation, and finding a top 3 based on the aggregate:

In MySQL I have a table called meanings with three columns: word, meaning, person. word has 16 possible values, meaning has 26. A person assigns one or more meanings to each word. In the sample above, person 1 assigned two meanings to word 2. There will be thousands of persons. I need to find the top three meanings for each of the 16 words, with their frequencies. Is it possible to solve this with a single MySQL query?


Alex presents a solution that uses GROUP_CONCAT basically as a poor man's windowing function, a technique I have described on several occasions in the past for ranking, median and percentile solutions in MySQL.

Now, Alex' solution is very clever and there are some elements that I think are very creative. That said, I think his solution can be improved still. Normally I wouldn't write a blog about it, and simply leave a comment on his blog, but his blog supports comments only for general articles, which is why I present it here:

SELECT word
, CONCAT(
SUBSTRING_INDEX(
GROUP_CONCAT(meaning ORDER BY num DESC), ',', 1
)
, ' ('
, SUBSTRING_INDEX(
GROUP_CONCAT(num ORDER BY num DESC), ',', 1
) / SUM(num) * 100
, '%)'
) rank1
, CONCAT(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(meaning ORDER BY num DESC), ',', 2
), ',', -1
)
, ' ('
, SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(num ORDER BY num DESC), ',', 2
), ',', -1
) / SUM(num) * 100
, '%)'
) rank2
, CONCAT(
SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(meaning ORDER BY num DESC), ',', 3
), ',', -1)
, ' ('
, SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(num ORDER BY num DESC), ',', 3
), ',', -1
) / SUM(num) * 100
, '%)'
) rank3
FROM (
SELECT word, meaning, COUNT(*) num
FROM t_meaning m
GROUP BY word,meaning
) a
GROUP BY word

This gives me output like this:

+------+--------------+--------------+--------------+
| word | rank1 | rank2 | rank3 |
+------+--------------+--------------+--------------+
| 1 | 16 (3.9728%) | 17 (3.9648%) | 12 (3.9632%) |
| 2 | 9 (3.9792%) | 10 (3.9632%) | 20 (3.9328%) |
| 3 | 20 (3.9744%) | 13 (3.968%) | 1 (3.9648%) |
| 4 | 26 (3.952%) | 7 (3.9456%) | 17 (3.9424%) |
| 5 | 9 (4.008%) | 21 (3.9824%) | 20 (3.936%) |
| 6 | 19 (3.9504%) | 10 (3.9488%) | 13 (3.9408%) |
| 7 | 23 (4.0464%) | 12 (3.976%) | 19 (3.9648%) |
| 8 | 23 (4.0112%) | 3 (4.0096%) | 8 (3.9328%) |
| 9 | 10 (4.016%) | 19 (3.984%) | 15 (3.9616%) |
| 10 | 10 (4.0304%) | 14 (3.9344%) | 11 (3.9312%) |
| 11 | 16 (3.9584%) | 6 (3.9296%) | 19 (3.9232%) |
| 12 | 7 (3.9968%) | 1 (3.9392%) | 26 (3.9264%) |
| 13 | 8 (4.048%) | 25 (3.9712%) | 23 (3.9616%) |
| 14 | 16 (3.9936%) | 26 (3.9632%) | 4 (3.9536%) |
| 15 | 22 (4.0608%) | 12 (4.0048%) | 1 (3.9632%) |
| 16 | 14 (4.0032%) | 18 (3.9712%) | 4 (3.9488%) |
+------+--------------+--------------+--------------+
16 rows in set (0.63 sec)

On my laptop, my solution is about 30% faster than the one presented by Alex. Personally I think mine is easier to understand too, but that is a matter of taste.

Anyway, I'm just posting this to share my solution - I do not intend to downplay the one presented by Alex. Instead, I invite everyone interested in SQL, MySQL and PostgreSQL to keep an eye on Alex' blog as well as his excellent answers on Stackoverflow. He's an SQL jedi master in my book :)

Of course, if you have a better solution to crack this problem in MySQL, please leave a comment. I'd love to hear what other people are doing to cope with these kinds of queries.

1 comment:

shlominoach said...

Hi Roland,

Have just came upon this post.

A more general solution, allowing for top-n rows per group:
http://code.openark.org/blog/mysql/sql-selecting-top-n-records-per-group