Tuesday, September 15, 2009

MySQL: Another Ranking trick

I just read SQL: Ranking without self join, in which Shlomi Noach shares a nice MySQL-specific trick based on user-defined variables to compute rankings.

Shlomi's trick reminds me somewhat of the trick I came across little over a year ago to caclulate percentiles. At that time, several people pointed out to me too that using user-defined variables in this way can be unreliable.

The problem with user-defined variables

So what is the problem exaclty? Well, whenever a query assigns to a variable, and that same variable is read in another part of the query, you're on thin ice. That's because the result of the read is likely to differ depending on whether the assignment took place before or after the read. Not surprising when you think about it - the whole point of variable assignment is to change its value, which by definition causes a different result when subsequently reading the variable (unless you assigned the already assigned value of course, duh...).

Now watch that previous statement clearly - the word subsequently is all-important.

See, that's the problem. The semantics of a SQL SELECT statement is to obtain a (tabular) resultset - not specifying an algorithm to construct that resultset. It is the job of the RDBMS to figure out an algorithm and thus, you can't be sure in what order individual expressions (including variable evaluation and assignment) are executed.

The MySQL manual states it like this:

The order of evaluation for user variables is undefined and may change based on the elements contained within a given query. In SELECT @a, @a := @a+1 ..., you might think that MySQL will evaluate @a first and then do an assignment second, but changing the query (for example, by adding a GROUP BY, HAVING, or ORDER BY clause) may change the order of evaluation.

The general rule is never to assign a value to a user variable in one part of a statement and use the same variable in some other part of the same statement. You might get the results you expect, but this is not guaranteed.

So what good are these variables anyway?

On the one hand, this looks really lame: can't MySQL just figure out the correct order of doing the calulations? Well, that is one way of looking at it. But there is an equally valid reason not to do that. If the calculations would influence execution order, it would drastically lessen the number of ways that are available to optimize the statement.

This begs the question: Why is it possible at all to assign values to the user-defined variables? The answer is quite simple: you can use it to pass values between statetments. My hunch is the variables were created in the olden days to overcome some limitations resulting from the lack of support for subqueries. Having variables at least enables you to execute a query and assign the result temporarily for use in a subsequent statement. For example, to find the student with the highest score, you can do:

mysql> select @score:=max(score) from score;
+--------------------+
| @score:=max(score) |
+--------------------+
| 97 |
+--------------------+
1 row in set (0.00 sec)

mysql> select * from score where score = @score;
+----------+--------------+-------+
| score_id | student_name | score |
+----------+--------------+-------+
| 2 | Gromit | 97 |
+----------+--------------+-------+
1 row in set (0.03 sec)
There is nothing wrong with this approach - problems start arising only when reading and writing the same variable in one and the same statement.

Another way - serializing the set with GROUP_CONCAT


Anyway, the percentile post I just linked to contains another solution for that problem that relies on GROUP_CONCAT. It turns out we can use the same trick here.

(Some people may like to point out that using GROUP_CONCAT is not without issues either, because it may truncate the list in case the pre-assigned string buffer is not large enough. I wrote about dealing with that limitation in several places and I remain recommending to set the group_concat_max_len server variable to the value set for the max_packet_size server variable like so:
SET @@group_concat_max_len := @@max_allowed_packet;
)

The best way to understand how it works is to think of the problem in a few steps. First, we make an ordered list of all the values we want to rank. We can do this with GROUP_CONCAT like this:

mysql> SELECT GROUP_CONCAT(
-> DISTINCT score
-> ORDER BY score DESC
-> ) AS scores
-> FROM score
-> ;
+-------------+
| scores |
+-------------+
| 97,95,92,85 |
+-------------+
1 row in set (0.00 sec)

Now that we have this list, we can use the FIND_IN_SET function to look up the position of any particlar value contained in the list. Because the list is ordered in descending order (due to the ORDER BY ... DESC), and contains only unique values (due to the DISTINCT), this position is in fact the rank number. For example, if we want to know the rank of all scores with the value 92, we can do:

mysql> SELECT FIND_IN_SET(92, '97,95,92,85')
+--------------------------------+
| FIND_IN_SET(92, '97,95,92,85') |
+--------------------------------+
| 3 |
+--------------------------------+
1 row in set (0.00 sec)
So, the answer is 3 because 92 is the third entry in the list.

(If you're wondering how it's possible that we can pass the integer 92 as first argument for FIND_IN_SET: the function expects string arguments, and automatically converts whichever non-string typed value we pass to a string. In the case of the integer 92, it is silently converted to the string '92')

Of course, we are't really interested in looking up ranks for individual numbers one at a time; rather, we'd like to combine this with a query on the scores table that does it for us. Likewise, we don't really want to manually supply the list of values as a string constant, we want to substitute that with the query we wrote to generate that list.
So, we get:

mysql> SELECT score_id, student_name, score
-> , FIND_IN_SET(
-> score
-> , (SELECT GROUP_CONCAT(
-> DISTINCT score
-> ORDER BY score DESC
-> )
-> FROM score)
-> ) as rank
-> FROM score;
+----------+--------------+-------+------+
| score_id | student_name | score | rank |
+----------+--------------+-------+------+
| 1 | Wallace | 95 | 2 |
| 2 | Gromit | 97 | 1 |
| 3 | Shaun | 85 | 4 |
| 4 | McGraw | 92 | 3 |
| 5 | Preston | 92 | 3 |
+----------+--------------+-------+------+
5 rows in set (0.00 sec)

Alternatively, if you think that subqueries are for the devil, you can rewrite this to a CROSS JOIN like so:

SELECT score_id, student_name, score
, FIND_IN_SET(
score
, scores
) AS rank
FROM score
CROSS JOIN (SELECT GROUP_CONCAT(
DISTINCT score
ORDER BY score DESC
) AS scores
FROM score) scores

Now that we have a solutions, lets see how it compares to Shlomi's original method. To do this, I am using the payment table from the sakila sample database.

First, Shlomi's method:

mysql> SELECT payment_id
-> , amount
-> , @prev := @curr
-> , @curr := amount
-> , @rank := IF(@prev = @curr, @rank, @rank+1) AS rank
-> FROM sakila.payment
-> , (SELECT @curr := null, @prev := null, @rank := 0) sel1
-> ORDER BY amount DESC;
+------------+--------+----------------+-----------------+------+
| payment_id | amount | @prev := @curr | @curr := amount | rank |
+------------+--------+----------------+-----------------+------+
| 342 | 11.99 | NULL | 11.99 | 1 |
. ... . ..... . ..... . ..... . . .
| 15456 | 0.00 | 0.00 | 0.00 | 19 |
+------------+--------+----------------+-----------------+------+
16049 rows in set (0.09 sec)

Wow! It sure is fast :) Now, the GROUP_CONCAT solution, using a subquery:

mysql> SELECT payment_id, amount
-> , FIND_IN_SET(
-> amount
-> , (SELECT GROUP_CONCAT(
-> DISTINCT amount
-> ORDER BY amount DESC
-> )
-> FROM sakila.payment)
-> ) as rank
-> FROM sakila.payment
+------------+--------+------+
| payment_id | amount | rank |
+------------+--------+------+
| 1 | 2.99 | 15 |
. . . .... . .. .
| 16049 | 2.99 | 15 |
+------------+--------+------+
16049 rows in set (0.14 sec)


(In case you're wondering why the results are different, this is because the result set for Shlomi's solution is necessarily ordered by ascending rank (or descending amount - same difference. To obtain the identical result, you need to add an ORDER BY clause to my query. But since the point was to calculate the ranks, I didn't bother. Of course, adding an ORDER BY could slow things down even more.)

Quite a bit slower, bummer. But at leastt we can't run into nasties with the user variables anymore. For this data set, I get about the same performance with the CROSS JOIN, but I should warn that I did not do a real benchmark.

Conclusion

Don't fall into the trap of reading and writing the same user-defined variable in the same statement. Although it seems like a great device and can give you very good performance, you cannot really control the order of reads and writes. Even if you can, you must check it again whenever you have reason to believe the query will be solved differently by the server. This is of course the case whenever you upgrade the server. But also seemingly harmless changes like adding an index to a table may change the order of execution.

Almost all cases where people want to read and write to the same user variables within the same query, they are dealing with a kind of serialization problem. They are trying to maintain state in a variable in order to use it across rows. In many cases, the right way to do that is to use a self-join. But this may not always be feasible, as pointed out in Shlomi's original post. For example, rewriting the payment rank query using a self join is not going to make you happy.

Often, there is a way out. You can use GROUP_CONCAT to serialize a set of rows. Granted, you need at least one pass for that, and another one to do something useful with the result, but this still a lot better than dealing with semi-cartesian self join issues.

Saturday, September 12, 2009

EU Should Protect MySQL-based Special Purpose Database Vendors

In my recent post on the EU antitrust regulators' probe into the Oracle Sun merger I did not mention an important class of stakeholders: the MySQL-based special purpose database startups. By these I mean:

I think it's safe to say the first three are comparable in the sense that they are all analytical databases: they are designed for data warehousing and business intelligence applications. ScaleDB might be a good fit for those applications, but I think it's architecture is sufficiently different from the first three to not call it an analytical database.

For Kickfire and Infobright, the selling point is that they are offering a relatively cheap solution to build large data warehouses and responsive business intelligence applications. (I can't really find enough information on Calpoint pricing, although they do mention low total cost of ownership.) An extra selling point is that they are MySQL compatible, which may make some difference for some customers. But that compatibility is in my opinion not as important as the availability of a serious data warehousing solution at a really sharp price.

Now, in my previous post, I mentioned that the MySQL and Oracle RDBMS products are very different, and I do not perceive them as competing. Instead of trying to kill the plain MySQL database server product, Oracle should take advantage of a huge opportunity to help shape the web by being a good steward, leading ongoing MySQL development, and in addition, enable their current Oracle Enterprise customers to build cheap LAMP-based websites (with the possibility of adding value by offering Oracle to MySQL data integration).

For these analytical database solutions, things may be different though.

I think these MySQL based analytical databases really are competitive to Oracle's Exadata analytical appliance. Oracle could form a serious threat to these MySQL-based analytical database vendors. After the merger, Oracle would certainly be in a position to hamper these vendors by resticting the non-GPL licensed usage of MySQL.
In a recent ad, Oracle vouched to increase investments in developing Sun's hardware and operating system technology. And this would eventually put them in an even better position to create appliances like Exadata, allowing them to ditch an external hardware partner like HP (which is their Exadata hardware partner).

So, all in all, in my opinion the EU should definitely take a serious look at the dynamics of the analytical database market and decide how much impact the Oracle / Sun merger could have on this particular class of MySQL OEM customers. The rise of these relatvely cheap MySQL-based analytical databases is a very interesting development for the business intelligence and data warehousing space in general, and means a big win for customers that need affordable datawarhousing / business intelligence. It would be a shame if it would be curtailed by Oracle. After the merger, Oracle sure would have the means and the motive, so if someone needs protection, I think it would be these MySQL-based vendors of analytical databases.

As always, these are just my musing and opinions - speculation is free. Feel free to correct me, add applause or point out my ignorance :)

Thursday, September 03, 2009

MySQL a factor in EU's decision

I just read Björn Schotte's post on the activities of the European Union antitrust regulators concerning the intended takeover of Sun Microsystems by Oracle.

Björn mentions a news article that cites EU Competition Commissioner Neelie Kroes saying that the commission has the obligation to protect the customers from reduced choice, higher costs or both. But to me, this bit is not the most interesting. Later on the article reads:


The Commission said it was concerned that the open source nature of Sun's MySQL database might not eliminate fully the potential for anti-competitive effects.

With both Oracle's databases and MySQL competing directly in many sectors of the database market, MySQL is widely expected to represent a greater competitive constraint as it becomes increasingly functional, the EU executive said.


In other words, the commission is working to protect the MySQL users :)

Personally, I (and many other MySQL community members) don't fear for the future of MySQL as a product. But I do think it is justified to worry about customers that are now paying Sun for some licensed usage of MySQL, most notably OEM customers and a bunch of Enterprise users.

Ever since the news was disclosed concerning the intention of Oracle to acquire Sun, it has been speculated that Oracle my try to "upsell" the Oracle RDBMS to current MySQL enterprise users. However I don't think that that would be the brightest of moves. I did a bit of speculation myself back in April in response to questions put forward in the SSWUG newsletter.

I maintain the opinions I stated there:

  • MySQL / Oracle are completely different beasts and customers realize this, and most likely Oracle does so too. People running MySQL for web related applications won't move to Oracle. Period. Oracle may be able to grab some customers that use MySQL for data warehousing, but I think that even in these cases a choice for Infobright or Kickfire makes more sense.

  • Not all problems are database problems - if Oracle does a decent job of supporting and developing MySQL, they may become a respectable enough partner for current (larger) MySQL users to help them solve other problems such as systems integration.

  • Instead of looking at the benefits for MySQL customers of using Oracle, look at the benefits for Oracle customers using MySQL. Suddenly Oracle can offer support for the most popular webstack in the world - Now all these enterprise customers running expensive Oracle installations can finally build cheap websites based on MySQL and even get support from Oracle on connecting their backend Enterprise Oracle instances to the MySQL web front ends.

  • It's not all about the products. Open Source adds a whole new dynamic to the development process. I'm not just talking about outside developers that offer new features and code patches, as this does not happen too often. There's more to it than code though

    In all successful open source projects I know there is a very lively culture of users engaging with developers and voicing their opinion on what is good and what is not so good. There is a very real chance for the user to influence the direction of the development process (although this does not mean everybody gets what they want in equal amounts). Conversely this provides a great opportunity for the development organization to learn about what the users really need and wish for.

    In short, Oracle may want to use Sun/MySQL to learn how to do better business with more empowered users.


Of course, its all just my opinion - speculation is free. So you should feel free too to post your ideas on the matter. Go ahead and leave a comment ;)

Roland Bouman's blog goes i18n (Powered by Google Translate)

Now that Pentaho Solutions is in print, and the first few copies are finding its way towards the readers, I felt like doing something completely unrelated. So, I hacked up a little page translation widget, based on the Google Language API. You can see the result in the top of the left sidebar of my blog right now:

translator

Using it is very simple: just pick the language of choice, and the page (text and some attributes like alt and title) will be translated. Pick the first entry in the list to see the original language again.

This all happens inline by dynamic DOM manipulation, without having to reload the page. I tested it on Chrome 2, Firefox 3.5, Opera 10, Safari 4 and Internet Explorer 6 and 8. So far, it seems to work for all these browsers.

Personally, I feel that the user experience you get with this widget is superior to what you would get with the google translation gadget. In addition, it is pretty easy to to configure the Translator class .

The code to add this to your page is in my opinion reasonably simple:

<!-- add a placeholder for the user interface -->
<div id="toolbar"><div>

<!-- Include script that defines the Translator class -->
<script type="text/javascript" src="Translator-min.js"></script>
<!-- Instantiate a translator, have it create its gui and render to placeholder -->
<script type="text/javascript">
var translator = new Translator();
var gui = translator.createGUI(null, "Language");
document.getElementById("toolbar").appendChild(gui);
</script>


This really is all the code you need - there are no dependencies on external Javascript frameworks. If you don't need or like the gui, you can of course skip the gui placeholder code as well as the second script and interract with the Translator object programmatically.

The minified javascript file is about 7k, which is not too bad in my opinion. I haven't worried too much about optimizations, and I think it should be possible to cut down on codesize.

Another thing I haven't focused on just now is integration with frameworks - on the contrary I made sure you can use it standalone. But in order to do that, I had to write a few methods to facilitate DOM manipulation and JSON parsing, and its almost certain you will find functions like that are already in your framework.

Anyway, readers, I'd like to hear from you...is this auseful feature on this blog? Would you like to use it on your own blog? If there's enough people that want it, I will make it available on google code or something like that.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...