Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Tuesday, July 29, 2008

MySQL aggregate UDF Performance

At the MySQL User's conference 2008 I did a tutorial on MySQL User-defined functions. At the same conference, Brian Miezejewski and Mark Atwood did a talk on aggregate UDFs.

In relation their upcoming talk, Mark wrote a blog post reporting that MySQL Aggregate UDFs are three times slower than equivalent native functions:
I just proved to myself via testing and benchmarking that user defined aggregate functions are about three times slower than the call interface to the built-in functions.
Later on he posted an update, explaining that it looks like this result is probably "a MacOS thing".
It turns out to be a MacOS thing. On Linux, the performance penalty is less than 10%. Further research will surely be done.


I just tested it on Linux and Windows. For a simple aggregate UDF that is equivalent to the built-in COUNT() function, I found that the UDF is about 5% slower than the built-in function. Personally I think that's not too bad.

If you are interested in trying this on your own system, download these instructions. Would be cool if you could leave a comment here of your results, TIA

Thursday, July 24, 2008

Inspect the Query Cache using MySQL Information Schema Plug-ins

A while ago I wrote about MySQL 5.1 information schema plug-ins.

At the time, I wrote a plug-in to report the contents of the query cache, but for all kinds of reasons, I never found the time to write a decent article about it, nor to release the code.

I am not sure if I'll ever find the time to write that article, but I just tidied up the code, and installed it in a new MySQL 5.1.26-rc server. It seems to work there, so I put the code up on the web.

Inside the source file, there's instructions to build and deploy it. If all goes well, you can do things like:

mysql> select * from information_schema.mysql_cached_queries\G
*************************** 1. row ***************************
STATEMENT_ID: 1
SCHEMA_NAME: test
STATEMENT_TEXT: select count(*) from world.city
RESULT_BLOCKS_COUNT: 1
RESULT_BLOCKS_SIZE: 512
RESULT_BLOCKS_SIZE_USED: 106
1 row in set (0.00 sec)

Are you interested in this plugin-in? Do you have any suggestions? Download it, and put your comments here.

Building MySQL from Source: There's a fine line...


There's a fine line between love and hate, you see

Can't wait to late, but baby I'm on it


(From: "Liberation" by Outkast)

Suppose you need to build a MySQL server with the same options as a pre-built one, distributed through the MySQL download site.

Sounds strange?

Well, maybe, but that's what you need if you want to compile certain MySQL 5.1 plug-ins so that they'll will play nice with such a pre-built server.

Some plug-ins depend on nothing more but the plugin.h header file, but for example storage engine plug-ins require things way beyond that. If you want to compile those yourself, you are required to first build a MySQL server from source, even though you will install the plug-in in another, pre-built, server.

(If you want to see how this works in practice, check out the excellent documentation for building and installing the PBXT storage engine plug-in.)

Now, the snag is, the server you need to build must ideally be built in exactly the same way as the deployment server. So, the question is, how do you find out how your server was built?

As it turns out, there is a very simple method to obtain the original compile line. You can find it in the mysqlbug script. On line 28 you see the following line:

CONFIGURE_LINE="./configure '--prefix=/usr/local/mysql' '--localstatedir=/usr/local/mysql/data' '--libexecdir=/usr/local/mysql/bin' '--with-comment=MySQL Community Server (GPL)' '--with-server-suffix=' '--enable-thread-safe-client' '--enable-local-infile' '--enable-assembler' '--with-pic' '--with-fast-mutexes' '--with-client-ldflags=-static' '--with-mysqld-ldflags=-static' '--with-zlib-dir=bundled' '--with-big-tables' '--with-ssl' '--with-readline' '--with-embedded-server' '--with-partition' '--with-innodb' '--without-ndbcluster' '--with-archive-storage-engine' '--with-blackhole-storage-engine' '--with-csv-storage-engine' '--without-example-storage-engine' '--with-federated-storage-engine' '--with-extra-charsets=complex' 'CC=ccache gcc -static-libgcc' 'CFLAGS=-g -O3' 'CXXFLAGS=-g -O3' 'CXX=ccache gcc -static-libgcc'"


Are you curious what will happen if you compile your plug-in against a server that is not built the same as the deployment server? Well, the plug-in won't be "playing nice" with your server, and it is likely that something nasty will happen, such as a crash, or worse: the plugin may unintentionally change the behavior of the server, with unwelcome results such as data corruption.

So, find the mysqlbug script, (it's normally located in the MySQL bin directory) and find that fine line...

Friday, July 11, 2008

MySQL: DIVide and Conquer

Everybody that has had to do some numeric calculations in SQL will have encountered this simple problem: divide an integer by another integer and round down the outcome to the nearest integer. In many cases people write it like this:

FLOOR(a/b)

Simple enough, right? First we do the division a/b, then we round down using the function FLOOR().


Update: My claim that TRUNCATE(a/b, 0) is equivalent to FLOOR(a/b) is false! It maybe true when the outcome of the division is a positive number, but in case of a negative number, TRUNCATE() will only lose the decimals (resulting in a higher negative number) and FLOOR() will still round down.

Thanks Kai!


However, there is a better way.

We can use the integer division operator DIV instead of the ordinary division operator /. Because this is an integer division, there is no need for an extra function to lose the decimals, and the expression is simply:

a DIV b

This approach has a number of advantages:

  • It is explicit. By looking at the expression we know immediately that the result will be an integer, and that a and b are meant to be integers too.

  • It is easier to read. Because we don't need another function and parenthesis, this expression is easier on the eyes, something that you will appreciate if the expression is not simply FLOOR(a/b) but something like FLOOR(SUM(a)/SUM(IFNULL(b,0)))

  • It is fast! The DIV operation does not have to deal with complex floating point math, and will be much faster on most microprocessors


To prove the last point, take a look at the results of a simple benchmark. I simply used the BENCHMARK() function and executed:

mysql> SELECT BENCHMARK(10000000,1234567 DIV 7) ;
+-----------------------------------+
| BENCHMARK(10000000,1234567 DIV 7) |
+-----------------------------------+
| 0 |
+-----------------------------------+
1 row in set (0.83 sec)

mysql> SELECT BENCHMARK(10000000,1234567 / 7) ;
+---------------------------------+
| BENCHMARK(10000000,1234567 / 7) |
+---------------------------------+
| 0 |
+---------------------------------+
1 row in set (7.26 sec)

mysql> SELECT BENCHMARK(10000000,FLOOR(1234567 / 7)) ;
+----------------------------------------+
| BENCHMARK(10000000,FLOOR(1234567 / 7)) |
+----------------------------------------+
| 0 |
+----------------------------------------+
1 row in set (8.80 sec)

I repeated this two more times and averaged the time spent, and then made this little graph of the results:
DIVideAndConquerThe results show that DIV is about 9 to 10 times faster than the ordinary division operator, and that adding FLOOR() function makes the entire expression another 10% slower.

Now, I don't think the performance benefit is of much practical significance. You may see a slight improvement for large datasets using multiple division operations, but in many cases the ordinary query processing will probably have a much larger part in the total time spent. But still, DIV is faster, easier to read and more explicit if you want to solve this type of problem.

Wednesday, July 09, 2008

A fast, single pass method to calculate the median in MySQL

After stepping off of the GROUP_CONCAT() solution for calculating quantiles I figured it would be nice to find a better way to calculate the median too.

Solution


I previously wrote on how to calculate the median using GROUP_CONCAT(), but I think that this is a better way:

SELECT AVG(length) AS median -- take the average of left and right median
, MIN(length) AS left_median --
, MAX(length) AS right_median --
, @l AS left_median_position --
, @r AS right_median_position --
FROM (
SELECT @n, length -- @n is just here to facilitate debug
FROM film
CROSS JOIN (
SELECT @n:=0
, @r := COUNT(*) DIV 2 + 1 -- right median or true median
, @l := COUNT(*) DIV 2 -- left median, or true medain
+ IF(
COUNT(*) % 2 -- even or odd?
, 1 -- odd, need next value
, 0 -- even, need true left median
)
FROM film
) ``
WHERE (@n:=@n+1) -- row number
BETWEEN @l AND @r -- select two middle ones
ORDER BY length -- need to sort to get middle values
) ``

Apart from not relying on setting the buffer size for group_concat_max_len, this solution is also faster than the GROUP_CONCAT() solution. For example, calculating the median amount from the payment table in the sakila database takes some 0.12 seconds using the GROUP_CONCAT() solution and about 0.03 seconds with this method.

Explanation


At the heart of the solution are the user-defined variables:

  • @n: the row number

  • @l: the position of the 'left median' that is the row number with the highest value in the lower half of all rows

  • @r: the position of the 'right median', that is the row number with the lowest value in the higher half of all rows


These are initialized in the inmost subquery:

SELECT @n:=0
, @r := COUNT(*) DIV 2 + 1 -- right median or true median
, @l := COUNT(*) DIV 2 -- left median, or true medain
+ IF(
COUNT(*) % 2 -- even or odd?
, 1 -- odd, need next value
, 0 -- even, need true left median
)
FROM film

Note that this yields one row, initializing the row number @n to zero. We can calculate the position for the right median immediately by doing

COUNT(*) DIV 2 + 1

You see, first we divide the total number of rows by two. If there is an even number of rows, COUNT(*) DIV 2 will give us the left median, and adding 1 is then by definition the right median. If COUNT(*) is an odd number, it still holds: the integer division rounds down to the nearest integer value, and adding 1 then gives us the position of the true median in that case.

The calculation of the left median is along the same lines:

@l := COUNT(*) DIV 2 -- left median, or true medain
+ IF(
COUNT(*) % 2 -- even or odd?
, 1 -- odd, need next value
, 0 -- even, need true left median
)

We just have to take care that in case we do have an odd number of rows, we need to pick the true median position here too.

To do that we first find out if there is an odd number of rows using COUNT(*) % 2. This calculates the remainder of dividing the total number of rows by two. In case of an odd number of rows, the remainder is 1 which is considered to be TRUE by the IF function. In this case, 1 is added, effectively making @l equal to @r (that is, both hold the position of the true median). In case of an even number of rows, we add 0 as COUNT(*) % 2 is already the desired position of the left median.

Now that we have these values, we can use @n to generate a rownumber, which we can compare against the calculated @l and @r values:

WHERE (@n:=@n+1) -- row number
BETWEEN @l AND @r -- select two middle ones
ORDER BY length -- need to sort to get middle values

And this will give us at most two rows in case of an even number of rows, and one row for an odd number of rows.

The final step is to average the values from the rows, which is required to calculate the median in case of an even number of rows:

SELECT AVG(length) AS median -- take the average of left and right median
...
FROM (
...
) ``

Tuesday, July 08, 2008

MySQL Percentile aftermath: Calculating all quantiles

Are you getting fed up yet with my posts on calculating percentiles? Well, I'm sorry but I'm not quite finished.

Here's a simple, fast method to calculate the specified number of quantiles:

-- set the number of quantiles, for exmple:
-- quartiles: 4
-- deciles: 10
-- percentiles: 100

SET @quantiles:=4; -- select quartiles

-- calculate all quantiles
--
--
SELECT amount AS metric
, @n DIV (@c DIV @quantiles) AS quantile
, @n AS N
FROM sakila.payment
CROSS JOIN (
SELECT @n:=0 -- rownumber
, @c:=COUNT(*) -- need this to calculate quantile partitions
FROM sakila.payment
) c
WHERE NOT ( -- modulo zero (=false), we are at the quantile
(@n:=@n+1) % (@c DIV @quantiles) -- rownumber equal to the quantile partition?
)
ORDER BY amount; -- need ordered partitions

You can find this snippet on MySQL Forge.

Monday, July 07, 2008

Calculating Percentiles with MySQL, Round 2

My previous post on calculating percentiles with MySQL generated some comments and good discussion. In particular, I got some very interesting comments from Vladimir.

Basically, Vladimir was doubtful whether the GROUP_CONCAT() solution would be optimal in comparison to a JOIN. His proposal is to solve it like this:

SELECT SUM(g1.r) sr
, g2.length l
, SUM(g1.r)/(SELECT COUNT(*) FROM film) p
FROM (SELECT COUNT(*) r, length FROM film GROUP BY length) g1
JOIN (SELECT COUNT(*) r, length FROM film GROUP BY length) g2
ON g1.length < g2.length
GROUP BY g2.length
HAVING p > 0.9
ORDER BY p
LIMIT 1

First, this query sets up two identical subqueries in the FROM list using GROUP BY and COUNT() to calculate the number of occurrences of each distinct value. Then, these are joined and GROUP BY is again applied to calculate the total number of rows having a lower value. Finally, HAVING is used to find the groups in the upper percentiles, and LIMIT and ORDER BY are used to single out the one desired percentile value.

As it turns out, this solution is slower for moderately small data sets, but much faster for large data sets. He benchmarked it for a total of 999 distinct values on varying number of rows. Here are his slightly rounded numbers:

#rows: group_concat: groupby-join:
4M 1 min 6 sec 5.3 sec
1M 5 sec 2.5 sec
100K 0.5 sec 1.6 sec

Although GROUP_CONCAT() seems to break down pretty soon, he also writes:

I must admit that when N of distinct rows reaches approx. 10K I get pretty the opposite results if the total number of rows is relatively small. Basically we get into the same situation as with joining the whole tables.

He concluded by saying:

But what I think is the right solution is having something like this on the server side:

SELECT COUNT(INC length)/(SELECT COUNT(*) FROM film) p, length
FROM film
GROUP BY length
HAVING p >= 0.9 ORDER BY p LIMIT 1

Where "INC" is a flag that tells the server to not reset per-group counters in the aggregate functions. This would be quite a trivial change in the Item_sum class and would make sense not only for SUM, but maybe also for MIN, MAX, AVG, COUNT and maybe some other aggregate functions.

So, COUNT(INC length) would be the cumulative count, or a running total of counts. The fun thing is, you can already do exactly that using user-defined variables. Look:

-- allow statement batching
DELIMITER go

-- initialize
SELECT 0, COUNT(*)
INTO @cum_cnt, @cnt
FROM sakila.payment;

-- calculate percentiles
SELECT @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p, -- running fraction of #rows per distinct amount
amount
FROM sakila.payment
GROUP BY amount
HAVING p >= 0.9
LIMIT 1;

go

and this gets us the result:

Query OK, 1 row affected (0.01 sec)

+-------------+--------+
| p | amount |
+-------------+--------+
| 0.904542334 | 6.99 |
+-------------+--------+
1 row in set (0.03 sec)

Here is the equivalent GROUP_CONCAT solution:

SELECT SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(
p.amount
ORDER BY p.amount
SEPARATOR ','
)
, ','
, 90/100 * COUNT(*) + 1
)
, ','
, -1
) AS `90th Percentile`
FROM sakila.payment AS p;

...and it is considerably slower:

+-----------------+
| 90th Percentile |
+-----------------+
| 6.99 |
+-----------------+
1 row in set (0.08 sec)

(sakila.payment has 16049 rows and 19 distinct values for amount)

So, the sane thing to do would be to forget about that GROUP_CONCAT idea, and use this method. It does not have the nasty drawbacks of having to mess with group_concat_max_len and I am pretty sure Vladimir's method will be faster across the board anyway.

The one thing you could object about is the extra query to initialize the user-defined variables. You can get around that by initializing those in a single row subquery in the FROM clause, a technique described by Baron Schwartz (see for example this post)

SELECT @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p,
amount
FROM sakila.payment
CROSS JOIN (SELECT @cum_cnt:=0
, @cnt:=COUNT(*)
FROM sakila.payment) p
GROUP BY amount
HAVING p >= 0.9
LIMIT 1;

This has the advantage that the variables are initialized in one go with the entire query, ensuring you are not accidentally working with uninitialized or garbage variables.

(BTW: If you want to do more with these user-defined variables, I can highly recommend more from Baron's site, see for example his article on advanced user defined variable techniques.)

Writing to the MySQL error log

In almost all application development situations, one needs to log debug information now and then. In almost all production systems, one needs to log serious error events somewhere too.

So, what can you do? Create a log table? Sprinkle your code with SELECT 'Debug: ...' statements?

At the UDF Repository for MySQL, we now have a solution to log messages to the MySQL error log: a user-defined function called log_error().

Currently it is all very crude: the log_error function takes one argument and writes it to the mysql_error log, appending a line terminator.

Please try it out, and let us know if you have comments or suggestions to improve. Thanks in advance,

Roland

Friday, July 04, 2008

Calculating the Nth percentile in MySQL

Yesterday, I was on the freenode ##pentaho irc channel when Andres Chaves asked me how to calculate the Nth percentile in MySQL. He saw a solution somewhere using subqueries, but wasn't too happy about it.

A while ago I wrote about calulating the median in MySQL, and it turns out the Nth percentile can be calculated using a similar, single-pass approach, not relying on subqueries, UDFs, or user-defined variables.

The percentile....


So, what is a percentile exactly? Here's what the wikipedia says:

A percentile is the value of a variable below which a certain percent of observations fall. So the 20th percentile is the value (or score) below which 20 percent of the observations may be found.

....and the median


The wikipedia continues and hints at the relationship between the Nth percentile and the median:

The 25th percentile is also known as the first quartile; the 50th percentile as the median.

Sidenote: as I understand it, this latter remark concerning the median is not entirely correct. The median is "the middle value": the number of observations with a higher value is equal to the number of observations that has a lower value. For a series with an even number of observations, say {1,2,3,4}, there isn't one middle value, there are two, in this case: 2 and 3. Typically, the median is computed by taking the arithmic mean of the two middle values, which would be (2+3) / 2 = 2.5 in this particular case. But for this example, the 50th percentile would be 3 and not 2.5. However, in most practical cases the values are fairly evenly distributed and sufficiently large, which means the difference between the median and 50th percentile will be small or absent.

However, the median and percentile problems are quite similar: in both cases, a value is picked or computed that is higher than the value found in a particular portion of the total number of observations. For the median, the requirement is that that particular portion is equal to the number of observations that exceeds it; for the Nth percentile the requirement is that that portion constitutes N percent of the total number of observations.

The Solution


In the following example, we calculate the 90th percentile of film lengths:

SELECT SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT( -- 1) make a sorted list of values
f.length
ORDER BY f.length
SEPARATOR ','
)
, ',' -- 2) cut at the comma
, 90/100 * COUNT(*) + 1 -- at the position beyond the 90% portion
)
, ',' -- 3) cut at the comma
, -1 -- right after the desired list entry
) AS `90th Percentile`
FROM sakila.film AS f

Here, the literal 90 represents which percentile we want, that is, "the 90th percentile".

(If you like, you can leave out the SEPARATOR ',' bit, as the default separator is a comma anyway. I just wanted to have a clear indication for the source of the ',' arguments in the SUBSTRING_INDEX() calls)

Explanation


The median and Nth percentile problem can be solved using a similar apporoach: first, we use the string aggregate function GROUP_CONCAT() to create an ordered list of values. Then we use the substring variation SUBSTRING_INDEX() to find and excise a list entry at a particular desired position.

The differences between the solutions for the median and the Nth is mainly in which entry we have to pick from the list. (See my prior article for the full story on median). For the Nth percentile, we first calculate the desired portion of the total number of observations. Because N is defined as a percentage, we divide by 100 to get the actual fraction of the total number of observations:

N / 100

Then, we multiply by the total number of observations to find the number of observations that make up the actual portion of observations within the specified percentile:

N / 100 * COUNT(*)

Because we want to find the observation for which the specified portion has a lower value, we need to look at the next entry instead of the last entry within the portion, so we add 1:

N / 100 * COUNT(*) + 1

Caveats


There are a number of things to look out for:

Percentile value not unique


When we calculate the 90th percentile of the film lengths, we get 173:

+-----------------+
| 90th Percentile |
+-----------------+
| 173 |
+-----------------+

If we check the result by counting the portion of films with a length lower than 173 we see:

mysql> SELECT 100 * COUNT(IF(f.length < 173, 1, NULL))/COUNT(*) `Percentage`
-> FROM film AS f;
+------------+
| Percentage |
+------------+
| 89.4212 |
+------------+

The reason that we do not get 90% is that there are multiple occurrences of films with length equal to 173:

mysql> SELECT title FROM film WHERE length = 173;
+----------------------+
| title |
+----------------------+
| BALLROOM MOCKINGBIRD |
| CONQUERER NUTS |
| FIRE WOLVES |
| GLADIATOR WESTWARD |
| PIZZA JUMANJI |
| TALENTED HOMICIDE |
| VELVET TERMINATOR |
+----------------------+
7 rows in set (0.01 sec)

So, even though we may have picked the entry at the right position, this may still be a value within the specified portion instead of beyond. The definitions I found for the Nth percentile do not stipulate any mechanism to deal with this kind of ambiguity, whereas for the median, the correct value is always found by averaging the left and right median if necessary.

What about NULL


The second caveat are NULL values. Currently this method does not work when the column for which you want to caculate percentile values is nullable. It is possible to work around this though. If you can exclude the rows with the NULL value entirely, you can simply add a WHERE clause. This is a good idea also because it will cull the number of rows to process.

It may not be acceptable to throw away the rows with NULL values, for example if there is another expression in your SELECT list that needs to do something with all rows. You can then still work around it, with some extra hassle. It would involve tweaking the GROUP_CONCAT to ignore NULL values. This could be done like this:

GROUP_CONCAT(
IF(<column> IS NULL
, ''
, <column>)
ORDER BY <column>
SEPARATOR ','
)

This will ensure that GROUP_CONCAT() does not also return NULL when a NULL value is present in the specified column. If there are NULL values, these will end up as a list of comma's in the head of the result:

,,,,<non-null-value1>,...,<non-null-valueN> -- 4 NULL values

Assuming we know the number of NULL's (and we do) we can clean up our list easily with just SUBSTRING():

SUBSTRING(
GROUP_CONCAT(
IF(<column> IS NULL
, ''
, <column>)
ORDER BY <column>
SEPARATOR ','
)
, SUM(IF(<column> IS NULL, 1, 0)) + 1
)

Because we are ignoring the NULL values in our list, we must likewise ignore them in our calculation of the portion of rows. So, instead if COUNT(*) we should use COUNT(<column>) in order to not count the NULL values.

group_concat_max_len


When using GROUP_CONCAT(), an issue that should always be on your radar is the maximum length of the GROUP_CONCAT() result. If the result value exceeds the maximum length, the GROUP_CONCAT() result will be truncated, and a warning will be issued:

1 row in set, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+--------------------------------------+
| Level | Code | Message |
+---------+------+--------------------------------------+
| Warning | 1260 | 1 line(s) were cut by GROUP_CONCAT() |
+---------+------+--------------------------------------+

It is really important to be aware of any warnings, as a truncation of the GROUP_CONCAT() result messes up the entire calculation.

The maximum length of the GROUP_CONCAT() result is controlled through the group_concat_max_len system variable.

It can be set thus:

SET @@group_concat_max_len := <num-bytes>

The maximum practical value is the maximum packet size, which is available as the max_allowed_packet system variable. This means you can write:

SET @@group_concat_max_len := @@max_allowed_packet;
and you will never be bothered by this problem again. The GROUP_CONCAT() result can still be too large though (namely, larger than the maximum packet size) but in that case you will get a proper error instead of a truncated result.

You should realize that setting the group_concat_max_len to a high value may lead to memory problems, as each GROUP_CONCAT() invocation may individually reserve the specified amount of memory to deal with its result.

Finally...


I will maintain this percentile calculation as a snippet on the MySQL Forge site. Please go there to find the latest version.

Now, finally, I have a question for you. When I wrote about the median calculation, I mentioned that I thought it was an original method, and I asked whether someone could deny that claim. I did not get any reaction, so I'd like to repeat my question: Do you know of any book or text or blog that describes this technique? If so, let me know so I can provide proper accreditation.

TIA, Roland.