Thursday, March 26, 2009

A Faster MySQL Database Size Google Chart

Abstract - As described by Walter Heck, MySQL database size can be visualized using Google Charts. With a minor code improvement the URL for the chart can be obtained twice as fast. With some more modification, the number of lines can be cut down resulting in a function that is half as long.

Hi!

It's been a while since I posted - I admit I'm struggling for a bit to balance time and attention to the day job, writing a book, preparing my talks for the MySQL user's conference and of course family life.

A month ago or so I read a couple of posts about using the Google chart API to visualize database size. Although I personally would not consider using Google Charts (in its current form) for serious application monitoring applications, I am quite charmed by its ease of use and availability. Time to give it a try myself.

I inspected the PL/SQL code provided by Alex Gorbachev, and the MySQL code by Walter Heck, as well as the improved implementation by Ruturaj Vartak.

Although I applaud both Walter and Ruturaj's efforts in porting this code, I think their code can be improved still. In this short article I'd like to illustrate how a minor modification can double performance.

Code Analysis

Let's take a brief moment to analyze Walter's original code. I will cite a few fragments of his code. (I have made some simplifications and removed some distractions to make it as easy as possible to understand the logic of the code. If you find an error in my citation, please consider the possibility that it is my doing, not Walter's).

His program takes the form of a MySQL stored function. It accepts a few parameters to configure the chart (chart type and size) and outputs a fully functional URL which can be used to retrieve a .png image that shows the size (in MB) of all databases managed by the MySQL server. This is the function signature:
CREATE FUNCTION FNC_GOOGRAPH_DB_SIZE(
p_chart_type CHAR,
p_height INT,
p_width INT
) RETURNS varchar(255)
The stored function is implemented by retrieving size metadata from the information_schema.TABLES table. The size of each table's data and indexes are then added and summed per database. The results are traversed using a cursor loop.

Here's the cursor declaration:
DECLARE c_schema_sizes cursor FOR
SELECT t.table_schema
, SUM(t.data_length + t.index_length) / 1024 / 1024
FROM information_schema.tables t
GROUP BY t.table_schema
Note that the figure is immediately represented as megabytes by dividing twice by 1024. (In other words, divide by 1024 to make kilobytes of the raw bytes, then divide by 1024 again to make megabytes out of kilobytes.)

During iteration, a list of database names and a list of database sizes are built through string concatenation. These are required to supply the data series and labels to the chart.

Only the code for constructing the data series is shown here:
/* Get the percentage of the total size as the graph's data */
IF v_chart_data = '' THEN
SET v_chart_data =
ROUND(v_data_length_sum / v_data_length_total, 2) * 100;
ELSE
SET v_chart_data = CONCAT(v_chart_data,',',
ROUND(v_data_length_sum / v_data_length_total, 2) * 100);
END IF;
Note that instead of making a list of actual sizes, a percentage is taken. Keep that in mind, we'll discuss this in more detail in the next section.

Finally, the actual URL is built, using the list of database sizes as the data series and the list of database names as data labels. Here's the URL construction code:
SET v_url = 'http://chart.apis.google.com/chart?';
SET v_url = CONCAT(v_url, 'cht=', p_chart_type);
SET v_url = CONCAT(v_url, '&chs=', p_width , 'x', p_height);
SET v_url = CONCAT(v_url, '&chtt=Database Sizes (MB)');
SET v_url = CONCAT(v_url, '&chl=', v_chart_labels);
SET v_url = CONCAT(v_url, '&chd=t:', v_chart_data);
SET v_url = CONCAT(v_url, '&chdl=', v_legend_labels);

About Google Chart Data Formats

We just mentioned that the data points are not simply concatenated - rather, data size per database is expressed as a percentage of the total size of all databases. This is not just some arbitrary choice - it is in fact required by the Google Chart API. Here's a quote from the Overview of data formats:
Before you can create a chart you must encode your data into a form that is understood by the Chart API. Use one of the following formats:
  • Text encoding uses a string of positive floating point numbers from zero to one hundred.
  • ...
For this reason, the code divides everything by v_data_length_total and multiplies by 100. In our code analysis, we did not discuss v_data_length_total because it was not really essential for the logic of the program. This is how it is computed:
SELECT  ROUND(SUM(t.data_length + t.index_length) / 1024 / 1024) 
INTO v_data_length_total
FROM information_schema.tables t
Note that this query is very similar to the query used in the cursor declaration. In both cases, the information_schema.TABLES table is queried to calculate the combined size of table data and indexes. The main difference is grouping: in the cursor declaration a GROUP BY table_schema was used to calculate the size per database. In this case, grouping is absent and the size is calculated across databases, i.e. the combined size of all tables and indexes in the entire server.

Text encoding with Data Scaling


Now, as it turns out, the text-endoding used by this code comes in two flavours:
Text encoding with data scaling uses a string of positive and negative floating point numbers in combination with a scaling parameter.
So, as an alternative to 'pre-scaling' the values in the data series to a 100-point scale, we can also opt to let Google do the work, provided we pass this scale factor.

Because the code to calculate the scale factor is already present, we can simply remove the scaling computation inside the loop, and add the scale factor to the URL like so:

-- inside loop: remove percentage calulation:
IF v_chart_data = '' THEN
SET v_chart_data = ROUND(v_data_length_sum,2);
ELSE
SET v_chart_data = concat(v_chart_data, ',', ROUND(v_data_length_sum, 2));
END IF;

-- outside loop: add scaling parameter to URL:
SET v_url = CONCAT(v_url, '&chds=0,', v_data_length_total);
However, we can do better than that.

An alternative method to calculate the scaling factor

The problem with the current solution is that it has to do a separate query to obtain the total size. We already noticed that both used queries are quite similar, save for the grouping. Because we are already looping through all results, we can try to calculate the total ourselves. This would allow us to avoid doing the SELECT...INTO statement altogether.

We would need to add proper initialization for the v_data_length_total variable and add the following line inside the loop:
SET v_data_length_total := v_data_length_total + v_data_length_sum;
When comparing performance before and after this change, we can observe a dramatic change. On my laptop, the original function takes anywhere between 45 and 50 seconds. After incorporating these changes, the time is slashed by two and the function takes 'only' 22 to 24 seconds.

The fact that the second solution is twice as fast (~23 seconds instead of ~46) is not a coincidence: it's because we're doing half the number of queries (1 instead of 2). Don't get me wrong - this is still very slow. But this comes all down to poor information_schema performance. In this particular case, there is not much we can do to improve the code further to gain performance.

More performance improvements?

I am very much convinced the bottleneck in database size chart function is the query on the information_schema.TABLES table. To be more exact, it's the fact that we're accessing the DATA_LENGTH and INDEX_LENGTH columns. Just watch this:
mysql> select count(*) from information_schema.tables;
+----------+
| count(*) |
+----------+
| 1576 |
+----------+
1 row in set (0.08 sec)

mysql> select count(data_length) from information_schema.tables;
+--------------------+
| count(data_length) |
+--------------------+
| 1563 |
+--------------------+
1 row in set (25.98 sec)

mysql> select count(index_length) from information_schema.tables;
+---------------------+
| count(index_length) |
+---------------------+
| 1563 |
+---------------------+
1 row in set (25.41 sec)
As you can see, simply accessing DATA_LENGTH and/or INDEX_LENGTH causes the query to be slow.

Notice also that the time spent to execute this stand-alone query is about the same as it takes for the improved function to complete. Basically, this tells us the function spends all its time performing the query - timewise, the contribution of the remainder of the function, such as cursor traversal, creating the value lists and building the URL is simply negligible.

This means that we simply can't improve the performance of the database size chart function unless we can improve the performance of this query. Because we have no other general way of obtaining index and data size, this is the end of the line.

Other Improvements

Even though we probably can't improve the performance of Walter's function anymore, I still think it's possible to improve the code.

I guess this is kind of a pet peeve of mine, but I dislike using MySQL cursors. There's a lot of syntax involved to set them up. Usually, they can be avoided anyway.

MySQL Cursors are also pretty slow. This is something you really start to notice when traversing tens of thousands of rows. Not that that really matters for building Google Chart URLs: Most likely, you will hit a limitation in either chart size or URL length, so you can't really gain a lot of performance by eliminating cursors for this particular purpose.

Still, I think that eliminating the cursor will help to make the code less complex, so lets try anyway.

Building Lists: GROUP_CONCAT()

We just explained how the cursor was used to build various lists of values. As it happens, MySQL supports the GROUP_CONCAT() function, which was designed especially for that purpose. GROUP_CONCAT() is an aggregate function, just like COUNT(), MIN() and MAX().

Like other aggregate functions, GROUP_CONCAT() can produce a single summary result on a group of rows. It does so by first concatenating its arguments for each row in the group, and then concatenating the per-row result for the entire group of rows, optionally separating row results using some separator.

Now, there is an important limitation with this function that causes many people to avoid GROUP_CONCAT altogether. Here's the relevant text from the manual:
The result is truncated to the maximum length that is given by the group_concat_max_len system variable, which has a default value of 1024.
Basically, this says that if you don't take proper precautions, the lists you generate with GROUP_CONCAT may be truncated. This is obviously bad news! However, there is a very simple workaround, which is also hinted at in the documentation:
The value can be set higher, although the effective maximum length of the return value is constrained by the value of max_allowed_packet.
So, the workaround is simple: By assigning the value of max_allowed_packet to group_concat_max_len wil allow the longest possible list of values. You may argue that this might still not long enough. However, that is a mooit point. Just read up on max_allowed_packet:
The maximum size of one packet or any generated/intermediate string.
In other words, no MySQL string will ever exceed its length beyond max_allowed_packet - the same limit holds for any other method of concatenating strings within MySQL, including cursor loops.

A cursor-less database size chart function

Without further ado, this is how I would write the database size chart function:
CREATE FUNCTION  f_dbsize_google_chart(
p_chart_type ENUM('bhs','p')
, p_width MEDIUMINT UNSIGNED
, p_height MEDIUMINT UNSIGNED
)
RETURNS LONGTEXT
READS SQL DATA
BEGIN
DECLARE v_sum_size, v_max_size DOUBLE;
DECLARE v_size_series, v_size_labels LONGTEXT;

-- store current group_concat_max_len so we can reset
DECLARE v_group_concat_max_len BIGINT UNSIGNED DEFAULT @@group_concat_max_len;

-- ensure group_concat capacity
SET @@group_concat_max_len := @@max_allowed_packet;

-- get the database size
SELECT ROUND(SUM(size),2), MAX(size)
, GROUP_CONCAT(
ROUND(size,2)
ORDER BY size, table_schema
)
, GROUP_CONCAT(
table_schema
ORDER BY size, table_schema
SEPARATOR '|'
)
INTO v_sum_size, v_max_size
, v_size_series, v_size_labels
FROM (
SELECT table_schema
, SUM(data_length + index_length) / 1024 / 1024 size
FROM information_schema.tables
WHERE table_type = 'BASE TABLE'
GROUP BY table_schema
) a;

-- restore original group_concat_max_len
SET @@group_concat_max_len := v_group_concat_max_len;

-- build URL
RETURN CONCAT(
'http://chart.apis.google.com/chart'
, '?cht=' , p_chart_type
, '&chs=' , p_width, 'x', p_height
, '&chds=0,', v_max_size
, '&chd=t:' , v_size_series
, '&chdl=' , v_size_labels
, '&chl=' , replace(v_size_series, ',', '|')
, '&chtt=MySQL Database Size (', v_sum_size, 'MB)'
);
END;
Here's a quick summary that points out some differences with regard to the original code:
  • Instead of accepting CHAR(1) for the chart type, an ENUM('bhs','p') is used to restrict the value to a listed type.
  • Instead returning a varchar(3000), this function returns LONGTEXT, effectively leaving it to the Google chart API to report a URL length limitation
  • Instead of a cursor, we use a single SELECT..INTO statement.
  • The query in the SELECT...INTO statement uses a subquery in the FROM clause which is functionally equivalent to the actual cursor in the original code. A small but important improvement is the addition of a condition to restrict the result only to base tables. This automatically excludes the information_schema and databases that contain only views
  • The outer query is functionally equivalent to the cursor loop and uses one GROUP_CONCAT() expression to obtain a list of values, and one GROUP_CONCAT() expression to obtain a list of labels. Here, we also calculate the scaling factor by taking the MAX of the database size.
  • To work around the truncation problem with GROUP_CONCAT, we set group_concat_max_len the the maximum practical value. We don't just set max_group_concat_len and leave it at that. We restore its original value at the end of the stored function.

4 comments:

Walter Heck said...

Very nice work, thanks for the explanations! I learned a lot from reading this. When I have some time i'll add reference to your post here from the original :)

Walter

rpbouman said...

Walter, thank *you*!

I didn't know about the google charts until I read your post. I found it an original application and I enjoyed reading it.

So, thanks, and thanks for commenting here.

kind regards,

Roland

Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...

hey ! that is some coding and explanation. I just wish somehow/someone could tweak mysql-administrator code and these graphs could be imported into its UI.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...