Abstract - As
described by Walter Heck, MySQL database size can be visualized using
Google Charts. With a minor code improvement the URL for the chart can be obtained twice as fast. With some more modification, the number of lines can be cut down resulting in a function that is half as long.
Hi!
It's been a while since I posted - I admit I'm struggling for a bit to balance time and attention to the day job,
writing a book, preparing
my talks for the
MySQL user's conference and of course family life.
A month ago or so I
read a
couple of
posts about using the
Google chart API to visualize database size. Although I personally would not consider using Google Charts (in its current form) for serious application monitoring applications, I am quite charmed by its ease of use and availability. Time to give it a try myself.
I inspected the PL/SQL code provided by
Alex Gorbachev, and the MySQL code by
Walter Heck, as well as the improved implementation by
Ruturaj Vartak.
Although I applaud both Walter and Ruturaj's efforts in porting this code, I think their code can be improved still. In this short article I'd like to illustrate how a minor modification can double performance.
Code Analysis
Let's take a brief moment to analyze Walter's original code. I will cite a few fragments of his code. (I have made some simplifications and removed some distractions to make it as easy as possible to understand the logic of the code. If you find an error in my citation, please consider the possibility that it is my doing, not Walter's).
His program takes the form of a
MySQL stored function. It accepts a few parameters to configure the chart (chart type and size) and outputs a fully functional URL which can be used to retrieve a .png image that shows the size (in MB) of all databases managed by the MySQL server. This is the function signature:
CREATE FUNCTION FNC_GOOGRAPH_DB_SIZE(
p_chart_type CHAR,
p_height INT,
p_width INT
) RETURNS varchar(255)
The stored function is implemented by retrieving size metadata from the
information_schema.TABLES
table. The size of each table's data and indexes are then added and summed per database. The results are traversed using a
cursor loop.
Here's the cursor declaration:
DECLARE c_schema_sizes cursor FOR
SELECT t.table_schema
, SUM(t.data_length + t.index_length) / 1024 / 1024
FROM information_schema.tables t
GROUP BY t.table_schema
Note that the figure is immediately represented as megabytes by dividing twice by
1024
. (In other words, divide by
1024
to make kilobytes of the raw bytes, then divide by
1024
again to make megabytes out of kilobytes.)
During iteration, a list of database names and a list of database sizes are built through string concatenation. These are required to supply the data series and labels to the chart.
Only the code for constructing the data series is shown here:
/* Get the percentage of the total size as the graph's data */
IF v_chart_data = '' THEN
SET v_chart_data =
ROUND(v_data_length_sum / v_data_length_total, 2) * 100;
ELSE
SET v_chart_data = CONCAT(v_chart_data,',',
ROUND(v_data_length_sum / v_data_length_total, 2) * 100);
END IF;
Note that instead of making a list of actual sizes, a percentage is taken. Keep that in mind, we'll discuss this in more detail in the next section.
Finally, the actual URL is built, using the list of database sizes as the data series and the list of database names as data labels. Here's the URL construction code:
SET v_url = 'http://chart.apis.google.com/chart?';
SET v_url = CONCAT(v_url, 'cht=', p_chart_type);
SET v_url = CONCAT(v_url, '&chs=', p_width , 'x', p_height);
SET v_url = CONCAT(v_url, '&chtt=Database Sizes (MB)');
SET v_url = CONCAT(v_url, '&chl=', v_chart_labels);
SET v_url = CONCAT(v_url, '&chd=t:', v_chart_data);
SET v_url = CONCAT(v_url, '&chdl=', v_legend_labels);
About Google Chart Data Formats
We just mentioned that the data points are not simply concatenated - rather, data size per database is expressed as a
percentage of the total size of all databases. This is not just some arbitrary choice - it is in fact required by the Google Chart API. Here's a quote from the
Overview of data formats:
Before you can create a chart you must encode your data into a form that is understood by the Chart API. Use one of the following formats:- Text encoding uses a string of positive floating point numbers from zero to one hundred.
- ...
For this reason, the code divides everything by
v_data_length_total
and multiplies by
100
. In our code analysis, we did not discuss
v_data_length_total
because it was not really essential for the logic of the program. This is how it is computed:
SELECT ROUND(SUM(t.data_length + t.index_length) / 1024 / 1024)
INTO v_data_length_total
FROM information_schema.tables t
Note that this query is very similar to the query used in the cursor declaration. In both cases, the
information_schema.TABLES
table is queried to calculate the combined size of table data and indexes. The main difference is grouping: in the cursor declaration a
GROUP BY table_schema
was used to calculate the size
per database. In this case, grouping is absent and the size is calculated across databases, i.e. the combined size of all tables and indexes in the entire server.
Text encoding with Data Scaling
Now, as it turns out, the text-endoding used by this code comes in two flavours:
Text encoding with data scaling uses a string of positive and negative floating point numbers in combination with a scaling parameter.
So, as an alternative to 'pre-scaling' the values in the data series to a 100-point scale, we can also opt to let Google do the work, provided we pass this scale factor.
Because the code to calculate the scale factor is already present, we can simply remove the scaling computation inside the loop, and add the scale factor to the URL like so:
-- inside loop: remove percentage calulation:
IF v_chart_data = '' THEN
SET v_chart_data = ROUND(v_data_length_sum,2);
ELSE
SET v_chart_data = concat(v_chart_data, ',', ROUND(v_data_length_sum, 2));
END IF;
-- outside loop: add scaling parameter to URL:
SET v_url = CONCAT(v_url, '&chds=0,', v_data_length_total);
However, we can do better than that.
An alternative method to calculate the scaling factor
The problem with the current solution is that it has to do a separate query to obtain the total size. We already noticed that both used queries are quite similar, save for the grouping. Because we are already looping through all results, we can try to calculate the total ourselves. This would allow us to avoid doing the
SELECT...INTO
statement altogether.
We would need to add proper initialization for the
v_data_length_total
variable and add the following line inside the loop:
SET v_data_length_total := v_data_length_total + v_data_length_sum;
When comparing performance before and after this change, we can observe a dramatic change. On my laptop, the original function takes anywhere between 45 and 50 seconds. After incorporating these changes, the time is slashed by two and the function takes 'only' 22 to 24 seconds.
The fact that the second solution is twice as fast (~23 seconds instead of ~46) is not a coincidence: it's because we're doing half the number of queries (1 instead of 2). Don't get me wrong - this is still very slow. But this comes all down to poor
information_schema
performance. In this particular case, there is not much we can do to improve the code further to gain performance.
More performance improvements?
I am very much convinced the bottleneck in database size chart function is the query on the
information_schema.TABLES
table. To be more exact, it's the fact that we're accessing the
DATA_LENGTH
and
INDEX_LENGTH
columns. Just watch this:
mysql> select count(*) from information_schema.tables;
+----------+
| count(*) |
+----------+
| 1576 |
+----------+
1 row in set (0.08 sec)
mysql> select count(data_length) from information_schema.tables;
+--------------------+
| count(data_length) |
+--------------------+
| 1563 |
+--------------------+
1 row in set (25.98 sec)
mysql> select count(index_length) from information_schema.tables;
+---------------------+
| count(index_length) |
+---------------------+
| 1563 |
+---------------------+
1 row in set (25.41 sec)
As you can see, simply accessing
DATA_LENGTH
and/or
INDEX_LENGTH
causes the query to be slow.
Notice also that the time spent to execute this stand-alone query is about the same as it takes for the improved function to complete. Basically, this tells us the function spends all its time performing the query - timewise, the contribution of the remainder of the function, such as cursor traversal, creating the value lists and building the URL is simply negligible.
This means that we simply can't improve the performance of the database size chart function unless we can improve the performance of this query. Because we have no other general way of obtaining index and data size, this is the end of the line.
Other Improvements
Even though we probably can't improve the performance of Walter's function anymore, I still think it's possible to improve the code.
I guess this is kind of a pet peeve of mine, but I dislike using MySQL cursors. There's a lot of
syntax involved to set them up. Usually, they
can be avoided anyway.
MySQL Cursors are also pretty slow. This is something you really start to notice when traversing tens of thousands of rows. Not that that really matters for building Google Chart URLs: Most likely, you will hit a limitation in either
chart size or
URL length, so you can't really gain a lot of performance by eliminating cursors for this particular purpose.
Still, I think that eliminating the cursor will help to make the code less complex, so lets try anyway.
Building Lists: GROUP_CONCAT()
We just explained how the cursor was used to build various lists of values. As it happens, MySQL supports the
GROUP_CONCAT()
function, which was designed especially for that purpose.
GROUP_CONCAT()
is an aggregate function, just like
COUNT()
,
MIN()
and
MAX()
.
Like other aggregate functions,
GROUP_CONCAT()
can produce a single summary result on a group of rows. It does so by first concatenating its arguments for each row in the group, and then concatenating the per-row result for the entire group of rows, optionally separating row results using some separator.
Now, there is an important limitation with this function that causes many people to avoid
GROUP_CONCAT
altogether. Here's the relevant text from the manual:
The result is truncated to the maximum length that is given by the group_concat_max_len
system variable, which has a default value of 1024.
Basically, this says that if you don't take proper precautions, the lists you generate with
GROUP_CONCAT
may be truncated. This is obviously bad news! However, there is a very simple workaround, which is also hinted at in the documentation:
The value can be set higher, although the effective maximum length of the return value is constrained by the value of max_allowed_packet
.
So, the workaround is simple: By assigning the value of
max_allowed_packet
to
group_concat_max_len
wil allow the longest possible list of values. You may argue that this might still not long enough. However, that is a mooit point. Just read up on
max_allowed_packet
:
The maximum size of one packet or any generated/intermediate string.
In other words, no MySQL string will ever exceed its length beyond
max_allowed_packet
- the same limit holds for any other method of concatenating strings within MySQL, including cursor loops.
A cursor-less database size chart function
Without further ado, this is how I would write the database size chart function:
CREATE FUNCTION f_dbsize_google_chart(
p_chart_type ENUM('bhs','p')
, p_width MEDIUMINT UNSIGNED
, p_height MEDIUMINT UNSIGNED
)
RETURNS LONGTEXT
READS SQL DATA
BEGIN
DECLARE v_sum_size, v_max_size DOUBLE;
DECLARE v_size_series, v_size_labels LONGTEXT;
-- store current group_concat_max_len so we can reset
DECLARE v_group_concat_max_len BIGINT UNSIGNED DEFAULT @@group_concat_max_len;
-- ensure group_concat capacity
SET @@group_concat_max_len := @@max_allowed_packet;
-- get the database size
SELECT ROUND(SUM(size),2), MAX(size)
, GROUP_CONCAT(
ROUND(size,2)
ORDER BY size, table_schema
)
, GROUP_CONCAT(
table_schema
ORDER BY size, table_schema
SEPARATOR '|'
)
INTO v_sum_size, v_max_size
, v_size_series, v_size_labels
FROM (
SELECT table_schema
, SUM(data_length + index_length) / 1024 / 1024 size
FROM information_schema.tables
WHERE table_type = 'BASE TABLE'
GROUP BY table_schema
) a;
-- restore original group_concat_max_len
SET @@group_concat_max_len := v_group_concat_max_len;
-- build URL
RETURN CONCAT(
'http://chart.apis.google.com/chart'
, '?cht=' , p_chart_type
, '&chs=' , p_width, 'x', p_height
, '&chds=0,', v_max_size
, '&chd=t:' , v_size_series
, '&chdl=' , v_size_labels
, '&chl=' , replace(v_size_series, ',', '|')
, '&chtt=MySQL Database Size (', v_sum_size, 'MB)'
);
END;
Here's a quick summary that points out some differences with regard to the original code:
- Instead of accepting
CHAR(1)
for the chart type, an ENUM('bhs','p')
is used to restrict the value to a listed type. - Instead returning a
varchar(3000)
, this function returns LONGTEXT
, effectively leaving it to the Google chart API to report a URL length limitation - Instead of a cursor, we use a single
SELECT..INTO
statement. - The query in the
SELECT...INTO
statement uses a subquery in the FROM
clause which is functionally equivalent to the actual cursor in the original code. A small but important improvement is the addition of a condition to restrict the result only to base tables. This automatically excludes the information_schema
and databases that contain only views - The outer query is functionally equivalent to the cursor loop and uses one
GROUP_CONCAT()
expression to obtain a list of values, and one GROUP_CONCAT()
expression to obtain a list of labels. Here, we also calculate the scaling factor by taking the MAX
of the database size. - To work around the truncation problem with
GROUP_CONCAT
, we set group_concat_max_len
the the maximum practical value. We don't just set max_group_concat_len
and leave it at that. We restore its original value at the end of the stored function.