Percona Live MySQL User's Conference, San Francisco, April 10-12th, 2012 Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Friday, June 13, 2014

MySQL: Extracting timstamp and MAC address from UUIDs

To whom it may concern.

Surrogate keys: auto-increment or UUID?

I recently overheard a statement about whether to use auto-incrementing id's (i.e, a sequence managed by the RDBMS) or universal unique identifiers (UUIDs) as method for generating surrogate key values.

Leakiness

Much has been written about this subject with regard to storage space, query performance and so on, but in this particular case the main consideration was leakiness. Leakiness in this case means that key values convey information about the state of the system that we didn't intend to disclose.

Auto-incrementing id's are leaky

For example, suppose you would subscribe to a new social media site, and you get assigned a personal profile page which looks like this:
http://social.media.site/user/67638
Suppose that 67638 is the auto-incrementing key value that was uniquely assigned to the profile. If that were the case then we could wait a day and create a new profile. We could then compare the key values and use it to estimate how many new profiles were created during that day. This might not necessarily be very sensitive information, but the point here is that by exposing the key values, the system exposes information that it didn't intend to disclose (or at least not in that way).

Are UUIDs leaky?

So clearly, auto-incrementing keys are leaky. The question is, are UUIDs less leaky? Because if that's the case, then that might weigh in on your consideration to choose for a UUID surrogate key. As it turns out, this question can be answered with the universal but always unsatisfactory answer that "it depends". Not all UUIDs are created equal, and wikipedia lists 5 different variants. This is not an exhaustive list, since vendors can (and so, probably will) invent their own variants.

MySQL UUIDs

In this article I want to focus on MySQL's implementation. MySQL has two different functions that generate UUIDs: UUID() and UUID_SHORT().

Are MySQL UUIDs leaky?

If you follow the links and read the documentation, then we can easily give a definitive answer, which is: yes, MySQL UUIDs are leaky: It is not my role to judge whether this leakiness is better or worse than the leakiness of auto-incrementing keys, I'm just providing the information so you can decide whether it affects you or not.

Hacking MySQL UUID values

Now, on to the fun bit. Let's hack MySQL UUIDs and extract meaningful information. Because we can.

Credit where credit's due: Although the documentation and MySQL source code contain all the information, I had a lot of benefit from the inconspicuously-looking but otherwise excellent website from the Kruithof family. It provides a neat recipe for extracting timestamp and MAC address from type 1 UUIDs. Thanks!

Here's a graphical representation of the recipe:

Without further ado, here come the hacks:

Extracting the timestamp from a MySQL UUID

Here's how:
select  uid                           AS uid
,       from_unixtime(
          (conv(                      
            concat(                   -- step 1: reconstruct hexadecimal timestamp
              substring(uid, 16, 3)
            , substring(uid, 10, 4)
            , substring(uid, 1, 8)
            ), 16, 10)                -- step 2: convert hexadecimal to decimal
            div 10 div 1000 div 1000  -- step 3: go from nanoseconds to seconds
          ) - (141427 * 24 * 60 * 60) -- step 4: substract timestamp offset (October 15,  
        )                             AS uuid_to_timestamp
,       current_timestamp()           AS timestamp
from    (select uuid() uid)           AS alias;
Here's an example result:
+--------------------------------------+---------------------+---------------------+
| uid                                  | uuid_to_timestamp   | timestamp           |
+--------------------------------------+---------------------+---------------------+
| a89e6d7b-f2ec-11e3-bcfb-5c514fe65f28 | 2014-06-13 13:20:00 | 2014-06-13 13:20:00 |
+--------------------------------------+---------------------+---------------------+
1 row in set (0.01 sec)
The query works by first obtaining the value from UUID(). I use a subquery in the from clause for that, which aliases the UUID() function call to uid. This allows other expressions to manipulate the same uid value. You cannot simply call the UUID() multiple times, since it generates a new unique value each time you call it. The raw value of uid is shown as well, which is:a89e6d7b-f2ec-11e3-bcfb-5c514fe65f28. Most people will recognize this as 5 hexadecimal fields, separated by dashes. The first step is to extract and re-order parts of the uid to reconstruct a valid timestamp:
  • Characters 16-18 form the most significant part of the timestamp. In our example that's 1e3; the last 3 characters of the third field in the uid.
  • Characters 10-13 form the middle part timestamp. In our example that's f2ec; this corresponds to the second field
  • Characters 1-8 form the least significant part of the timestamp. In our example that's a89e6d7b; this is the first field of the uid.

Extracting the parts is easy enough with SUBSTRING(), and we can use CONCAT() to glue the parts into the right order; that is, putting the most to least significant parts of the timestamp in a left-to-right order. The hexadecimal timestamp is now 1e3f2eca89e6d7b.

The second step is to convert the hexadecimal timestamp to a decimal value. We can do that using CONV(hextimestamp, 16, 10), where 16 represents the number base of the hexadecimal input timestamp, and 10 represents the number base of output value.

We now have a timestamp, but it is in a 100-nanosecond resolution. So the third step is to divide so that we get back to seconds resolution. We can safely use a DIV integer division. First we divide by 10 to go from 100-nanosecond resolution to microseconds; then by 1000 to go to milliseconds, and then again by 1000 to go from milliseconds to seconds.

We now have a timestamp expressed as the number of seconds since the date of Gregorian reform to the Christian calendar, which is set at October 15, 1582. We can easily convert this to unix time by subtracting the number of seconds between that date and January 1, 1970 (i.e. the start date for unix time). I suppose there are nicer ways to express that, but 141427 * 24 * 60 * 60 is the value we need to do the conversion.

We now have a unix timestamp, and MySQL offers the FROM_UNIXTIME() function to go from unix time to a MySQL timestamp value.

Extracting the MAC address from a MySQL UUID

The last field of type 1 UUID's is the so-called node id. On BSD and Linux platforms, MySQL uses the MAC address to create the node id. The following query extracts the MAC address in the familiar colon-separated representation:
select  uid                           AS uid
,       concat(
                substring(uid, 25,2)
        , ':',  substring(uid, 27,2)
        , ':',  substring(uid, 29,2)
        , ':',  substring(uid, 31,2)
        , ':',  substring(uid, 33,2)
        , ':',  substring(uid, 35,2)
        )                             AS uuid_to_mac
from    (select uuid() uid)           AS alias;
Here's the result:
+--------------------------------------+-------------------+
| uid                                  | uuid_to_mac       |
+--------------------------------------+-------------------+
| 78e5e7c0-f2f5-11e3-bcfb-5c514fe65f28 | 5c:51:4f:e6:5f:28 |
+--------------------------------------+-------------------+
1 row in set (0.01 sec)
I checked on Ubuntu with ifconfig and found that this actually works.

What about UUID_SHORT()?

The UUID_SHORT() function is implemented thus:
(server_id & 255) << 56
+ (server_startup_time_in_seconds << 24)
+ incremented_variable++;
This indicates we could try and apply right bitshifting to extract server id and start time.

Since the server_id can be larger (much larger) than 255, we cannot reliably extract it. However, you can give it a try; assuming there are many mysql replication clusters with less than 255 nodes, and assuming admins will often use a simple incrementing number scheme for the server id. you might give it a try.

The start time is also easy to extract with bitshift. Feel free to post queries for that in the comments.

Conclusions

I do not pretend to present any novel insights here, this is just a summary of well-known principles. The most important take-away is that you should strive to not expose system implementation details. Surrogate key values are implementation details so should never have been exposed in the first place. If you cannot meet that requirement (or you need to compromise because of some other requirement) then you, as system or application designer should be aware of the leakiness of your keys. In order to achieve that awareness, you must have insight at the implementation-level of how the keys are generated. Then you should be able to explain, in simple human language, to other engineers, product managers and users, which bits of information are leaking, and what would be the worst possible scenario of abuse of that information. Without that analysis you just cannot decide to expose the keys and hope for the best.

Wednesday, June 11, 2014

When kettle's "Get data From XML" is bombed by the BOM

To whom it may concern...
I just ran into a problem with Pentaho Data Integration, and I figured it may save others some time if I document it here.
The case is very straightforward: read a fairly small XML document directly from a URL, and parse out interesting data using the Get data from XML step.
Typically, this steps works quite well for me. But I just ran into a case where it doesn't work quite as expected. I ran into an error when I tried it on this URL:
http://api.worldbank.org/en/countries?page=1
If you follow the URL you'll find it returns a normal looking document:
<?xml version="1.0" encoding="utf-8"?>
<wb:countries page="1" pages="6" per_page="50" total="260" xmlns:wb="http://www.worldbank.org">
  <wb:country id="ABW">
    <wb:iso2Code>AW</wb:iso2Code>
    <wb:name>Aruba</wb:name>
    <wb:region id="LCN">Latin America &amp; Caribbean (all income levels)</wb:region>
    <wb:adminregion id="" />
    <wb:incomeLevel id="NOC">High income: nonOECD</wb:incomeLevel>
    <wb:lendingType id="LNX">Not classified</wb:lendingType>
    <wb:capitalCity>Oranjestad</wb:capitalCity>
    <wb:longitude>-70.0167</wb:longitude>
    <wb:latitude>12.5167</wb:latitude>
  </wb:country>
  ...
</wb:countries>

The error: Content is not allowed in prolog

The error I got was:
Content is not allowed in prolog.
You can encounter this error in any context where the step tries to retrieve the document from the URL, for example when you hit the "Get XPAth nodes" or "Preview" buttons, as well as when you're actually running the step.

Using the w3c XML validator

The error message indicates that the XML document is in some way not valid. So I ran the URL through the w3c validator:
http://validator.w3.org/check?uri=http%3A%2F%2Fapi.worldbank.org%2Fen%2Fcountries%3Fpage%3D1&charset=%28detect+automatically%29&doctype=Inline&group=0
Interestingly, this indicated that the document is valid XML.

A rather dismal workaround

Then I tried a few things in kettle in an attempt to work around it. I won't bother you with everything I tried. Eventually, I did find a viable work-around: By retrieving the document with the HTTP Client step, and then saving that to file using a simple Text file output step (omitting the header, separators, and quotes), I could then successfully open and parse that file with the "Get data from XML" step (from within a second transformation). This was of course a bit annoying since it involved a second transformation, which complicates things considerably. However all attempts to skip the "Text file output" step all brought me back to where I was and gave me the dreaded Content is not allowed in prolog. error. So something was happening to the document between saving and loading from disk that somehow fixed it.

Mind w3c validator Warnings!

I decided to investigate a bit more. What I didn't notice at first when I validated the XML document is that, despite passing validation, it does yield 2 warnings:
  • No DOCTYPE found! Checking XML syntax only.
  • Byte-Order Mark found in UTF-8 File.
As it turns out, this second warning conveys a very important tidbit of information.

UTF-8, the BOM, and java don't play nice together

I knew what a BOM was, but I didn't quite understand it's implications in particular for java and java-based applications. Here's a quick list of things you need to know to understand the problem:
  • The byte-order mark (BOM) is a special unicode character that indicates several details of the encoding of an inputstream.
  • The BOM is optional, and for UTF-8 it is actually disrecommended. But, apparently, this does not mean it's never there, or even non-standard!
  • The particular combination of a BOM occurring in a UTF-8 stream is not supported by java. There are bug reports about it here and here.
Maybe the Get data from XML step should be more forgiving, and take care of the BOM for us if it does occur. It sure would have saved me time. Anyway, it currently doesn't, and I came up with the following solution that is reasonably straightforward and does solve the problem:

A better workaround

We can first retrieve the document with the "Http Client" step, and then remove the BOM if it is present, and then process the document using the Get data from XML step. The transformation below illustrates that setup: So, the "HTTP client" step retrieves the XML text in the document field, and the User-defined Java Expression step simply finds the first occurrence of the less than character (<), which demarcates either the start of the xml declaration, or the document element. The code for that expression is rather straightforward:
document.substring(document.indexOf("<"))
All in all, not very pretty, but it does the job. I hope this was of some use to you.
UPDATE1: I created PDI-12410 pertaining to this case.
UPDATE2: Apart from the BOM, there seems to be a separate, independent problem when the XML is acquired from a URL and the server uses gzip compression.
UPDATE3: I have a commit here that solves both the BOM and the gzip issues: https://github.com/rpbouman/pentaho-kettle/commit/6cf28b5e4e88022dbf356ccad01c3b949bed4731.

Tuesday, March 20, 2012

A Generic Normalizer step for Kettle

Abstract

Kettle (a.k.a. Pentaho Data Integration) offers the standard Row Normalizer step to "unpivot" columns to rows. However, this step requires some configuration which presumes its input stream is static, and its structure is known. In this post, I explain how to construct a simple User-defined java class step that implements a generic Row Normalizer step that can unpivot an arbitrary input stream without manual configuration.

The Row Normalizer step


Kettle (a.k.a. Pentaho Data Integration) offers a standard step to "unpivot" columns to rows. This step is called the Row Normalizer. You can see it in action in the screenshot below:

In the screenshot, the input is a table of columns id, first name, and last name. The output is a table of columns id, fieldname, and value. The id column is preserved, but for each row coming from the input stream, two rows are created in the output stream: 1 for the first name field, and 1 for the last name field.

Essentially the Row Normalizer step in this example is configured to treat the first name and last name fields as a repeating group. The repeating group is untangled by dumping all values for either field in the value column. The fieldname column is used to mark the kind of value: some values are of the "first name field" kind (in case they came from the original first name input field), some are from the "last name field" kind (when the derive from the last name input field).

There are several use cases for the operation performed by the Row normalizer step. It could be used to break down a genuine repeating group in order to create a more normalized dataset. Or you might need to convert a relational dataset into a graph consisting of subject-predicate-object tuples for loading a triple store. Or maybe you want to turn a table into a fine-grained stream of changes for auditing.

The problem

The Row normalizer step works great for streams that have a structure that is known in advance. The structure needs to be known in advance in order to specify those fields that are to be considered as repeating group in the step configuration so they can be broken out into separate kinds.

Sometimes, you don't know the structure of the input stream in advance, or it is just to inconvenient to manually specify it. In these cases, you'd really wish you could somehow unpivot any field that happens to be part of the input stream. In other words, you'd need to have a generic Row Normalizer step.

The Solution

In Kettle, there's always a solution, and often more. Here, I'd like to present a solution to dynamically unpivot an arbitrary input stream using a user-defined java class step.

Below is a screenshot of the step configuration:

This configuration allows the step to take an arbitrary input stream and normalize it into a collection of triples consisting of:
  1. An id column. This column delivers generated integer values, and each distinct value uniquely identifies a single input row.
  2. A fieldnum column. This is a generated integer value that uniquely identifies a field within each input row.
  3. A value column. This is a string column that contains the value that appears in the field identified by the fieldnum column within the row identified by the rownum value.

The Row Normalizer versus the UJDC generic Normalizer

For the input data set mentioned in the initial example, the output generated by this UJDC step is shown below:
There are a few differences with regard to the output of kettle's Row Normalizer step:
  1. One obvious difference is that the Row Normalizer step has the ability to attach names to the values, whereas the UJDC step only delivers a generated field position. One the one hand, it's really nice to have field names. On the other hand, this is also one of the weaknesses of the Row Normalizer step, because providing the names most be done manually.
  2. Another difference is that the UDJC step delivers 3 output rows for each input row, instead of the 2 rows delivered by the Row Normalizer step. The "extra" row is due to the id column. Because the id column is the key of the original input data, the Row Normalizer step was configured to only unpivot the first name and last name fields, keeping the id field unscathed: this allows any downstream steps to see which fields belong to which row. The UDJC step however does not know which field or fields form the key of the input stream. Instead, it generates its own key, the rownum field, and the id field is simply treated like any other field and unpivoted, just like the first name and last name fields. So the main difference is that the downstream steps need to use the generated rownum field to see which fields belong to which row.

The Code

The code and comments are pretty straightforward:
static long rownum = 0;
static RowMetaInterface inputRowMeta;
static long numFields;
static String[] fieldNames;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
// get the current row
Object[] r = getRow();

// If the row object is null, we are done processing.
if (r == null) {
setOutputDone();
return false;
}

// If this is the first row, cache some metadata.
// We will reuse this metadata in processing the rest of the rows.
if (first) {
inputRowMeta = getInputRowMeta();
fieldNames = inputRowMeta.getFieldNames();
numFields = fieldNames.length;
}

// Generate a new id number for the current row.
rownum += 1;

// Generate one output row for each field in the input stream.
int fieldnum;
for (fieldnum = 0; fieldnum < numFields; fieldnum++) {
Object[] outputRow = new Object[3];
outputRow[0] = rownum;
// Assign the field id. Note that we need to cast to long to match Kettle's type system.
outputRow[1] = (long)fieldnum+1;
outputRow[2] = inputRowMeta.getString(r, fieldnum);
putRow(data.outputRowMeta, outputRow);
}

return true;
}

Getting Field information

So the UDJC step only generates a number to identify the field. For some purposes it may be useful to pull in other information about the fields, like their name, data type or data format. While we could do this also directly int the UDJC step by writing more java code, it is easier and more flexible to use some of Kettle's built-in steps:
  1. The Get Metadata Structure step. This step takes an input stream, and generates one row for each distinct field. Each of these rows has a number of columns that describe the field from the input stream. One of the fields is a Position field, which uniquely identifies each field from the input stream using a generated integer, just like the fieldnum field from our UJDC step does.
  2. The stream lookup step. This step allows us to combine the output stream of our UJDC step with the output of the Get Metadata structure step. By matching the Position field of the Get Metadata Structure step with the fieldnum field of the UDJC step, we can lookup any metadata fields that we happen to find useful.


Below is a screenshot that shows how all these steps work together:
And here endeth the lesson.

Running Pentaho BI Server behind a proxy

To whom it may concern - a quick hands-on guide for running the Pentaho BI server behind a proxy

Prerequisites


This post assumes you're running Ubuntu linux (or at least a Debian) and that you have both the apache Httpd server as well as the Pentaho BI server installed.

Apache HTTP Server


If you haven't got apache installed, this is your line:
$ sudo apt-get install apache2

You can then control the apache2 Http server using the apaceh2ctl script. For instance, to start it, do:
$ sudo apache2ctl start

Once it's started you can navigate to its homepage to verify that it is running:
http://localhost/

You can stop it by running
$ sudo apache2ctl stop

If you're changing apache's configuration, you need to restart it to allow changes to take effect using this command:
$ sudo apache2ctl restart
.

Java


Pentaho relies on Java. If not installed already you can get it like this:
$ sudo apt-get install openjdk-6-jdk

Pentaho BI Server


If you haven't got the Pentaho BI Server, download the latest version from sourceforge, and unpack the archive in some location you find convenient. (For development purposes I simply keep and run it in a subdirectory of my home directory)

You can start the pentaho BI Server by cd-ing into the biserver-ce directory and then run:
$ ./start-pentaho.sh

You can then navigate to its homepage:
http://localhost:8080/pentaho/Home

(Simply navigating to http://localhost:8080 will automatically redirect you there too).

It can be useful to monitor the log while it's running:
$ tail -f tomcat/logs/catalina.out

If you want to change something in Pentaho's configuration, you need to stop the server and then restart it. This is done by running:
$ ./stop-pentaho.sh

Configuring Proxy support for Apache


Boris Kuzmanovic wrote an excellent post to setting up proxy support for Apache. My summary (and adjustment) follows below.

First, change the apache configuration to load the required proxy modules:
$ sudo a2enmod proxy
$ sudo a2enmod proxy_http

Then, edit any site definitions to use the proxy. I just modified the default site definition:
$ sudo geany /etc/apache2/sites-enabled/000-default


Inside the <VirtualHost> section, I added these snippets immediately above the </VirtualHost> that ends the section:

<Location /pentaho/>
ProxyPass http://localhost:8080/pentaho/
ProxyPassReverse http://localhost:8080/pentaho/
SetEnv proxy-chain-auth
</Location>

<Location /pentaho-style/>
ProxyPass http://localhost:8080/pentaho-style/
ProxyPassReverse http://localhost:8080/pentaho-style/
SetEnv proxy-chain-auth
</Location>


After making these changes, we need to restart apache:
$ sudo apache2ctl restart
.

These two <Location> directives are now effectively tunneled to the respective locations on the Pentaho BI Server, and vice versa, the response is passed back.

Using mod_proxy_ajp instead of proxy_http


While the regular HTTP proxy simply works, there is a better, more thightly integrated solution. The regular HTTP proxy basically handles HTTP requests received by the Apache Httpd server by sending a new, equivalent HTTP request, through to the tomcat server. Likewise, Tomcat's HTTP response is then send back as a new equivalent HTTP response to the source of the original, initial request.

So, that's twice a transport over HTTP.

Things can be improved by routing the incoming HTTP request to the tomcat server using a binary protocol called the AJP (Apache JServ) protocol. (For a detailed comparison, see this excellent comparison between HTTP/HTTPS and AJP.)

Fortunately, the steps to setup an AJP proxy are almost identical to those for setting up a regular HTTP proxy. First, enable the ajp proxy module:
$ sudo a2enmod proxy
$ sudo a2enmod proxy_ajp

(Note that the proxy module was already enabled as part of setting up the regular http proxy. The line is repeated here for completeness, but not necessary if you completed the steps for setting up support for the regular http proxy. You can enable either or both the proxy_http and the proxy_ajp modules, and both require the proxy module.)

Then, we edit again the site configuration to use the proxy. Since the locations /pentaho/ and /pentaho-style/ were already used, we first comment those out:

#<Location /pentaho/>
# ProxyPass http://localhost:8080/pentaho/
# ProxyPassReverse http://localhost:8080/pentaho/
# SetEnv proxy-chain-auth
#</Location>

#<Location /pentaho-style/>
# ProxyPass http://localhost:8080/pentaho-style/
# ProxyPassReverse http://localhost:8080/pentaho-style/
# SetEnv proxy-chain-auth
#</Location>


Then we add equivalent lines going via the AJP proxy:

ProxyPass /pentaho ajp://localhost:8009/pentaho
ProxyPassReverse /pentaho ajp://localhost:8009/pentaho

ProxyPass /pentaho-style ajp://localhost:8009/pentaho-style
ProxyPassReverse /pentaho-style ajp://localhost:8009/pentaho-style

(The bit that goes ajp://localhost:8009 refers to the ajp service that is running on port 8009 of tomcat by default.)

Again we have to restart the apache service for the changes to take effect:
$ sudo apache2ctl restart
.

Acknowledgements


Thanks to Paul Stöllberger, Pedro Alves and Tom Barber for valuable feedback and background information regarding AJP.

Thursday, December 01, 2011

Common Schema: dependencies routines

Are you a MySQL DBA? Checkout the common_schema project by Oracle Ace Shlomi Noach.

The common_schema is an open source MySQL schema that packs a number of utility views, functions and stored procedures. You can use these utilities to simplify MySQL database administration and development. Shlomi just released revision 178, and I'm happy and proud to be working together with Shlomi on this project.

Among the many cool features created by Shlomi, such as foreach, repeat_exec and exec_file, there are a few %_dependencies procedures I contributed:

  • get_event_dependencies(schema_name, event_name)

  • get_routine_dependencies(schema_name, routine_name)

  • get_sql_dependencies(sql, default_schema)

  • get_view_dependencies(schema_name, view_name)

All these procedures return a resultset that indicates which schema objects are used by the object identified by the input parameters. Here are a few examples that should give you an idea:

mysql> call common_schema.get_routine_dependencies('common_schema', 'get_routine_dependencies');
+---------------+----------------------+-------------+--------+
| schema_name | object_name | object_type | action |
+---------------+----------------------+-------------+--------+
| common_schema | get_sql_dependencies | procedure | call |
| mysql | proc | table | select |
+---------------+----------------------+-------------+--------+
2 rows in set (0.19 sec)

Query OK, 0 rows affected (0.19 sec)

mysql> call common_schema.get_routine_dependencies('common_schema', 'get_sql_dependencies');
+---------------+-------------------+-------------+--------+
| schema_name | object_name | object_type | action |
+---------------+-------------------+-------------+--------+
| common_schema | _get_sql_token | procedure | call |
| common_schema | _sql_dependencies | table | create |
| common_schema | _sql_dependencies | table | drop |
| common_schema | _sql_dependencies | table | insert |
| common_schema | _sql_dependencies | table | select |
+---------------+-------------------+-------------+--------+
5 rows in set (1.59 sec)
Of course, there's always a lot to be desired. The main shortcomings as I see it now is that the dependencies are listed only one level deep: that is, the dependencies are not recursively analyzed. Another problem is that there is currently nothing to calculate reverse dependencies (which would arguably be more useful).

The good news is, this is all open source, and your contributions are welcome! If you're interested in the source code of these routines, checkout the common_schema project, and look in the common_schema/routines/dependencies directory.

If you'd like to add recursive dependencies, or reverse dependencies, then don't hesitate and contribute. If you have a one-off contribution that relates directly to these dependencies routines, then it's probably easiest if you email me directly, and I'll see what I can do to get it in. If you are interested in more long term contribution, it's probably best if you write Shlomi, as he is the owner of the common_schema project.

You can even contribute without implementing new features or fixing bugs. You can simply contribute by using the software and find bugs or offer suggestions to improve it. If you found a bug, or have an idea for an improvement or an entirely new feature, please use the issue tracker.

For now, enjoy, and untill next time.

Friday, October 21, 2011

MySQL Hacks: Preventing deletion of specific rows

Recently, someone emailed me:
I have a requirement in MYSQL as follows:
we have a table EMP and we have to restrict the users not delete employees with DEPT_ID = 10. If user executes a DELETE statement without giving any WHERE condition all the rows should be deleted except those with DEPT_ID = 10.

We are trying to write a BEFORE DELETE trigger but we are not able to get this functionality.

I have seen your blog where you explained about Using an UDF to Raise Errors from inside MySQL Procedures and/or Triggers. Will it helps me to get this functionality? Could you suggest if we have any other alternatives to do this as well?
Frankly, I usually refer people that write me these things to a public forum, but this time I felt like giving it a go. I figured it would be nice to share my solution and I'm also curious if others found other solutions still.

(Oh, I should point out that I haven't asked what the underlying reasons are for this somewhat extraordinary requirement. I normally would do that if I would be confronted with sucha a requirement in a professional setting. In this case I'm only interested in finding a crazy hack)

Attempt 1: Re-insert deleted rows with a trigger

My first suggestion was:
Raising the error won't help you achieve your goal: as soon as you raise the error, the statement will either abort (in case of a non-transactional table) or rollback all row changes made up to raising the error (in case of a transactional table)

Although I find the requirement strange, here's a trick you could try:

write a AFTER DELETE FOR EACH ROW trigger that re-inserts the rows back into the table in case the condition DEPT_ID = 10 is met.

Hope this helps...

Alas, I should've actually tried it myself before replying, because it doesn't work. If you do try it, a DELETEresults in this runtime error:
Can't update table 'emp' in stored function/trigger because it is already used by statement which invoked this stored function/trigger.
This is also known as "the mutating table problem".

Attempt 2: Re-insert deleted rows into a FEDERATED table

As it turns out, there is a workaround that meets all of the original demands. The workaround relies on the FEDERATED storage engine, which we can use to trick MySQL into thinking we're manipulating a different table than the one that fires the trigger. My first attempt went something like this:

CREATE TABLE t (
id INT AUTO_INCREMENT PRIMARY KEY,
dept_id INT,
INDEX(dept_id)
);

CREATE TABLE federated_t (
id INT AUTO_INCREMENT PRIMARY KEY,
dept_id INT,
INDEX(dept_id)
)
ENGINE FEDERATED
CONNECTION = 'mysql://root@localhost:3306/test/t';

DELIMITER //

CREATE TRIGGER adr_t
AFTER DELETE ON t
FOR EACH ROW
IF old.dept_id = 10 THEN
INSERT INTO t_federated
VALUES (old.id, old.dept_id);
END IF;
//

DELIMITER ;
So the idea is to let the trigger re-insert the deleted rows back into the federated table, which in turn points to the original table that fired the trigger to fool MySQL into thinking it isn't touching the mutating table. Although this does prevent one from deleting any rows that satisfy the DEPT_ID = 10 condition, it does not work as intended:

mysql> INSERT INTO t VALUES (1,10), (2,20), (3,30);
Query OK, 3 rows affected (0.11 sec)

mysql> DELETE FROM t;
ERROR 1159 (08S01): Got timeout reading communication packets

mysql> SELECT * FROM t;
+----+---------+
| id | dept_id |
+----+---------+
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
+----+---------+
3 rows in set (0.00 sec)
At this point I can only make an educated guess about the actual underlying reason for this failure. It could be that the deletion is locking the rows or even the table, thereby blocking the insert into the federated table until we get a timeout. Or maybe MySQL enters into an infinite loop of deletions and insertions until we hit a timeout. I didn't investigate, so I don't know, but it seems clear this naive solution doesn't solve he problem.

Attempt 3: Deleting from the FEDERATED table and re-inserting into the underlying table

It turns out that we can solve it with a FEDERATED table by turning the problem around: Instead of manipulating the original table, we can INSERT and DELETE from the FEDERATED table, and have an AFTER DELETE trigger on the FEDERATED table re-insert the deleted rows back into the original table:

DROP TRIGGER adr_t;

DELIMITER //

CREATE TRIGGER adr_federated_t
AFTER DELETE ON federated_t
FOR EACH ROW
IF old.dept_id = 10 THEN
INSERT INTO t
VALUES (old.id, old.dept_id);
END IF;
//

DELIMITER ;
Now, the DELETE does work as intended:

mysql> DELETE FROM federated_t;
Query OK, 3 rows affected (0.14 sec)

mysql> SELECT * FROM federated_t;
+----+---------+
| id | dept_id |
+----+---------+
| 1 | 10 |
+----+---------+
1 row in set (0.00 sec)
Of course, to actually use this solution, one would grant applications access only to the federated table, and "hide" the underlying table so they can't bypass the trigger by deleting rows directly from the underlying table.

Now, even though this solution does seem to fit the original requirements, I would not recommend it for several reasons:
  • It uses the FEDERATED storage engine, which hasn't been well supported. For that reason, it isn't enabled by default, and you need access to the MySQL configuration to enable it, limiting the applicability of this solution. Also, you could run into some nasty performance problems with the FEDERATED storage engine
  • The solution relies on a trigger. In MySQL, triggers can really limit performance
  • Perhaps the most important reason is that this solution performs "magic" by altering the behaviour of SQL statements. Arguably, this is not so much the fault of the solution as it is of the original requirement.

An Alternative without relying on magic: a foreign key constraint

If I were to encounter the original requirement in a professional situation, I would argue that we should not desire to alter the semantics of SQL commands. If we tell the RDBMS to delete all rows from a table, it should either succeed and result in all rows being deleted, or it should fail and fail completely, leaving the data unchanged.

So how would we go about implementing a solution for this changed requirement?

We certainly could try the approach that was suggested in the original request: create a trigger that raises an exception whenever we find the row should not be deleted. However, this would still rely on a trigger (which is slow). And if you're not on MySQL 5.5 (or higher), you would have to use one of the ugly hacks to raise an exception.

As it turns out, there is a very simple solution that does not rely on triggers. We can create a "guard table" that references the table we want to protect using a foreign key constraint:

mysql> CREATE TABLE t_guard (
-> dept_id INT PRIMARY KEY,
-> FOREIGN KEY (dept_id)
-> REFERENCES t(dept_id)
-> );
Query OK, 0 rows affected (0.11 sec)

mysql> INSERT INTO t_guard values (10);
Query OK, 1 row affected (0.08 sec)

mysql> DELETE FROM t;
ERROR 1451 (23000): Cannot delete or update a parent row: a foreign key constraint fails (`test`.`t_guard`, CONSTRAINT `t_guard_ibfk_1` FOREIGN KEY (`dept_id`) REFERENCES `t` (`dept_id`))
mysql> DELETE FROM t WHERE dept_id != 10;
Query OK, 2 rows affected (0.05 sec)
(Like in the prior example with the federated table, the guard table would not be accessible to the application, and the "guard rows" would have to be inserted by a privileged user)

Finally: what a quirkyy foreign key constraint!

You might have noticed that there's something quite peculiar about the foreign key constraint: typically, foreign key constraints serve to relate "child" rows to their respective "parent" row. To do that, the foreign key would typically point to a column (or set of columns) that make up either the primary key or a unique constraint in the parent table. But in this case, the referenced column dept_id in the t table is contained only in an index which is not unique. Strange as it may seem, this is allowed by MySQL (or rather, InnoDB). In this particular case, this flexibility (or is it a bug?) serves us quite well, and it allows us to guard many rows in the t table with dept_id = 10 with just one single row in the guard table.

Friday, October 07, 2011

Fighting Spam: Word Verification

Hi All,

this is a quick note to let you know that from now on, commenters on this blog will need to complete a word verification (captcha) step.

Personally, I regret to have to take this measure. Let me explain why I'm doing it anyway.

Since 3 months or so, moderating comments on this blog is becoming a real drag due to a surge in anonymous spam. While bloggers spam detection is quite good, I still get notificaton mails prompting me to moderate. I feel this is consuming more of my time than it's worth.

Except for requiring word verification, other policies (or lack thereof) are still in effect: all comments are moderated, but anyone can comment, even anonymously. In practice, all real comments get published - even negative or derogatory ones (should I receive them).

Sorry for the convenience, but I hope you'll understand.

Wednesday, August 24, 2011

Re-implementing udf_init_error in MySQL 5.5 and up

To whom it may concern -

Today, I received an email from a user of the udf_init_error UDF (which resides in the lib_mysqludf_udf library). The purpose of this UDF is to generate an error condition, which can be used to abruptly terminate a trigger or stored procedure. As such it is a workaround for bug #11661. This is all described extensively in my now ancient article here.

The user wrote me because of a problem experienced in MySQL 5.5:
...calling
select udf_init_error('Transaction Cannot Be Done Because....');
will return user friendly error message:
Transaction Cannot Be Done Because....
. But in MySQL 5.5, it returns
Can't initialize function 'udf_init_error; Transaction Cannot Be Done Because....
The Can't initialize function 'udf_init_error; bit is so annoying! How can I get rid of that?
I explained that the UDF still works like it should; it's just that at some point during the 5.0 lifecycle, the format of the error message was changed. (I can't recall exactly which version that was, but I did file bug #38452 that describes this issue).

Anyway, I suggested to move away from using the udf_init_error() UDF, and port all dependent code to use the SIGNAL syntax instead, which was introduced in MySQL 5.5. (For a friendly introduction to using the SIGNAL syntax, please check out one of my prior articles).

Unfortunately, for this particular user this would not be an easy task:
The use of SIGNAL did come to my mind, but the implementation is not easy. I have thousands of stored routines to modify. Besides, I'm already satisfied with what the UDF does.
On the one hand, It makes me happy to hear the udf_init_error() UDF served him so well that he wrote so many routines that rely on it; on the other hand, I feel bad that this is holding him back from upgrading to MySQL 5.5.

For everybody that is in this same position, I'd like to suggest the following solution: simply re-implement udf_init_error() as a stored SQL function that uses the SIGNAL functionality instead. The error message returned to the client will not be exactly the same as in the olden MySQL 5.0 days, but at least there will not be an annoying complaint about a UDF that cannot be initialized.

Here's a very simple example that illustrates how to do it:
CREATE FUNCTION udf_init_error(
p_message VARCHAR(80)
)
RETURNS INTEGER
DETERMINISTIC
NO SQL
BEGIN
DECLARE err CONDITION FOR SQLSTATE '45000';
SIGNAL err SET MESSAGE_TEXT = p_message;
RETURN 1;
END;
I hope this helps.