Friday, June 13, 2014

MySQL: Extracting timstamp and MAC address from UUIDs

To whom it may concern.

Surrogate keys: auto-increment or UUID?

I recently overheard a statement about whether to use auto-incrementing id's (i.e, a sequence managed by the RDBMS) or universal unique identifiers (UUIDs) as method for generating surrogate key values.

Leakiness

Much has been written about this subject with regard to storage space, query performance and so on, but in this particular case the main consideration was leakiness. Leakiness in this case means that key values convey information about the state of the system that we didn't intend to disclose.

Auto-incrementing id's are leaky

For example, suppose you would subscribe to a new social media site, and you get assigned a personal profile page which looks like this:
http://social.media.site/user/67638
Suppose that 67638 is the auto-incrementing key value that was uniquely assigned to the profile. If that were the case then we could wait a day and create a new profile. We could then compare the key values and use it to estimate how many new profiles were created during that day. This might not necessarily be very sensitive information, but the point here is that by exposing the key values, the system exposes information that it didn't intend to disclose (or at least not in that way).

Are UUIDs leaky?

So clearly, auto-incrementing keys are leaky. The question is, are UUIDs less leaky? Because if that's the case, then that might weigh in on your consideration to choose for a UUID surrogate key. As it turns out, this question can be answered with the universal but always unsatisfactory answer that "it depends". Not all UUIDs are created equal, and wikipedia lists 5 different variants. This is not an exhaustive list, since vendors can (and so, probably will) invent their own variants.

MySQL UUIDs

In this article I want to focus on MySQL's implementation. MySQL has two different functions that generate UUIDs: UUID() and UUID_SHORT().

Are MySQL UUIDs leaky?

If you follow the links and read the documentation, then we can easily give a definitive answer, which is: yes, MySQL UUIDs are leaky: It is not my role to judge whether this leakiness is better or worse than the leakiness of auto-incrementing keys, I'm just providing the information so you can decide whether it affects you or not.

Hacking MySQL UUID values

Now, on to the fun bit. Let's hack MySQL UUIDs and extract meaningful information. Because we can.

Credit where credit's due: Although the documentation and MySQL source code contain all the information, I had a lot of benefit from the inconspicuously-looking but otherwise excellent website from the Kruithof family. It provides a neat recipe for extracting timestamp and MAC address from type 1 UUIDs. Thanks!

Here's a graphical representation of the recipe:

Without further ado, here come the hacks:

Extracting the timestamp from a MySQL UUID

Here's how:
select  uid                           AS uid
,       from_unixtime(
          (conv(                      
            concat(                   -- step 1: reconstruct hexadecimal timestamp
              substring(uid, 16, 3)
            , substring(uid, 10, 4)
            , substring(uid, 1, 8)
            ), 16, 10)                -- step 2: convert hexadecimal to decimal
            div 10 div 1000 div 1000  -- step 3: go from nanoseconds to seconds
          ) - (141427 * 24 * 60 * 60) -- step 4: substract timestamp offset (October 15,  
        )                             AS uuid_to_timestamp
,       current_timestamp()           AS timestamp
from    (select uuid() uid)           AS alias;
Here's an example result:
+--------------------------------------+---------------------+---------------------+
| uid                                  | uuid_to_timestamp   | timestamp           |
+--------------------------------------+---------------------+---------------------+
| a89e6d7b-f2ec-11e3-bcfb-5c514fe65f28 | 2014-06-13 13:20:00 | 2014-06-13 13:20:00 |
+--------------------------------------+---------------------+---------------------+
1 row in set (0.01 sec)
The query works by first obtaining the value from UUID(). I use a subquery in the from clause for that, which aliases the UUID() function call to uid. This allows other expressions to manipulate the same uid value. You cannot simply call the UUID() multiple times, since it generates a new unique value each time you call it. The raw value of uid is shown as well, which is:a89e6d7b-f2ec-11e3-bcfb-5c514fe65f28. Most people will recognize this as 5 hexadecimal fields, separated by dashes. The first step is to extract and re-order parts of the uid to reconstruct a valid timestamp:
  • Characters 16-18 form the most significant part of the timestamp. In our example that's 1e3; the last 3 characters of the third field in the uid.
  • Characters 10-13 form the middle part timestamp. In our example that's f2ec; this corresponds to the second field
  • Characters 1-8 form the least significant part of the timestamp. In our example that's a89e6d7b; this is the first field of the uid.

Extracting the parts is easy enough with SUBSTRING(), and we can use CONCAT() to glue the parts into the right order; that is, putting the most to least significant parts of the timestamp in a left-to-right order. The hexadecimal timestamp is now 1e3f2eca89e6d7b.

The second step is to convert the hexadecimal timestamp to a decimal value. We can do that using CONV(hextimestamp, 16, 10), where 16 represents the number base of the hexadecimal input timestamp, and 10 represents the number base of output value.

We now have a timestamp, but it is in a 100-nanosecond resolution. So the third step is to divide so that we get back to seconds resolution. We can safely use a DIV integer division. First we divide by 10 to go from 100-nanosecond resolution to microseconds; then by 1000 to go to milliseconds, and then again by 1000 to go from milliseconds to seconds.

We now have a timestamp expressed as the number of seconds since the date of Gregorian reform to the Christian calendar, which is set at October 15, 1582. We can easily convert this to unix time by subtracting the number of seconds between that date and January 1, 1970 (i.e. the start date for unix time). I suppose there are nicer ways to express that, but 141427 * 24 * 60 * 60 is the value we need to do the conversion.

We now have a unix timestamp, and MySQL offers the FROM_UNIXTIME() function to go from unix time to a MySQL timestamp value.

Extracting the MAC address from a MySQL UUID

The last field of type 1 UUID's is the so-called node id. On BSD and Linux platforms, MySQL uses the MAC address to create the node id. The following query extracts the MAC address in the familiar colon-separated representation:
select  uid                           AS uid
,       concat(
                substring(uid, 25,2)
        , ':',  substring(uid, 27,2)
        , ':',  substring(uid, 29,2)
        , ':',  substring(uid, 31,2)
        , ':',  substring(uid, 33,2)
        , ':',  substring(uid, 35,2)
        )                             AS uuid_to_mac
from    (select uuid() uid)           AS alias;
Here's the result:
+--------------------------------------+-------------------+
| uid                                  | uuid_to_mac       |
+--------------------------------------+-------------------+
| 78e5e7c0-f2f5-11e3-bcfb-5c514fe65f28 | 5c:51:4f:e6:5f:28 |
+--------------------------------------+-------------------+
1 row in set (0.01 sec)
I checked on Ubuntu with ifconfig and found that this actually works.

What about UUID_SHORT()?

The UUID_SHORT() function is implemented thus:
(server_id & 255) << 56
+ (server_startup_time_in_seconds << 24)
+ incremented_variable++;
This indicates we could try and apply right bitshifting to extract server id and start time.

Since the server_id can be larger (much larger) than 255, we cannot reliably extract it. However, you can give it a try; assuming there are many mysql replication clusters with less than 255 nodes, and assuming admins will often use a simple incrementing number scheme for the server id. you might give it a try.

The start time is also easy to extract with bitshift. Feel free to post queries for that in the comments.

Conclusions

I do not pretend to present any novel insights here, this is just a summary of well-known principles. The most important take-away is that you should strive to not expose system implementation details. Surrogate key values are implementation details so should never have been exposed in the first place. If you cannot meet that requirement (or you need to compromise because of some other requirement) then you, as system or application designer should be aware of the leakiness of your keys. In order to achieve that awareness, you must have insight at the implementation-level of how the keys are generated. Then you should be able to explain, in simple human language, to other engineers, product managers and users, which bits of information are leaking, and what would be the worst possible scenario of abuse of that information. Without that analysis you just cannot decide to expose the keys and hope for the best.

Wednesday, June 11, 2014

When kettle's "Get data From XML" is bombed by the BOM

To whom it may concern...
I just ran into a problem with Pentaho Data Integration, and I figured it may save others some time if I document it here.
The case is very straightforward: read a fairly small XML document directly from a URL, and parse out interesting data using the Get data from XML step.
Typically, this steps works quite well for me. But I just ran into a case where it doesn't work quite as expected. I ran into an error when I tried it on this URL:
http://api.worldbank.org/en/countries?page=1
If you follow the URL you'll find it returns a normal looking document:
<?xml version="1.0" encoding="utf-8"?>
<wb:countries page="1" pages="6" per_page="50" total="260" xmlns:wb="http://www.worldbank.org">
  <wb:country id="ABW">
    <wb:iso2Code>AW</wb:iso2Code>
    <wb:name>Aruba</wb:name>
    <wb:region id="LCN">Latin America &amp; Caribbean (all income levels)</wb:region>
    <wb:adminregion id="" />
    <wb:incomeLevel id="NOC">High income: nonOECD</wb:incomeLevel>
    <wb:lendingType id="LNX">Not classified</wb:lendingType>
    <wb:capitalCity>Oranjestad</wb:capitalCity>
    <wb:longitude>-70.0167</wb:longitude>
    <wb:latitude>12.5167</wb:latitude>
  </wb:country>
  ...
</wb:countries>

The error: Content is not allowed in prolog

The error I got was:
Content is not allowed in prolog.
You can encounter this error in any context where the step tries to retrieve the document from the URL, for example when you hit the "Get XPAth nodes" or "Preview" buttons, as well as when you're actually running the step.

Using the w3c XML validator

The error message indicates that the XML document is in some way not valid. So I ran the URL through the w3c validator:
http://validator.w3.org/check?uri=http%3A%2F%2Fapi.worldbank.org%2Fen%2Fcountries%3Fpage%3D1&charset=%28detect+automatically%29&doctype=Inline&group=0
Interestingly, this indicated that the document is valid XML.

A rather dismal workaround

Then I tried a few things in kettle in an attempt to work around it. I won't bother you with everything I tried. Eventually, I did find a viable work-around: By retrieving the document with the HTTP Client step, and then saving that to file using a simple Text file output step (omitting the header, separators, and quotes), I could then successfully open and parse that file with the "Get data from XML" step (from within a second transformation). This was of course a bit annoying since it involved a second transformation, which complicates things considerably. However all attempts to skip the "Text file output" step all brought me back to where I was and gave me the dreaded Content is not allowed in prolog. error. So something was happening to the document between saving and loading from disk that somehow fixed it.

Mind w3c validator Warnings!

I decided to investigate a bit more. What I didn't notice at first when I validated the XML document is that, despite passing validation, it does yield 2 warnings:
  • No DOCTYPE found! Checking XML syntax only.
  • Byte-Order Mark found in UTF-8 File.
As it turns out, this second warning conveys a very important tidbit of information.

UTF-8, the BOM, and java don't play nice together

I knew what a BOM was, but I didn't quite understand it's implications in particular for java and java-based applications. Here's a quick list of things you need to know to understand the problem:
  • The byte-order mark (BOM) is a special unicode character that indicates several details of the encoding of an inputstream.
  • The BOM is optional, and for UTF-8 it is actually disrecommended. But, apparently, this does not mean it's never there, or even non-standard!
  • The particular combination of a BOM occurring in a UTF-8 stream is not supported by java. There are bug reports about it here and here.
Maybe the Get data from XML step should be more forgiving, and take care of the BOM for us if it does occur. It sure would have saved me time. Anyway, it currently doesn't, and I came up with the following solution that is reasonably straightforward and does solve the problem:

A better workaround

We can first retrieve the document with the "Http Client" step, and then remove the BOM if it is present, and then process the document using the Get data from XML step. The transformation below illustrates that setup: So, the "HTTP client" step retrieves the XML text in the document field, and the User-defined Java Expression step simply finds the first occurrence of the less than character (<), which demarcates either the start of the xml declaration, or the document element. The code for that expression is rather straightforward:
document.substring(document.indexOf("<"))
All in all, not very pretty, but it does the job. I hope this was of some use to you.
UPDATE1: I created PDI-12410 pertaining to this case.
UPDATE2: Apart from the BOM, there seems to be a separate, independent problem when the XML is acquired from a URL and the server uses gzip compression.
UPDATE3: I have a commit here that solves both the BOM and the gzip issues: https://github.com/rpbouman/pentaho-kettle/commit/6cf28b5e4e88022dbf356ccad01c3b949bed4731.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...