Surrogate keys: auto-increment or UUID?
I recently overheard a statement about whether to use auto-incrementing id's (i.e, a sequence managed by the RDBMS) or universal unique identifiers (UUIDs) as method for generating surrogate key values.Leakiness
Much has been written about this subject with regard to storage space, query performance and so on, but in this particular case the main consideration was leakiness. Leakiness in this case means that key values convey information about the state of the system that we didn't intend to disclose.Auto-incrementing id's are leaky
For example, suppose you would subscribe to a new social media site, and you get assigned a personal profile page which looks like this:http://social.media.site/user/67638Suppose that
67638
is the auto-incrementing key value that was uniquely assigned to the profile. If that were the case then we could wait a day and create a new profile. We could then compare the key values and use it to estimate how many new profiles were created during that day. This might not necessarily be very sensitive information, but the point here is that by exposing the key values, the system exposes information that it didn't intend to disclose (or at least not in that way).
Are UUIDs leaky?
So clearly, auto-incrementing keys are leaky. The question is, are UUIDs less leaky? Because if that's the case, then that might weigh in on your consideration to choose for a UUID surrogate key. As it turns out, this question can be answered with the universal but always unsatisfactory answer that "it depends". Not all UUIDs are created equal, and wikipedia lists 5 different variants. This is not an exhaustive list, since vendors can (and so, probably will) invent their own variants.MySQL UUIDs
In this article I want to focus on MySQL's implementation. MySQL has two different functions that generate UUIDs:UUID()
and UUID_SHORT()
.
Are MySQL UUIDs leaky?
If you follow the links and read the documentation, then we can easily give a definitive answer, which is: yes, MySQL UUIDs are leaky:UUID()
implements a version 1 UUID, which is generated according to DCE 1.1: Remote Procedure Call (Appendix A) CAE (Common Applications Environment). Type 1 UUIDs are also described by RFC1422, "A Universally Unique IDentifier (UUID) URN Namespace". In short, it consists of a timestamp and a MAC address, plus some addition data to ensure uniqueness. If you want to check the MySQL source code, look for the functionString *Item_func_uuid::val_str(String *str)
initem_strfunc.cc
.UUID_SHORT()
doesn't seem to conform to any particular external standard, but it contains the server's id as well as its startime, plus some extra data to ensure unicity. The MySQL source code for this islonglong Item_func_uuid_short::val_int()
initem_func.cc
.
Hacking MySQL UUID values
Now, on to the fun bit. Let's hack MySQL UUIDs and extract meaningful information. Because we can.Credit where credit's due: Although the documentation and MySQL source code contain all the information, I had a lot of benefit from the inconspicuously-looking but otherwise excellent website from the Kruithof family. It provides a neat recipe for extracting timestamp and MAC address from type 1 UUIDs. Thanks!
Here's a graphical representation of the recipe:
Without further ado, here come the hacks:Extracting the timestamp from a MySQL UUID
Here's how:select uid AS uid , from_unixtime( (conv( concat( -- step 1: reconstruct hexadecimal timestamp substring(uid, 16, 3) , substring(uid, 10, 4) , substring(uid, 1, 8) ), 16, 10) -- step 2: convert hexadecimal to decimal div 10 div 1000 div 1000 -- step 3: go from nanoseconds to seconds ) - (141427 * 24 * 60 * 60) -- step 4: substract timestamp offset (October 15, ) AS uuid_to_timestamp , current_timestamp() AS timestamp from (select uuid() uid) AS alias;Here's an example result:
+--------------------------------------+---------------------+---------------------+ | uid | uuid_to_timestamp | timestamp | +--------------------------------------+---------------------+---------------------+ | a89e6d7b-f2ec-11e3-bcfb-5c514fe65f28 | 2014-06-13 13:20:00 | 2014-06-13 13:20:00 | +--------------------------------------+---------------------+---------------------+ 1 row in set (0.01 sec)The query works by first obtaining the value from UUID(). I use a subquery in the from clause for that, which aliases the
UUID()
function call to uid
. This allows other expressions to manipulate the same uid value. You cannot simply call the UUID()
multiple times, since it generates a new unique value each time you call it. The raw value of uid
is shown as well, which is:a89e6d7b-f2ec-11e3-bcfb-5c514fe65f28
. Most people will recognize this as 5 hexadecimal fields, separated by dashes.
The first step is to extract and re-order parts of the uid to reconstruct a valid timestamp:- Characters 16-18 form the most significant part of the timestamp. In our example that's
1e3
; the last 3 characters of the third field in the uid. - Characters 10-13 form the middle part timestamp. In our example that's
f2ec
; this corresponds to the second field - Characters 1-8 form the least significant part of the timestamp. In our example that's
a89e6d7b
; this is the first field of the uid.
Extracting the parts is easy enough with SUBSTRING()
, and we can use CONCAT()
to glue the parts into the right order; that is, putting the most to least significant parts of the timestamp in a left-to-right order. The hexadecimal timestamp is now 1e3f2eca89e6d7b
.
The second step is to convert the hexadecimal timestamp to a decimal value. We can do that using CONV(hextimestamp, 16, 10)
, where 16
represents the number base of the hexadecimal input timestamp, and 10
represents the
number base of output value.
We now have a timestamp, but it is in a 100-nanosecond resolution. So the third step is to divide so that we get back to seconds resolution. We can safely use a DIV
integer division. First we divide by 10 to go from 100-nanosecond resolution to microseconds; then by 1000
to go to milliseconds, and then again by 1000
to go from milliseconds to seconds.
We now have a timestamp expressed as the number of seconds since the date of Gregorian reform to the Christian calendar, which is set at October 15, 1582. We can easily convert this to unix time by subtracting the number of seconds between that date and January 1, 1970 (i.e. the start date for unix time). I suppose there are nicer ways to express that, but 141427 * 24 * 60 * 60
is the value we need to do the conversion.
We now have a unix timestamp, and MySQL offers the FROM_UNIXTIME()
function to go from unix time to a MySQL timestamp value.
Extracting the MAC address from a MySQL UUID
The last field of type 1 UUID's is the so-called node id. On BSD and Linux platforms, MySQL uses the MAC address to create the node id. The following query extracts the MAC address in the familiar colon-separated representation:select uid AS uid , concat( substring(uid, 25,2) , ':', substring(uid, 27,2) , ':', substring(uid, 29,2) , ':', substring(uid, 31,2) , ':', substring(uid, 33,2) , ':', substring(uid, 35,2) ) AS uuid_to_mac from (select uuid() uid) AS alias;Here's the result:
+--------------------------------------+-------------------+ | uid | uuid_to_mac | +--------------------------------------+-------------------+ | 78e5e7c0-f2f5-11e3-bcfb-5c514fe65f28 | 5c:51:4f:e6:5f:28 | +--------------------------------------+-------------------+ 1 row in set (0.01 sec)I checked on Ubuntu with
ifconfig
and found that this actually works.
What about UUID_SHORT()?
The UUID_SHORT() function is implemented thus:(server_id & 255) << 56 + (server_startup_time_in_seconds << 24) + incremented_variable++;This indicates we could try and apply right bitshifting to extract server id and start time.
Since the server_id
can be larger (much larger) than 255, we cannot reliably extract it. However, you can give it a try; assuming there are many mysql replication clusters with less than 255 nodes, and assuming admins will often use a simple incrementing number scheme for the server id. you might give it a try.
The start time is also easy to extract with bitshift. Feel free to post queries for that in the comments.
8 comments:
Please do not use UUIDs of the form you have described as primary keys of InnoDB tables unless you don't care about performance.
The problem is that this UUID form places a rapidly changing part of the timestamp at the leftmost positions in the primary key. InnoDB orders data by primary key. What this UUID will do is cause the working set size to become spread all over the whole of the table, making it as large as possible. That's a great way to destroy server performance by making the buffer pool caching as inefficient as possible.
If you must use a UUID-like value, place at least some of the date parts at the start so that you keep the working set size to a tolerable level. Just moving around a few bits will take care of this.
UUID-like values have some performance advantages because a client can know the PK in advance and that's very useful in some cases, compared to having to ask the server what autoincrementing value was generated. Before 5.5 and even in some cases in 5.5 it's possible for autoincrement value generation to become a bottleneck at high insert rates. The 5.5 changes are described in part from http://dev.mysql.com/doc/refman/5.5/en/innodb-auto-increment-handling.html .
If the UUID is generated at the client that may disclose information about the client, typically a web server. Since a MAC address reveals the network equipment vendor it might be useful to use fake MAC addresses on the network cards when it's desirable for security reasons to conceal possible equipment vulnerabilities. Same for the server MAC address.
James Day, MySQL Senior Principal Support Engineer, Oracle
James,
thanks for your insightful comment.
I knew about the problem UUID pks pose to innodb tables, but I couldn't quite express it as eloquently as you did. So, thanks for pointing that out!
I personally have never felt the need to UUIDs, but when I heard the angle about leakiness it just seemed like a fun thing to write about and so I did.
I'm wondering about your remark that the client can know the UUID in advance: surely that scenario does not come into play here? At least, it seems to me that if the clients generate their own GUIDs, there is now way to guarantee unicity, unless the clients are somehow connected and coordinate the generation of the GUIDs?
All real MAC addresses are guaranteed to be globally unique, so that ensures that each client will be using a different series of values if they use a UID form that includes a real MAC address. NIC vendors are allocated unique vendor IDs and add the rest of the uniqueness when making the card so that no two cards that they make will have the same value.
The rest of the parts of this UID version then try to ensure that even UIDs on the same computer are likely to be unique, by incorporating a fairly fine-grained timestamp, 100ns.
GUIDs and UUIDs only offer high probability uniqueness, not guaranteed. If a client doesn't handle the collision case the client application has a bug.
If a client application wants a guarantee then it can use a service process on the same computer to guarantee that multiple instances on the same box will not collide.
Of course here we're writing mainly about the version 1 UUID form that includes the MAC address. There are a range of alternative forms. Most of those have the same adverse performance effect if used with a database but some are designed to do better.
James Day
James, all clear now. Thank you, much appreciated!
We've encountered some of these same problems, particularly with using UUIDs as primary keys in a sharded environment, and wrote the following to mitigate some of the issues: https://github.com/groupon/locality-uuid.java. I hope its useful!
- Peter Bakkum
Peter Bakkum,
very cool! Thanks for commenting :) It is much appreciated!
Your query drops sub second information from the timestamp because it uses "div" instead of "/".
Here is a fixed query:
SET @uuid = uuid(); SET @TS = CONV(CONCAT(SUBSTR(@uuid, 16, 3),SUBSTR(@uuid, 10, 4),SUBSTR(@uuid, 1, 8)), 16, 10) / 10000000 - (141427 * 24 * 60 * 60); SELECT @uuid `UUID()`, SUBSTR(@uuid, 25) `MAC`, @TS `TS`, FROM_UNIXTIME(@TS);
I think this is among the most vital info for me. And i'm glad reading your article.
But wanna remark on some general things, The site style is great,
the articles is really excellent : D. Good job, cheers
Post a Comment