Friday, May 12, 2017

Do we still need to talk about Data Vault 2.0 Hash keys?

A few days ago, I ran into the article "Hash Keys In The Data Vault", published recently (2017-04-28) on the the Scalefree Company blog. Scalefree is a company, founded by Dan Linstedt and Michael Olschminke. Linstedt is the inventor of Data Vault, which is a method to model and implement enterprise data warehouses.

The article focuses on the use of hash-functions in Data Vault data warehousing. To be precise, it explains how Data Vault 2.0 differs from Data Vault 1.0 by using hash-functions rather than sequences to generate surrogate key values for business keys. In addition, hash-functions are suggested as a tool to detect change of non-business key attributes to track how their values change over time.

Abstract

First I will analyze and comment on the Scalefree article and DV 2.0, and explain a number of tenets of DV thinking along the way. Critical comments will be made about using hash values as primary keys in DV 2.0, and the apparent lack of progress made in DV thinking regarding this matter. A discussion follows about the birthday problem and how it relates to DV 2.0 usage of hash functions. Using the Square approximation method it will be demonstrated how we can make informed and accurate decisions about the risk of a collision with regard to the data volume and choice of hash function. Different scenarios regarding the impact of a collision on data integrity will then be explored. Finally, a practical proposal is made to detect hash collisions and to prevent them from introducing data integrity violations into our data warehouse.

A Summary of the Scalefree article

I encourage you to first read the original article. My summary is here below.
  1. The business key is what the business users use to identify a business object.
  2. To identify business objects in the data warehouse, we should use a surrogate key that fits in one column instead of the (possibly composite) business key.
  3. The reason to use a single-column surrogate key is that business keys can be large (many bytes per key field, multiple key fields) which makes them slow - in particular for join operations.
  4. In DV 1.0, sequences are used to generate values for surrogate keys.
  5. Using sequences to generate surrogate key values implies a loading process consisting of at least 2 phases that have to be executed in order.

    The previous point requires some context regarding the architecture of the Data Vault model. Without pretending to completely represent all tenets of DV, I believe the explanation below is fair and complete enough to grasp the point:

    DV stores the data that makes up a business object in 3 types of tables:

    • The business key and its corresponding surrogate key are stored in hub-tables.
    • The change history of descriptive attributes are stored in satellite-tables
    • Relationships between business objects are stored in link-tables.

    So, each distinct type of business object corresponds to one hub-table, and at least one, but possibly multiple satellite-tables. The satellite-tables refer to their respective hub-table via the surrogate key. Likewise, link-tables maintain relationships between multiple business objects by storing a combination of surrogate keys referring to the hub-tables that participate in the relationship.

    This means that for a single new business object, any of its satellite-tables can be loaded no sooner than when it is known which surrogate key value belongs to its business key. Since the new sequence value is drawn when loading the new business key into its designated hub-table, this means that the new business object must be loaded into its hub-table prior to loading its satellite-tables. Likewise, any relationships that the business object may have with other business objects can be loaded into link-tables only after *all* business objects that are related through the link have been loaded into their respective hub-tables.

    In practice this means that in DV 1.0, you'd first have to load all hub-tables, drawing new surrogate key values from the sequences as you encounter new business keys. Only in a subsequent phase would you be able to load the satellite- and link-tables (possibly in parallel), using the business key to look up the previously generated surrogate key stored in the hub-tables.

  6. In DV 2.0, hash-functions are used to generate values for surrogate keys.
  7. When using hash-functions to generate surrogate key values, hub-, satellite- and link-tables can all be loaded in parallel.

    The previous point needs some explanation.

    There is no strict, formal definition of a hash-function. Rather, there are a number of aspects that many hash-functions share and on which DV 2.0 relies:
    • Deterministic: a given set of input values will always yield the same output value. In a DV context, deterministic just means one business key maps to exactly one surrogate key.

      To some extent, a solution based on a sequence as generator of surrogate keys also appears to be deterministic, at least within the confines of one physical system.
    • Stateless: While a solution based on a sequence appears to be deterministic, its values are generated by incrementing the previous value, and by "remembering" the mapping from business key to its corresponding surrogate key by storing them together in the hub. Remembering the mapping by storing it in the hub is what makes it deterministic, because only then do we have the ability to look up the surrogate key based on the business key (and vice versa). But if we'd have two separate but otherwise identical systems, and load the same set of business objects into both, but each in a different order, then the mapping from business key to surrogate key will be different, depending on which object was loaded first in that particular system, because the one loaded earlier will draw a lower sequence number.
    • Fixed output size: DV expects a chosen hash-function to always return a fixed-length value output. (And typically, the length of the hash key should be smaller, ideally, much smaller than its corresponding business key)
    • Uniformity: This means that inputs of the hash-function yields results that are very evenly divided across the output domain. If that is the case, the chance that different inputs (i.e., two different business keys) yield the same output value is very small. The phenomenon where two calls to a hash-function, using different arguments, yield the same result value is called a collision. We will have much to say about collisions later on in this article.

    Basically the idea is that a hash-function can be used to generate a surrogate key value that is typically much smaller (and thus, "faster", in particular for join operations) than its corresponding business key, and it can do so without an actual lookup to the set of existing surrogate key values. Ideally a hash-function calculates its output based soley based on its input, and nothing else.

    It is in this latter aspect that makes it different from a sequence, which generates values that have no relationship at all with the values that make up business key. In the case of a sequence, the relationship between surrogate key and business key is maintained by storing it in the hub, and looking it up from there when it is needed.

  8. When using hash-functions there's a risk of collision. A collision occurs when two different inputs to the hash-function generate identical output.
  9. The risk of collision is very small. In a database with more than one trillion hash-values, the probability that you will get a collision is like the odds of a meteor landing on your data center.
  10. The MD5 hash-function is recommended for DV 2.0. As compared to other popular hash-functions (like MD6, SHA1 etc), MD5 storage requirements are relatively modest (128 bits), while the chance of a hash-collision is 'decently low'. It's also almost ubiquitously available across platforms and databases.

Comments

Many tenets of Data Vault thinking make good sense:
  • The focus on business keys, rather than blindly using the primary key of the source systems feeding into the EDW, ensures that the result will serve the needs of the business.
  • While the use of surrogate keys in transactional systems is still often a source of debate, this practice is not controversial in data warehousing.

    That said, I think the technical reasons for introducing a surrogate key as mentioned by the article (make keys smaller and more wieldy to improve join performance) are not as important as the functional requirement of any data warehouse to integrate data from multiple datasources, and ensuring identity there is resilient to change of the source systems.

  • There's a lot to be said in favor of maintaining attribute history in satellite-tables. In particular, the ability to have multiple satellite-tables is appealing, since it allows you to create and maintain groups of attributes that somehow belong together, and maintain them as a unit. For example, grouping attributes based on their rate of change, or according to whether they tend to appear together in queries sounds like a very useful thing.
  • Link-tables also sound like a good idea. Decoupling direct links from the business objects and maintaining them in completely separate tables makes the EDW very resilient to changes in the data sources. In addition it allows you to add extra relationships that may not be directly present in the source systems but which make sense from the point of view of data integration and/or business intelligence.
  • Once we accept the benefits of satellite- and link-tables and surrogate keys, then we must also embrace hub-tables. The model wouldn't make any sense without it, since we need an integration point anyway to maintain the mapping between business and surrogate key (which would of course be the hub-table).
  • Once we accept the modeling entities of DV, we also need to accept any constraints that surrogate key generation may have on the loading process. Of course, a dependency itself is something we would like to avoid, but the fact that a known limitation is recognized and anticipated is a good thing.

Flashback a few years ago

When I read the Scalefree article, I experienced a bit of a flashback that took me back some two, three years to this question ("Hash Key Collisions") by Charles Choi on the DataVault discussions group on linkedin. In case you cannot access linkedin, I'm reprinting the question below:
According to Dan's paper "DV2.0 and Hash Keys", the chances of having a hash key collision are nearly nonexistent (1/2^128). But I still worry.

What if a collision actually does occur that results in the business making a catastrophic decision? Can we really say we have 100% confidence in the "system of fact" when we choose to accept the risk of a collision? What is our course of action from a technology point of view if a collision did actually occur?
The paper "DV2.0 and Hash Keys" that Charles refers to is publicly available though not freely. (You can purchase it via Amazon.) Here's the relevant statement from the article:
The mathematical chances of a collision as a result of using MD5 are (1 / (2^128)) which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456.

In reality, you would have to produce 6 billion new business keys per second per hub for 100 years to reach a 50% chance of getting a collision. Not very likely to happen in our lifetime.
A similar question was asked by Ray OBrien in response to a post by Linstedt (see: #datavault 2.0 hashes versus natural keys):
collisions are real and MD5 not a good choice. but generally the smaller the input key domain and the larger the Hash output size, the less chance of collision, BUT it is always there.. so I would like to see some comments on the verification steps needed and cost to load of Collision ManagementI. If Integrity of data is important, then this is important.
In Linstedt's answer, you can find a similar expression of this probability, which goes:
because we split business keys across multiple hubs, the chances of collision (even with MD5) are next to none.

Yes, they do exist – but you would have to produce 6 billion new business keys Per Second PER HUB in order to reach a 50% chance of a collision in the first place.

Objections

Regardless of the merits of DV (which I believe I stated fairly in the previous section) I have a few doubts and objections on the writing and thinking by DV 2.0 advocates I have observed so far; all around the topic of hash-key collisions. My objections are:
  • The wording around the probability of hash-collisions is not helpful to understand the risk. As such, it does not help to decide whether to use DV 2.0, let alone choose a suitable hash-function for a concrete use-case.
  • The actual numbers regarding probability of hash-collisions are stated flat-out wrong on more than just one occasion.
  • The probability doesn't matter if you can't afford to lose any data
  • DV 2.0 does not discuss the consequences of a hash-collision, and no concrete advice is given on how to detect hash-collisions, let alone handle them.
I believe the questions by Charles Choi and Ray OBrien show that I am not alone. At the time they voiced their doubts, DV 2.0 was relatively new and I can understand that maybe at the time these tenets of DV 2.0 would still need to mature.

A couple of years have passed since, and after reading the article on the scalefree company blog, I am sad to observe that, apparently, no progress has been made in DV 2.0 thinking. At least, if such progress has been made, the scalefree company blog article doesn't seem to offer any new views on the matter. Instead, it comes up with an - equally unhelpful - restatement of the probability of hash-collision by comparing it to the chance of being hit by a meteor.

In the remainder of the article I will explain my objections and attempt to offer some thoughts that may help advance these matters.

The DV 2.0 wording around the probability of hash-collisions is not helpful

First, let's try and analyze the wording around probabilities, and what message it conveys.

Distracting Rhetorics

What does it really mean when someone says that "the probability [...] is like the odds of a meteor landing on your data center"? What does it mean, really, when someone says that the probability is "1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456"?

Well obviously, they are saying the chance is very small. Maybe it's just me, but I also sense a level of rhetoric in the wording that seems to be intended to dwarf he reader with Big Serious Numbers.

It's almost as if they're saying: this won't happen, so you shouldn't worry. You're not worrying about meteors hitting your data center all the time, so why worry about a hash-collision? Right?

You can observe that the rhetoric is working too, just look at how Charles Choi voiced his question: "According to Dan (Linstedt) [...] the chances [...] are nearly nonexistent (1/2^128). But I still worry.".

It's as if Charles is apologizing in advance for worrying.

Probabilities are not absolute

There is a more fundamental problem with the probability wording of the previous 2 examples, and that's that they project probability as an absolute.

To be fair, if you read them in context, then you'll notice that both statements are about MD5 collisions. Obviously, this matters, since not all hash-functions have equal probability of collisions. For hash-functions that have a fixed-length output, the chance surely has to have a relationship with the length of the output, since that puts a hard limitation on the number of unique values it could possibly encode.

However, apart from the output length of the hash-function and the algorithm it uses, there is at least one other factor which determines the probability, and that is your data volume.

Intuitively, this is easy to understand: if you have an empty set, the probability of the first hash-value causing a collision is exactly zero, since there is nothing to collide with. At the other extreme end of possibilities, if the set contains as many items as the total number of unique values the hash-function is capable of generating, then the probability of a collision is exactly one, since the entire keyspace has been "used up" already. In between these extremes, we have a growing number of existing entries that could collide with a new entry, so the probability increases from zero to one as the actual number of items (i.e, the number of rows in the hub - the data volume) increases.

While this observation seems trivial, it is important to mention it in this discussion because the aspect of volume is, for some reason, often not touched at all by DV 2.0 advocates. It's a mystery why that should be so, because if we would know the relationship between probability of a collision, maximum possible number of unique values, and the maximum volume of data, then we could reason about these variables sensibly. Like:
  • Given the maximum risk that I am willing to take to lose data due to a collision, what is the maximum volume I can store if I use a 128-bit hash-function?
  • Given the maximum risk of collision that I am willing to take, and given the maximum number of rows I need to store, what would be the mimumum output length of the hash-function I should look for?
  • Given my current data volume, and my current choice of hash-function, what is the risk I am running now of losing data due to collision?

The Birthday Problem

Interestingly, Linstedt does provide one statement that at least takes the data volume into account: "you'd have to produce 6 billion new business keys Per Second PER HUB in order to reach a 50% chance of a collision in the first place". Let's see what that means exactly.

Apart from the probability, this statement includes the other two variables: 128 bits keylength of MD5; a data volume of 6 bio rows per second for a 100 years. But if you look at the probability (50%) you'll notice how completely useless this wording is, that is, apart from its rhetorical power. Who in their right mind is interested in a system that can only accept half of the data you're trying to store in it? How can you possibly apply this piece of knowledge in any practical sense to a system you have to actually build?

I spend a while thinking about how this statement could have come about, and I have to come to believe that it is actually a restatement of the classical birthday paradox:
The birthday paradox, also known as the birthday problem, states that in a random gathering of 23 people, there is a 50% chance that two people will have the same birthday.
(As compared to Linstedt's statement, the 50% probability stays the same; the 6 billion rows per second for a 100 years is equivalent to the number of people gathered, and the number of possible unique values in the key corresponds to the number of days in a year.)

Whether or not the original birthday problem statement is actually what made Linstedt's word his statement like he did, I think it's clear that a 50% probability of a collision has no practical bearing on building any kind of database. To me, it just sounds like more rhetoric to convince that hash-collisions are really rare.

Probabilities are stated flat out wrong

The discussion of the birthday problem brings us to an actual method to calculate the probability as a function of data volume and key length. The wikipedia article I linked to provides an exact method, as well as many useful approximations, which are much easier to calculate. The birthday problem wikipedia article explains it far better than I ever could do, and also provides this really useful rule-of-the-thumb called the square approximation:
A good rule of thumb which can be used for mental calculation is the relation

p(n) ≈n2
/
2m

which can also be written as

n ≈ (2m * p(n))

which works well for probabilities less than or equal to 0.5
with
n
the actual size of the keyset - i.e, the number of rows you need to fit into the hub
p(n)
the probability of a collision given n.
m
the theoretical maximum size of the keyset, i.e. the maximum number of unique values that your hash-function can encode.
Given a fixed length hash-function, such as MD5, then m can be calculated by raising it to the power of the keylength (expressed as the number of bits):


m = 2 ^ bitlength (or 2bitlength)

(In case this needs explaining: if your key would be just one bit long, then it can old only two values - 0 or 1 or 21. If the key would be 2 bits long, it could hold 2 * 2, or 22 = 4 unique values. With 3 bits, 2 * 2 * 2 or 23 = 8 and so on)

From the discussion above as well as the previous section, we can now conclude that at least one statement of the probability:
The mathematical chances of a collision as a result of using MD5 are (1 / (2^128))
is simply flat-out wrong, since it does not take the data volume into account. Rather, since 2^128 is the number of possible unique values that MD5 can cover, 1 / 2 ^ 128 is the chance that the second row you put into your hub will collide with the first one.

So how big or small are the odds really of running into a MD5 collision in case we're handling a volume of 6 billion rows per second for a 100 years? Using square approximation, we get:

p(n) ≈(6,000,000,000 rows * 60 seconds * 60 minutes * 24 hours * 365.25 days * 100 years)2 ≈ 0.526
/
2 * (2 ^ 128 bits in a MD5 hash-value)


which is in fact closer to 53%. To figure out how many years we would need to insert 6 billion rows per second to achieve the 50% chance of running into a collision, we can use the second form of the formula:

n ≈ (2 * (2 ^ 128 bits in MD5 hash-value) * 50% probability) ≈ 1.84467e+19 rows in the hub

Dividing by 6,000,000,000 rows * 60 seconds * 60 minutes * 24 hours * 365.25 days will give you 97 and slightly less than a half year.

The point of doing these calculations here is obviously not to prove Linstedt wrong by showing you'd already arrive at a 50% after only 97 years and some instead of after a 100. Nor is it to determine that after 100 years, the chance is actually closer to 53%. Besides the fact that I'm using an approximation, neither makes any sense anyway, because the probability of 50% is already way, way beyond any definition of a working system.

The point I am trying to make is that it is perfectly possible to reason about large numbers and to clearly and transparently demonstrate how they are calculated. Using square approximation, you have a tool to calculate the value of the third variable once you have the value of the other two, allowing you to reason about it from three different angles.

I think we can all agree that's a much better position than getting stumped by Really Seriously Big Numbers.

Probabilities add up for each keyset

So far, we've just looked at the probability for encountering a collision while loading a single hub. But the probability increases as you have more hubs.

Intuitively this is clear because each hubs could encounter a collision independently. So, the chance of suffering a collision in either one of them grows as you have more hubs to maintain, and is quite a bit more than the chance of suffering a collision in just one particular hub.

If we'd like to compute the chance of getting a collision in at least one hub we can apply the following calculation:
1 - ((1 - Ph1) * (1 - Ph2) * ... * (1 - PhN))
with:
Ph1
Probability of a collision in the first hub
Ph2
Probability of a collision in the second hub
...
PhX
Probability of a collision in the last hub
The rationale behind this is that if the probibility of a collision is P(n), then the chance of not having a collision is 1 - P(n). To calculate the chance of not having a collision in any of the hubs, we have to multiply the individual chances of not having a collision in one particular hub with each other. If that number is the chance of not having a collision in any of the hubs, then all remaining probability must mean there is a collision in one or more hubs. So the chance of having at least one collision is obtained by substracting the probability from having no collision at all from 1.

So the chance that someone ever will encounter a collision could be quite a bit larger than you'd expect if you're focussing on just one hub.

No matter how slight a Probability, I can't afford to lose data

Now we arrive at a more fundamental objection regarding the matter of using hash-values as keys in your database.

If you re-read Charles question, you'd notice that he is politely explaining that, although he understands and appreciates that a MD5 hash-collision maybe really rare, he simply doesn't ever want to lose any data because of it. Ray OBrien raises the exact same point, even mentioning data integrity as the reason why he cares.

When this issue is put forward in DV 2.0 disucssions, it usually means the end of any meaningful discussion. The answers that DV 2.0 advocates usually give in response typically match one of the following:

"Look, do you realize how rare a hash-collision is?"
Yes thank you. You just stumped me with some Really Seriously Big Numbers, and I get it. Super rare. I just don't want to lose data though.
"You don't have to use 128-bits MD5, you can use a hash-function that returns larger values, like 160-bits SHA1. Collisions will then be, you know, even more rare."
Perfect thanks. Did I mention I can't afford to lose data?
"We use 2 hash-functions as key and it works for us."
Ah, I get it now. You made collisions Super-duper-rare, how clever. So will you never lose data now?
"I have built hundreds of Terabyte-sized data warehouses, and I never encountered a hash-collision."
Well let me guess. Might that be because they are very rare?
"Teradata is using hashing to solve MPP data distribution, and Hadoop uses hashing in HDFS. If it works for them, then why wouldn't it work in DV 2.0"
So you're saying Teradata and Hadoop use hashing for some purpose, and DV2.0 is using hashing for a completely different purpose, and now you want me to explain why it works in one use-case but not for a completely different use-case? That's...interesting.

How about: Teradata and Hadoop are not using hash values as primary keys, and DV 2.0 is?
"Look, why are you so worried about hash-collisions? You're not worrying all the time about a meteor hitting your data center, are you?"
Actually, I do. That's why our database is geographically distributed across data centers.
"Ha! Gotcha now. What about two meteors? Wouldn't that be comparable to using two hashes?"
I suppose it would. Difference is, I can't help meteors falling on my data centers. But I can choose to stick to sequences instead of hash-functions.
"But I just explained, sequences are a bottleneck and prevent you from parallel loading your EDW!"
Yes I heard. And I asked my customer: they are pretty sure about how they feel regarding the possibility of losing data, and they clearly told me they'd rather wait around for the data to be loaded as compared to being able to report super-quickly on wrong data.
"We use hash-keys to store tweets for sentiment analysis, and our results are pretty accurate, even if we lose data sometimes."
I'm sure you are- good for you! But we manage a monetary transaction log and we feel that the risk of losing one $1,000,000,000 transaction just doesn't justify loading 1,000,000,000 transactions worth $1 super-fast in a parallel fashion. Silly us eh?


And that's really all there is to it: Probabilities don't mean a thing if you're really sure you don't want to lose any data. When it happens, it is no comfort that you were the one to have had such extraordinarily bad luck experiencing it.

Another thing people may overlook is that the probability also doesn't tell you when it will happen. The only real guarantee you have is that inserting the first key in an empty hub will always succeed. But already the 2nd row might collide. It probably won't, and the odds are really slim. But it might. If you're sure you don't want that, then don't use hash values as keys. It's really that simple.

Sure, there might be other risks that could make us lose data. For example, the probability of disk corruption might be larger than that of a hash-collision. But it doesn't follow that we should be setting ourselves a trap if we can avoid it, especially if you know how to avoid it.

We cannot control disaster like disk corruption or meteor impact. If we could though, we would! Whether to use hash keys or to stick to sequences is a conscious choice, so let's be sure we make it based on information and requirements, and rather not based on some analogy that is chosen with the express purpose of making you feel a little bit ridiculous for being so averse to taking a risk.

If you're building a database for someone else, and you're considering to use hashes as keys for your data, then be prepared to ask you customer: "How many data can you afford to lose?" or "In the event that we cannot load some data due to hash-collisions, how much time and effort can we spend to take it offline so we can fix it?"

Another way to think about it is this: Suppose you would, in fact, lose data because of a collision. Then how comfortable are you to admit to your customer that you constructed a solution, that, by design, could end up giving you wrong results, while there was an alternative that guarantees correctness, at least to the extent of things you can control? And suppose you would get wrong results, did you anticipate just how wrong those results could be?

I truly feel that considerations like this are not on the database/data warehousing professional - they are on the customer. It's their data. Please, respect that.

What if we do have a collision?

I get that there are use cases where you might want to accept the small risk of a collision. But you cannot really, truly make that assessment if you haven't considered and anticipated it as if it is a real event actually hitting you. I think DV 2.0 falls short in nourishing healthy discussion regarding the anticipation of such events.

So, what will happen if you have your hash-keys in place, and you encounter a collision? We can try and anticipate a few concrete scenarios.

First of all, will you even detect a collision when it happens? The Scalefree article has a few flow diagrams showing how raw data is staged, and then loaded into a hub. In that flow, the row is dropped when the hash already exists. So the question now is, why did the hash already exist?

Obviously, it's possible that we already loaded this business object, and we're merely seeing it again. In that case, we're fine and we'll simply be loading newer data into the satellites and links for that business object. But it's also possible that this is an entirely different business object that happened to yield the exact same hash-value as that of a different business object loaded earlier. In other words, you now have a collision. For the hub, it will pass by unnoticed, but the satellites and links that point to that business object will now store data pertaining to more than one distinct business objects.

So in this case, we're not losing data, but compromising the integrity of the business object that arrived earlier. Our database integrity is now violated and our queries will return wrong results. You won't know though, because you didn't attempt to detect a collision. To your data vault, the distinction between multiple different business objects has ceased to exist.

I don't know about you but this does not feel like a happy place to me - especially if this an inevitable and avoidable consequence of the design. (And a whole bunch of bad luck of course).

Alternatively, we build our solution in such a way that we can at least detect collisions. Once we detect it, we can maybe prevent loading associated data for the satellite-tables and link-tables for the colliding business object. This means we will be completely ignoring the later arriving business object, as if it isn't there.

We have now lost data. That is not a good thing, but at least this allows the earlier arriving business object to maintain its integrity. Our query results won't be wrong, they will just be incomplete. To me this is a slightly happier place, but the fact that it's a matter of fate which one of the objects made it into the database, and which one was rejected still makes me feel that the solution has failed.

But suppose we do want to go for that treatment (after of course getting confirmation from the business that this is really what they want) - how can we implement it? Well, at the very least, the load process for either the hub or the staging area would need to compare the hash value as well as the business key. Only if both are equal can we consider the objects equal.

Making the comparison is not hard but it will of course be slower than only calculating the hash, because in this case you need a lookup on the hub, just like you did with the DV 1.0 solution based on sequences.

But what if we detect the collision in this way? How can we then use this information to prevent loading the associated satellite- and link-table data?

Introducing a collision table

We could store collisions in a special collision table. We'd get one such table for each hub. The collision table would have the same layout as the hub table to which it corresponds, and you'd use it to store the colliding hash, as well as the business key for which the collision was met. The key of the collision table would have to be made up of both the hash key as well as the business key, so that we can handle multiple collisions for the same hash key.

Once the collision table is in place, and the process for loading the hub-tables is modified to detect and store collisions, we can think about the process for loading the satellite- and link-tables. I can see two options
  • Have the load process for satellite- and link-tables do a lookup to the collision table to see if you need to discard data for business objects with collisions
  • Check collision tables after the load and run a clean-up process to restore integrity after detecting new collisions.

Collision table lookup

This solution relies on the process that loads the satellite- and link-tables to make a lookup to the collision table, using both the calculated hash and the business key. If collisions really are as rare as they should be, then that lookup should be really fast, because the collision tables will be pretty much empty.

If all goes well, then the lookup will fail. This means we have no collision and we can proceed loading the satellite- and link-tables. In the rare event that the lookup succeeds there is apparently a hash-collision, and we must not load the satellite- and link-tables to prevent violating the integrity of our data. You are now in a position to either discard the data, or to store it someplace else in case you have a clever idea of reconciling the data later on.

However, we now have reintroduced the constraint on the loading process, because we now rely on the process that loads the hubs to also detect and store collisions in the collision table.

So, with this solution, we lose the ability to load hub-tables in parallel with satellite- and link-tables. What may be of some comfort in comparison with the DV 1.0 sequence based solution is that the collision table lookup will be much faster than a hub-lookup to find the value generated by a sequence, because the collision table will be pretty much empty. So, the burden of the constraint and loading dependency should be much lighter than in the case of a DV 1.0 sequence based solution.

Another important drawback is that if your solution spans multiple systems, you need to maintain one set of collision tables somewhere, and all loading processes will be dependent upon them. In other words, the solution is not stateless anymore.

Clean-up after load in case of new collissions

Alternatively, we keep loading hub-, satellite- and link-tables in parallel, and we check the collision tables after each load to see if the last load introduced any new collisions. If we find that it did, we need to perform clean-up after the load.

The way clean-up would work is as follows: our load process should have been logged and our satellite- and link-tables should have metadata identifying the load process that put its contents in the data warehouse. The load identifier would also be stored in the collision table. Using that information, we can identify which new collision our last load introduced. We now have the load identifier as well as the hash key of the new collision, and we can then use that to delete all satellite- and link-table rows that have the load identifier of our latest load, as well as the hash key of the colliding business object.

After clean-up, we have restored data integrity for those business objects that encountered a collision, up to the point prior to the last load.

Now, we know that our last load brought us a business object that had a hash collision with some business object that was already in our data warehouse, and we made the conscious decision to reject that data for now, or maybe store it somplace else untill we know how to reconcile it. But the last load might also have brought us data that actually belonged to the business object that already existed in our data warehouse. We would really like to reload that part of the data for the existing business object. Our clean-up process had no way of distinguishing between satellite- and link- data for the one or the other business object, because it only knows about their colliding hash keys. In other words, our clean-up process might have removed data that actually did belong to the already existing business object, and now we need to put that back.

The solution would be to have an alternative load process especially for this -hopefully exceptional- case.

The alternative load process would be similar to the collision table lookup solution described above. It would only load satellite- and link-tables, and it would include a lookup to the collision table. If the lookup fails, we're dealing with data belonging to the existing business object and we can load it. If the lookup succeeds then this is data that belongs to a new business object that caused a collision and we should discard it or store someplace else for later reconciliation.

If things go the way they should, then the clean-up-and-reload process should occur seldom. And if we need to run it, it would probably be quite fast since it deals with only a few business objects - typically only one.

The only drawback now is that our data warehouse lived through a short period where integrity was compromised during the load process. But at least, we can repair integrity for all existing business objects, and selectively discard only data for those business objects that suffer from hash collion with an already existing business object.

While this approach still relies on keeping collision tables around, we regain our ability to load hub-, satellite- and link-tables in parallel. We can even do loads spanning multiple systems; we just need to take care to clean those up as well in the case we do encounter a collision.

Other solutions?

I don't want't to pretend this is an exhaustive list - I'm hoping there are more options and I just can't think of them right now.

How to load the colliding business objects?

What these scenario's do not solve is loading the later arriving business object. We only managed to prevent these objects from entering our data warehouse, but have not found a solution to load that data as well.

And, We can't - not unless we change the key.

On the other hand, changing the key might just be doable: you could decide to try another hash-function for at least that hub. You would need to update all satellites and links pointing to that hub (and of course, the hub itself) and rehash every row that points to it.

Of course, since you're still relying on hash-keys, just - hopefully - larger ones, you haven't solved the problem, you've just increased the odds. And you might even run into new collision while you're doing the rehashing operation. But that's just the life you've chosen. At least we now have something that resembles dealing with the problem rather than praying it won't happen.

I guess my main point here is - make sure every stakeholder is actually accepting the risk and make sure the procedures for dealing with a collision are specified and tested. The fact that the chance you'll need it may be next to neligible is not a license to pretend you do not need to be prepared to do these tasks. If it is decided that you'll be taking the risk, then actually do take the risk, and take it seriously.

Can't we have our cake and eat it too?

Isn't there some way we can benefit from hashes and still, magically immunize ourselves against hash-collisions? It turns out we can.

To recap:
  • DV, like other data warehousing methods - suggests using surrogate keys, because business keys are largish and unwieldy, and slow down join operations.
  • DV 2.0 suggests using hash-keys to avoid sequences, which make it impossible to load hubs, satellites and links all in parallel, and which slow down the load due to a lookup in a large hub.
This maybe a longshot, but it has as advantage the guarantee that it will work. As should be amply clear from the discussions in this article, hash-keys always have the chance of a collision, and it doesn't matter how small the risk is if you already know you do not want to accept losing data or giving up integrity. So, what is clear is that if that is the requirement, you cannot use hashes as keys. Period.

But that does not mean we cannot benefit from hashes.

Especially if the hash-key is small in comparison with the business key, then we could build up our keys as a composite of the hash-code, followed by the field or fields that make up the business key. If our database uses B-tree indexing, then any joins will be able to resolve the join in the very vast majority of cases only based on the hash-key column, which should be the first column in the key definition. You would still keep using the fields of the business key in your join conditions, to ensure that, in case of a hash-collision, the query will still return the correct result. Since the collisions are so rare, the database will end up with very few rows after it has resolved the join over the first field in the key, so the overhead of such a large key should be minimal.

Of course, this solution does not help in cutting down the amount of data you need to store - that will be much more in this scenario since you keep dragging the business keys literally everywhere: every satellite and every link table that points to this hub will inherit the column for the hash, as well as all columns that make up the business key. But if it helps - it should just add to the storage requirements, and not that much in terms of join processing.

The benefits of the -relatively- fast join will break down though when the database already uses hash-joins. In those cases, it will not be able to use the first column of the keys as prefix.

Finally

I hope you enjoyed this article. I am super curious to hear from DV practitioners if they have scenarios we can learn from. Drop a line in the comments! I'm looking forward to it.

14 comments:

Unknown said...

Great article Roland, congratulations!

... on top of the discussion around probability we should also keep in mind that MD5 is not random. It's a mapping of one large space of possibilities to a smaller one. That means that depending on your data peculiarities you could very well have massive amounts of hash collisions. MD5 tries to avoid it and in fact in general it's best practice to pick the right hashing algorithm based on your data. For a generic joining key generator I would suggest that it is a major problem to pre-choose MD5.

Tom Breur said...

Brilliant article, Roland! I love the fact that you nailed the math, and turned the discussion back squarely where it belongs: the customer. In all fairness, I think you would then ALSO educate your customer that the benefit of using the hash function imho isn't so much the lookup performance for the Hub, as it is that you can insert straight into the Satellite, instead of having to perform an additional join via the Hub.

As to your question: I utterly enjoyed reading this!

Hans Michiels said...

Hi Roland, what an excellent post! I really enjoyed it.
All you say makes a lot of sense.
To pick out some things:
a) If a collision occurs, you want to know. Silently colliding keys is just not acceptable (at least for data that must be 100% correct, like financial transactions; for “people also bought” lists shown on a web shop you might want to accept collisions as the business impact would be small)
b) If a collision occurs you need a plan to bring back the Data Vault in an integer state.
c) If a collision occurs you need a plan to change the hashing algorithm.
d) Later arriving business objects need special attention in this area.

And I would like to add:

e) Distributed systems need special attention.

My thoughts on c, d and e.

c) If a collision occurs you need a plan to change the hashing algorithm.
Supposed that your ELT code is designed to detect hash collisions, what can you do if one occurs? Obviously changing the hashing algorithm to a more secure version (for instance MD5 to SHA1) is an option. However then you have to do massive reengineering of your Data Vault, because a SHA1 value needs 20 bytes (40 when converted to CHAR) which does not fit in the CHAR(32) columns used for MD5. I would say there are at least three other options.

c1) Why not make a list of “Hash collision mitigating alternatives” for computing the hash key and store in metadata for each hub if an alternative was used? (this metadata will not be a massive list and could be replicated to distributed platforms as well).
Alternatives can be: hash a “rearranged” business key, for instance, when the business key is 1234:
Alternative 1: 1234|1234 (concatenate the value twice)
Alternative 2: 1234|1234|1234 (concatenate the value three times)
Alternative 3: 1|1234|4 (first character, entire business key, last character)
Alternative 4: 4321 (reverse the value)
Alternative 5: 4321|1234 (reverse + concat)
Alternative 6: 1234|4321 (concat + reverse)
Etcetera, use your imagination, but keep in mind the technical possibilities and performance of platforms (scripting or expression languages) where hashes need to be calculated.
Then rehash the entire hub (still using MD5) and all links where the Hashkey is also used with an alternative until the collision is gone.

c2) Alternatively, but only if you accept data loss, you could assign a business key that causes a collision to be the victim, and leave its data out of the Data Vault. But then this should be stored somewhere (e.g. a “collisions” hub, with three columns: target hub, hash key and business key, besides technical audit columns), and for every business key used to compute a hash key in every hub, satellite and link you would need to consult this “collisions” hub, to check if the business key is in it.

c3) Elaborating further on the second option, if you create a “collisions” hub, you could also extend it with a ReplacingHashkey column! As the Hashkey column is a CHAR(32) column, you could put anything in it, not necessarily a hash value. It would even be wise to make it recognizable as a replacing (hash) key, for instance by using the original colliding hash key, in which some character(s) is/are replaced by a character outside the character set used for hash values.
E.g. original hash key:
52328012bc595d26df28b8bdee2fae7f
Replacing hash key:
52328012bc595d26df28b8bdee2fae!!
As the chance on collisions is very small, picking a value manually when this occurs, would be an acceptable practice I think. And as the replacing hash key (which is not a real hash) can only collide within the “collisions” hub itself, it is easy to check that the replacing hash key is not a duplicate itself.

(to be continued)

Hans Michiels said...

(continued)
d) Later arriving business objects need special attention in this area.
If the ELT code is designed to detect collisions when adding new rows to a hub, why not add a hub row when the business key is first seen, so in this case when “fact” rows (sorry for the dimensional jargon) are earlier than the describing context? If loading the hub is solid, a collision will be detected.

e) Distributed systems need special attention.
Without central management of hash keys, distributed systems (relational, Hadoop or NoSQL) will still have a risk on collision. If you are facing this situation, you’ll have to make a decision to accept this (small) risk or not. Alternatives are
- store the original business key permanently along with the hashkey (but this weakens the advantage of using hashkeys) or
- store the original business key temporarily along with the hashkey and mark the hashkey as “unconfirmed” until it is confirmed to be non-colliding by checking it against the central hash keys store (yes eventually central management would IMHO be the only way to 100% guarantee that you can at least detect collisions after they have occurred).

rpbouman said...

Hi @Matt, thank you!

Also, very good point about MD5. When I did some background reading on hash functions, I learned about the concept of Continuity https://en.wikipedia.org/wiki/Hash_function#Continuity

It seems to me that in many cases this is something you might want when using hash functions as keys.

That said, I don't see any way around the fundamental problem that a hash-value is simply a poor choice as key if data integrity is important

rpbouman said...

Hi @Tom Breur!

thanks for the kind words - glad you appreciate the article. Regarding informing the customer: I agree, I just felt that there was already quite enough information about the benefits of the approach. I was just hoping to bring back some balance.

rpbouman said...

@Hans Michiels,

thank you very much for your elaborate comments. I noticed the comments only later on, and in the mean while I wrote some more explaination of my thought around dealing with collisions. Indeed, I too arrive at the concept of a collision table, which will basically be similar to a hub, but with hash key + business key as primary key.

I too have this concept of choosing a victim to at least keep integrity for existing objects.

Any possible solutions for re-engineering the DV to accomodate rejected data, I'm not sure yet.

I see that rehashing could work, but it pains me that it doesn't actually solve the fundamental problem. That said I do see it could be a viable solution in many cases.

I don't mind the overhaul too much - it shouldn't happen often, and as long as it is clear that this a limitation of the architecture, and processes are in place to actually perform such an operation, I guess this is simply what we have to accept.

X said...

Educational, thank you for taking the time to explain so thoroughly and accessibly!

Unknown said...

Hi Roland, great article and a joy to read. I totally agree that we have to take our customers serious if they have any doubt or reasons to worry around their data. And that does not mean that we tell them not to worry because of big numbers and maybe small odds as you rightfully wrote down.
Next to the collision part there is another thing to consider before using hash instead of SID's : How to ensure that there is a corresponding Business Key in the Hub for a relation or context? With a SID this is guaranteed due to the need to lookup first and then load, hence 2 synchronization points in the load (or 3 if you have LSat's for effectivity). If you loose the dependency and load all in parallel you might end up with context or relations without corresponding business key.
Again there might be people coming up with the same big numbers and comforting talks but we have to be clear that this is a risk also. It is a good thing to be aware of and make a calculated decision.

The motto is: think, think again and decide if you want to use hash or not!
Be aware of risks, take concerns serious and decide. Don't do anything because someone will tell you you have to or it does not fit the standards! Remember DataVault comes in many flavours and styles, pick the one or create the one which is the best fit, or look for an alternative ensemble like Focal Point, Anchor, 2G, etc.

Thanks again for your post Roland.

Unknown said...

Excellent analysis, Roland.

As for a fix, I think that's not too easy. There are a few issues I'd like to comment on though.

1) Hashing key size. I think that the second your business key is shorter than the hash size, you're better off just using the business key as "hash". So Hash(business key) = (business key). That way you have zero chance of collisions. With compound keys the function would be Hash(BK1, BK2, BK3) = BK1+BK2+BK3 where the + is a suitable character not occurring in the BKs themselves. That only leaves us with BKs that are larger than a hash.

2) Your collision table looks a lot like the Key Satellites I've used on occassion. KeySats just store the alternative key(s) for a given BK. For instance, if I have surrogate keys in a source and want to retain the ability to map the BK to a source system surrogate key, I store the SKs in a key-satellite. Potentially, multiple SK's have the same BK (*cough*SAP SLcM*cough*). This seems to me the exact same scenario, for which solutions already exist. It does mean some additional work to fix things later on. For satellites this is trivial, detection and reporting on collisions is trivial as well, especially since this would be quite rare. The only problem is the link but as long as you maintain your existence satellite (yes, another type of satellite that's very useful) on a link, with auditing trail, you never lose anything and can easily reconstruct the intended link.

Rehashing the collision could be a problem, but for those cases we could just use a special bit (on the Hashkey that is used for collisions only. If we do get 3 collisions on different business keys, perhaps it is time to reconsider the hashing function or complain to the vendor of the hashing implementation.

All in all I agree that your analysis is correct, and hashing is not as "care-free" as it is presented sometimes. Then again, I usually build my solutions with KeySats and existence Sats in DV1.0 already, to mitigate the collisions between SK's on the BK's. Extending that solution to hashed keys seems fairly trivial. But if you have to retrofit it, it will be an unpleasant exercise.

rpbouman said...

Hi Ronald, thanks for your reply and for sharing your experience and insights! Much obliged :)

I have to read up a little on keysats and the other concepts you wrote about. It's good to hear there are multiple solutions. I wonder why some sort of hash key collision detection and recommendation to hancle them are not part of the DV 2.0 spec proper.

Best regards,

Roland.

Anindya Mitra said...

Excellent article Ronald. Your coverage on the potential problems is really helpful to prepare ourselves for unpleasant surprises in production.

One more area I would like to highlight here is Data Masking (De-identification) -
Sensitive business keys such as customer no, account no, card no etc, when used instead of synthetic key may give rise to other issues. Data masking/De-identification which tries to UPDATE the primary key columns is catastrophic on data platforms especially MPP systems. "Create Table as Select" is better option.
In the same context, using business/natural key instead will cause migration/replication of the same in link & satellite tables too. Assuming these tables have significant high volume of data compared to Hub tables, the masking effort is significant.I would discount collision tables as data volume is negligible.
This is one downside of using raw business keys as the hash keys, of course if the organization cares about sensitive data. Using synthetic keys ensures sensitive data is restricted to Hubs only thus reducing the effort manifold.

Thanks

Unknown said...

Hi Roland,
Thanks for the clear, well-thought, well-organized article! Just read it.

I agree that the use of DV2 HashKeys is an area that requires more analysis and possibly complementary or even replacing solutions as discussed above with some concrete examples.

My own view on this is that this is a design item and needs to be further evaluated per implementation case. I believe the use of DV2 may well be justified in some implementations, thus accepting the very low risk of collision and claiming its benefits, while may definitely necessitate a fully robust solution in some critical implementation contexts (the above example of the risk of losing just one $1,000,000,000 transaction clearly explains this)

Similarly, I do think that just the use of surrogate keys may well be justified in some implementations too, where the lack of the capability of parallel loading of all hubs/satellite/links may not be a serious issue, considering the requirements, special data volumes, infrastructure, load speeds etc. Thus better not go blindly with a hash-key implementation without evaluating.

In my opinion, another item that need to be considered when using hash-keys: the B-tree indexes used may be more skewed than when the sequence based surrogate keys are used and this may have a performance affect too..

rpbouman said...

Thanks Ali Aytekin Keskin, your comment and thoughts are much appreciated.

Indeed, we have to be practical and I do think there is a lot to be said in favor of DV in general, including DV 2.0. May qualms are mainly that these issues seem to be always belittled and avoided by a certain circle of DV practitioners and I just wanted to get that out of my system.

Your remark about B-Tree indexes is very on point - it was mentioned to me earlier by other people as well but it's worth repeating that this indeed offers another challenge for implementing DV, but on the implementation level.

Thanks, and best regards,

Roland.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...