Friday, May 12, 2017

Do we still need to talk about Data Vault 2.0 Hash keys?

A few days ago, I ran into the article "Hash Keys In The Data Vault", published recently (2017-04-28) on the the Scalefree Company blog. Scalefree is a company, founded by Dan Linstedt and Michael Olschminke. Linstedt is the inventor of Data Vault, which is a method to model and implement enterprise data warehouses.

The article focuses on the use of hash-functions in Data Vault data warehousing. To be precise, it explains how Data Vault 2.0 differs from Data Vault 1.0 by using hash-functions rather than sequences to generate surrogate key values for business keys. In addition, hash-functions are suggested as a tool to detect change of non-business key attributes to track how their values change over time.

Abstract

First I will analyze and comment on the Scalefree article and DV 2.0, and explain a number of tenets of DV thinking along the way. Critical comments will be made about using hash values as primary keys in DV 2.0, and the apparent lack of progress made in DV thinking regarding this matter. A discussion follows about the birthday problem and how it relates to DV 2.0 usage of hash functions. Using the Square approximation method it will be demonstrated how we can make informed and accurate decisions about the risk of a collision with regard to the data volume and choice of hash function. Different scenarios regarding the impact of a collision on data integrity will then be explored. Finally, a practical proposal is made to detect hash collisions and to prevent them from introducing data integrity violations into our data warehouse.

A Summary of the Scalefree article

I encourage you to first read the original article. My summary is here below.
  1. The business key is what the business users use to identify a business object.
  2. To identify business objects in the data warehouse, we should use a surrogate key that fits in one column instead of the (possibly composite) business key.
  3. The reason to use a single-column surrogate key is that business keys can be large (many bytes per key field, multiple key fields) which makes them slow - in particular for join operations.
  4. In DV 1.0, sequences are used to generate values for surrogate keys.
  5. Using sequences to generate surrogate key values implies a loading process consisting of at least 2 phases that have to be executed in order.

    The previous point requires some context regarding the architecture of the Data Vault model. Without pretending to completely represent all tenets of DV, I believe the explanation below is fair and complete enough to grasp the point:

    DV stores the data that makes up a business object in 3 types of tables:

    • The business key and its corresponding surrogate key are stored in hub-tables.
    • The change history of descriptive attributes are stored in satellite-tables
    • Relationships between business objects are stored in link-tables.

    So, each distinct type of business object corresponds to one hub-table, and at least one, but possibly multiple satellite-tables. The satellite-tables refer to their respective hub-table via the surrogate key. Likewise, link-tables maintain relationships between multiple business objects by storing a combination of surrogate keys referring to the hub-tables that participate in the relationship.

    This means that for a single new business object, any of its satellite-tables can be loaded no sooner than when it is known which surrogate key value belongs to its business key. Since the new sequence value is drawn when loading the new business key into its designated hub-table, this means that the new business object must be loaded into its hub-table prior to loading its satellite-tables. Likewise, any relationships that the business object may have with other business objects can be loaded into link-tables only after *all* business objects that are related through the link have been loaded into their respective hub-tables.

    In practice this means that in DV 1.0, you'd first have to load all hub-tables, drawing new surrogate key values from the sequences as you encounter new business keys. Only in a subsequent phase would you be able to load the satellite- and link-tables (possibly in parallel), using the business key to look up the previously generated surrogate key stored in the hub-tables.

  6. In DV 2.0, hash-functions are used to generate values for surrogate keys.
  7. When using hash-functions to generate surrogate key values, hub-, satellite- and link-tables can all be loaded in parallel.

    The previous point needs some explanation.

    There is no strict, formal definition of a hash-function. Rather, there are a number of aspects that many hash-functions share and on which DV 2.0 relies:
    • Deterministic: a given set of input values will always yield the same output value. In a DV context, deterministic just means one business key maps to exactly one surrogate key.

      To some extent, a solution based on a sequence as generator of surrogate keys also appears to be deterministic, at least within the confines of one physical system.
    • Stateless: While a solution based on a sequence appears to be deterministic, its values are generated by incrementing the previous value, and by "remembering" the mapping from business key to its corresponding surrogate key by storing them together in the hub. Remembering the mapping by storing it in the hub is what makes it deterministic, because only then do we have the ability to look up the surrogate key based on the business key (and vice versa). But if we'd have two separate but otherwise identical systems, and load the same set of business objects into both, but each in a different order, then the mapping from business key to surrogate key will be different, depending on which object was loaded first in that particular system, because the one loaded earlier will draw a lower sequence number.
    • Fixed output size: DV expects a chosen hash-function to always return a fixed-length value output. (And typically, the length of the hash key should be smaller, ideally, much smaller than its corresponding business key)
    • Uniformity: This means that inputs of the hash-function yields results that are very evenly divided across the output domain. If that is the case, the chance that different inputs (i.e., two different business keys) yield the same output value is very small. The phenomenon where two calls to a hash-function, using different arguments, yield the same result value is called a collision. We will have much to say about collisions later on in this article.

    Basically the idea is that a hash-function can be used to generate a surrogate key value that is typically much smaller (and thus, "faster", in particular for join operations) than its corresponding business key, and it can do so without an actual lookup to the set of existing surrogate key values. Ideally a hash-function calculates its output based soley based on its input, and nothing else.

    It is in this latter aspect that makes it different from a sequence, which generates values that have no relationship at all with the values that make up business key. In the case of a sequence, the relationship between surrogate key and business key is maintained by storing it in the hub, and looking it up from there when it is needed.

  8. When using hash-functions there's a risk of collision. A collision occurs when two different inputs to the hash-function generate identical output.
  9. The risk of collision is very small. In a database with more than one trillion hash-values, the probability that you will get a collision is like the odds of a meteor landing on your data center.
  10. The MD5 hash-function is recommended for DV 2.0. As compared to other popular hash-functions (like MD6, SHA1 etc), MD5 storage requirements are relatively modest (128 bits), while the chance of a hash-collision is 'decently low'. It's also almost ubiquitously available across platforms and databases.

Comments

Many tenets of Data Vault thinking make good sense:
  • The focus on business keys, rather than blindly using the primary key of the source systems feeding into the EDW, ensures that the result will serve the needs of the business.
  • While the use of surrogate keys in transactional systems is still often a source of debate, this practice is not controversial in data warehousing.

    That said, I think the technical reasons for introducing a surrogate key as mentioned by the article (make keys smaller and more wieldy to improve join performance) are not as important as the functional requirement of any data warehouse to integrate data from multiple datasources, and ensuring identity there is resilient to change of the source systems.

  • There's a lot to be said in favor of maintaining attribute history in satellite-tables. In particular, the ability to have multiple satellite-tables is appealing, since it allows you to create and maintain groups of attributes that somehow belong together, and maintain them as a unit. For example, grouping attributes based on their rate of change, or according to whether they tend to appear together in queries sounds like a very useful thing.
  • Link-tables also sound like a good idea. Decoupling direct links from the business objects and maintaining them in completely separate tables makes the EDW very resilient to changes in the data sources. In addition it allows you to add extra relationships that may not be directly present in the source systems but which make sense from the point of view of data integration and/or business intelligence.
  • Once we accept the benefits of satellite- and link-tables and surrogate keys, then we must also embrace hub-tables. The model wouldn't make any sense without it, since we need an integration point anyway to maintain the mapping between business and surrogate key (which would of course be the hub-table).
  • Once we accept the modeling entities of DV, we also need to accept any constraints that surrogate key generation may have on the loading process. Of course, a dependency itself is something we would like to avoid, but the fact that a known limitation is recognized and anticipated is a good thing.

Flashback a few years ago

When I read the Scalefree article, I experienced a bit of a flashback that took me back some two, three years to this question ("Hash Key Collisions") by Charles Choi on the DataVault discussions group on linkedin. In case you cannot access linkedin, I'm reprinting the question below:
According to Dan's paper "DV2.0 and Hash Keys", the chances of having a hash key collision are nearly nonexistent (1/2^128). But I still worry.

What if a collision actually does occur that results in the business making a catastrophic decision? Can we really say we have 100% confidence in the "system of fact" when we choose to accept the risk of a collision? What is our course of action from a technology point of view if a collision did actually occur?
The paper "DV2.0 and Hash Keys" that Charles refers to is publicly available though not freely. (You can purchase it via Amazon.) Here's the relevant statement from the article:
The mathematical chances of a collision as a result of using MD5 are (1 / (2^128)) which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456.

In reality, you would have to produce 6 billion new business keys per second per hub for 100 years to reach a 50% chance of getting a collision. Not very likely to happen in our lifetime.
A similar question was asked by Ray OBrien in response to a post by Linstedt (see: #datavault 2.0 hashes versus natural keys):
collisions are real and MD5 not a good choice. but generally the smaller the input key domain and the larger the Hash output size, the less chance of collision, BUT it is always there.. so I would like to see some comments on the verification steps needed and cost to load of Collision ManagementI. If Integrity of data is important, then this is important.
In Linstedt's answer, you can find a similar expression of this probability, which goes:
because we split business keys across multiple hubs, the chances of collision (even with MD5) are next to none.

Yes, they do exist – but you would have to produce 6 billion new business keys Per Second PER HUB in order to reach a 50% chance of a collision in the first place.

Objections

Regardless of the merits of DV (which I believe I stated fairly in the previous section) I have a few doubts and objections on the writing and thinking by DV 2.0 advocates I have observed so far; all around the topic of hash-key collisions. My objections are:
  • The wording around the probability of hash-collisions is not helpful to understand the risk. As such, it does not help to decide whether to use DV 2.0, let alone choose a suitable hash-function for a concrete use-case.
  • The actual numbers regarding probability of hash-collisions are stated flat-out wrong on more than just one occasion.
  • The probability doesn't matter if you can't afford to lose any data
  • DV 2.0 does not discuss the consequences of a hash-collision, and no concrete advice is given on how to detect hash-collisions, let alone handle them.
I believe the questions by Charles Choi and Ray OBrien show that I am not alone. At the time they voiced their doubts, DV 2.0 was relatively new and I can understand that maybe at the time these tenets of DV 2.0 would still need to mature.

A couple of years have passed since, and after reading the article on the scalefree company blog, I am sad to observe that, apparently, no progress has been made in DV 2.0 thinking. At least, if such progress has been made, the scalefree company blog article doesn't seem to offer any new views on the matter. Instead, it comes up with an - equally unhelpful - restatement of the probability of hash-collision by comparing it to the chance of being hit by a meteor.

In the remainder of the article I will explain my objections and attempt to offer some thoughts that may help advance these matters.

The DV 2.0 wording around the probability of hash-collisions is not helpful

First, let's try and analyze the wording around probabilities, and what message it conveys.

Distracting Rhetorics

What does it really mean when someone says that "the probability [...] is like the odds of a meteor landing on your data center"? What does it mean, really, when someone says that the probability is "1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456"?

Well obviously, they are saying the chance is very small. Maybe it's just me, but I also sense a level of rhetoric in the wording that seems to be intended to dwarf he reader with Big Serious Numbers.

It's almost as if they're saying: this won't happen, so you shouldn't worry. You're not worrying about meteors hitting your data center all the time, so why worry about a hash-collision? Right?

You can observe that the rhetoric is working too, just look at how Charles Choi voiced his question: "According to Dan (Linstedt) [...] the chances [...] are nearly nonexistent (1/2^128). But I still worry.".

It's as if Charles is apologizing in advance for worrying.

Probabilities are not absolute

There is a more fundamental problem with the probability wording of the previous 2 examples, and that's that they project probability as an absolute.

To be fair, if you read them in context, then you'll notice that both statements are about MD5 collisions. Obviously, this matters, since not all hash-functions have equal probability of collisions. For hash-functions that have a fixed-length output, the chance surely has to have a relationship with the length of the output, since that puts a hard limitation on the number of unique values it could possibly encode.

However, apart from the output length of the hash-function and the algorithm it uses, there is at least one other factor which determines the probability, and that is your data volume.

Intuitively, this is easy to understand: if you have an empty set, the probability of the first hash-value causing a collision is exactly zero, since there is nothing to collide with. At the other extreme end of possibilities, if the set contains as many items as the total number of unique values the hash-function is capable of generating, then the probability of a collision is exactly one, since the entire keyspace has been "used up" already. In between these extremes, we have a growing number of existing entries that could collide with a new entry, so the probability increases from zero to one as the actual number of items (i.e, the number of rows in the hub - the data volume) increases.

While this observation seems trivial, it is important to mention it in this discussion because the aspect of volume is, for some reason, often not touched at all by DV 2.0 advocates. It's a mystery why that should be so, because if we would know the relationship between probability of a collision, maximum possible number of unique values, and the maximum volume of data, then we could reason about these variables sensibly. Like:
  • Given the maximum risk that I am willing to take to lose data due to a collision, what is the maximum volume I can store if I use a 128-bit hash-function?
  • Given the maximum risk of collision that I am willing to take, and given the maximum number of rows I need to store, what would be the mimumum output length of the hash-function I should look for?
  • Given my current data volume, and my current choice of hash-function, what is the risk I am running now of losing data due to collision?

The Birthday Problem

Interestingly, Linstedt does provide one statement that at least takes the data volume into account: "you'd have to produce 6 billion new business keys Per Second PER HUB in order to reach a 50% chance of a collision in the first place". Let's see what that means exactly.

Apart from the probability, this statement includes the other two variables: 128 bits keylength of MD5; a data volume of 6 bio rows per second for a 100 years. But if you look at the probability (50%) you'll notice how completely useless this wording is, that is, apart from its rhetorical power. Who in their right mind is interested in a system that can only accept half of the data you're trying to store in it? How can you possibly apply this piece of knowledge in any practical sense to a system you have to actually build?

I spend a while thinking about how this statement could have come about, and I have to come to believe that it is actually a restatement of the classical birthday paradox:
The birthday paradox, also known as the birthday problem, states that in a random gathering of 23 people, there is a 50% chance that two people will have the same birthday.
(As compared to Linstedt's statement, the 50% probability stays the same; the 6 billion rows per second for a 100 years is equivalent to the number of people gathered, and the number of possible unique values in the key corresponds to the number of days in a year.)

Whether or not the original birthday problem statement is actually what made Linstedt's word his statement like he did, I think it's clear that a 50% probability of a collision has no practical bearing on building any kind of database. To me, it just sounds like more rhetoric to convince that hash-collisions are really rare.

Probabilities are stated flat out wrong

The discussion of the birthday problem brings us to an actual method to calculate the probability as a function of data volume and key length. The wikipedia article I linked to provides an exact method, as well as many useful approximations, which are much easier to calculate. The birthday problem wikipedia article explains it far better than I ever could do, and also provides this really useful rule-of-the-thumb called the square approximation:
A good rule of thumb which can be used for mental calculation is the relation

p(n) ≈n2
/
2m

which can also be written as

n ≈ (2m * p(n))

which works well for probabilities less than or equal to 0.5
with
n
the actual size of the keyset - i.e, the number of rows you need to fit into the hub
p(n)
the probability of a collision given n.
m
the theoretical maximum size of the keyset, i.e. the maximum number of unique values that your hash-function can encode.
Given a fixed length hash-function, such as MD5, then m can be calculated by raising it to the power of the keylength (expressed as the number of bits):


m = 2 ^ bitlength (or 2bitlength)

(In case this needs explaining: if your key would be just one bit long, then it can old only two values - 0 or 1 or 21. If the key would be 2 bits long, it could hold 2 * 2, or 22 = 4 unique values. With 3 bits, 2 * 2 * 2 or 23 = 8 and so on)

From the discussion above as well as the previous section, we can now conclude that at least one statement of the probability:
The mathematical chances of a collision as a result of using MD5 are (1 / (2^128))
is simply flat-out wrong, since it does not take the data volume into account. Rather, since 2^128 is the number of possible unique values that MD5 can cover, 1 / 2 ^ 128 is the chance that the second row you put into your hub will collide with the first one.

So how big or small are the odds really of running into a MD5 collision in case we're handling a volume of 6 billion rows per second for a 100 years? Using square approximation, we get:

p(n) ≈(6,000,000,000 rows * 60 seconds * 60 minutes * 24 hours * 365.25 days * 100 years)2 ≈ 0.526
/
2 * (2 ^ 128 bits in a MD5 hash-value)


which is in fact closer to 53%. To figure out how many years we would need to insert 6 billion rows per second to achieve the 50% chance of running into a collision, we can use the second form of the formula:

n ≈ (2 * (2 ^ 128 bits in MD5 hash-value) * 50% probability) ≈ 1.84467e+19 rows in the hub

Dividing by 6,000,000,000 rows * 60 seconds * 60 minutes * 24 hours * 365.25 days will give you 97 and slightly less than a half year.

The point of doing these calculations here is obviously not to prove Linstedt wrong by showing you'd already arrive at a 50% after only 97 years and some instead of after a 100. Nor is it to determine that after 100 years, the chance is actually closer to 53%. Besides the fact that I'm using an approximation, neither makes any sense anyway, because the probability of 50% is already way, way beyond any definition of a working system.

The point I am trying to make is that it is perfectly possible to reason about large numbers and to clearly and transparently demonstrate how they are calculated. Using square approximation, you have a tool to calculate the value of the third variable once you have the value of the other two, allowing you to reason about it from three different angles.

I think we can all agree that's a much better position than getting stumped by Really Seriously Big Numbers.

Probabilities add up for each keyset

So far, we've just looked at the probability for encountering a collision while loading a single hub. But the probability increases as you have more hubs.

Intuitively this is clear because each hubs could encounter a collision independently. So, the chance of suffering a collision in either one of them grows as you have more hubs to maintain, and is quite a bit more than the chance of suffering a collision in just one particular hub.

If we'd like to compute the chance of getting a collision in at least one hub we can apply the following calculation:
1 - ((1 - Ph1) * (1 - Ph2) * ... * (1 - PhN))
with:
Ph1
Probability of a collision in the first hub
Ph2
Probability of a collision in the second hub
...
PhX
Probability of a collision in the last hub
The rationale behind this is that if the probibility of a collision is P(n), then the chance of not having a collision is 1 - P(n). To calculate the chance of not having a collision in any of the hubs, we have to multiply the individual chances of not having a collision in one particular hub with each other. If that number is the chance of not having a collision in any of the hubs, then all remaining probability must mean there is a collision in one or more hubs. So the chance of having at least one collision is obtained by substracting the probability from having no collision at all from 1.

So the chance that someone ever will encounter a collision could be quite a bit larger than you'd expect if you're focussing on just one hub.

No matter how slight a Probability, I can't afford to lose data

Now we arrive at a more fundamental objection regarding the matter of using hash-values as keys in your database.

If you re-read Charles question, you'd notice that he is politely explaining that, although he understands and appreciates that a MD5 hash-collision maybe really rare, he simply doesn't ever want to lose any data because of it. Ray OBrien raises the exact same point, even mentioning data integrity as the reason why he cares.

When this issue is put forward in DV 2.0 disucssions, it usually means the end of any meaningful discussion. The answers that DV 2.0 advocates usually give in response typically match one of the following:

"Look, do you realize how rare a hash-collision is?"
Yes thank you. You just stumped me with some Really Seriously Big Numbers, and I get it. Super rare. I just don't want to lose data though.
"You don't have to use 128-bits MD5, you can use a hash-function that returns larger values, like 160-bits SHA1. Collisions will then be, you know, even more rare."
Perfect thanks. Did I mention I can't afford to lose data?
"We use 2 hash-functions as key and it works for us."
Ah, I get it now. You made collisions Super-duper-rare, how clever. So will you never lose data now?
"I have built hundreds of Terabyte-sized data warehouses, and I never encountered a hash-collision."
Well let me guess. Might that be because they are very rare?
"Teradata is using hashing to solve MPP data distribution, and Hadoop uses hashing in HDFS. If it works for them, then why wouldn't it work in DV 2.0"
So you're saying Teradata and Hadoop use hashing for some purpose, and DV2.0 is using hashing for a completely different purpose, and now you want me to explain why it works in one use-case but not for a completely different use-case? That's...interesting.

How about: Teradata and Hadoop are not using hash values as primary keys, and DV 2.0 is?
"Look, why are you so worried about hash-collisions? You're not worrying all the time about a meteor hitting your data center, are you?"
Actually, I do. That's why our database is geographically distributed across data centers.
"Ha! Gotcha now. What about two meteors? Wouldn't that be comparable to using two hashes?"
I suppose it would. Difference is, I can't help meteors falling on my data centers. But I can choose to stick to sequences instead of hash-functions.
"But I just explained, sequences are a bottleneck and prevent you from parallel loading your EDW!"
Yes I heard. And I asked my customer: they are pretty sure about how they feel regarding the possibility of losing data, and they clearly told me they'd rather wait around for the data to be loaded as compared to being able to report super-quickly on wrong data.
"We use hash-keys to store tweets for sentiment analysis, and our results are pretty accurate, even if we lose data sometimes."
I'm sure you are- good for you! But we manage a monetary transaction log and we feel that the risk of losing one $1,000,000,000 transaction just doesn't justify loading 1,000,000,000 transactions worth $1 super-fast in a parallel fashion. Silly us eh?


And that's really all there is to it: Probabilities don't mean a thing if you're really sure you don't want to lose any data. When it happens, it is no comfort that you were the one to have had such extraordinarily bad luck experiencing it.

Another thing people may overlook is that the probability also doesn't tell you when it will happen. The only real guarantee you have is that inserting the first key in an empty hub will always succeed. But already the 2nd row might collide. It probably won't, and the odds are really slim. But it might. If you're sure you don't want that, then don't use hash values as keys. It's really that simple.

Sure, there might be other risks that could make us lose data. For example, the probability of disk corruption might be larger than that of a hash-collision. But it doesn't follow that we should be setting ourselves a trap if we can avoid it, especially if you know how to avoid it.

We cannot control disaster like disk corruption or meteor impact. If we could though, we would! Whether to use hash keys or to stick to sequences is a conscious choice, so let's be sure we make it based on information and requirements, and rather not based on some analogy that is chosen with the express purpose of making you feel a little bit ridiculous for being so averse to taking a risk.

If you're building a database for someone else, and you're considering to use hashes as keys for your data, then be prepared to ask you customer: "How many data can you afford to lose?" or "In the event that we cannot load some data due to hash-collisions, how much time and effort can we spend to take it offline so we can fix it?"

Another way to think about it is this: Suppose you would, in fact, lose data because of a collision. Then how comfortable are you to admit to your customer that you constructed a solution, that, by design, could end up giving you wrong results, while there was an alternative that guarantees correctness, at least to the extent of things you can control? And suppose you would get wrong results, did you anticipate just how wrong those results could be?

I truly feel that considerations like this are not on the database/data warehousing professional - they are on the customer. It's their data. Please, respect that.

What if we do have a collision?

I get that there are use cases where you might want to accept the small risk of a collision. But you cannot really, truly make that assessment if you haven't considered and anticipated it as if it is a real event actually hitting you. I think DV 2.0 falls short in nourishing healthy discussion regarding the anticipation of such events.

So, what will happen if you have your hash-keys in place, and you encounter a collision? We can try and anticipate a few concrete scenarios.

First of all, will you even detect a collision when it happens? The Scalefree article has a few flow diagrams showing how raw data is staged, and then loaded into a hub. In that flow, the row is dropped when the hash already exists. So the question now is, why did the hash already exist?

Obviously, it's possible that we already loaded this business object, and we're merely seeing it again. In that case, we're fine and we'll simply be loading newer data into the satellites and links for that business object. But it's also possible that this is an entirely different business object that happened to yield the exact same hash-value as that of a different business object loaded earlier. In other words, you now have a collision. For the hub, it will pass by unnoticed, but the satellites and links that point to that business object will now store data pertaining to more than one distinct business objects.

So in this case, we're not losing data, but compromising the integrity of the business object that arrived earlier. Our database integrity is now violated and our queries will return wrong results. You won't know though, because you didn't attempt to detect a collision. To your data vault, the distinction between multiple different business objects has ceased to exist.

I don't know about you but this does not feel like a happy place to me - especially if this an inevitable and avoidable consequence of the design. (And a whole bunch of bad luck of course).

Alternatively, we build our solution in such a way that we can at least detect collisions. Once we detect it, we can maybe prevent loading associated data for the satellite-tables and link-tables for the colliding business object. This means we will be completely ignoring the later arriving business object, as if it isn't there.

We have now lost data. That is not a good thing, but at least this allows the earlier arriving business object to maintain its integrity. Our query results won't be wrong, they will just be incomplete. To me this is a slightly happier place, but the fact that it's a matter of fate which one of the objects made it into the database, and which one was rejected still makes me feel that the solution has failed.

But suppose we do want to go for that treatment (after of course getting confirmation from the business that this is really what they want) - how can we implement it? Well, at the very least, the load process for either the hub or the staging area would need to compare the hash value as well as the business key. Only if both are equal can we consider the objects equal.

Making the comparison is not hard but it will of course be slower than only calculating the hash, because in this case you need a lookup on the hub, just like you did with the DV 1.0 solution based on sequences.

But what if we detect the collision in this way? How can we then use this information to prevent loading the associated satellite- and link-table data?

Introducing a collision table

We could store collisions in a special collision table. We'd get one such table for each hub. The collision table would have the same layout as the hub table to which it corresponds, and you'd use it to store the colliding hash, as well as the business key for which the collision was met. The key of the collision table would have to be made up of both the hash key as well as the business key, so that we can handle multiple collisions for the same hash key.

Once the collision table is in place, and the process for loading the hub-tables is modified to detect and store collisions, we can think about the process for loading the satellite- and link-tables. I can see two options
  • Have the load process for satellite- and link-tables do a lookup to the collision table to see if you need to discard data for business objects with collisions
  • Check collision tables after the load and run a clean-up process to restore integrity after detecting new collisions.

Collision table lookup

This solution relies on the process that loads the satellite- and link-tables to make a lookup to the collision table, using both the calculated hash and the business key. If collisions really are as rare as they should be, then that lookup should be really fast, because the collision tables will be pretty much empty.

If all goes well, then the lookup will fail. This means we have no collision and we can proceed loading the satellite- and link-tables. In the rare event that the lookup succeeds there is apparently a hash-collision, and we must not load the satellite- and link-tables to prevent violating the integrity of our data. You are now in a position to either discard the data, or to store it someplace else in case you have a clever idea of reconciling the data later on.

However, we now have reintroduced the constraint on the loading process, because we now rely on the process that loads the hubs to also detect and store collisions in the collision table.

So, with this solution, we lose the ability to load hub-tables in parallel with satellite- and link-tables. What may be of some comfort in comparison with the DV 1.0 sequence based solution is that the collision table lookup will be much faster than a hub-lookup to find the value generated by a sequence, because the collision table will be pretty much empty. So, the burden of the constraint and loading dependency should be much lighter than in the case of a DV 1.0 sequence based solution.

Another important drawback is that if your solution spans multiple systems, you need to maintain one set of collision tables somewhere, and all loading processes will be dependent upon them. In other words, the solution is not stateless anymore.

Clean-up after load in case of new collissions

Alternatively, we keep loading hub-, satellite- and link-tables in parallel, and we check the collision tables after each load to see if the last load introduced any new collisions. If we find that it did, we need to perform clean-up after the load.

The way clean-up would work is as follows: our load process should have been logged and our satellite- and link-tables should have metadata identifying the load process that put its contents in the data warehouse. The load identifier would also be stored in the collision table. Using that information, we can identify which new collision our last load introduced. We now have the load identifier as well as the hash key of the new collision, and we can then use that to delete all satellite- and link-table rows that have the load identifier of our latest load, as well as the hash key of the colliding business object.

After clean-up, we have restored data integrity for those business objects that encountered a collision, up to the point prior to the last load.

Now, we know that our last load brought us a business object that had a hash collision with some business object that was already in our data warehouse, and we made the conscious decision to reject that data for now, or maybe store it somplace else untill we know how to reconcile it. But the last load might also have brought us data that actually belonged to the business object that already existed in our data warehouse. We would really like to reload that part of the data for the existing business object. Our clean-up process had no way of distinguishing between satellite- and link- data for the one or the other business object, because it only knows about their colliding hash keys. In other words, our clean-up process might have removed data that actually did belong to the already existing business object, and now we need to put that back.

The solution would be to have an alternative load process especially for this -hopefully exceptional- case.

The alternative load process would be similar to the collision table lookup solution described above. It would only load satellite- and link-tables, and it would include a lookup to the collision table. If the lookup fails, we're dealing with data belonging to the existing business object and we can load it. If the lookup succeeds then this is data that belongs to a new business object that caused a collision and we should discard it or store someplace else for later reconciliation.

If things go the way they should, then the clean-up-and-reload process should occur seldom. And if we need to run it, it would probably be quite fast since it deals with only a few business objects - typically only one.

The only drawback now is that our data warehouse lived through a short period where integrity was compromised during the load process. But at least, we can repair integrity for all existing business objects, and selectively discard only data for those business objects that suffer from hash collion with an already existing business object.

While this approach still relies on keeping collision tables around, we regain our ability to load hub-, satellite- and link-tables in parallel. We can even do loads spanning multiple systems; we just need to take care to clean those up as well in the case we do encounter a collision.

Other solutions?

I don't want't to pretend this is an exhaustive list - I'm hoping there are more options and I just can't think of them right now.

How to load the colliding business objects?

What these scenario's do not solve is loading the later arriving business object. We only managed to prevent these objects from entering our data warehouse, but have not found a solution to load that data as well.

And, We can't - not unless we change the key.

On the other hand, changing the key might just be doable: you could decide to try another hash-function for at least that hub. You would need to update all satellites and links pointing to that hub (and of course, the hub itself) and rehash every row that points to it.

Of course, since you're still relying on hash-keys, just - hopefully - larger ones, you haven't solved the problem, you've just increased the odds. And you might even run into new collision while you're doing the rehashing operation. But that's just the life you've chosen. At least we now have something that resembles dealing with the problem rather than praying it won't happen.

I guess my main point here is - make sure every stakeholder is actually accepting the risk and make sure the procedures for dealing with a collision are specified and tested. The fact that the chance you'll need it may be next to neligible is not a license to pretend you do not need to be prepared to do these tasks. If it is decided that you'll be taking the risk, then actually do take the risk, and take it seriously.

Can't we have our cake and eat it too?

Isn't there some way we can benefit from hashes and still, magically immunize ourselves against hash-collisions? It turns out we can.

To recap:
  • DV, like other data warehousing methods - suggests using surrogate keys, because business keys are largish and unwieldy, and slow down join operations.
  • DV 2.0 suggests using hash-keys to avoid sequences, which make it impossible to load hubs, satellites and links all in parallel, and which slow down the load due to a lookup in a large hub.
This maybe a longshot, but it has as advantage the guarantee that it will work. As should be amply clear from the discussions in this article, hash-keys always have the chance of a collision, and it doesn't matter how small the risk is if you already know you do not want to accept losing data or giving up integrity. So, what is clear is that if that is the requirement, you cannot use hashes as keys. Period.

But that does not mean we cannot benefit from hashes.

Especially if the hash-key is small in comparison with the business key, then we could build up our keys as a composite of the hash-code, followed by the field or fields that make up the business key. If our database uses B-tree indexing, then any joins will be able to resolve the join in the very vast majority of cases only based on the hash-key column, which should be the first column in the key definition. You would still keep using the fields of the business key in your join conditions, to ensure that, in case of a hash-collision, the query will still return the correct result. Since the collisions are so rare, the database will end up with very few rows after it has resolved the join over the first field in the key, so the overhead of such a large key should be minimal.

Of course, this solution does not help in cutting down the amount of data you need to store - that will be much more in this scenario since you keep dragging the business keys literally everywhere: every satellite and every link table that points to this hub will inherit the column for the hash, as well as all columns that make up the business key. But if it helps - it should just add to the storage requirements, and not that much in terms of join processing.

The benefits of the -relatively- fast join will break down though when the database already uses hash-joins. In those cases, it will not be able to use the first column of the keys as prefix.

Finally

I hope you enjoyed this article. I am super curious to hear from DV practitioners if they have scenarios we can learn from. Drop a line in the comments! I'm looking forward to it.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...