Sunday, June 18, 2017

UI5: per-view Internationalization - What about inheritance?

I started to learn UI5 in earnest in september 2016. Quite soon after that I wrote two blog posts about per-view internationalization:

(In case you're wondering: per-view internationalization is the ability to maintain i18n.properties files, which contain translations for human readable texts together with the view code wherein these texts appear, rather maintaining just one giant i18.properties file that contains the translations for any and all internationalized texts that appear throughout the application. I think the benefits of this idea are clear from a developer's point of view. If you're interested in more information about this concept, then please check out the two blog posts I wrote prior)

Per-view i18n and inheritance

The reason for me to revisit the topic is that I found that the method I use to figure out which i18n.properties files to load, doesn't work very well when you're using inheritance. In UI5, inheritance is achieved by calling the extend-method on the constructor you want to inherit from.

To understand why it doesn't work well, we should first define the scope and the desired behavior.

In my case, I'm using inheritence to build controllers for views that are a lot like other existing views, but with some additional features. Most if not all of the behavior (and texts) of the existing view should be copied, and I'm managing that copy by extending the controller of the existing view. It's just that we need some extra things in the view, and these extra things might bring their own internationalization texts along.

So, what we really need is to include not only the i18n files for the extended controller, we also need to load any i18n files that might be created for the views managed by the superclasses of the extended controller. We need to mind the order too: the extended view might choose to override some of the texts defined by a superclass, so the i18n texts that are closer in the chain of inheritance should be given precedence.

A Per-view i18n solution that works with inheritance

The following code will solve this problem:
    _initI18n: function(){
      //first, check if we already constructed the i18n model for this class
      if (this.constructor._i18Model) {
        //we did! Don't do all that work again, just use the existing one.
        this.setModel(this.constructor._i18Model, i18n);
        return;
      }

      //check if the view and controller are in the same directory.
      //if they are not, then we need to take the possibility into account 
      //that the view and the controller might both have their own i18n files.
      var stack = [];
      var controllerClass = this.getMetadata();
      var viewName = this.getView().getViewName();
      if (controllerClass.getName() !== viewName) {
        //if the view name is different from controller name, 
        //then we assume the view may have its own i18n 
        //that overrides those of the controller 
        stack.push(viewName);
      }

      //walk the chain of inheritance up to sap.ui.core.mvc.Controller
      //store each superclass at the front of the stack
      var className, rootControllerClassName = "sap.ui.core.mvc.Controller";
      while (true) {
        className = controllerClass.getName();
        if (className === rootControllerClassName) {
          break;
        }
        stack.unshift(className);
        controllerClass = controllerClass.getParent();
      }

      //walk the stack and create a resourcebundle for each class
      //use it to enhance this class' i18n model.
      stack.forEach(function(className){
        var bundleData;
        if (window[className] && window[className]._i18Model) {
          bundleData = window[className]._i18Model.getResourceBundle();
        }
        else {
          className = className.split(".");
          //snip off the local classname to get the directory name
          className.pop();  
          //add i18n to make the i18n directory name 
          className.push(i18n);
          //add i18n to point to the i18n.properties file(s) 
          className.push(i18n);
          bundleData = {bundleName: className.join(".")};
        }
        
        var i18nModel = this.getModel(i18n);
        if (i18nModel) {
          i18nModel.enhance(bundleData);
        } 
        else {
          i18nModel = new ResourceModel(bundleData);
          this.setModel(i18nModel, i18n);
        }
      }.bind(this));        

      //cache the i18n model for new instances of this class.
      this.constructor._i18Model = this.getModel(i18n);
    }
Note that this code replaces the _initI18n-method that appeared in my prior blog posts on this topic. It is also assumed that this method sits in some abstract base controller, which you'll extend to create actual concrete controllers for your views.

Here are a couple of highlights that explain the new and improved _initI18N-method:
  • Examining the entire chain of inheritance, up until sap.ui.core.mvc.Controller. This is achieved with this snippet:
          var stack = [];
          ...
          var className, rootControllerClassName = "sap.ui.core.mvc.Controller";
          while (true) {
            className = controllerClass.getName();
            if (className === rootControllerClassName) {
              break;
            }
            stack.unshift(className);
            controllerClass = controllerClass.getParent();
          }
    
    The subsequenct forEach-iteration of the stack then constructs the resource bundle to enhance the i18n model in the usual way.

    Note that names of superclasses that are "higher up" in the hierarchy (or put another way: more basal) are stacked in front of subclasses. This way, the forEach-array method will encounter the class names in the desired order, allowing the subclasess to override texts added by superclasses.
  • Distinguish between texts defined by the view and the controller.

    Admittedly this scenario is quite rare, but if the controller and view each define their own i18n files, then we'd like to enhance our i18n model with files from both resourcebundles. I somehwat arbitrarily decided that in this case, the view's texts should probably override those of the controller.

    I achieve this with this piece of code:
          var stack = [];
          var controllerClass = this.getMetadata();
          var viewName = this.getView().getViewName();
          if (controllerClass.getName() !== viewName) {
            //if the view name is different from controller name, 
            //then we assume the view may have its own i18n 
            //that overrides those of the controller 
            stack.push(viewName);
          }
    
    In other words, the view name is placed as very last item of the stack; if it differs from the controller name, then the controller name appears directly before it, making the view resource bundle the last to enhance our i18n model.
  • Caching the i18n model at the class level, so that every instance may reuse it.

    While I fixed the inheritance issue, it occurred to me that all instances of the controller would go through their own cycle of building the i18n model. Since the i18nmodel deals almost only with static texts, it seemed wasteful to repeat all that work for each instance. We can simply store the i18n model as a property of the constructor, and retrieve it any time we're creating a new instance.

    This is achieved with the very first and last bits of the _initI18n-method:
          //first, check if we already constructed the i18n model for this class
          if (this.constructor._i18Model) {
            //we did! Don't do all that work again, just use the existing one.
            this.setModel(this.constructor._i18Model, i18n);
            return;
          }
          ...
          ...
          ...
          //cache the i18n model for new instances of this class.
          this.constructor._i18Model = this.getModel(i18n);
    

Finally

I hope you enjoyed this post. Let me know and drop a line!

Saturday, June 17, 2017

Team Just-BI wins 2nd Prize at Dutch Accountability Hack 2017!

To whom it may concern,

one week ago, Friday, 9th of June 2017 I was at the Dutch house of representatives ("de tweede kamer") to participate in the 2017 Accountability Hackathon.

The Accountability Hack Event

The event was organized and sponsored by a number of Dutch ministeries, the Court of Audit ("Algemene Rekenkamer"), the Central Agency for Statistics ("Centraal Bureau voor de Statistiek") and the Open State Foundation. Goal of the event was to invite programmers, developers, data analysts, journalists and so on to come together and create applications that use one or more of the numerous open data sources published by the Dutch government to create insights in the performance or the spending of Dutch governmental or publicly subsidized organisations.

This assignment alone might need some clarification. In Dutch democracy there has always been a push towards more transparency. But in the last decade in particular, there has been an increasing demand to provide this transparency by publishing openly accessible data sets. The idea is that publishing records and metadata contributes to an environment where civilians can answer any question about how their government is functioning themselves, by querying and combining their data. Now, obviously, not everybody is capable of working with raw data sets, so there is also a demand for tools, applications and people with know-how bridge the technical gap and truly making all this data available on a functional level.

This is where events like the Accountability Hackathon come in: it is a direct attempt to stimulate individuals, but also commercial companies to apply their expertise to create applications and tools that provide meaningful information and insights, based on open data.

Team Just-BI

I participated on behalf of my company Just-Business Intelligence. Just-BI provides end-to-end Business Intelligence consultancy. I'm in the custom development branch, which creates web and mobile applications in the realm of self-service and operational Business Intelligence.

Just BI has a policy of assigning consultants to billable projects for at most 80% of their working time; The remaining 20% is meant to be invested in knowledge development. We try to align agendas and meet each other every friday at our office in Rijswijk.

This arrangement made it possible for me to attend an event like the Accountability Hackathon - in fact, Just-BI stimulates its consultants to reach out and participate in events like these.

Submission: Jubilant

The Just-BI submission is a generic OData query and exploration tool called Jubilant (short for Just Business Intelligence Analysis Tool).



Jubilant is an Open UI5 web application that provides a plugin architecture that makes it easy for developers to write their own data visualisations based on OData services. Jubilant provides rich metadata about OData services, as well as a number of reusable components that make it easy to quickly build a query editor/designer.

The Jubilant concept allows a plugin developer to focus on making a cool visualisation, without having having to invest time and effort to provide the user with a query builder. During the hackathon I managed to create two plugins - one simple table visualisation, which simply renders raw data in a data grid, and a OData Metadata Graph visualiser, which plots the structure, entity types and relationships exposed by the OData service as a graph.

OData and Open Data

The connection to open data and the assignment for the Accountability Hack is that a number of key open data API's use the OData protocol. A good example is the Dutch Parliament API.

Interestingly, there are relatively few OData query tools available, and none of them are particularly affordable. In fact, during the accountability hackathon a few teams tried to work with these OData APIs and discovered they didn't quite know how to access and process them. I don't know if this finding influenced the jury in any way, but it certainly highlighted the need for a tool like Jubilant.

For Just-BI OData is a key protocol as well, since it happens to be the standard way of exposing data by many SAP products, like SAP/HANA. While Just-BI is a general end-to-end Business Intelligence shop, many of our customer engagements have a strong focus on SAP products. This is also reflected in the Open UI5 framework, which has rather good support for OData.

Result: 2nd Prize!

I was surprised, but obviously very happy to have been awarded the 2nd prize, which is good for 1.500 EUR. It is a honour and a privilege to be in a position to work on stuff I like and maybe contribute something to the transparency of the Dutch democracy. And, frankly, I just had a great time hacking!

Since this was basically just a working day for me, I decided to donate 500 EUR of the prize money to Just-Care, which is a charity supported by Just-BI.

Where to get Jubilant?

At Just-BI, we're currently working out the exact details around a release of Jubilant. I will write an update as soon as I can disclose more, but I can already say that all the work I did for the accountability hack will become available as Open Source Software in the very near future.

In the mean while, if you're interested in Jubilant and OData-based self-service BI, don't hesitate to contact me. Or contact Just-BI.

Friday, May 12, 2017

Do we still need to talk about Data Vault 2.0 Hash keys?

A few days ago, I ran into the article "Hash Keys In The Data Vault", published recently (2017-04-28) on the the Scalefree Company blog. Scalefree is a company, founded by Dan Linstedt and Michael Olschminke. Linstedt is the inventor of Data Vault, which is a method to model and implement enterprise data warehouses.

The article focuses on the use of hash-functions in Data Vault data warehousing. To be precise, it explains how Data Vault 2.0 differs from Data Vault 1.0 by using hash-functions rather than sequences to generate surrogate key values for business keys. In addition, hash-functions are suggested as a tool to detect change of non-business key attributes to track how their values change over time.

Abstract

First I will analyze and comment on the Scalefree article and DV 2.0, and explain a number of tenets of DV thinking along the way. Critical comments will be made about using hash values as primary keys in DV 2.0, and the apparent lack of progress made in DV thinking regarding this matter. A discussion follows about the birthday problem and how it relates to DV 2.0 usage of hash functions. Using the Square approximation method it will be demonstrated how we can make informed and accurate decisions about the risk of a collision with regard to the data volume and choice of hash function. Different scenarios regarding the impact of a collision on data integrity will then be explored. Finally, a practical proposal is made to detect hash collisions and to prevent them from introducing data integrity violations into our data warehouse.

A Summary of the Scalefree article

I encourage you to first read the original article. My summary is here below.
  1. The business key is what the business users use to identify a business object.
  2. To identify business objects in the data warehouse, we should use a surrogate key that fits in one column instead of the (possibly composite) business key.
  3. The reason to use a single-column surrogate key is that business keys can be large (many bytes per key field, multiple key fields) which makes them slow - in particular for join operations.
  4. In DV 1.0, sequences are used to generate values for surrogate keys.
  5. Using sequences to generate surrogate key values implies a loading process consisting of at least 2 phases that have to be executed in order.

    The previous point requires some context regarding the architecture of the Data Vault model. Without pretending to completely represent all tenets of DV, I believe the explanation below is fair and complete enough to grasp the point:

    DV stores the data that makes up a business object in 3 types of tables:

    • The business key and its corresponding surrogate key are stored in hub-tables.
    • The change history of descriptive attributes are stored in satellite-tables
    • Relationships between business objects are stored in link-tables.

    So, each distinct type of business object corresponds to one hub-table, and at least one, but possibly multiple satellite-tables. The satellite-tables refer to their respective hub-table via the surrogate key. Likewise, link-tables maintain relationships between multiple business objects by storing a combination of surrogate keys referring to the hub-tables that participate in the relationship.

    This means that for a single new business object, any of its satellite-tables can be loaded no sooner than when it is known which surrogate key value belongs to its business key. Since the new sequence value is drawn when loading the new business key into its designated hub-table, this means that the new business object must be loaded into its hub-table prior to loading its satellite-tables. Likewise, any relationships that the business object may have with other business objects can be loaded into link-tables only after *all* business objects that are related through the link have been loaded into their respective hub-tables.

    In practice this means that in DV 1.0, you'd first have to load all hub-tables, drawing new surrogate key values from the sequences as you encounter new business keys. Only in a subsequent phase would you be able to load the satellite- and link-tables (possibly in parallel), using the business key to look up the previously generated surrogate key stored in the hub-tables.

  6. In DV 2.0, hash-functions are used to generate values for surrogate keys.
  7. When using hash-functions to generate surrogate key values, hub-, satellite- and link-tables can all be loaded in parallel.

    The previous point needs some explanation.

    There is no strict, formal definition of a hash-function. Rather, there are a number of aspects that many hash-functions share and on which DV 2.0 relies:
    • Deterministic: a given set of input values will always yield the same output value. In a DV context, deterministic just means one business key maps to exactly one surrogate key.

      To some extent, a solution based on a sequence as generator of surrogate keys also appears to be deterministic, at least within the confines of one physical system.
    • Stateless: While a solution based on a sequence appears to be deterministic, its values are generated by incrementing the previous value, and by "remembering" the mapping from business key to its corresponding surrogate key by storing them together in the hub. Remembering the mapping by storing it in the hub is what makes it deterministic, because only then do we have the ability to look up the surrogate key based on the business key (and vice versa). But if we'd have two separate but otherwise identical systems, and load the same set of business objects into both, but each in a different order, then the mapping from business key to surrogate key will be different, depending on which object was loaded first in that particular system, because the one loaded earlier will draw a lower sequence number.
    • Fixed output size: DV expects a chosen hash-function to always return a fixed-length value output. (And typically, the length of the hash key should be smaller, ideally, much smaller than its corresponding business key)
    • Uniformity: This means that inputs of the hash-function yields results that are very evenly divided across the output domain. If that is the case, the chance that different inputs (i.e., two different business keys) yield the same output value is very small. The phenomenon where two calls to a hash-function, using different arguments, yield the same result value is called a collision. We will have much to say about collisions later on in this article.

    Basically the idea is that a hash-function can be used to generate a surrogate key value that is typically much smaller (and thus, "faster", in particular for join operations) than its corresponding business key, and it can do so without an actual lookup to the set of existing surrogate key values. Ideally a hash-function calculates its output based soley based on its input, and nothing else.

    It is in this latter aspect that makes it different from a sequence, which generates values that have no relationship at all with the values that make up business key. In the case of a sequence, the relationship between surrogate key and business key is maintained by storing it in the hub, and looking it up from there when it is needed.

  8. When using hash-functions there's a risk of collision. A collision occurs when two different inputs to the hash-function generate identical output.
  9. The risk of collision is very small. In a database with more than one trillion hash-values, the probability that you will get a collision is like the odds of a meteor landing on your data center.
  10. The MD5 hash-function is recommended for DV 2.0. As compared to other popular hash-functions (like MD6, SHA1 etc), MD5 storage requirements are relatively modest (128 bits), while the chance of a hash-collision is 'decently low'. It's also almost ubiquitously available across platforms and databases.

Comments

Many tenets of Data Vault thinking make good sense:
  • The focus on business keys, rather than blindly using the primary key of the source systems feeding into the EDW, ensures that the result will serve the needs of the business.
  • While the use of surrogate keys in transactional systems is still often a source of debate, this practice is not controversial in data warehousing.

    That said, I think the technical reasons for introducing a surrogate key as mentioned by the article (make keys smaller and more wieldy to improve join performance) are not as important as the functional requirement of any data warehouse to integrate data from multiple datasources, and ensuring identity there is resilient to change of the source systems.

  • There's a lot to be said in favor of maintaining attribute history in satellite-tables. In particular, the ability to have multiple satellite-tables is appealing, since it allows you to create and maintain groups of attributes that somehow belong together, and maintain them as a unit. For example, grouping attributes based on their rate of change, or according to whether they tend to appear together in queries sounds like a very useful thing.
  • Link-tables also sound like a good idea. Decoupling direct links from the business objects and maintaining them in completely separate tables makes the EDW very resilient to changes in the data sources. In addition it allows you to add extra relationships that may not be directly present in the source systems but which make sense from the point of view of data integration and/or business intelligence.
  • Once we accept the benefits of satellite- and link-tables and surrogate keys, then we must also embrace hub-tables. The model wouldn't make any sense without it, since we need an integration point anyway to maintain the mapping between business and surrogate key (which would of course be the hub-table).
  • Once we accept the modeling entities of DV, we also need to accept any constraints that surrogate key generation may have on the loading process. Of course, a dependency itself is something we would like to avoid, but the fact that a known limitation is recognized and anticipated is a good thing.

Flashback a few years ago

When I read the Scalefree article, I experienced a bit of a flashback that took me back some two, three years to this question ("Hash Key Collisions") by Charles Choi on the DataVault discussions group on linkedin. In case you cannot access linkedin, I'm reprinting the question below:
According to Dan's paper "DV2.0 and Hash Keys", the chances of having a hash key collision are nearly nonexistent (1/2^128). But I still worry.

What if a collision actually does occur that results in the business making a catastrophic decision? Can we really say we have 100% confidence in the "system of fact" when we choose to accept the risk of a collision? What is our course of action from a technology point of view if a collision did actually occur?
The paper "DV2.0 and Hash Keys" that Charles refers to is publicly available though not freely. (You can purchase it via Amazon.) Here's the relevant statement from the article:
The mathematical chances of a collision as a result of using MD5 are (1 / (2^128)) which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456.

In reality, you would have to produce 6 billion new business keys per second per hub for 100 years to reach a 50% chance of getting a collision. Not very likely to happen in our lifetime.
A similar question was asked by Ray OBrien in response to a post by Linstedt (see: #datavault 2.0 hashes versus natural keys):
collisions are real and MD5 not a good choice. but generally the smaller the input key domain and the larger the Hash output size, the less chance of collision, BUT it is always there.. so I would like to see some comments on the verification steps needed and cost to load of Collision ManagementI. If Integrity of data is important, then this is important.
In Linstedt's answer, you can find a similar expression of this probability, which goes:
because we split business keys across multiple hubs, the chances of collision (even with MD5) are next to none.

Yes, they do exist – but you would have to produce 6 billion new business keys Per Second PER HUB in order to reach a 50% chance of a collision in the first place.

Objections

Regardless of the merits of DV (which I believe I stated fairly in the previous section) I have a few doubts and objections on the writing and thinking by DV 2.0 advocates I have observed so far; all around the topic of hash-key collisions. My objections are:
  • The wording around the probability of hash-collisions is not helpful to understand the risk. As such, it does not help to decide whether to use DV 2.0, let alone choose a suitable hash-function for a concrete use-case.
  • The actual numbers regarding probability of hash-collisions are stated flat-out wrong on more than just one occasion.
  • The probability doesn't matter if you can't afford to lose any data
  • DV 2.0 does not discuss the consequences of a hash-collision, and no concrete advice is given on how to detect hash-collisions, let alone handle them.
I believe the questions by Charles Choi and Ray OBrien show that I am not alone. At the time they voiced their doubts, DV 2.0 was relatively new and I can understand that maybe at the time these tenets of DV 2.0 would still need to mature.

A couple of years have passed since, and after reading the article on the scalefree company blog, I am sad to observe that, apparently, no progress has been made in DV 2.0 thinking. At least, if such progress has been made, the scalefree company blog article doesn't seem to offer any new views on the matter. Instead, it comes up with an - equally unhelpful - restatement of the probability of hash-collision by comparing it to the chance of being hit by a meteor.

In the remainder of the article I will explain my objections and attempt to offer some thoughts that may help advance these matters.

The DV 2.0 wording around the probability of hash-collisions is not helpful

First, let's try and analyze the wording around probabilities, and what message it conveys.

Distracting Rhetorics

What does it really mean when someone says that "the probability [...] is like the odds of a meteor landing on your data center"? What does it mean, really, when someone says that the probability is "1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456"?

Well obviously, they are saying the chance is very small. Maybe it's just me, but I also sense a level of rhetoric in the wording that seems to be intended to dwarf he reader with Big Serious Numbers.

It's almost as if they're saying: this won't happen, so you shouldn't worry. You're not worrying about meteors hitting your data center all the time, so why worry about a hash-collision? Right?

You can observe that the rhetoric is working too, just look at how Charles Choi voiced his question: "According to Dan (Linstedt) [...] the chances [...] are nearly nonexistent (1/2^128). But I still worry.".

It's as if Charles is apologizing in advance for worrying.

Probabilities are not absolute

There is a more fundamental problem with the probability wording of the previous 2 examples, and that's that they project probability as an absolute.

To be fair, if you read them in context, then you'll notice that both statements are about MD5 collisions. Obviously, this matters, since not all hash-functions have equal probability of collisions. For hash-functions that have a fixed-length output, the chance surely has to have a relationship with the length of the output, since that puts a hard limitation on the number of unique values it could possibly encode.

However, apart from the output length of the hash-function and the algorithm it uses, there is at least one other factor which determines the probability, and that is your data volume.

Intuitively, this is easy to understand: if you have an empty set, the probability of the first hash-value causing a collision is exactly zero, since there is nothing to collide with. At the other extreme end of possibilities, if the set contains as many items as the total number of unique values the hash-function is capable of generating, then the probability of a collision is exactly one, since the entire keyspace has been "used up" already. In between these extremes, we have a growing number of existing entries that could collide with a new entry, so the probability increases from zero to one as the actual number of items (i.e, the number of rows in the hub - the data volume) increases.

While this observation seems trivial, it is important to mention it in this discussion because the aspect of volume is, for some reason, often not touched at all by DV 2.0 advocates. It's a mystery why that should be so, because if we would know the relationship between probability of a collision, maximum possible number of unique values, and the maximum volume of data, then we could reason about these variables sensibly. Like:
  • Given the maximum risk that I am willing to take to lose data due to a collision, what is the maximum volume I can store if I use a 128-bit hash-function?
  • Given the maximum risk of collision that I am willing to take, and given the maximum number of rows I need to store, what would be the mimumum output length of the hash-function I should look for?
  • Given my current data volume, and my current choice of hash-function, what is the risk I am running now of losing data due to collision?

The Birthday Problem

Interestingly, Linstedt does provide one statement that at least takes the data volume into account: "you'd have to produce 6 billion new business keys Per Second PER HUB in order to reach a 50% chance of a collision in the first place". Let's see what that means exactly.

Apart from the probability, this statement includes the other two variables: 128 bits keylength of MD5; a data volume of 6 bio rows per second for a 100 years. But if you look at the probability (50%) you'll notice how completely useless this wording is, that is, apart from its rhetorical power. Who in their right mind is interested in a system that can only accept half of the data you're trying to store in it? How can you possibly apply this piece of knowledge in any practical sense to a system you have to actually build?

I spend a while thinking about how this statement could have come about, and I have to come to believe that it is actually a restatement of the classical birthday paradox:
The birthday paradox, also known as the birthday problem, states that in a random gathering of 23 people, there is a 50% chance that two people will have the same birthday.
(As compared to Linstedt's statement, the 50% probability stays the same; the 6 billion rows per second for a 100 years is equivalent to the number of people gathered, and the number of possible unique values in the key corresponds to the number of days in a year.)

Whether or not the original birthday problem statement is actually what made Linstedt's word his statement like he did, I think it's clear that a 50% probability of a collision has no practical bearing on building any kind of database. To me, it just sounds like more rhetoric to convince that hash-collisions are really rare.

Probabilities are stated flat out wrong

The discussion of the birthday problem brings us to an actual method to calculate the probability as a function of data volume and key length. The wikipedia article I linked to provides an exact method, as well as many useful approximations, which are much easier to calculate. The birthday problem wikipedia article explains it far better than I ever could do, and also provides this really useful rule-of-the-thumb called the square approximation:
A good rule of thumb which can be used for mental calculation is the relation

p(n) ≈n2
/
2m

which can also be written as

n ≈ (2m * p(n))

which works well for probabilities less than or equal to 0.5
with
n
the actual size of the keyset - i.e, the number of rows you need to fit into the hub
p(n)
the probability of a collision given n.
m
the theoretical maximum size of the keyset, i.e. the maximum number of unique values that your hash-function can encode.
Given a fixed length hash-function, such as MD5, then m can be calculated by raising it to the power of the keylength (expressed as the number of bits):


m = 2 ^ bitlength (or 2bitlength)

(In case this needs explaining: if your key would be just one bit long, then it can old only two values - 0 or 1 or 21. If the key would be 2 bits long, it could hold 2 * 2, or 22 = 4 unique values. With 3 bits, 2 * 2 * 2 or 23 = 8 and so on)

From the discussion above as well as the previous section, we can now conclude that at least one statement of the probability:
The mathematical chances of a collision as a result of using MD5 are (1 / (2^128))
is simply flat-out wrong, since it does not take the data volume into account. Rather, since 2^128 is the number of possible unique values that MD5 can cover, 1 / 2 ^ 128 is the chance that the second row you put into your hub will collide with the first one.

So how big or small are the odds really of running into a MD5 collision in case we're handling a volume of 6 billion rows per second for a 100 years? Using square approximation, we get:

p(n) ≈(6,000,000,000 rows * 60 seconds * 60 minutes * 24 hours * 365.25 days * 100 years)2 ≈ 0.526
/
2 * (2 ^ 128 bits in a MD5 hash-value)


which is in fact closer to 53%. To figure out how many years we would need to insert 6 billion rows per second to achieve the 50% chance of running into a collision, we can use the second form of the formula:

n ≈ (2 * (2 ^ 128 bits in MD5 hash-value) * 50% probability) ≈ 1.84467e+19 rows in the hub

Dividing by 6,000,000,000 rows * 60 seconds * 60 minutes * 24 hours * 365.25 days will give you 97 and slightly less than a half year.

The point of doing these calculations here is obviously not to prove Linstedt wrong by showing you'd already arrive at a 50% after only 97 years and some instead of after a 100. Nor is it to determine that after 100 years, the chance is actually closer to 53%. Besides the fact that I'm using an approximation, neither makes any sense anyway, because the probability of 50% is already way, way beyond any definition of a working system.

The point I am trying to make is that it is perfectly possible to reason about large numbers and to clearly and transparently demonstrate how they are calculated. Using square approximation, you have a tool to calculate the value of the third variable once you have the value of the other two, allowing you to reason about it from three different angles.

I think we can all agree that's a much better position than getting stumped by Really Seriously Big Numbers.

Probabilities add up for each keyset

So far, we've just looked at the probability for encountering a collision while loading a single hub. But the probability increases as you have more hubs.

Intuitively this is clear because each hubs could encounter a collision independently. So, the chance of suffering a collision in either one of them grows as you have more hubs to maintain, and is quite a bit more than the chance of suffering a collision in just one particular hub.

If we'd like to compute the chance of getting a collision in at least one hub we can apply the following calculation:
1 - ((1 - Ph1) * (1 - Ph2) * ... * (1 - PhN))
with:
Ph1
Probability of a collision in the first hub
Ph2
Probability of a collision in the second hub
...
PhX
Probability of a collision in the last hub
The rationale behind this is that if the probibility of a collision is P(n), then the chance of not having a collision is 1 - P(n). To calculate the chance of not having a collision in any of the hubs, we have to multiply the individual chances of not having a collision in one particular hub with each other. If that number is the chance of not having a collision in any of the hubs, then all remaining probability must mean there is a collision in one or more hubs. So the chance of having at least one collision is obtained by substracting the probability from having no collision at all from 1.

So the chance that someone ever will encounter a collision could be quite a bit larger than you'd expect if you're focussing on just one hub.

No matter how slight a Probability, I can't afford to lose data

Now we arrive at a more fundamental objection regarding the matter of using hash-values as keys in your database.

If you re-read Charles question, you'd notice that he is politely explaining that, although he understands and appreciates that a MD5 hash-collision maybe really rare, he simply doesn't ever want to lose any data because of it. Ray OBrien raises the exact same point, even mentioning data integrity as the reason why he cares.

When this issue is put forward in DV 2.0 disucssions, it usually means the end of any meaningful discussion. The answers that DV 2.0 advocates usually give in response typically match one of the following:

"Look, do you realize how rare a hash-collision is?"
Yes thank you. You just stumped me with some Really Seriously Big Numbers, and I get it. Super rare. I just don't want to lose data though.
"You don't have to use 128-bits MD5, you can use a hash-function that returns larger values, like 160-bits SHA1. Collisions will then be, you know, even more rare."
Perfect thanks. Did I mention I can't afford to lose data?
"We use 2 hash-functions as key and it works for us."
Ah, I get it now. You made collisions Super-duper-rare, how clever. So will you never lose data now?
"I have built hundreds of Terabyte-sized data warehouses, and I never encountered a hash-collision."
Well let me guess. Might that be because they are very rare?
"Teradata is using hashing to solve MPP data distribution, and Hadoop uses hashing in HDFS. If it works for them, then why wouldn't it work in DV 2.0"
So you're saying Teradata and Hadoop use hashing for some purpose, and DV2.0 is using hashing for a completely different purpose, and now you want me to explain why it works in one use-case but not for a completely different use-case? That's...interesting.

How about: Teradata and Hadoop are not using hash values as primary keys, and DV 2.0 is?
"Look, why are you so worried about hash-collisions? You're not worrying all the time about a meteor hitting your data center, are you?"
Actually, I do. That's why our database is geographically distributed across data centers.
"Ha! Gotcha now. What about two meteors? Wouldn't that be comparable to using two hashes?"
I suppose it would. Difference is, I can't help meteors falling on my data centers. But I can choose to stick to sequences instead of hash-functions.
"But I just explained, sequences are a bottleneck and prevent you from parallel loading your EDW!"
Yes I heard. And I asked my customer: they are pretty sure about how they feel regarding the possibility of losing data, and they clearly told me they'd rather wait around for the data to be loaded as compared to being able to report super-quickly on wrong data.
"We use hash-keys to store tweets for sentiment analysis, and our results are pretty accurate, even if we lose data sometimes."
I'm sure you are- good for you! But we manage a monetary transaction log and we feel that the risk of losing one $1,000,000,000 transaction just doesn't justify loading 1,000,000,000 transactions worth $1 super-fast in a parallel fashion. Silly us eh?


And that's really all there is to it: Probabilities don't mean a thing if you're really sure you don't want to lose any data. When it happens, it is no comfort that you were the one to have had such extraordinarily bad luck experiencing it.

Another thing people may overlook is that the probability also doesn't tell you when it will happen. The only real guarantee you have is that inserting the first key in an empty hub will always succeed. But already the 2nd row might collide. It probably won't, and the odds are really slim. But it might. If you're sure you don't want that, then don't use hash values as keys. It's really that simple.

Sure, there might be other risks that could make us lose data. For example, the probability of disk corruption might be larger than that of a hash-collision. But it doesn't follow that we should be setting ourselves a trap if we can avoid it, especially if you know how to avoid it.

We cannot control disaster like disk corruption or meteor impact. If we could though, we would! Whether to use hash keys or to stick to sequences is a conscious choice, so let's be sure we make it based on information and requirements, and rather not based on some analogy that is chosen with the express purpose of making you feel a little bit ridiculous for being so averse to taking a risk.

If you're building a database for someone else, and you're considering to use hashes as keys for your data, then be prepared to ask you customer: "How many data can you afford to lose?" or "In the event that we cannot load some data due to hash-collisions, how much time and effort can we spend to take it offline so we can fix it?"

Another way to think about it is this: Suppose you would, in fact, lose data because of a collision. Then how comfortable are you to admit to your customer that you constructed a solution, that, by design, could end up giving you wrong results, while there was an alternative that guarantees correctness, at least to the extent of things you can control? And suppose you would get wrong results, did you anticipate just how wrong those results could be?

I truly feel that considerations like this are not on the database/data warehousing professional - they are on the customer. It's their data. Please, respect that.

What if we do have a collision?

I get that there are use cases where you might want to accept the small risk of a collision. But you cannot really, truly make that assessment if you haven't considered and anticipated it as if it is a real event actually hitting you. I think DV 2.0 falls short in nourishing healthy discussion regarding the anticipation of such events.

So, what will happen if you have your hash-keys in place, and you encounter a collision? We can try and anticipate a few concrete scenarios.

First of all, will you even detect a collision when it happens? The Scalefree article has a few flow diagrams showing how raw data is staged, and then loaded into a hub. In that flow, the row is dropped when the hash already exists. So the question now is, why did the hash already exist?

Obviously, it's possible that we already loaded this business object, and we're merely seeing it again. In that case, we're fine and we'll simply be loading newer data into the satellites and links for that business object. But it's also possible that this is an entirely different business object that happened to yield the exact same hash-value as that of a different business object loaded earlier. In other words, you now have a collision. For the hub, it will pass by unnoticed, but the satellites and links that point to that business object will now store data pertaining to more than one distinct business objects.

So in this case, we're not losing data, but compromising the integrity of the business object that arrived earlier. Our database integrity is now violated and our queries will return wrong results. You won't know though, because you didn't attempt to detect a collision. To your data vault, the distinction between multiple different business objects has ceased to exist.

I don't know about you but this does not feel like a happy place to me - especially if this an inevitable and avoidable consequence of the design. (And a whole bunch of bad luck of course).

Alternatively, we build our solution in such a way that we can at least detect collisions. Once we detect it, we can maybe prevent loading associated data for the satellite-tables and link-tables for the colliding business object. This means we will be completely ignoring the later arriving business object, as if it isn't there.

We have now lost data. That is not a good thing, but at least this allows the earlier arriving business object to maintain its integrity. Our query results won't be wrong, they will just be incomplete. To me this is a slightly happier place, but the fact that it's a matter of fate which one of the objects made it into the database, and which one was rejected still makes me feel that the solution has failed.

But suppose we do want to go for that treatment (after of course getting confirmation from the business that this is really what they want) - how can we implement it? Well, at the very least, the load process for either the hub or the staging area would need to compare the hash value as well as the business key. Only if both are equal can we consider the objects equal.

Making the comparison is not hard but it will of course be slower than only calculating the hash, because in this case you need a lookup on the hub, just like you did with the DV 1.0 solution based on sequences.

But what if we detect the collision in this way? How can we then use this information to prevent loading the associated satellite- and link-table data?

Introducing a collision table

We could store collisions in a special collision table. We'd get one such table for each hub. The collision table would have the same layout as the hub table to which it corresponds, and you'd use it to store the colliding hash, as well as the business key for which the collision was met. The key of the collision table would have to be made up of both the hash key as well as the business key, so that we can handle multiple collisions for the same hash key.

Once the collision table is in place, and the process for loading the hub-tables is modified to detect and store collisions, we can think about the process for loading the satellite- and link-tables. I can see two options
  • Have the load process for satellite- and link-tables do a lookup to the collision table to see if you need to discard data for business objects with collisions
  • Check collision tables after the load and run a clean-up process to restore integrity after detecting new collisions.

Collision table lookup

This solution relies on the process that loads the satellite- and link-tables to make a lookup to the collision table, using both the calculated hash and the business key. If collisions really are as rare as they should be, then that lookup should be really fast, because the collision tables will be pretty much empty.

If all goes well, then the lookup will fail. This means we have no collision and we can proceed loading the satellite- and link-tables. In the rare event that the lookup succeeds there is apparently a hash-collision, and we must not load the satellite- and link-tables to prevent violating the integrity of our data. You are now in a position to either discard the data, or to store it someplace else in case you have a clever idea of reconciling the data later on.

However, we now have reintroduced the constraint on the loading process, because we now rely on the process that loads the hubs to also detect and store collisions in the collision table.

So, with this solution, we lose the ability to load hub-tables in parallel with satellite- and link-tables. What may be of some comfort in comparison with the DV 1.0 sequence based solution is that the collision table lookup will be much faster than a hub-lookup to find the value generated by a sequence, because the collision table will be pretty much empty. So, the burden of the constraint and loading dependency should be much lighter than in the case of a DV 1.0 sequence based solution.

Another important drawback is that if your solution spans multiple systems, you need to maintain one set of collision tables somewhere, and all loading processes will be dependent upon them. In other words, the solution is not stateless anymore.

Clean-up after load in case of new collissions

Alternatively, we keep loading hub-, satellite- and link-tables in parallel, and we check the collision tables after each load to see if the last load introduced any new collisions. If we find that it did, we need to perform clean-up after the load.

The way clean-up would work is as follows: our load process should have been logged and our satellite- and link-tables should have metadata identifying the load process that put its contents in the data warehouse. The load identifier would also be stored in the collision table. Using that information, we can identify which new collision our last load introduced. We now have the load identifier as well as the hash key of the new collision, and we can then use that to delete all satellite- and link-table rows that have the load identifier of our latest load, as well as the hash key of the colliding business object.

After clean-up, we have restored data integrity for those business objects that encountered a collision, up to the point prior to the last load.

Now, we know that our last load brought us a business object that had a hash collision with some business object that was already in our data warehouse, and we made the conscious decision to reject that data for now, or maybe store it somplace else untill we know how to reconcile it. But the last load might also have brought us data that actually belonged to the business object that already existed in our data warehouse. We would really like to reload that part of the data for the existing business object. Our clean-up process had no way of distinguishing between satellite- and link- data for the one or the other business object, because it only knows about their colliding hash keys. In other words, our clean-up process might have removed data that actually did belong to the already existing business object, and now we need to put that back.

The solution would be to have an alternative load process especially for this -hopefully exceptional- case.

The alternative load process would be similar to the collision table lookup solution described above. It would only load satellite- and link-tables, and it would include a lookup to the collision table. If the lookup fails, we're dealing with data belonging to the existing business object and we can load it. If the lookup succeeds then this is data that belongs to a new business object that caused a collision and we should discard it or store someplace else for later reconciliation.

If things go the way they should, then the clean-up-and-reload process should occur seldom. And if we need to run it, it would probably be quite fast since it deals with only a few business objects - typically only one.

The only drawback now is that our data warehouse lived through a short period where integrity was compromised during the load process. But at least, we can repair integrity for all existing business objects, and selectively discard only data for those business objects that suffer from hash collion with an already existing business object.

While this approach still relies on keeping collision tables around, we regain our ability to load hub-, satellite- and link-tables in parallel. We can even do loads spanning multiple systems; we just need to take care to clean those up as well in the case we do encounter a collision.

Other solutions?

I don't want't to pretend this is an exhaustive list - I'm hoping there are more options and I just can't think of them right now.

How to load the colliding business objects?

What these scenario's do not solve is loading the later arriving business object. We only managed to prevent these objects from entering our data warehouse, but have not found a solution to load that data as well.

And, We can't - not unless we change the key.

On the other hand, changing the key might just be doable: you could decide to try another hash-function for at least that hub. You would need to update all satellites and links pointing to that hub (and of course, the hub itself) and rehash every row that points to it.

Of course, since you're still relying on hash-keys, just - hopefully - larger ones, you haven't solved the problem, you've just increased the odds. And you might even run into new collision while you're doing the rehashing operation. But that's just the life you've chosen. At least we now have something that resembles dealing with the problem rather than praying it won't happen.

I guess my main point here is - make sure every stakeholder is actually accepting the risk and make sure the procedures for dealing with a collision are specified and tested. The fact that the chance you'll need it may be next to neligible is not a license to pretend you do not need to be prepared to do these tasks. If it is decided that you'll be taking the risk, then actually do take the risk, and take it seriously.

Can't we have our cake and eat it too?

Isn't there some way we can benefit from hashes and still, magically immunize ourselves against hash-collisions? It turns out we can.

To recap:
  • DV, like other data warehousing methods - suggests using surrogate keys, because business keys are largish and unwieldy, and slow down join operations.
  • DV 2.0 suggests using hash-keys to avoid sequences, which make it impossible to load hubs, satellites and links all in parallel, and which slow down the load due to a lookup in a large hub.
This maybe a longshot, but it has as advantage the guarantee that it will work. As should be amply clear from the discussions in this article, hash-keys always have the chance of a collision, and it doesn't matter how small the risk is if you already know you do not want to accept losing data or giving up integrity. So, what is clear is that if that is the requirement, you cannot use hashes as keys. Period.

But that does not mean we cannot benefit from hashes.

Especially if the hash-key is small in comparison with the business key, then we could build up our keys as a composite of the hash-code, followed by the field or fields that make up the business key. If our database uses B-tree indexing, then any joins will be able to resolve the join in the very vast majority of cases only based on the hash-key column, which should be the first column in the key definition. You would still keep using the fields of the business key in your join conditions, to ensure that, in case of a hash-collision, the query will still return the correct result. Since the collisions are so rare, the database will end up with very few rows after it has resolved the join over the first field in the key, so the overhead of such a large key should be minimal.

Of course, this solution does not help in cutting down the amount of data you need to store - that will be much more in this scenario since you keep dragging the business keys literally everywhere: every satellite and every link table that points to this hub will inherit the column for the hash, as well as all columns that make up the business key. But if it helps - it should just add to the storage requirements, and not that much in terms of join processing.

The benefits of the -relatively- fast join will break down though when the database already uses hash-joins. In those cases, it will not be able to use the first column of the keys as prefix.

Finally

I hope you enjoyed this article. I am super curious to hear from DV practitioners if they have scenarios we can learn from. Drop a line in the comments! I'm looking forward to it.

Thursday, March 30, 2017

Announcing Shivers - Visualizing SAP/HANA Information View Dependencies

Dear SAP HANA DBa's, developers, architects etc, we at Just BI released Shivers.

What is Shivers?

Shivers stands for SAP/HANA Information Viewers. It is a tool to analyze and visualize dependencies between SAP/HANA information views (that is, Analytic Views, Attribute Views, and Calculation Views) and catalog objects (such as tables and views). Below is a screenshot of Shivers so you can get an impression of what it looks like:

Why Shivers?

In our work as SAP- and Business Intelligence consultants, SAP/HANA information views are a key tool in delivering Business Intelligence and Custom App Development Solutions to our customers.

In some cases, information views can get quite complex. Visualization and documentation features offered by development tools, like HANA Studio, do not always provide the kind of insight we need to create or maintain our solutions.

One very particular and concrete case is creating technical documentation, or preparing presentation materials for knowledge transfer sessions to handover to the customer support organization. In the past, some of our consultants would manually piece together overview slides of the data objects (information views, tables) that make up our solutions. I decided to try to whip up a solution that would make tasks like this a little simpler.

How do I install and run Shivers?

The simplest option is to download a shivers.zip archive from github. You can then unzip the archive, and open the index.html file inside in a modern webbrowser (tested with chrome and IE11 but should work in all modern browsers).

Note that you do not need to install shivers on a web server or HANA XS server. It runs directly from your local disk in your browser. Of course, you can install Shivers on a webserver or HANA XS server if you like, and then you can open Shivers by navigating to the url corresponding to the place where you installed Shivers. But this won't affect how Shivers works.

How do I load information views into Shivers so I can visualize them?

On the Shivers toolbar in the top of the application, you'll find a file chooser button. When you click that, a file browser will open. Use it to browse to whatever location on your local disk where you keep your information view source files.

I for example have my HANA Studio workspace stored in C:\Users\rbouman\hana_work, so I browse to that location and then find the appropriate path in the subdirectories to the HANA System and package I'm interested in. When I found a directory containing information view source files, I selected them, and confirm the file chooser.

Shivers will then prompt for a package name. It needs to do this to know how other views could refer to this set of views. Since the package name is itself not stored in the information view source files, the user has to supply that information. Later, when analyzing dependencies, Shivers will use the entered package name and match it with pacakge names it encounters inside the information view source files, when that information view refers to other information views.

After these steps, Shivers will populate the treeview on the left-hand side of the screen with the package structure and entries for the information views you loaded.

Note that shivers has a log tab. The log tab will report about the loading process, and it will also report it whenever it loads and information view that depends on other information views that are currently not loaded into Shivers. If you want to make complete and exhaustive graphs, you should then also load whichever file is reported in the log tab as an omitted dependency.

When the loaded files are visible in the treeview, you can click on any of the treenodes representing an information view. When you select one, a tab will open inside the Shivers window, showing a dependency graph of the selected information view.

Shivers currently does not do any clever layout of your dependency graphs. But you can manually drag and place items inside the graph to make it look good.

When you are done editing your graph, you can right click it, and choose "Export Image" to export the visualization to a png file, which you can then use in your technical documentation or knowledge transfer presentation slide deck.

What about HADES?

In my previous blogpost (http://rpbouman.blogspot.nl/2016/10/sap-hana-on-which-base-columns-do-my.html) I wrote about HADES, which is a bunch of (open source) utilities to report on information view metadata.

In a way, HADES and SHIVERS complement each other.

HADES are server-side tools (so far, only stored routines) that help analysis of information views. HADES is data-oriented, and requires access to a HANA SQL client and allows you to extract just about any information you'll ever want to know about your information views. But, HADES is in a way also very low-level, and using it effectively requires SQL skills.

Shivers is a graphically oriented client-side tool. It is implemented as a web page, that you start in your browser, directly from your file system. Shivers does not require any connection to a HANA Server. Rather, you use checked out information view source files (.analyticview, .attributeview and .calculationview files), load them into shivers, and then the tool does static code and dependency analysis. So far, possibilities to extend or influence reporting of Shivers are very limited. Shivers draws dependency graphs of your information views, and for now - that is it.

Why is Shivers an offline client-side tool?

Implementing Shivers as non-server, client side tool was a very deliberate decision. At our customer's systems, we cannot assume that we are allowed to install our tools on the HANA Server for just any purpose. So we really must have a solution that works regardless of that. What we can assume is that we have access to the information view source files, since we mostly work with HANA Studio and check out information view source files from the HANA repository all the time.

What are the terms and conditions for using Shivers?

Shivers is open source software under the Apache 2.0 License. This should give you all the freedom to copy, use, modify and distribute Shivers, free of charge, provided you respect the Apache License.

How can I get support Shivers?

We at Just BI are always happy to help if you run into any issue, but Shivers is not currently a for-profit solution. If necessary we can always negotiate professional support should you require it.

We Want your Feedback!

Please check out Shivers and let us know what you think. You can either drop me a line at this blog. Or you can post issues in the github change tracker.

Don't be shy - bugs, new feature proposals, critique - everything is welcome. Just let us know :)

You can also fork the shivers project, and contribute your work back via a pull request.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...