Monday, December 02, 2019

Building a UI5 Demo for SAP HANA Text Analysis: Part 4

This is the last of a series of blogposts describing a simple web front end tool to explore SAP HANA's Text Analysis features on documents uploaded by the user. As a reminder, the following overview outlines all the posts in the series: In the previous post we presented the sample application, explained its functionality, and concluded by pointing to the github repository and the installation instructions so that you may run the application on your own HANA system.

In this post, we'll explain in detail how the upload functionality works from the UI5 side of things

Uploading Files and Binary data to HANA .xsodata services uing UI5

In the first installment of the series, options and concerns were discussed on the topic of loading the (binday) document content into the SAP HANA database. We chose to use an OData service. In this installment, we'll go into fairly deep detail how to implement a file upload feature backed by a HANA .xsodata service using the UI5 front-end framework.

Some notes on the UI5 Application Implementation

Before we discuss any particular details of the implementation of the UI5 application, it is necessary to point out that this particular application is demoware. Many typical patterns of UI5 application development were omitted here: there is no internationalization, and no modules or dependency injection with sap.ui.define(). There is not even a MVC architecture, so no XML views or controllers; no Component configuration, and no application descriptor (manifest.json).

Instead, the application consists of just a single index.html, which contains 2 <script> tags:
<script 
  src="https://sapui5.hana.ondemand.com/1.71.5/resources/sap-ui-core.js"
  id="sap-ui-bootstrap"
  data-sap-ui-libs="
    sap.ui.core,
    sap.ui.layout,
    sap.ui.unified,
    sap.ui.table,
    sap.ui.commons, 
    sap.m
  "
  data-sap-ui-theme="sap_bluecrystal"
>
</script>
<script src="index.js" type="text/javascript"></script>
The first one bootstraps ui5, and the second one loads index.js, which contains the implementation.

The main reason for this rather spartan approach is that the primary goal of me and my colleagues Arjen and Mitchell was to quickly come up with a functional prototype that demonstrates the file upload feature. Although I have grown used to a more orhtodox UI5 boilerplate, it was a distraction when it came to just quickly illustrating an upload feature. Once we built the upload feature, I wanted to see how easy it would be to augment it and make it somewhat useful application, and I was kind of interested to experience how it would be to carry on using this unorthodox, pure-javascript approach.

There's much more that could be said about this approach but it's another topic. So for now: if you're new to UI5 and want to learn more: don't take this application as an example, it's atypical. And if you are an experienced UI5 developer: now you have the background, let's move on to the essence.

Communication with the backend using an OData Model

Before we get to the topic of building and controlling the upload feature, a couple of words should be said about how UI5 applications can communicate with their (xsodata) backend.

In UI5, we needn't worry about the exact details of doing the backend call directly. Rather, UI5 offers an object that provides javascript methods that take care of this. This object is the model. In our application, the model is an instance of the sap.ui.model.odata.v2.ODataModel, which we instantiated somewhere in the top of our index.js:
var pkg = document.location.pathname.split('/').slice(1, -2);
var odataService = [].concat(pkg, ['service', 'ta']);  

/**
 * OData
 */
var modelName = 'data';
var model = new sap.ui.model.odata.v2.ODataModel('/' + odataService.join('/') + '.xsodata', {
  disableHeadRequestForToken: true
});
It's not necessary to go over the model instantiation in detail - for now it is enough to know that upon instantiation, the model is passed the uri of the .xsodata service we already built. We obtain the url in the code preceding the model instantiation by taking the url of the current webpage and building a path to service/ta.xsodata relative to that location:
var pkg = document.location.pathname.split('/').slice(1, -2);
var odataService = [].concat(pkg, ['service', 'ta']);  

Uploading a file: high level client-side tasks

From a functional point of view, the web app (client) there's two distinct tasks to be considered:
  • Building the user interface so the user can select the file to upload.
  • Loading the file contents into the database.
The first high-level task is strictly a matter of user interaction and is more or less independent from how the second high level task is implemented. For the second high-level task, we already have the backend in place - this is the OData service we built in the second installment of this blogpost series. What remains is how to do this from within UI5.

But already, we can break down this task in two subtasks:
  • Extracting the content from the chosen file. Once the user has chosen a file, they have only identified the thing they want to upload. The web app does not need to parse or understand the file content, but it does need to extract the data (file content) so it can send it to the server.
  • Sending the right request to the backend. The request will somehow include the contents extracted from the file, and it will have such a form that the server understands what to do with those contents - in this case, store it in the a table for text analysis.

A UI5 File Upload control

For the file upload user interaface, we settled on the sap.ui.unified.FileUploader control. Here's the relevant code from index.js that instantiates the control:
var fileUploader = new sap.ui.unified.FileUploader({
  buttonText: 'Browse File...',
  change: onFileToUploadChanged
  busyIndicatorDelay: 0
});
The sap.ui.unified.FileUploader control is presented to the user as a input field and a button to open a file chooser. This lets the user browse and pick a file from their client device.

In addition, the sap.ui.unified.FileUploader control provides events, configuration options and methods to validate the user's choice, and to send the file off to a server. For example, you can set the uploadUrl property to specify where to send the file to, and there's an upload() method to let the control do the request.

As it turns out, most of this addition functionality did not prove to be very useful for the task at hand, because the request we need to make is quite specific, and we didn't really find a clear way of configuring the control to send just the right request. Perhaps it is possible, and then we would be most obliged to learn how.

What we ended up doing instead is to only use the file choosing capabilities of the sap.ui.unified.FileUploader control. To keep track of the user's choice, we configured a handler for the change event, which gets called whenever the user chooses a file, or cancels the choice.

The handler does a couple of things:
  • Determine wheter a file was chosen. If not, the Upload confirmation button gets disabled so the user can only either retry choosing a file, or close the upload dialog.
  • If a file is chosen, a request is sent to the backend to figure out if the file already exists.
  • Depending upon whether the file already exists, the state of the upload dialog is set to inform the user of what action will be taken if they confirm the upload.
Let's go over these tasks in detail. First, validating the user's choice by checking if the user did in fact choose a file:
var fileToUpload;
var fileToUploadExists;
function onFileToUploadChanged(event){
  fileToUpload = null;
  fileToUploadExists = false;
  var files = event.getParameter('files');
  if (files.length === 0) {
    initFileUploadDialog();
    return;
  }
  fileToUpload  = files[0];
  
  ...more code here...
}
Note that we set up the fileToUpload variable to keep track of the user's choice. We need to keep track of it somewhere, since the choosing of the file and the upload are separate tasks with regards to the UI: choosing the file happens when the user hits the Browse button provided by the sap.ui.unified.FileUploader control, wheras the upload is triggered by hitting the confirm button of the upload dialog.

When the user is done choosing the file, the sap.ui.unified.FileUploader will fire the change event, and our handler onFileToUploadChanged() gets called and passed the event as an argument. This event provides access to the FileList object associated with the file chooser:
  var files = event.getParameter('files');
Note: the FileList is not part of UI5. Rather, it is one of a number of brower built-in objects, which together form the Web File API. We would have loved to obtain the FileList or the File object from our sap.ui.unified.FileUploader control directly by using a getter or something like that, but at the time we found no such method, and settled for a handler in the change event.

Once we have the FileList, we can check whether the user selected any files, and either disable the upload confirmation button (if no file was selected), or assign the chosen file to our fileToUpload variable so we can refer to it when the upload is confirmed:
function onFileToUploadChanged(event){

  ...

  if (files.length === 0) {
    initFileUploadDialog();
    return;
  }
  fileToUpload = files[0];

  ....

}
If we pass the check, our variable fileToUpload will now contain the File object reflecting the user's choice. (Note that this object too is not a UI5 object, it's also part of the Web File API.)

Note that in theory, the list of files associated with the sap.ui.unified.FileUploader could have contained more than one file. But the default behavior is to let the user choose only one file. You can override that behavior by setting the sap.ui.unified.FileUploader's multiple property to true. Because we know that in this case, there can be at most only one file, we only need to check whether there is a file or not - there's no need to consider muliple files.

Checking whether the File was already Uploaded

Once we know for sure the user has chosen a file, it remains to be determined what should be done with it should the user decide to confirm the upload. To help the user decide whether they should confirm, we send a request to the backend to find out if the file was already uploaded:
function onFileToUploadChanged(event){

  ...

  fileToUpload  = files[0];
  fileUploader.setBusy(true);
  model.read('/' + filesEntityName, {
    filters: [new sap.ui.model.Filter({
      path: fileNamePath,
      operator: sap.ui.model.FilterOperator.EQ,
      value1: fileToUpload.name
    })],
    urlParameters: {
      $select: [fileNamePath, 'FILE_LAST_MODIFIED']
    },            
    success: function(data){

      ...update state depending upon whether the file exists...

    },
    error: function(error){

      ...update state to inform the user of an error...

    }
  });

  ....

}
The model provides a read() method which can be used to query the backend OData service. The first argument to the read() method is the so-called path, which identifies the OData EntitySet we want to query. In this case, we are interested in the Files EntitySet, as this corresonds to our CT_FILE database table in our backend. Because we use the name of the Files EntitySet in a lot of places, we stored it in the filesEntityName variable. So, our path becomes:
'/' + filesEntityName
Apart from the path, the read() method takes a second argument, which is an object of query options. We'll highlight the few we need here.

Because we only want to know whether the backend already has a file with the same name as the one the user just selected , we add a parameter to restict the search. This is done with the filters option:
    filters: [new sap.ui.model.Filter({
      path: fileNamePath,
      operator: sap.ui.model.FilterOperator.EQ,
      value1: fileToUpload.name
    })],
The filters option takes an array of sap.ui.model.Filter objects. When we instantiate the sap.ui.model.Filter object, we pass an object with the following configuration options:
  • path - this should get a value that refers to a property defined by the OData entity type of this Entity Set. It corresponds to the name of a column of our database table. In this case, it is set to fileNamePath, which is a variable we initialized with 'FILE_NAME', i.e., the name of the column in the CT_FILE table that holds the name of our files.
  • value1 - this should be the literal value that we want to use in our filter. In this case, we want to look for files with the same name as the file chosen by the user, so we set it to the name property of the File object that the user selected - fileToUpload.name
  • operator - this should be one of the values defined by the sap.ui.model.FilterOperator object, which defines how the given filter value should be compared to the value of the column. In this case the operator is sap.ui.model.FilterOperator.EQ, which stands for an equals comparison. By using this operator, we demand that the value of the column should be exactly the same as the name of the chosen file.
There is one other option specified that affects the request:
    urlParameters: {
      $select: [fileNamePath, 'FILE_LAST_MODIFIED']
    },            
This specifies for which columns we want to retrieve the values from the backend. It may be omitted, but in that case, all columns would be returned. Often this will not be a problem, but in this case, we really want to prevent the server from returning the values for the FILE_CONTENT column. Always retrieving the file contents would be an unnessary burden for both the front- and the backend so we actively suppress the default behavior. The only columns requested here are FILE_NAME and FILE_LAST_MODIFIED. The latter is currently unused but might come in handy to provide even more information to the user so they can better decide whether they want to re-upload an existing file.

The remaining options in the call to the model's read() method have nothing todo with the request, but are callback functions for handling the result of the read request. The error callback gets called if there is some kind of issue with the request itself - maybe the backend has gone away, or maybe the structure of the service changed. The success callback is called when the read request executes normally, and any results are then passed to it as argument. This is even true if no results are found - the callback then simply gets passed an empty list of results.
In our example, the main purpose of the success callback is to flag whether the file already exists, and to update the state of the file uploader accordingly to inform the user. The existence of the file is flagged by assigning the fileToUploadExists variable, and we will see its significance in the next section where we discuss the implementation of the upload of the file contents.

Handling the Upload

We've just seen exactly how the UI5 application can let the user choose a file, and we even used our model to check whether the chosen file is already present in the backend. Once these steps are done, we now have successfully initialized two variables, fileToUpload and fileAlreadyExists. This is all we need to handle the upload.

In the application, the user initiates the upload by clicking the Confirmation Button of the uploadDialog. This then triggers the button's press event, where we've attached the function uploadFile as handler.

So, in this handler, we must examine the value of the fileAlreadyExists variable and take the appopriate action:
  • If fileAlreadyExists is false, we should tell our model to add a new item. This is done by calling the createEntry()-method
  • If fileAlreadyExists is true, we should tell our model to update the existing item. This is done by calling the update()-method

The path argument

Both methods take a path as first argument to indicate where the new item should be added, or which item to update.

When adding a new item, the path is simply the path of the Files Entity Set with the model:
'/' + filesEntityName
(Note: this is exactly the same as the path we used in the read() call to figure out whether the file already exists.)

The path for updating an existing item also starts with the path of the Entity Set, but includes the key to identify the item that is to be updated. Lucky for us, the sap.ui.model.odata.v2.ODataModel model provides the createKey() method which constructs such a path, including the key part, based on the values of the properties that make up the key. So, the code to construct the path for the update method becomes:
'/' + model.createKey(filesEntityName, {"FILE_NAME": fileToUpload.name})
(For more detailed information about how OData keys and paths work, see ODAta Uri Conventions, in particular the section on "Adressing Entries".)

The payload

In addition to the path, we also need to pass the data, which is sometimes referred to as the payload. While the path tells the model where to add or update an item, the payload specifies what should be added or updated.

Now, even though the UI documentation is not very specific about how to construct to payload, we have used the createEntry() and update() methods of the sap.ui.model.odata.v2.ODataModel in the past without any problems. It is normally quite intuitive and hasslefree: you simply specify an Object, and specify keys that match the property names of the target entity set, and assign JavaScript values, just as-is. So, if we disregard the FILE_CONTENT field for a moment, the payload for the Files entity set could be something like this:
var payload = {
  "FILE_NAME": fileToUpload.name,
  "FILE_TYPE": fileToUpload.type,
  "FILE_LAST_MODIFIED": new Date(fileToUpload.lastModified),
  "FILE_SIZE": fileToUpload.size,
  "FILE_LAST_UPLOADED": new Date(Date.now())
};
Let's compare this to the data types of the corresponding properties in the entity type of the entity set:
<EntityType Name="FilesType">
  <Key>
    <PropertyRef Name="FILE_NAME"/>
  </Key>
  <Property Name="FILE_NAME" Type="Edm.String" Nullable="false" MaxLength="256"/>
  <Property Name="FILE_TYPE" Type="Edm.String" Nullable="false" MaxLength="256"/>
  <Property Name="FILE_LAST_MODIFIED" Type="Edm.DateTime" Nullable="false"/>
  <Property Name="FILE_SIZE" Type="Edm.Int32" Nullable="false"/>
  <Property Name="FILE_CONTENT" Type="Edm.Binary" Nullable="false"/>
  <Property Name="FILE_LAST_UPLOADED" Type="Edm.DateTime" Nullable="false"/>
</EntityType>
(Note: this entity type is taken from the $metadata document of our service.)

So, in short - there is a pretty straightforward mapping between the JavaScript runtime values and the Edm Type System (see: "6. Primitive Data Types") used by OData: JavaScript Strings may be assigned to Edm.Strings, JavaScript Date objects may be assigned to Edm.DateTimes, and JavaScript Numbers may be assigned to Edm.Int32s.

This is less trivial than one might think when one considers what happens here: one the one hand, we have the types as declared by OData service, which are Edm Types. Then, we have to consider the content type used to transport the payload in the HTTP request: OData services may support several content types, and by default OData supports both application/atom+xml and application/json.So, when starting with a payload as a JavaScript runtime object, this first needs to be converted to an equivalent representation in one of these content types (UI5 uses the JSON representation) by the client before it can be sent off to the OData service in a HTTP request.

It bears repeating that this is not a simple, standard JSON serialization, since the type system used by the JSON standard only knows how to represent JavaScript Strings, Numbers, Booleans (and arrays and objects containing values of those types). The native JSON type system is simply too minimal to represent all the types in the Edm Type system used by OData, hence the need for an extra JSON representation format. The sap.ui.model.odata.v2.ODataModel does a pretty remarkable job in hiding all this complexity and making sure things work relatively painless.

Representing FILE_CONTENT in the payload

Now for the FILE_CONTENT property. In the entity type, we notice that the data type is Edm.Binary. What would be the proper JavaScript runtime type to construct the payload?

We just mentioned that normally, the mapping from JavaScript runtime types is usually taken care of by the sap.ui.model.odata.v2.ODataModel. So we might be tempted to simply pass the File object itself directly as value for the FILE_CONTENT property. But when we call either the createEntry or update method with a payload like this:
var payload = {
  "FILE_NAME": fileToUpload.name,
  "FILE_TYPE": fileToUpload.type,
  "FILE_LAST_MODIFIED": new Date(fileToUpload.lastModified),
  "FILE_SIZE": fileToUpload.size,
  "FILE_LAST_UPLOADED": new Date(Date.now()),
  "FILE_CONTENT": fileToUpload
};
we get an error in the response:
The serialized resource has an invalid value in member 'FILE_CONTENT'.
So clearly, the sap.ui.model.odata.v2.ODataModel needs some help here.

One might assume that the problem has to do with the File object being a little bit too specific for UI5 - after all, a File object is not just some binary value, but it is a sublclass of the Blob object, which has all kinds of file-specific properties of itself. However, assigning a proper, plain Blob object in the payload yields exactly the same result, so that's not it either.

Instead of continuing to experiment with different values and types, we took a step back and took a look at the OData specification to see if we could learn a bit more about the Edm.Binary type. In the part about the JSON representation (See: "4. Primitive Types") we found this:
Base64 encoded value of an EDM.Binary value represented as a JSON string
It seems to suggest the whatever thing that represents the Edm.Binary value need to be Base64 encoded, which yields a string value at runtime, and this string may then be serialized to a JSON string. So, if we could make a Base64 encoded string value of our binary value, we could assign that in the payload. (We already saw that sap.ui.model.odata.v2.ODataModel will turn JavaScript String values to a JSON representation so we don't have to do that step ourselves.)

Fortunately, it's easy to create Base64 encoded values. The browser built-in function btoa() does this for us.

However, we're not there yet, as the spec starts with a binary value, and JavaScript does not have a binary type (and hence, no binary values).

We then took a look at the specification to find out exactly what a Edm.Binary value is. We found something in the section about Primitive Datatypes on how to create literal Edm.Binary Values:
binary'[A-Fa-f0-9][A-Fa-f0-9]*' OR X '[A-Fa-f0-9][A-Fa-f0-9]*' 
NOTE: X and binary are case sensitive. 
Spaces are not allowed between binary and the quoted portion. 
Spaces are not allowed between X and the quoted portion. 
Odd pairs of hex digits are not allowed.

Example 1: X'23AB' 
Example 2: binary'23ABFF'
At this point the thinking was that we could take the bytes that make up the binary value and convert it to its hexadecimal string representation, single-quote it the resulting hex string, and finally prepend either X or binary to it. At runtime, this would then be a JavaScript string value reprenting an Edm.Binary literal, which we could then turn into its Base64 encoded value, and send assign to the payload.

When we went this route, the error message went away, and sure enough, documents started to show up in our backend table. Unfortunately, the documents ended up there as Edm.Binary literals, that is, as strings that are an accurate Edm.Binary literal representation of the document but otherwise useless.

At this point the solution was clear though - just leave out the intermediate step of converting the original value to an Edm.Binary literal.

The uploadFile function

Remember, at this point we have the File object stored in the fileToUpload variable, and a flag fileToUploadExists is set to true or false depending upon whether the file is already stored in the backend table. This is code we ended up with for uploading the file:
function uploadFile(){
  var fileReader = new FileReader();
  fileReader.onload = function(event){
    var binaryString = event.target.result;
    var payload = {
      "FILE_NAME": fileToUpload.name,
      "FILE_TYPE": fileToUpload.type,
      "FILE_LAST_MODIFIED": new Date(fileToUpload.lastModified),
      "FILE_SIZE": fileToUpload.size,
      "FILE_CONTENT": btoa(binaryString),
      "FILE_LAST_UPLOADED": new Date(Date.now())
    };
    if (fileToUploadExists) {
      model.update(
        '/' + model.createKey(filesEntityName, {
          "FILE_NAME": fileToUpload.name
        }), 
        payload
      );
    }
    else {
      model.createEntry('/' + filesEntityName, {
        properties: payload
      });
    }
    model.submitChanges({
      success: function(){
        closeUploadDialog();                
      }
    });
  };
  fileReader.readAsBinaryString(fileToUpload);
}
As explained earlier, uploading the file breaks down into 2 subtasks, and this handler takes care of both:
  • First, we use the FileReader to read the contents of the File object
  • Then, we send it to the backend. To do that, construct the path and the payload, and call either the createEntry or the update method based on whether the file already exists, passing the path and the payload.

Using the FileReader to read the contents of a File object

First, we need to read the contents of the file. We do that using a FileReader, which is also part of the Web File API. To get the contents of a File object, we can call one of the FileReader's read methods.

The FileReader'sread methods do not return the contents of the file directly: the Web File API is mostly asynchronous. Instead, we have to attach an event handler to the FileReader which can respond to the FileReader's events. In this case we overrided the FileReader's onload() method, which gets called when the FileReader is done reading a File. (Instead of the override, we could also have attached a handler with addEventListener but it really doesn't matter too much how the handler is attached.)

Once set up, we can now call a read() method and wait for the reader to call our onload() handler.

So the general structure to read the file is as follows:
function uploadFile(){
  var fileReader = new FileReader();
  fileReader.onload = function(event){

    var binaryString = event.target.result;

    ...do something with the file contents...

  };
  fileReader.readAsBinaryString(fileToUpload);
}


We already mentioned the FileReader provides a number of different read methods, and the chosen method determines the type of the value that will be available in event.target.result by the time the load handler is called. Today, the FileReader provides:To figure out which method we should use, we should consider how our backend expects to receive the data. Or rather, how our sap.ui.model.odata.v2.ODataModel wants us to pass the data so it can do the appropriate call to the backend. In a previous section we already explained our struggle to figure out how to represent a Edm.Binary value in the payload, and based on those findings, readAsBinaryString() is the appropriate method. With this read method, the FileReader turns each individual byte of the file contents in to a JavaScript character, much like the fromCharCode()-method of the String object would do. The resulting value is a JavasScript binary string: each character represents a byte.

Note that this is very different from what the readAsText() method would do: that would attempt to read the bytes as if they are encoded characters in UTF-8 encoding, in other words it would result in a character string, not a binary string.

After obtaining the file contents as binary string, we can apply the Base64 encoding and assign it to the payload:
    var payload = {
      "FILE_NAME": fileToUpload.name,
      "FILE_TYPE": fileToUpload.type,
      "FILE_LAST_MODIFIED": new Date(fileToUpload.lastModified),
      "FILE_SIZE": fileToUpload.size,
      "FILE_CONTENT": btoa(binaryString),
      "FILE_LAST_UPLOADED": new Date(Date.now())
    };

Summary

This concludes the last installment in this blog series. In this post we learned how to use: And by putting these elements together we created a File upload feature for a UI5 application, backed by a HANA OData service.

Odds and Ends

It's nice that we finally found out how to write the file contents to our OData Service. But something does not feel quite right. Although we happened to find a way to write the binary data that satisfies both the sap.ui.model.odata.v2.ODataModel as well as the SAP HANA .xsodata service that backs it, we still haven't found any official documentation, either from the OData specification or from SAP that confirms that this is really the correct way. We would hope that SAP HANA's .xsodata implementation is a faithful implementation of the standard, but for the Edm.Binary type, I'm just not 100% sure. If anybody could chime in and confirm this, and preferably point me to something in the OData specification that confirms this, then I would be most grateful.

Sunday, December 01, 2019

Building a UI5 Demo for SAP HANA Text Analysis: Part 3

We now continue our series on building a simple web application for exploring SAP HANA Text Analysis features. As a reminder, here are the links to the other installments in the series: In the previous blog post, we built a backend for our SAP HANA Text Analysis application. In this blog post I present a simple web application which lets end-users upload documents to the backend and inspect the SAP HANA Text analysis results.

The application is a very simple UI5 application. The code, along with the back-end code is available on github, and instructions to install this application on your own SAP HANA system are provided as well.

The UI5 Application: Functional Overview

Here's an overview of the UI5 demo application:



The application features a single page, which is split vertically. On the left hand side of the splitter is the list of uploaded files, and it shows all rows from the CT_FILE database table. On the right hand side of the splitter is the list of text analysis results, and this shows rows from the $TA_ database table.

In the screenshot, only the FILE_NAME column is visible, but you can reveal the other columns by choosing them from the column header menu, which pops up when you right click a column header:



Since we haven't uploaded any files yet, both lists are currently empty. So, let's upload a file to see it in action! To upload a file, hit the button on the top left side of the application toolbar (1):



After clicking the "Upload File for Text Analysis" toolbar button, a dialog appears that lets you browse files so you can upload them. Hit the "Browse File..." button in the dialog to open a File explorer (2). Use the file explorer to choose a file (3). Note that this demo project's github repository provides a number of sample files in the sample-docs folder.

After choosing a file in the File explorer, the file name appears in the dialog:



To actually upload the chosen file, confirm the dialog by clicking the "Upload" button at the bottom of the dialog. The file will then appear in the file list left of the splitter, and is then selected.

Whenever the selection in the file list changes, the text analysis results in the list on the right of the splitter are updated to match the selected item. As we mentioned in the previous post, collection of text analysis results is ASYNCHRONOUS, so after uploading a new file, there is a possibility that the text analysis results have not yet arrived. Unfortunately, there is not much that can be done about that at this point.



You can now browse, filter, and sort the list of analysis results to explore the results of the text analysis. Obviously, by itself this is not very useful, but the point of this app is to make if very easy to inspect the actual raw text analysis results. Hopefully, it will give you some ideas on how you could use this type of information to build actual real world applications.

Once you're done with a particular file, you can remove it too using this application: in the File list, simply hit the trashbin icon to remove that particular file. A dialog will appear where you need to confirm the deletion of that file. When you confirm the dialog, the file will be deleted form the CT_FILE table. Note that any corresponding analysis results from the $TA_ table will not be removed by this demo application, unless you manually added a foreign key constraint on the $TA_ table that cascades the deletes from the CT_FILE table.

Installing this application on your own HANA System

Front-end and back-end code for this application is available on github and licensed as open source software under the terms and conditions of the Apache 2.0 software license. The remainder of this post provides the installation instructions.

Obtaining the source and placing it in a destination package on your HANA system

  • Create a package with your favorite IDE for SAP HANA (Web IDE, SAP HANA Studio, Eclipse with SAP HANA Developer Tools)
  • Download an archive of the github repository
  • Unzip the archive and transfer its contents to the HANA package you just created.

Updating Package and Schema names

  • With db/CT_FILE.hdbdd:
    • update the namespace, update the package identifier from "system-local"."public"."rbouman"."ta" to the name of the package you just created.
    • modify the @Schema from 'RBOUMAN' to whatever schema you want to use. (Create a schema yourself if you don't already have one)
    • Activate db/CT_FILE.hdbdd. In the database catalog, you should now have this table. Hana should have created a corresponding $TA_ table as well.
  • With service/ta.xsodata:
    • In the first entity definition, update the table repository object identifier "system-local.public.rbouman.ta.db::CT_FILE" so it matches the location of the table on your system.
    • In the second entity definition, update the catalog table identifier from "RBOUMAN"."$TA_system-local.public.rbouman.ta.db::CT_FILE.FT_IDX_CT_FILE" so it matches the database schema and catalog table name on your system.
    • Activate service/ta.xsodata.

Activation

You can now activate the package you created to activate all remaining objects, such as the .xsapp and .xsaccess files, as well as the web subpackage and all its contents.

Running the application

After installation, you should be able to open the web application. You can do this by navigating to:
http://yourhanahost:yourxsport/path/to/your/package/web/index.html
where:
  • yourhanahost is the hostname or IP address of your SAP HANA system
  • yourxsport is the port where your HANA's xs engine is running. Typically this is 80 followed by your HANA instance number.
  • path/to/your/package is the name of the package where you installed the app, but using slashes (/) instead of dots (.) as the separator character.

Summary

In this blog post we finally got to use the backend we built previously by installing and running the UI5 App. You may either use the app to explore the SAP HANA Text Analysis results and to experiment with different different document formats.

If you're also interested into how the actual upload process works and how it is implemented in the UI5 app, then you can read all about it in the next and final installment of this series.

Building a UI5 Demo for SAP HANA Text Analysis: Part 2

In the previous blog post, I explained some of the pre-requisites for building a SAP HANA Text Analysis application, and some thoughts were dedicated on how to expose these features to a end-user facing web-application. In this blog post, these considerations are put to practice and a basic but functional backend is created to support such an application.

Building the HANA Text Analysis Backend

We'll quicky go through a setup so you can try this out yourselves. This assumes you have access to a HANA System and development tools (either the SAP Web IDE, HANA Studio, or Eclipse IDE with SAP HANA Develpoment tools, or whatever - it doesn't really matter.

Package Structure

There is a lot that could be said about the proper package structure for HANA applications, but we won't go there now. It's enough to have something with just enough structure to keep the different responsibilities of the application apart. We settled for a simple base package called ta, which acts as the root package of the demo project, with 3 subpackages:
  • db for anything related to the physical database structure
  • service for the OData service, that exposes the database to our web-application
  • web for the HTML5/UI5 application - i.e. the stuff that is served from HANA to run inside the client's web-browser
Apart from these 3 subpackages, the ta package also contains these 2 files, which are necessary to expose it as a web application: Note that with this setup, all of our subpackages are exposed, whereas only the service and web subpackages are required to be exposed. An actual, serious application would only expose whatever is minimally required to be exposed, and would not expose any packages related to the physical database structure.

The CT_FILE table

We created a HANA table to hold our uploaded files by creating a file called CT_FILE.hdbdd file in the db package. This allows you to maintain the table definition as a repository object, which makes it transportable.

The CT_FILE.hdbdd file has the following contents:
namespace "system-local"."public"."rbouman"."ta"."db";

@Schema: 'RBOUMAN'
@Catalog.tableType: #COLUMN
Entity CT_FILE {
  Key "FILE_NAME"                     : String(256)  not null;
      "FILE_TYPE"                     : String(256)  not null;
      "FILE_LAST_MODIFIED"            : UTCTimestamp not null;
      "FILE_SIZE"                     : Integer      not null;
      "FILE_CONTENT"                  : LargeBinary  not null;
      "FILE_LAST_UPLOADED"            : UTCTimestamp not null ;
} 
technical configuration {
    FULLTEXT INDEX "FT_IDX_CT_FILE" ON ("FILE_CONTENT")
        ASYNCHRONOUS
        LANGUAGE DETECTION ('en')
        MIME TYPE COLUMN "FILE_TYPE"
        FUZZY SEARCH INDEX off
        PHRASE INDEX RATIO 0.721
        SEARCH ONLY OFF
        FAST PREPROCESS OFF
        TEXT ANALYSIS ON
        CONFIGURATION 'GRAMMATICAL_ROLE_ANALYSIS';
};
The important feature here is the definition of the FILE_CONTENT column as a LargeBinary, and the FULLTEXT INDEX definition on that column. The particular syntax to define the fulltext index in a .hdbdd table definition is described SAP HANA Core Data Services (CDS) Reference, whereas the actual options that are applicable for FULLTEXT INDEXes are described in the SAP HANA SQL and System Views Reference. Finally, guidance on the meaning and functionality of the text analysis configurations are described in the SAP HANA Text Analysis Developer Guide.

The short of it is that with this configuration, the (binary) content of docments stored in the TEXT_CONTENT column will be analyzed automatically. The results of the analysis will be stored in a sepearate $TA_ table called $TA_system-local.public.rbouman.ta.db::CT_FILE.FT_IDX_CT_FILE. This table is created and maintained by the HANA system. The structure of this $TA_ table is described here.

ASYNCHRONOUS

I just mentioned that the analysis results will be stored in the $TA_ table automatically. While this is true, the analysis does not occur immediately. This is because the FULLTEXT INDEX is created with the ASYNCHRONOUS option. This allows HANA to store documents in the CT_FILE table without having to wait for the text analysis process to finish.

We could the debate the advantages and drawbacks of the ASYNCHRONOUS option and whether it would make more sense to specify SYNCHRONOUS instead, or leave the option out alltogether (in which case SYNCHRONOUS would be implied). However there is a very simple reason why it is currently specified as ASYNCHRONOUS: if a FULLTEXT INDEX specifies a CONFIGURATION option, then it must be specified as SYNCHRONOUS, or else the following error occurs upon activation:
CONFIGURATION not supported with synchronous processing
For actual analysis, we really do need the CONFIGURATION option, as it offers all the truly interesting properties of text analysis. So, it seems there's just no way around it - text analysis results are collected in the background, and finish at some point after our document is stored in the table. And there seems to be no way of finding out whether the analysis is finished, or even if it is still busy. For instance, this makes it impossible to determine whether a recently uploaded document is still being analyzed, or whether the document was not eligible for text analysis at all: in both cases, the analysis results will remain absent.

That said, even though the FULLTEXT INDEX is specified as ASYNCHRONOUS HANA will let you specify when the analysis results should be updated. At least, according to the SAP HANA SQL and System Views Reference, it is possible to specify a FLUSH [QUEUE] <flush_queue_elem>-clause right after the ASYNCHRONOUS option, with <flush_queue_elem> indicating either a time interval (expressed as a number of minutes) or a number of documents. So, in theory, it would be possible to write:
ASYNCHRONOUS FLUSH QUEUE AFTER 1 DOCUMENTS
which would indicate that update of the analysis would kick in as soon as a new document has been loaded.

Unfortunately, on the HANA System I have access to, this results in the following error upon activation:
Flush based on documents/minutes not yet supported
The same error message occurs when I try ASYNCHRONOUS FLUSH QUEUE EVERY 1 MINUTES instead.

So, it looks like we'll just have to live with this for now. I did some checks and I noticed that analysis kicks in after a couple of seconds, but this is on a system that is not very heavily used. So for the purpose of exploration it's not too bad, but this does seem like it could become a problem for real-time applications (like chatbots).

The Key

Another thing worth mentioning here is the key of the CT_FILE table. For this very simple demo application, we chose to make only the FILE_NAME column the primary key of the table. The choice of key will depend on what kind of application you're building. In many practical cases you might not care at all about the fysical name of the uploaded file at all, and a name given by the uploader might be a better choice. Or maybe you don't care about names at all, only about whether the content of the document may be considered unique, in which case some hash of the file contents may be a suitable choice.

No matter what key you choose for the FULLTEXT INDEXed table, the column definitions that make up the key are copied to the corresponding $TA_ table in order to maintain the relationship between the analysis result and the original source document of the analysis, thus using those columns as a foreign key. Note however that HANA does not automatically create a FOREIGN KEY constraint to enforce referential integrity. But you may add such a constraint yourself. This may be useful in particular to cascade deletes on the document table to the text analysis results table. (Adding such a constraint manually is suggested in the introduction of the SAP HANA Text Analysis Developer Guide.)

The primary key of the $TA_ table consists of the key columns from the document table, plus to additional columns that identify an individual analysis result: TA_RULE and TA_COUNTER, where TA_RULE is the type of analysis that yielded the result, and TA_COUNTER is an integer that identifies, in order, each analysis result within a document and within a particular analysis type.

The .xsodata Service Definition

We expose both the CT_TEXT and the $TA_system-local.public.rbouman.ta.db::CT_FILE.FT_IDX_CT_FILE tables via an OData service. The OData service is created by creating a .xsodata service definition file called ta.xsodata in the service subpackage.

The contents of ta.xsodata service definition file are shown below:
service {

    entity "system-local.public.rbouman.ta.db::CT_FILE" as "Files";
    
    entity "RBOUMAN"."$TA_system-local.public.rbouman.ta.db::CT_FILE.FT_IDX_CT_FILE" as "TextAnalysis";
}
annotations {
  enable OData4SAP;
}
settings {
  support null;
}
This creates a OData service and maps our two tables CT_FILE and the $TA_ table to the OData EntitySets Files and TextAnalysis respectively. The OData service will be available at a url of which the path corresponds to the fully qualified package name of the .xsodata file.

Note that the syntax for mapping the tables to EntitySets is slightly different, depending upon whether the table is created as a repository object or as a database catalog object:
  • for CT_FILE, it is the package name containing the table's .hdbdd file, followed by 2 colons, and then followed by the local table name.
  • for $TA_ table, it is the (quoted) database schema name, followed by a dot, and them followed by the quoted table name.
The reason for the difference is that we only maintain the CT_FILE table as a repository object. There is no corresponding repository object for the $TA_ file, since HANA creates that autonomously as a result of the full text index on CT_FILE. Since the $TA_ table is created automatically we can assume the entire thing is transportable as it is, as long as we make sure we maintain our document table as a repository object, and refer to it in the .xsodata file using it's repository object name.

Activation and Verification

Now that we have all these artifacts, we should try and activate our package and test the service. You can either attempt to activate the entire package, or activate each file individually. For the latter, you need to make sure to activate the .hdbdd file before attempting to activate the .xsodata file, because the .xsodata file is dependent upon the existence of the tables in the database catalog.

After succesful activation, you can attempt to visit the service by navigating to its service document or metadata document using your web browser. These documents should be available at the following urls:
  • Service Document: http://yourhanahost:yourxsport/path/to/your/package/service/ta.xsodata
  • Metadata Document: http://yourhanahost:yourxsport/path/to/your/package/service/ta.xsodata/$metadata
where:
  • yourhanahost is the hostname or ipadress of your SAP HANA system
  • yourxsport is the portname of your xsengine. This is normally 80 followed by the HANA Instance number. For example, if the instance number is 10, the port will be 8010
  • path/to/your/package is the path you get when you take the package identifier where you put the db, service and web subpackages and replace the dot (.) that separates the individual package names with a slash (/).

Inspect the $metadata document of the service

If all is well, your $metadata document should look something like this:

<edmx:Edmx 
  xmlns:edmx="http://schemas.microsoft.com/ado/2007/06/edmx"
  xmlns:sap="http://www.sap.com/Protocols/SAPData" 
  Version="1.0"
>
  <edmx:DataServices 
    xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" 
    m:DataServiceVersion="2.0"
  >
    <Schema 
      xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
      xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
      xmlns="http://schemas.microsoft.com/ado/2008/09/edm"
      Namespace="system-local.public.rbouman.ta.service.ta"
    >
      <EntityType Name="FilesType">
        <Key>
          <PropertyRef Name="FILE_NAME"/>
        </Key>
        <Property Name="FILE_NAME" Type="Edm.String" Nullable="false" MaxLength="256"/>
        <Property Name="FILE_TYPE" Type="Edm.String" Nullable="false" MaxLength="256"/>
        <Property Name="FILE_LAST_MODIFIED" Type="Edm.DateTime" Nullable="false"/>
        <Property Name="FILE_SIZE" Type="Edm.Int32" Nullable="false"/>
        <Property Name="FILE_CONTENT" Type="Edm.Binary" Nullable="false"/>
        <Property Name="FILE_LAST_UPLOADED" Type="Edm.DateTime" Nullable="false"/>
      </EntityType>
      <EntityType Name="TextAnalysisType">
        <Key>
          <PropertyRef Name="FILE_NAME"/>
          <PropertyRef Name="TA_RULE"/>
          <PropertyRef Name="TA_COUNTER"/>
        </Key>
        <Property Name="FILE_NAME" Type="Edm.String" Nullable="false" MaxLength="256"/>
        <Property Name="TA_RULE" Type="Edm.String" Nullable="false" MaxLength="200"/>
        <Property Name="TA_COUNTER" Type="Edm.Int64" Nullable="false"/>
        <Property Name="TA_TOKEN" Type="Edm.String" MaxLength="5000"/>
        <Property Name="TA_LANGUAGE" Type="Edm.String" MaxLength="2"/>
        <Property Name="TA_TYPE" Type="Edm.String" MaxLength="100"/>
        <Property Name="TA_NORMALIZED" Type="Edm.String" MaxLength="5000"/>
        <Property Name="TA_STEM" Type="Edm.String" MaxLength="5000"/>
        <Property Name="TA_PARAGRAPH" Type="Edm.Int32"/>
        <Property Name="TA_SENTENCE" Type="Edm.Int32"/>
        <Property Name="TA_CREATED_AT" Type="Edm.DateTime"/>
        <Property Name="TA_OFFSET" Type="Edm.Int64"/>
        <Property Name="TA_PARENT" Type="Edm.Int64"/>
      </EntityType>
      <EntityContainer 
        Name="ta" 
        m:IsDefaultEntityContainer="true"
      >
        <EntitySet Name="Files" EntityType="system-local.public.rbouman.ta.service.ta.FilesType"/>
        <EntitySet Name="TextAnalysis" EntityType="system-local.public.rbouman.ta.service.ta.TextAnalysisType"/>
      </EntityContainer>
    </Schema>
  </edmx:DataServices>
</edmx:Edmx>

Summary

In this installment, we executed the plan formulated in part 1 of this series. We should now have a functional back-end which we may use to support our front-end application.

In the next installment we will present a front-end application, and explain how you can obtain it and install it yourself on your own SAP HANA System.

Building a UI5 Demo for SAP HANA Text Analysis: Part 1

Last week, me and my Just-BI co-workers Arjen Koot and Mitchell Beekink had a bit of a rumble with HANA (1.0) and the UI5 toolkit. In the process, we made a few observations and found out a few things which we figured might be worth sharing in a couple of blog posts: (Even though this was all done on HANA 1.0, many of these things should still work on HANA 2.0 as well using XS Classic).

Exploring HANA Text Analysis

The main use case of our concern is SAP HANA Text Analytics. SAP HANA's Text Analysis features let you extract tokens and semantics from various sources. Text analysis is not limited to plaintext but is also supported for binary documents in various formats, such as PDF and Microsoft Office documents such Word documents and Excel workbooks. After analysis, the analysis result may then be used for further processing.

Business cases that might utilize text analysis features include automated classification of invoices or reimbursment requests, matching CV's from employees or job applicants to vacancies, and detection of plagiary, to name just a few.

The hard work of converting the documents, and performing the actual text analysis is all handled fully by HANA, which is great. In our specific case, this process includes the conversion of binary documents in PDF format to text.

What Document Types can HANA handle?

To find out which types and formats your HANA instance can handle, run a query on "SYS"."M_TEXT_ANALYSIS_MIME_TYPES":
SELECT * 
FROM "SYS"."M_TEXT_ANALYSIS_MIME_TYPES";

+---------------------------------------------------------------------------+--------------------------------------------+
|MIME_TYPE_NAME                                                             | MIME_TYPE_DESCRIPTION                      |
+---------------------------------------------------------------------------+--------------------------------------------+
| text/plain                                                                | Plain Text                                 |
| text/html                                                                 | HyperText Markup Language                  |
| text/xml                                                                  | Extensible Markup Language                 |
| application/x-cscompr                                                     | SAP compression format                     |
| application/x-abap-rawstring                                              | ABAP rawstring format                      |
| application/msword                                                        | Microsoft Word                             |
| application/vnd.openxmlformats-officedocument.wordprocessingml.document   | Microsoft Word                             |
| application/vnd.ms-powerpoint                                             | Microsoft PowerPoint                       |
| application/vnd.openxmlformats-officedocument.presentationml.presentation | Microsoft PowerPoint                       |
| application/vnd.ms-excel                                                  | Microsoft Excel                            |
| application/vnd.openxmlformats-officedocument.spreadsheetml.sheet         | Microsoft Excel                            |
| application/rtf                                                           | Rich Text Format                           |
| application/vnd.ms-outlook                                                | Microsoft Outlook e-mail (".msg") messages |
| message/rfc822                                                            | Generic e-mail (".eml") messages           |
| application/vnd.oasis.opendocument.text                                   | Open Document Text                         |
| application/vnd.oasis.opendocument.spreadsheet                            | Open Document Spreadsheet                  |
| application/vnd.oasis.opendocument.presentation                           | Open Document Presentation                 |
| application/vnd.wordperfect                                               | WordPerfect                                |
| application/pdf                                                           | Portable Document Format                   |
+---------------------------------------------------------------------------+--------------------------------------------+

HANA Text Analysis Applications: pre-requisites

To use the text analysis features, you need to
  • Create a database table with a column having the BLOB data type. SAP HANA's text analysis features also work with plaintext, but for our specific use case we are interested in analyzing PDF documents. From the point of view of the application and database storage, these documents are binary files, which is why end up with a BLOB.
  • Create a FULLTEXT INDEX on that BLOB column, which configures all text analysis features we need, such as tokenization, stemming, and semantic extraction.
Once this is in place, we only need to store our documents (binary PDF files) in the BLOB column, and SAP HANA will do the rest (more or less automatically). The text analysis results can then be collected from a $TA table and used for further, application specific processing.

Uploading Document Content

Now, as humble a task as it may seem, storing binary document content into a BLOB column is a bit of a challenge.

Various resources published by SAP (like this one) focus on the text analysis features itself. They only offer a simple and pragmatic suggestion for loading the data into the table, which relies on a (client-side) Python script that reads the file from the client, and then uses a plain SQL INSERT-statement to upload the file contents to the BLOB column.

This approach is fine for exploring just the text analysis features of course, but it's not of much use if you want to create a end user facing application. What we would like to do for them instead, is to offer a web application or mobile app (for example, based on UI5), which would allow end users to upload their own documents to the database with an easy-to use graphical user-interface.

What about a .xsjs script?

Now, it is entirely possible to come up with a solution that is somewhat similar to the client-side Python script. For example, we could write an .xsjs script that runs on the HANA server. This script would then handle a HTTP POST request and receive the raw document bytes in the request body.

Then, the .xsjs script would run a similar kind of SQL statement, and store the received data in the BLOB column. (An example .xsjs script using SQL can be found here.)

Drawbacks of a .xsjs approach

However, there are a couple of drawbacks to this approach.

First of all, it is unlikely that a user-facing application would only want to upload the binary content of the document. Most likely, some application-specific metadata will need to be stored as well.

Another problem is that we would need to write specifc code to handle a very specific kind of request: uploading data to one particular table - that is: the unstructured document content itself, plus whatever structured data the application needs to associate with it. The script will need to refer to the table and its columns by name (or perhaps via a stored procedure that does the actual loading - same difference), and when the name of the table or one of its columns changes, the script must also be changed.

This same issue bites back when you need to transport such a solution to another HANA system. Since there is no formal dependency between the .xsjs script and the database catalog object it references, this poses a bit of a challenge for the HANA transport system which is typically used for these kinds of tasks. At least, such a dependency is not registered anywhere and needs thus be managed manually.

Another thing to keep in mind is that if an application can write data somewhere, it generally also needs to read data from that source. For example, the user may want to search some of the metadata fields to see if a particular file was already uploaded, or to see when a particular file was last uploaded, or to update a previously uploaded file.

Now, we could certainly write the .xsjs script so that it does those things as well. But the point is, even though we only minimally need a service that allows us to upload the document, this interface is incomplete from the application's point of view, so the actual script will need to implement much more data operations than only the upload. And so, what seemed like a small task to write a, simple script to do one simple thing, becomes a pretty serious task to write a pretty full featured service which does a whole bunch of things. And even if writing that would seem okay, it then needs to be maintained and documented and so on.

Finally, writing a server side script that executes SQL could introduce a security risk. Even though it might be written in a way that is safe, one actually needs to make sure that it is indeed written safely, and this needs to be implemented for all functionalities that the service offers.

So, in summary - even though it may seem like a simple task, a realistic, practical solution that is safe, maintainable and fully featured is not so trivial at all. Not with .xsjs anyway! It will require substantial effort to write, document and maintain. The time and effort would be much better spent when dedicated to desiging or building actual business features.

Using OData instead

HANA also offers an OData implementation. HANA OData services are defined in .xsodata service definitions and they pretty much solve all drawbacks mentioned above:
  • The WEB API is generic, and works the same for just about any database table-like object you want to expose
  • The API is complete and well defined, and supports all CRUD data operations you need in practice
  • HANA registers the dependency between the .xsodata file and the repository or database catalog objects it references. This is essential to create transportable solutions.
  • Typically, changes in table structure will be automatically picked up by the .xsodata service and is a low maintenance task. Writing the .xsjs service in that way implies another level of complexity which is certainly doable but increases development and maintenance effort.
So: what's not to like about OData? We'll certainly get back on that topic, especially in the last part of the series! But for now let's just be happy with the benefits of .xsodata over .xsjs.

Summary

We learned that we must create a table with a BLOB column to hold our documents, and a FULLTEXT INDEX so HANA knows to analyze the contents. After some consideration we decided to try if we can upload use HANA's OData implementation to upload document content to such a table.

In the next installment, we will explain in more detail how to build these backend objects.

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Rolling up each game's lines into a single game row (6/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is the sixth installment of a s...