Sunday, December 01, 2019

Building a UI5 Demo for SAP HANA Text Analysis: Part 1

Last week, me and my Just-BI co-workers Arjen Koot and Mitchell Beekink had a bit of a rumble with HANA (1.0) and the UI5 toolkit. In the process, we made a few observations and found out a few things which we figured might be worth sharing in a couple of blog posts: (Even though this was all done on HANA 1.0, many of these things should still work on HANA 2.0 as well using XS Classic).

Exploring HANA Text Analysis

The main use case of our concern is SAP HANA Text Analytics. SAP HANA's Text Analysis features let you extract tokens and semantics from various sources. Text analysis is not limited to plaintext but is also supported for binary documents in various formats, such as PDF and Microsoft Office documents such Word documents and Excel workbooks. After analysis, the analysis result may then be used for further processing.

Business cases that might utilize text analysis features include automated classification of invoices or reimbursment requests, matching CV's from employees or job applicants to vacancies, and detection of plagiary, to name just a few.

The hard work of converting the documents, and performing the actual text analysis is all handled fully by HANA, which is great. In our specific case, this process includes the conversion of binary documents in PDF format to text.

What Document Types can HANA handle?

To find out which types and formats your HANA instance can handle, run a query on "SYS"."M_TEXT_ANALYSIS_MIME_TYPES":
SELECT * 
FROM "SYS"."M_TEXT_ANALYSIS_MIME_TYPES";

+---------------------------------------------------------------------------+--------------------------------------------+
|MIME_TYPE_NAME                                                             | MIME_TYPE_DESCRIPTION                      |
+---------------------------------------------------------------------------+--------------------------------------------+
| text/plain                                                                | Plain Text                                 |
| text/html                                                                 | HyperText Markup Language                  |
| text/xml                                                                  | Extensible Markup Language                 |
| application/x-cscompr                                                     | SAP compression format                     |
| application/x-abap-rawstring                                              | ABAP rawstring format                      |
| application/msword                                                        | Microsoft Word                             |
| application/vnd.openxmlformats-officedocument.wordprocessingml.document   | Microsoft Word                             |
| application/vnd.ms-powerpoint                                             | Microsoft PowerPoint                       |
| application/vnd.openxmlformats-officedocument.presentationml.presentation | Microsoft PowerPoint                       |
| application/vnd.ms-excel                                                  | Microsoft Excel                            |
| application/vnd.openxmlformats-officedocument.spreadsheetml.sheet         | Microsoft Excel                            |
| application/rtf                                                           | Rich Text Format                           |
| application/vnd.ms-outlook                                                | Microsoft Outlook e-mail (".msg") messages |
| message/rfc822                                                            | Generic e-mail (".eml") messages           |
| application/vnd.oasis.opendocument.text                                   | Open Document Text                         |
| application/vnd.oasis.opendocument.spreadsheet                            | Open Document Spreadsheet                  |
| application/vnd.oasis.opendocument.presentation                           | Open Document Presentation                 |
| application/vnd.wordperfect                                               | WordPerfect                                |
| application/pdf                                                           | Portable Document Format                   |
+---------------------------------------------------------------------------+--------------------------------------------+

HANA Text Analysis Applications: pre-requisites

To use the text analysis features, you need to
  • Create a database table with a column having the BLOB data type. SAP HANA's text analysis features also work with plaintext, but for our specific use case we are interested in analyzing PDF documents. From the point of view of the application and database storage, these documents are binary files, which is why end up with a BLOB.
  • Create a FULLTEXT INDEX on that BLOB column, which configures all text analysis features we need, such as tokenization, stemming, and semantic extraction.
Once this is in place, we only need to store our documents (binary PDF files) in the BLOB column, and SAP HANA will do the rest (more or less automatically). The text analysis results can then be collected from a $TA table and used for further, application specific processing.

Uploading Document Content

Now, as humble a task as it may seem, storing binary document content into a BLOB column is a bit of a challenge.

Various resources published by SAP (like this one) focus on the text analysis features itself. They only offer a simple and pragmatic suggestion for loading the data into the table, which relies on a (client-side) Python script that reads the file from the client, and then uses a plain SQL INSERT-statement to upload the file contents to the BLOB column.

This approach is fine for exploring just the text analysis features of course, but it's not of much use if you want to create a end user facing application. What we would like to do for them instead, is to offer a web application or mobile app (for example, based on UI5), which would allow end users to upload their own documents to the database with an easy-to use graphical user-interface.

What about a .xsjs script?

Now, it is entirely possible to come up with a solution that is somewhat similar to the client-side Python script. For example, we could write an .xsjs script that runs on the HANA server. This script would then handle a HTTP POST request and receive the raw document bytes in the request body.

Then, the .xsjs script would run a similar kind of SQL statement, and store the received data in the BLOB column. (An example .xsjs script using SQL can be found here.)

Drawbacks of a .xsjs approach

However, there are a couple of drawbacks to this approach.

First of all, it is unlikely that a user-facing application would only want to upload the binary content of the document. Most likely, some application-specific metadata will need to be stored as well.

Another problem is that we would need to write specifc code to handle a very specific kind of request: uploading data to one particular table - that is: the unstructured document content itself, plus whatever structured data the application needs to associate with it. The script will need to refer to the table and its columns by name (or perhaps via a stored procedure that does the actual loading - same difference), and when the name of the table or one of its columns changes, the script must also be changed.

This same issue bites back when you need to transport such a solution to another HANA system. Since there is no formal dependency between the .xsjs script and the database catalog object it references, this poses a bit of a challenge for the HANA transport system which is typically used for these kinds of tasks. At least, such a dependency is not registered anywhere and needs thus be managed manually.

Another thing to keep in mind is that if an application can write data somewhere, it generally also needs to read data from that source. For example, the user may want to search some of the metadata fields to see if a particular file was already uploaded, or to see when a particular file was last uploaded, or to update a previously uploaded file.

Now, we could certainly write the .xsjs script so that it does those things as well. But the point is, even though we only minimally need a service that allows us to upload the document, this interface is incomplete from the application's point of view, so the actual script will need to implement much more data operations than only the upload. And so, what seemed like a small task to write a, simple script to do one simple thing, becomes a pretty serious task to write a pretty full featured service which does a whole bunch of things. And even if writing that would seem okay, it then needs to be maintained and documented and so on.

Finally, writing a server side script that executes SQL could introduce a security risk. Even though it might be written in a way that is safe, one actually needs to make sure that it is indeed written safely, and this needs to be implemented for all functionalities that the service offers.

So, in summary - even though it may seem like a simple task, a realistic, practical solution that is safe, maintainable and fully featured is not so trivial at all. Not with .xsjs anyway! It will require substantial effort to write, document and maintain. The time and effort would be much better spent when dedicated to desiging or building actual business features.

Using OData instead

HANA also offers an OData implementation. HANA OData services are defined in .xsodata service definitions and they pretty much solve all drawbacks mentioned above:
  • The WEB API is generic, and works the same for just about any database table-like object you want to expose
  • The API is complete and well defined, and supports all CRUD data operations you need in practice
  • HANA registers the dependency between the .xsodata file and the repository or database catalog objects it references. This is essential to create transportable solutions.
  • Typically, changes in table structure will be automatically picked up by the .xsodata service and is a low maintenance task. Writing the .xsjs service in that way implies another level of complexity which is certainly doable but increases development and maintenance effort.
So: what's not to like about OData? We'll certainly get back on that topic, especially in the last part of the series! But for now let's just be happy with the benefits of .xsodata over .xsjs.

Summary

We learned that we must create a table with a BLOB column to hold our documents, and a FULLTEXT INDEX so HANA knows to analyze the contents. After some consideration we decided to try if we can upload use HANA's OData implementation to upload document content to such a table.

In the next installment, we will explain in more detail how to build these backend objects.

No comments:

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...