- Part 1 - an Overview: SAP HANA Text Analysis on Documents uploaded by an end-user
- Part 2 - Hands on: Building the backend for a SAP HANA Text Analysis application
- Part 3 - Presenting: A UI5 front-end to upload documents and explore SAP HANA Text Analytics features
- Part 4 - Deep dive: How to upload documents with OData in a UI5 Application
Exploring HANA Text Analysis
The main use case of our concern is SAP HANA Text Analytics. SAP HANA's Text Analysis features let you extract tokens and semantics from various sources. Text analysis is not limited to plaintext but is also supported for binary documents in various formats, such as PDF and Microsoft Office documents such Word documents and Excel workbooks. After analysis, the analysis result may then be used for further processing.Business cases that might utilize text analysis features include automated classification of invoices or reimbursment requests, matching CV's from employees or job applicants to vacancies, and detection of plagiary, to name just a few.
The hard work of converting the documents, and performing the actual text analysis is all handled fully by HANA, which is great. In our specific case, this process includes the conversion of binary documents in PDF format to text.
What Document Types can HANA handle?
To find out which types and formats your HANA instance can handle, run a query on"SYS"."M_TEXT_ANALYSIS_MIME_TYPES"
:
SELECT *
FROM "SYS"."M_TEXT_ANALYSIS_MIME_TYPES";
+---------------------------------------------------------------------------+--------------------------------------------+
|MIME_TYPE_NAME | MIME_TYPE_DESCRIPTION |
+---------------------------------------------------------------------------+--------------------------------------------+
| text/plain | Plain Text |
| text/html | HyperText Markup Language |
| text/xml | Extensible Markup Language |
| application/x-cscompr | SAP compression format |
| application/x-abap-rawstring | ABAP rawstring format |
| application/msword | Microsoft Word |
| application/vnd.openxmlformats-officedocument.wordprocessingml.document | Microsoft Word |
| application/vnd.ms-powerpoint | Microsoft PowerPoint |
| application/vnd.openxmlformats-officedocument.presentationml.presentation | Microsoft PowerPoint |
| application/vnd.ms-excel | Microsoft Excel |
| application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | Microsoft Excel |
| application/rtf | Rich Text Format |
| application/vnd.ms-outlook | Microsoft Outlook e-mail (".msg") messages |
| message/rfc822 | Generic e-mail (".eml") messages |
| application/vnd.oasis.opendocument.text | Open Document Text |
| application/vnd.oasis.opendocument.spreadsheet | Open Document Spreadsheet |
| application/vnd.oasis.opendocument.presentation | Open Document Presentation |
| application/vnd.wordperfect | WordPerfect |
| application/pdf | Portable Document Format |
+---------------------------------------------------------------------------+--------------------------------------------+
HANA Text Analysis Applications: pre-requisites
To use the text analysis features, you need to- Create a database table with a column having the
BLOB
data type. SAP HANA's text analysis features also work with plaintext, but for our specific use case we are interested in analyzing PDF documents. From the point of view of the application and database storage, these documents are binary files, which is why end up with aBLOB
. - Create a
FULLTEXT INDEX
on thatBLOB
column, which configures all text analysis features we need, such as tokenization, stemming, and semantic extraction.
BLOB
column, and SAP HANA will do the rest (more or less automatically). The text analysis results can then be collected from a $TA
table and used for further, application specific processing.
Uploading Document Content
Now, as humble a task as it may seem, storing binary document content into aBLOB
column is a bit of a challenge.
Various resources published by SAP (like this one) focus on the text analysis features itself. They only offer a simple and pragmatic suggestion for loading the data into the table, which relies on a (client-side) Python script that reads the file from the client, and then uses a plain SQL
INSERT
-statement to upload the file contents to the BLOB
column.
This approach is fine for exploring just the text analysis features of course, but it's not of much use if you want to create a end user facing application. What we would like to do for them instead, is to offer a web application or mobile app (for example, based on UI5), which would allow end users to upload their own documents to the database with an easy-to use graphical user-interface.
What about a .xsjs
script?
Now, it is entirely possible to come up with a solution that is somewhat similar to the client-side Python script.
For example, we could write an .xsjs
script that runs on the HANA server.
This script would then handle a HTTP POST request and receive the raw document bytes in the request body.
Then, the
.xsjs
script would run a similar kind of SQL statement, and store the received data in the BLOB
column.
(An example .xsjs
script using SQL can be found here.)
Drawbacks of a .xsjs
approach
However, there are a couple of drawbacks to this approach.
First of all, it is unlikely that a user-facing application would only want to upload the binary content of the document. Most likely, some application-specific metadata will need to be stored as well.
Another problem is that we would need to write specifc code to handle a very specific kind of request: uploading data to one particular table - that is: the unstructured document content itself, plus whatever structured data the application needs to associate with it. The script will need to refer to the table and its columns by name (or perhaps via a stored procedure that does the actual loading - same difference), and when the name of the table or one of its columns changes, the script must also be changed.
This same issue bites back when you need to transport such a solution to another HANA system. Since there is no formal dependency between the
.xsjs
script and the database catalog object it references, this poses a bit of a challenge for the HANA transport system which is typically used for these kinds of tasks. At least, such a dependency is not registered anywhere and needs thus be managed manually.
Another thing to keep in mind is that if an application can write data somewhere, it generally also needs to read data from that source. For example, the user may want to search some of the metadata fields to see if a particular file was already uploaded, or to see when a particular file was last uploaded, or to update a previously uploaded file.
Now, we could certainly write the
.xsjs
script so that it does those things as well. But the point is, even though we only minimally need a service that allows us to upload the document, this interface is incomplete from the application's point of view, so the actual script will need to implement much more data operations than only the upload. And so, what seemed like a small task to write a, simple script to do one simple thing, becomes a pretty serious task to write a pretty full featured service which does a whole bunch of things. And even if writing that would seem okay, it then needs to be maintained and documented and so on.
Finally, writing a server side script that executes SQL could introduce a security risk. Even though it might be written in a way that is safe, one actually needs to make sure that it is indeed written safely, and this needs to be implemented for all functionalities that the service offers.
So, in summary - even though it may seem like a simple task, a realistic, practical solution that is safe, maintainable and fully featured is not so trivial at all. Not with .xsjs anyway! It will require substantial effort to write, document and maintain. The time and effort would be much better spent when dedicated to desiging or building actual business features.
Using OData instead
HANA also offers an OData implementation. HANA OData services are defined in.xsodata
service definitions and they pretty much solve all drawbacks mentioned above:- The WEB API is generic, and works the same for just about any database table-like object you want to expose
- The API is complete and well defined, and supports all CRUD data operations you need in practice
- HANA registers the dependency between the
.xsodata
file and the repository or database catalog objects it references. This is essential to create transportable solutions. - Typically, changes in table structure will be automatically picked up by the
.xsodata
service and is a low maintenance task. Writing the.xsjs
service in that way implies another level of complexity which is certainly doable but increases development and maintenance effort.
.xsodata
over .xsjs
.
Summary
We learned that we must create a table with aBLOB
column to hold our documents, and a FULLTEXT INDEX
so HANA knows to analyze the contents. After some consideration we decided to try if we can upload use HANA's OData implementation to upload document content to such a table.
In the next installment, we will explain in more detail how to build these backend objects.
No comments:
Post a Comment