Sunday, December 01, 2019

Building a UI5 Demo for SAP HANA Text Analysis: Part 2

In the previous blog post, I explained some of the pre-requisites for building a SAP HANA Text Analysis application, and some thoughts were dedicated on how to expose these features to a end-user facing web-application. In this blog post, these considerations are put to practice and a basic but functional backend is created to support such an application.

Building the HANA Text Analysis Backend

We'll quicky go through a setup so you can try this out yourselves. This assumes you have access to a HANA System and development tools (either the SAP Web IDE, HANA Studio, or Eclipse IDE with SAP HANA Develpoment tools, or whatever - it doesn't really matter.

Package Structure

There is a lot that could be said about the proper package structure for HANA applications, but we won't go there now. It's enough to have something with just enough structure to keep the different responsibilities of the application apart. We settled for a simple base package called ta, which acts as the root package of the demo project, with 3 subpackages:
  • db for anything related to the physical database structure
  • service for the OData service, that exposes the database to our web-application
  • web for the HTML5/UI5 application - i.e. the stuff that is served from HANA to run inside the client's web-browser
Apart from these 3 subpackages, the ta package also contains these 2 files, which are necessary to expose it as a web application: Note that with this setup, all of our subpackages are exposed, whereas only the service and web subpackages are required to be exposed. An actual, serious application would only expose whatever is minimally required to be exposed, and would not expose any packages related to the physical database structure.

The CT_FILE table

We created a HANA table to hold our uploaded files by creating a file called CT_FILE.hdbdd file in the db package. This allows you to maintain the table definition as a repository object, which makes it transportable.

The CT_FILE.hdbdd file has the following contents:
namespace "system-local"."public"."rbouman"."ta"."db";

@Schema: 'RBOUMAN'
@Catalog.tableType: #COLUMN
Entity CT_FILE {
  Key "FILE_NAME"                     : String(256)  not null;
      "FILE_TYPE"                     : String(256)  not null;
      "FILE_LAST_MODIFIED"            : UTCTimestamp not null;
      "FILE_SIZE"                     : Integer      not null;
      "FILE_CONTENT"                  : LargeBinary  not null;
      "FILE_LAST_UPLOADED"            : UTCTimestamp not null ;
} 
technical configuration {
    FULLTEXT INDEX "FT_IDX_CT_FILE" ON ("FILE_CONTENT")
        ASYNCHRONOUS
        LANGUAGE DETECTION ('en')
        MIME TYPE COLUMN "FILE_TYPE"
        FUZZY SEARCH INDEX off
        PHRASE INDEX RATIO 0.721
        SEARCH ONLY OFF
        FAST PREPROCESS OFF
        TEXT ANALYSIS ON
        CONFIGURATION 'GRAMMATICAL_ROLE_ANALYSIS';
};
The important feature here is the definition of the FILE_CONTENT column as a LargeBinary, and the FULLTEXT INDEX definition on that column. The particular syntax to define the fulltext index in a .hdbdd table definition is described SAP HANA Core Data Services (CDS) Reference, whereas the actual options that are applicable for FULLTEXT INDEXes are described in the SAP HANA SQL and System Views Reference. Finally, guidance on the meaning and functionality of the text analysis configurations are described in the SAP HANA Text Analysis Developer Guide.

The short of it is that with this configuration, the (binary) content of docments stored in the TEXT_CONTENT column will be analyzed automatically. The results of the analysis will be stored in a sepearate $TA_ table called $TA_system-local.public.rbouman.ta.db::CT_FILE.FT_IDX_CT_FILE. This table is created and maintained by the HANA system. The structure of this $TA_ table is described here.

ASYNCHRONOUS

I just mentioned that the analysis results will be stored in the $TA_ table automatically. While this is true, the analysis does not occur immediately. This is because the FULLTEXT INDEX is created with the ASYNCHRONOUS option. This allows HANA to store documents in the CT_FILE table without having to wait for the text analysis process to finish.

We could the debate the advantages and drawbacks of the ASYNCHRONOUS option and whether it would make more sense to specify SYNCHRONOUS instead, or leave the option out alltogether (in which case SYNCHRONOUS would be implied). However there is a very simple reason why it is currently specified as ASYNCHRONOUS: if a FULLTEXT INDEX specifies a CONFIGURATION option, then it must be specified as SYNCHRONOUS, or else the following error occurs upon activation:
CONFIGURATION not supported with synchronous processing
For actual analysis, we really do need the CONFIGURATION option, as it offers all the truly interesting properties of text analysis. So, it seems there's just no way around it - text analysis results are collected in the background, and finish at some point after our document is stored in the table. And there seems to be no way of finding out whether the analysis is finished, or even if it is still busy. For instance, this makes it impossible to determine whether a recently uploaded document is still being analyzed, or whether the document was not eligible for text analysis at all: in both cases, the analysis results will remain absent.

That said, even though the FULLTEXT INDEX is specified as ASYNCHRONOUS HANA will let you specify when the analysis results should be updated. At least, according to the SAP HANA SQL and System Views Reference, it is possible to specify a FLUSH [QUEUE] <flush_queue_elem>-clause right after the ASYNCHRONOUS option, with <flush_queue_elem> indicating either a time interval (expressed as a number of minutes) or a number of documents. So, in theory, it would be possible to write:
ASYNCHRONOUS FLUSH QUEUE AFTER 1 DOCUMENTS
which would indicate that update of the analysis would kick in as soon as a new document has been loaded.

Unfortunately, on the HANA System I have access to, this results in the following error upon activation:
Flush based on documents/minutes not yet supported
The same error message occurs when I try ASYNCHRONOUS FLUSH QUEUE EVERY 1 MINUTES instead.

So, it looks like we'll just have to live with this for now. I did some checks and I noticed that analysis kicks in after a couple of seconds, but this is on a system that is not very heavily used. So for the purpose of exploration it's not too bad, but this does seem like it could become a problem for real-time applications (like chatbots).

The Key

Another thing worth mentioning here is the key of the CT_FILE table. For this very simple demo application, we chose to make only the FILE_NAME column the primary key of the table. The choice of key will depend on what kind of application you're building. In many practical cases you might not care at all about the fysical name of the uploaded file at all, and a name given by the uploader might be a better choice. Or maybe you don't care about names at all, only about whether the content of the document may be considered unique, in which case some hash of the file contents may be a suitable choice.

No matter what key you choose for the FULLTEXT INDEXed table, the column definitions that make up the key are copied to the corresponding $TA_ table in order to maintain the relationship between the analysis result and the original source document of the analysis, thus using those columns as a foreign key. Note however that HANA does not automatically create a FOREIGN KEY constraint to enforce referential integrity. But you may add such a constraint yourself. This may be useful in particular to cascade deletes on the document table to the text analysis results table. (Adding such a constraint manually is suggested in the introduction of the SAP HANA Text Analysis Developer Guide.)

The primary key of the $TA_ table consists of the key columns from the document table, plus to additional columns that identify an individual analysis result: TA_RULE and TA_COUNTER, where TA_RULE is the type of analysis that yielded the result, and TA_COUNTER is an integer that identifies, in order, each analysis result within a document and within a particular analysis type.

The .xsodata Service Definition

We expose both the CT_TEXT and the $TA_system-local.public.rbouman.ta.db::CT_FILE.FT_IDX_CT_FILE tables via an OData service. The OData service is created by creating a .xsodata service definition file called ta.xsodata in the service subpackage.

The contents of ta.xsodata service definition file are shown below:
service {

    entity "system-local.public.rbouman.ta.db::CT_FILE" as "Files";
    
    entity "RBOUMAN"."$TA_system-local.public.rbouman.ta.db::CT_FILE.FT_IDX_CT_FILE" as "TextAnalysis";
}
annotations {
  enable OData4SAP;
}
settings {
  support null;
}
This creates a OData service and maps our two tables CT_FILE and the $TA_ table to the OData EntitySets Files and TextAnalysis respectively. The OData service will be available at a url of which the path corresponds to the fully qualified package name of the .xsodata file.

Note that the syntax for mapping the tables to EntitySets is slightly different, depending upon whether the table is created as a repository object or as a database catalog object:
  • for CT_FILE, it is the package name containing the table's .hdbdd file, followed by 2 colons, and then followed by the local table name.
  • for $TA_ table, it is the (quoted) database schema name, followed by a dot, and them followed by the quoted table name.
The reason for the difference is that we only maintain the CT_FILE table as a repository object. There is no corresponding repository object for the $TA_ file, since HANA creates that autonomously as a result of the full text index on CT_FILE. Since the $TA_ table is created automatically we can assume the entire thing is transportable as it is, as long as we make sure we maintain our document table as a repository object, and refer to it in the .xsodata file using it's repository object name.

Activation and Verification

Now that we have all these artifacts, we should try and activate our package and test the service. You can either attempt to activate the entire package, or activate each file individually. For the latter, you need to make sure to activate the .hdbdd file before attempting to activate the .xsodata file, because the .xsodata file is dependent upon the existence of the tables in the database catalog.

After succesful activation, you can attempt to visit the service by navigating to its service document or metadata document using your web browser. These documents should be available at the following urls:
  • Service Document: http://yourhanahost:yourxsport/path/to/your/package/service/ta.xsodata
  • Metadata Document: http://yourhanahost:yourxsport/path/to/your/package/service/ta.xsodata/$metadata
where:
  • yourhanahost is the hostname or ipadress of your SAP HANA system
  • yourxsport is the portname of your xsengine. This is normally 80 followed by the HANA Instance number. For example, if the instance number is 10, the port will be 8010
  • path/to/your/package is the path you get when you take the package identifier where you put the db, service and web subpackages and replace the dot (.) that separates the individual package names with a slash (/).

Inspect the $metadata document of the service

If all is well, your $metadata document should look something like this:

<edmx:Edmx 
  xmlns:edmx="http://schemas.microsoft.com/ado/2007/06/edmx"
  xmlns:sap="http://www.sap.com/Protocols/SAPData" 
  Version="1.0"
>
  <edmx:DataServices 
    xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" 
    m:DataServiceVersion="2.0"
  >
    <Schema 
      xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
      xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
      xmlns="http://schemas.microsoft.com/ado/2008/09/edm"
      Namespace="system-local.public.rbouman.ta.service.ta"
    >
      <EntityType Name="FilesType">
        <Key>
          <PropertyRef Name="FILE_NAME"/>
        </Key>
        <Property Name="FILE_NAME" Type="Edm.String" Nullable="false" MaxLength="256"/>
        <Property Name="FILE_TYPE" Type="Edm.String" Nullable="false" MaxLength="256"/>
        <Property Name="FILE_LAST_MODIFIED" Type="Edm.DateTime" Nullable="false"/>
        <Property Name="FILE_SIZE" Type="Edm.Int32" Nullable="false"/>
        <Property Name="FILE_CONTENT" Type="Edm.Binary" Nullable="false"/>
        <Property Name="FILE_LAST_UPLOADED" Type="Edm.DateTime" Nullable="false"/>
      </EntityType>
      <EntityType Name="TextAnalysisType">
        <Key>
          <PropertyRef Name="FILE_NAME"/>
          <PropertyRef Name="TA_RULE"/>
          <PropertyRef Name="TA_COUNTER"/>
        </Key>
        <Property Name="FILE_NAME" Type="Edm.String" Nullable="false" MaxLength="256"/>
        <Property Name="TA_RULE" Type="Edm.String" Nullable="false" MaxLength="200"/>
        <Property Name="TA_COUNTER" Type="Edm.Int64" Nullable="false"/>
        <Property Name="TA_TOKEN" Type="Edm.String" MaxLength="5000"/>
        <Property Name="TA_LANGUAGE" Type="Edm.String" MaxLength="2"/>
        <Property Name="TA_TYPE" Type="Edm.String" MaxLength="100"/>
        <Property Name="TA_NORMALIZED" Type="Edm.String" MaxLength="5000"/>
        <Property Name="TA_STEM" Type="Edm.String" MaxLength="5000"/>
        <Property Name="TA_PARAGRAPH" Type="Edm.Int32"/>
        <Property Name="TA_SENTENCE" Type="Edm.Int32"/>
        <Property Name="TA_CREATED_AT" Type="Edm.DateTime"/>
        <Property Name="TA_OFFSET" Type="Edm.Int64"/>
        <Property Name="TA_PARENT" Type="Edm.Int64"/>
      </EntityType>
      <EntityContainer 
        Name="ta" 
        m:IsDefaultEntityContainer="true"
      >
        <EntitySet Name="Files" EntityType="system-local.public.rbouman.ta.service.ta.FilesType"/>
        <EntitySet Name="TextAnalysis" EntityType="system-local.public.rbouman.ta.service.ta.TextAnalysisType"/>
      </EntityContainer>
    </Schema>
  </edmx:DataServices>
</edmx:Edmx>

Summary

In this installment, we executed the plan formulated in part 1 of this series. We should now have a functional back-end which we may use to support our front-end application.

In the next installment we will present a front-end application, and explain how you can obtain it and install it yourself on your own SAP HANA System.

No comments:

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...