- Part 1 - an Overview: SAP HANA Text Analysis on Documents uploaded by an end-user
- Part 2 - Hands on: Building the backend for a SAP HANA Text Analysis application
- Part 3 - Presenting: A UI5 front-end to upload documents and explore SAP HANA Text Analytics features
- Part 4 - Deep dive: How to upload documents with OData in a UI5 Application
Building the HANA Text Analysis Backend
We'll quicky go through a setup so you can try this out yourselves. This assumes you have access to a HANA System and development tools (either the SAP Web IDE, HANA Studio, or Eclipse IDE with SAP HANA Develpoment tools, or whatever - it doesn't really matter.Package Structure
There is a lot that could be said about the proper package structure for HANA applications, but we won't go there now. It's enough to have something with just enough structure to keep the different responsibilities of the application apart. We settled for a simple base package calledta
, which acts as the root package of the demo project, with 3 subpackages:db
for anything related to the physical database structureservice
for the OData service, that exposes the database to our web-applicationweb
for the HTML5/UI5 application - i.e. the stuff that is served from HANA to run inside the client's web-browser
ta
package also contains these 2 files, which are necessary to expose it as a web application:.xsapp
- an empty file to make the contents of the package available via the XS webserver..xsaccess
- A configuration file for managing access and authorizations for the web application.
The CT_FILE
table
We created a HANA table to hold our uploaded files by creating a file called CT_FILE.hdbdd
file in the db
package. This allows you to maintain the table definition as a repository object, which makes it transportable.
The
CT_FILE.hdbdd
file has the following contents:namespace "system-local"."public"."rbouman"."ta"."db";
@Schema: 'RBOUMAN'
@Catalog.tableType: #COLUMN
Entity CT_FILE {
Key "FILE_NAME" : String(256) not null;
"FILE_TYPE" : String(256) not null;
"FILE_LAST_MODIFIED" : UTCTimestamp not null;
"FILE_SIZE" : Integer not null;
"FILE_CONTENT" : LargeBinary not null;
"FILE_LAST_UPLOADED" : UTCTimestamp not null ;
}
technical configuration {
FULLTEXT INDEX "FT_IDX_CT_FILE" ON ("FILE_CONTENT")
ASYNCHRONOUS
LANGUAGE DETECTION ('en')
MIME TYPE COLUMN "FILE_TYPE"
FUZZY SEARCH INDEX off
PHRASE INDEX RATIO 0.721
SEARCH ONLY OFF
FAST PREPROCESS OFF
TEXT ANALYSIS ON
CONFIGURATION 'GRAMMATICAL_ROLE_ANALYSIS';
};
The important feature here is the definition of the FILE_CONTENT
column as a LargeBinary
, and the FULLTEXT INDEX
definition on that column.
The particular syntax to define the fulltext index in a .hdbdd
table definition is described SAP HANA Core Data Services (CDS) Reference, whereas the actual options that are applicable for FULLTEXT INDEX
es are described in the SAP HANA SQL and System Views Reference. Finally, guidance on the meaning and functionality of the text analysis configurations are described in the SAP HANA Text Analysis Developer Guide.
The short of it is that with this configuration, the (binary) content of docments stored in the
TEXT_CONTENT
column will be analyzed automatically. The results of the analysis will be stored in a sepearate $TA_
table called $TA_system-local.public.rbouman.ta.db::CT_FILE.FT_IDX_CT_FILE
. This table is created and maintained by the HANA system. The structure of this $TA_
table is described here.
ASYNCHRONOUS
I just mentioned that the analysis results will be stored in the $TA_
table automatically. While this is true, the analysis does not occur immediately. This is because the FULLTEXT INDEX
is created with the ASYNCHRONOUS
option. This allows HANA to store documents in the CT_FILE
table without having to wait for the text analysis process to finish.
We could the debate the advantages and drawbacks of the
ASYNCHRONOUS
option and whether it would make more sense to specify SYNCHRONOUS
instead, or leave the option out alltogether (in which case SYNCHRONOUS
would be implied). However there is a very simple reason why it is currently specified as ASYNCHRONOUS
: if a FULLTEXT INDEX
specifies a CONFIGURATION
option, then it must be specified as SYNCHRONOUS
, or else the following error occurs upon activation:
CONFIGURATION not supported with synchronous processing
For actual analysis, we really do need the CONFIGURATION
option, as it offers all the truly interesting properties of text analysis. So, it seems there's just no way around it - text analysis results are collected in the background, and finish at some point after our document is stored in the table. And there seems to be no way of finding out whether the analysis is finished, or even if it is still busy. For instance, this makes it impossible to determine whether a recently uploaded document is still being analyzed, or whether the document was not eligible for text analysis at all: in both cases, the analysis results will remain absent.
That said, even though the
FULLTEXT INDEX
is specified as ASYNCHRONOUS
HANA will let you specify when the analysis results should be updated. At least, according to the SAP HANA SQL and System Views Reference, it is possible to specify a FLUSH [QUEUE] <flush_queue_elem>
-clause right after the ASYNCHRONOUS
option, with <flush_queue_elem>
indicating either a time interval (expressed as a number of minutes) or a number of documents. So, in theory, it would be possible to write:ASYNCHRONOUS FLUSH QUEUE AFTER 1 DOCUMENTS
which would indicate that update of the analysis would kick in as soon as a new document has been loaded.
Unfortunately, on the HANA System I have access to, this results in the following error upon activation:
Flush based on documents/minutes not yet supportedThe same error message occurs when I try
ASYNCHRONOUS FLUSH QUEUE EVERY 1 MINUTES
instead.
So, it looks like we'll just have to live with this for now. I did some checks and I noticed that analysis kicks in after a couple of seconds, but this is on a system that is not very heavily used. So for the purpose of exploration it's not too bad, but this does seem like it could become a problem for real-time applications (like chatbots).
The Key
Another thing worth mentioning here is the key of theCT_FILE
table. For this very simple demo application, we chose to make only the FILE_NAME
column the primary key of the table. The choice of key will depend on what kind of application you're building. In many practical cases you might not care at all about the fysical name of the uploaded file at all, and a name given by the uploader might be a better choice. Or maybe you don't care about names at all, only about whether the content of the document may be considered unique, in which case some hash of the file contents may be a suitable choice.
No matter what key you choose for the
FULLTEXT INDEX
ed table, the column definitions that make up the key are copied to the corresponding $TA_
table in order to maintain the relationship between the analysis result and the original source document of the analysis, thus using those columns as a foreign key. Note however that HANA does not automatically create a FOREIGN KEY
constraint to enforce referential integrity. But you may add such a constraint yourself. This may be useful in particular to cascade deletes on the document table to the text analysis results table.
(Adding such a constraint manually is suggested in the introduction of the SAP HANA Text Analysis Developer Guide.)
The primary key of the
$TA_
table consists of the key columns from the document table, plus to additional columns that identify an individual analysis result: TA_RULE
and TA_COUNTER
, where TA_RULE
is the type of analysis that yielded the result, and TA_COUNTER
is an integer that identifies, in order, each analysis result within a document and within a particular analysis type.
The .xsodata
Service Definition
We expose both the CT_TEXT
and the $TA_system-local.public.rbouman.ta.db::CT_FILE.FT_IDX_CT_FILE
tables via an OData service. The OData service is created by creating a .xsodata
service definition file called ta.xsodata
in the service
subpackage.
The contents of
ta.xsodata
service definition file are shown below:service {
entity "system-local.public.rbouman.ta.db::CT_FILE" as "Files";
entity "RBOUMAN"."$TA_system-local.public.rbouman.ta.db::CT_FILE.FT_IDX_CT_FILE" as "TextAnalysis";
}
annotations {
enable OData4SAP;
}
settings {
support null;
}
This creates a OData service and maps our two tables CT_FILE
and the $TA_
table to the OData EntitySets Files
and TextAnalysis
respectively. The OData service will be available at a url of which the path corresponds to the fully qualified package name of the .xsodata
file.
Note that the syntax for mapping the tables to EntitySets is slightly different, depending upon whether the table is created as a repository object or as a database catalog object:
- for
CT_FILE
, it is the package name containing the table's.hdbdd
file, followed by 2 colons, and then followed by the local table name. - for
$TA_
table, it is the (quoted) database schema name, followed by a dot, and them followed by the quoted table name.
CT_FILE
table as a repository object. There is no corresponding repository object for the $TA_
file, since HANA creates that autonomously as a result of the full text index on CT_FILE
. Since the $TA_
table is created automatically we can assume the entire thing is transportable as it is, as long as we make sure we maintain our document table as a repository object, and refer to it in the .xsodata
file using it's repository object name.
Activation and Verification
Now that we have all these artifacts, we should try and activate our package and test the service. You can either attempt to activate the entire package, or activate each file individually. For the latter, you need to make sure to activate the.hdbdd
file before attempting to activate the .xsodata
file, because the .xsodata
file is dependent upon the existence of the tables in the database catalog.
After succesful activation, you can attempt to visit the service by navigating to its service document or metadata document using your web browser. These documents should be available at the following urls:
- Service Document:
http://yourhanahost:yourxsport/path/to/your/package/service/ta.xsodata
- Metadata Document:
http://yourhanahost:yourxsport/path/to/your/package/service/ta.xsodata/$metadata
yourhanahost
is the hostname or ipadress of your SAP HANA systemyourxsport
is the portname of your xsengine. This is normally 80 followed by the HANA Instance number. For example, if the instance number is 10, the port will be 8010path/to/your/package
is the path you get when you take the package identifier where you put thedb
,service
andweb
subpackages and replace the dot (.) that separates the individual package names with a slash (/).
Inspect the $metadata
document of the service
If all is well, your $metadata
document should look something like this:
<edmx:Edmx
xmlns:edmx="http://schemas.microsoft.com/ado/2007/06/edmx"
xmlns:sap="http://www.sap.com/Protocols/SAPData"
Version="1.0"
>
<edmx:DataServices
xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
m:DataServiceVersion="2.0"
>
<Schema
xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
xmlns="http://schemas.microsoft.com/ado/2008/09/edm"
Namespace="system-local.public.rbouman.ta.service.ta"
>
<EntityType Name="FilesType">
<Key>
<PropertyRef Name="FILE_NAME"/>
</Key>
<Property Name="FILE_NAME" Type="Edm.String" Nullable="false" MaxLength="256"/>
<Property Name="FILE_TYPE" Type="Edm.String" Nullable="false" MaxLength="256"/>
<Property Name="FILE_LAST_MODIFIED" Type="Edm.DateTime" Nullable="false"/>
<Property Name="FILE_SIZE" Type="Edm.Int32" Nullable="false"/>
<Property Name="FILE_CONTENT" Type="Edm.Binary" Nullable="false"/>
<Property Name="FILE_LAST_UPLOADED" Type="Edm.DateTime" Nullable="false"/>
</EntityType>
<EntityType Name="TextAnalysisType">
<Key>
<PropertyRef Name="FILE_NAME"/>
<PropertyRef Name="TA_RULE"/>
<PropertyRef Name="TA_COUNTER"/>
</Key>
<Property Name="FILE_NAME" Type="Edm.String" Nullable="false" MaxLength="256"/>
<Property Name="TA_RULE" Type="Edm.String" Nullable="false" MaxLength="200"/>
<Property Name="TA_COUNTER" Type="Edm.Int64" Nullable="false"/>
<Property Name="TA_TOKEN" Type="Edm.String" MaxLength="5000"/>
<Property Name="TA_LANGUAGE" Type="Edm.String" MaxLength="2"/>
<Property Name="TA_TYPE" Type="Edm.String" MaxLength="100"/>
<Property Name="TA_NORMALIZED" Type="Edm.String" MaxLength="5000"/>
<Property Name="TA_STEM" Type="Edm.String" MaxLength="5000"/>
<Property Name="TA_PARAGRAPH" Type="Edm.Int32"/>
<Property Name="TA_SENTENCE" Type="Edm.Int32"/>
<Property Name="TA_CREATED_AT" Type="Edm.DateTime"/>
<Property Name="TA_OFFSET" Type="Edm.Int64"/>
<Property Name="TA_PARENT" Type="Edm.Int64"/>
</EntityType>
<EntityContainer
Name="ta"
m:IsDefaultEntityContainer="true"
>
<EntitySet Name="Files" EntityType="system-local.public.rbouman.ta.service.ta.FilesType"/>
<EntitySet Name="TextAnalysis" EntityType="system-local.public.rbouman.ta.service.ta.TextAnalysisType"/>
</EntityContainer>
</Schema>
</edmx:DataServices>
</edmx:Edmx>
Summary
In this installment, we executed the plan formulated in part 1 of this series. We should now have a functional back-end which we may use to support our front-end application.In the next installment we will present a front-end application, and explain how you can obtain it and install it yourself on your own SAP HANA System.
No comments:
Post a Comment