Roland Bouman's blog

DataZen winter meetup 2025

2025-01-22T11:30:00.010+01:00

The DataZen winter meetup 2025 is nigh!

Join us 18 - 20 February 2025 for 3 days of expert-led sessions on AI, LLM, ChatGPT, Big Data, MLOps, and more. This FREE online event is open to data enthusiasts of all levels!

Checkout the program here: https://wearecommunity.io/events/winter-data-meetup2025/talks/84304

I'm doing a talk on DuckDB and Huey - an open source browser app for pivoting hundreds of millions of rows directly in your browser:

My talk is on february 18 2025, 10:00- 11:00 CET Huey: Blazing-Fast Browser Pivot Tables with DuckDB/WASM.

In this session, I’ll cover:

✅ Emerging trends in data and analytics, including the rise of DuckDB and Small Data
✅ How to build high-performance analytical browser apps using DuckDB/WASM
✅ The workings of Huey - a tool for pivoting hundreds of millions of rows directly in your browser
✅ Plus, a live demo to demonstrate Huey's speed and user-friendliness

If you're a data engineer, developer, or anyone excited about cutting-edge analytics tools, join me to learn about the future of browser-based analytics and maybe start building your own DuckDB/WASM apps! 📅 Register now - it's time to redefine data pivoting!

I'm looking forward to seeing you there!

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

2024-12-19T01:38:00.011+01:00

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is about a particular challenge posted on the DuckDB Discord server which happened to interest me. So, here goes!!

Yesterday on the DuckDB discord server user @col_montium asked:

I've got a bit of a head scratcher. My json from https://api.census.gov/data/2000/dec/sf1/variables.json is a single struct with dynamic keys. I want to extract a table with columns field_code, label, concept, and group. I completed this with a small sample of the data but with the entire dataset the query uses up 24 gigabytes of RAM and then crashes. Here's my query:

WITH source AS (
  SELECT *
  FROM 'https://api.census.gov/data/2000/dec/sf1/variables.json'
),
all_keys AS (
  SELECT unnest( json_keys( variables ) ) AS "key",
         variables
  FROM   source
),
extracted_fields AS (
  SELECT "key" AS field_code
  ,      variables->key->>'$.label' AS label
  ,      variables->key->>'$.concept' AS concept
  ,      variables->key->>'$.predicateType' AS predicate_type
  ,      variables->key->>'$.group' AS "group"
  FROM all_keys
  WHERE "key" NOT IN ('for', 'in', 'ucgid')
)
SELECT *
FROM extracted_fields
ORDER BY field_code

I tried to run this myself on my laptop using the DuckDB 1.1.3 (GA) command line interface, and it failed with a similar error as reported by @col_montium:

Run Time (s): real 252.335 user 197.031250 sys 9.500000
Out of Memory Error: failed to allocate data of size 16.0 MiB (24.9 GiB/25.0 GiB used)

I was able to run it successfully using a 1.1.4 nightly build, but it took about 3 minutes to complete:

Run Time (s): real 188.681 user 148.500000 sys 8.078125

Dataset: Size and Structure

Surprisingly, the source data set is less than 2MB (2,039,748 bytes)! The query also doesn't appear to be terribly complicated, and if it finishes successfully, it yields only 8141 rows. So, clearly something interesting must be the matter with the structure of the dataset.

As it is so modest in size, we can easily inspect it with a text editor:

{
  "variables": {
    "for": {
      "label": "Census API FIPS 'for' clause",
      "concept": "Census API Geography Specification",
      "predicateType": "fips-for",
      "group": "N/A",
      "limit": 0,
      "predicateOnly": true
    },
    "in" {
      "label": "Census API FIPS 'in' clause",
      "concept": "Census API Geography Specification",
      "predicateType": "fips-in",
      "group": "N/A",
      "limit": 0,
      "predicateOnly": true
    },
    "ucgid" {
      "label": "Uniform Census Geography Identifier clause",
      "concept": "Census API Geography Specification",
      "predicateType": "ucgid",
      "group": "N/A",
      "limit": 0,
      "predicateOnly": true,
      "hasGeoCollectionSupport": true
    },
    "P029009" {
      "label": "Total!!In households!!Related child!!Own child!!6 to 11 years",
      "concept": "RELATIONSHIP BY AGE FOR THE POPULATION UNDER 18 YEARS [46]",
      "predicateType": "int",
      "group": "P029",
      "limit": 0
    },
    
    ...many more variables...
    
    "PCT012H185" {
      "label": "Total!!Female!!78 years",
      "concept": "SEX BY AGE (HISPANIC OR LATINO) [209]",
      "predicateType": "int",
      "group": "PCT012H",
      "limit": 0
    }
  }
}

The structure of the dataset is quite simple: a single object with a single variables property, which has an object value with many object-typed properties. The object-type values of these properties contain a handful of recurring scalar properties like label, concept, predicateType, group and limit.

Now that we analyzed the structure, it should be clear what the query attempts to achieve. It wants to make a row for each property of the outermost variables-object. The property name becomes the field_code column, and forms its natural primary key. The recurring properties of the innermost objects become its non-key columns label, concept, predicate_type and group.

Using `read_text()` and a DuckDB table column with the `JSON`-datatype

In an attempt to get the query working without running out of memory, I decided to store the dataset in a DuckDB table like so:

CREATE TABLE t_json
AS
SELECT  filename
,       content::JSON AS data
FROM    read_text('https://api.census.gov/data/2000/dec/sf1/variables.json');

With this in place, a modified version of the initial query might be:

WITH variables AS (
  SELECT  data->variables                         AS variables
  ,       unnest( json_keys( variables ) )        AS field_code
  FROM    t_json
)
SELECT    field_code
,         variables->field_code->>'label'         AS label
,         variables->field_code->>'concept'       AS concept
,         variables->field_code->>'predicateType' AS predicate_type
,         variables->field_code->>'group'         AS group
FROM      variables
WHERE     field_code NOT IN ('for', 'in', 'ucgid')
ORDER BY  field_code

The results look like this:

┌────────────┬──────────────────────────────────┬─────────┬────────────────┬─────────┐
│ field_code │              label               │ concept │ predicate_type │  group  │
│  varchar   │             varchar              │ varchar │    varchar     │ varchar │
├────────────┼──────────────────────────────────┼─────────┼────────────────┼─────────┤
│ AIANHH     │ American Indian Area/Alaska Na…  │ NULL    │ NULL           │ N/A     │
│ AIHHTLI    │ American Indian Area (Off-Rese…  │ NULL    │ NULL           │ N/A     │
│ AITSCE     │ American Indian Tribal Subdivi…  │ NULL    │ NULL           │ N/A     │
│ ANRC       │ Alaska Native Regional Corpora…  │ NULL    │ NULL           │ N/A     │
│ BLKGRP     │ Census Block Group               │ NULL    │ NULL           │ N/A     │
│ BLOCK      │ Census Block                     │ NULL    │ NULL           │ N/A     │
│ CD106      │ Congressional District (106th)   │ NULL    │ string         │ N/A     │
│ CONCIT     │ Consolidated City                │ NULL    │ NULL           │ N/A     │
│   ·        │         ·                        │  ·      │  ·             │  ·      │
│   ·        │         ·                        │  ·      │  ·             │  ·      │
│   ·        │         ·                        │  ·      │  ·             │  ·      │
│ SUBMCD     │ Sub-Minor Civil Division (FIPS)  │ NULL    │ NULL           │ N/A     │
│ SUMLEVEL   │ Summary Level code               │ NULL    │ string         │ N/A     │
│ TRACT      │ Census Tract                     │ NULL    │ NULL           │ N/A     │
│ UA         │ Urban Area                       │ NULL    │ NULL           │ N/A     │
│ US         │ United States                    │ NULL    │ NULL           │ N/A     │
│ ZCTA3      │ ZIP Code Tabulation Area (Thre…  │ NULL    │ NULL           │ N/A     │
│ ZCTA5      │ Zip Code Tabulation Area (Five…  │ NULL    │ NULL           │ N/A     │
├────────────┴──────────────────────────────────┴─────────┴────────────────┴─────────┤
│ 8141 rows (15 shown)                                                     5 columns │
└────────────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 11.682 user 8.015625 sys 2.406250

Less than 12 seconds. Not super-fast, but still: a pretty substantial improvement as compared to the initial query. What could explain this difference? Two things stand out as compared to the initial query:

Using the read_text()-function to ingest the data. In the original query, the data was ingested by the first common table expression called source, which did a SELECT directly FROM the dataset's URL 'https://api.census.gov/data/2000/dec/sf1/variables.json'. Clearly, with this syntax, DuckDB will do some magic to perform the appropriate action, which in this case will be to invoke read_json_auto('https://api.census.gov/data/2000/dec/sf1/variables.json').
Explicitly casting the content of the data file to DuckDB's JSON-datatype. I expected the issue with the original query had something todo with the transformation of JSON data to a tabular result. Having the data in a JSON-datatype seemed like a good start to play with different methods to extract data. The read_text()-function returns the contents of the file as VARCHAR. It cannot and does not attempt to parse or process the data. That's why we need to cast its output explictly to the JSON-type ourselves.

You might object to a comparison as this query does not include the time to fetch the dataset from the internet, nor the time to store it in the DuckDB table. For me that takes about 6 - 10 seconds. However, we can easily rewrite the query to include read_text() directly:

WITH variables AS (
  SELECT  json( content )->variables              AS variables
  ,       unnest( json_keys( variables ) )        AS field_code
  FROM    read_text( 'https://api.census.gov/data/2000/dec/sf1/variables.json' )
)
SELECT    field_code
,         variables->field_code->>'label'         AS label
,         variables->field_code->>'concept'       AS concept
,         variables->field_code->>'predicateType' AS predicate_type
,         variables->field_code->>'group'         AS group
FROM      varibles
WHERE     field_code NOT IN ('for', 'in', 'ucgid')
ORDER BY  field_code

This query takes about 20 seconds, in other words: about the same time as it takes to run the previous query plus the time to fetch the dataset and load it in the table. For a serious comparison of query performance we should probably eliminate the download and use only local files, but we'll get to that later. For now it's enough to notice that this setup is a significant improvement, both in terms of query execution time as well as in stability/memory consumption.

You might also have noticed this alternative query uses slightly different syntax to the extract label, concept, predicateType and group. Also, the final data presentation query is combined with the field extraction logic. However, these changes are really a matter of style. They do not have any bearing on query performance - not for this data set anyway.

Understanding the difference: the JSON-reader

With these results in mind, we have to conclude that the actual operations on the JSON data cannot really explain the difference. So, it must have something todo with the JSON-reader that was implicitly invoked by the initial query. Let's zoom in a bit on the difference between data extracted from a JSON-typed value and data extracted by the JSON-reader.

We know that in the second example, the data is of the JSON-type, because we explicitly cast it to that type. However, that only tells us that the data is essentially text, conforming to the JSON-syntax. It is of course very useful to know the encoding scheme used to represent values and structures, that still does not reveal anything about the content and structure of the data itself.

What data type does the JSON reader think the data has? Let's find out using this DESCRIBE-statement:

DESCRIBE
SELECT *
FROM 'https://api.census.gov/data/2000/dec/sf1/variables.json'

Its result is:

┌─────────────┬─────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │         column_type         │  null   │   key   │ default │  extra  │
│   varchar   │           varchar           │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ variables   │ STRUCT("for" STRUCT("labe…  │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────────────────────┴─────────┴─────────┴─────────┴─────────┘
Run Time (s): real 9.541 user 1.968750 sys 0.031250

As you can see, the variables property of the main object that makes up our dataset was detected and projected as a column. DuckDB has also inferred a pretty detailed data type for the column, expressed using DuckDB native SQL data types.

In the example above it's truncated, but if we run it again using .mode line, we can see the entire type descriptor. It goes like this:

STRUCT(
  "for" STRUCT(
    "label" VARCHAR
  , concept VARCHAR
  , predicateType VARCHAR
  , "group" VARCHAR
  , "limit" BIGINT
  , predicateOnly BOOLEAN
  )
, "in" STRUCT(
    "label" VARCHAR
  , concept VARCHAR
  , predicateType VARCHAR
  , "group" VARCHAR
  , "limit" BIGINT
  , predicateOnly BOOLEAN
  )
, ucgid STRUCT(
    "label" VARCHAR
  , concept VARCHAR
  , predicateType VARCHAR
  , "group" VARCHAR
  , "limit" BIGINT
  , predicateOnly BOOLEAN
  , hasGeoCollectionSupport BOOLEAN
  )
, P029009 STRUCT(
    "label" VARCHAR
  , concept VARCHAR
  , predicateType VARCHAR
  , "group" VARCHAR
  , "limit" BIGINT
  )
  
  ...many more ...
  
, PCT012H185 STRUCT(
    "label" VARCHAR
  , concept VARCHAR
  , predicateType VARCHAR
  , "group" VARCHAR
  , "limit" BIGINT
  )
)

Looks familiar? It should, as it corresponds closely to the structure we witnessed when analyzing the raw JSON data using a text editor.

So, this really is quite different as compared to the JSON-type column, as the JSON-reader must have done some work to explore the structure of the data. Not only did it detect the data type, it also uses the type to represent the data, which in turn determines what operators and functions will be available by the remainder of the query to work with the data.

Whether this is related to the difference in performance and to the out of memory errors, is yet to be determined, but it does not seem unlikely. For starters, the data type closely resembles the data. The textual description of the data type is about half the size of the data itself, which does not seem like a good thing!

We will investigate the JSON-readers' capabilities for data type detection and the possibilities to control that in the next couple of sections. But before we examine that in more detail, let's take a step back and consider what we just learned.

One lesson learned

So far, we've learned at least one important lesson: the JSON reader does in fact do exactly what its name implies - read JSON-encoded data. But while doing so, it tends to output data using DuckDB native data types, and typically not the JSON data type.

In hindsight this certainly sounds perfectly sensible. Yet, both me and @col_montium apparently did not realize that fully, as the original query takes the data coming out of the JSON reader, and processes it using functions like json_keys(), and the extraction operators -> and ->>. These are designed for working with values in the JSON data type, and not for the STRUCT values that the JSON reader hands back to us.

You may wonder: how it is possible that these functions and operators work at all on the STRUCT value returned by the JSON reader? The answer is quite simple: when we apply these functions and operators, they first implicitly cast their arguments and operands to the JSON-type. So, just as the JSON reader spent all its effort to read raw JSON-encoded data and convert it into neat native DuckDB STRUCTs, we immediately undid that effort only to convert them back to a JSON-type. And initially we - or at least I - didn't even realize it.

Actually - it is a little bit different still, as the JSON-reader deals with reading JSON-encoded text whereas the JSON-data type is still a database type. But the point here is that the JSON-reader did work to provide a precise description of the data as STRUCTs, and we didn't bother treating the data as such. Instead, we implicitly cast it to the JSON-type, which is much looser. So it feels there might be an unused opportunity to benefit from the detected data type. We will examine this possibility in the next section.

Avoiding the implicit `JSON`-type cast?

Now that we have a better understanding of what we've done, could we rewrite the original query so we can avoid the implicit JSON-type cast, and benefit from the DuckDB native types returned by the JSON reader? Somewhat surprisingly, this is not quite as simple as one might think it would (or should) be.

First of all, we cannot simply use whatever are the STRUCT equivalents for the JSON-type functions and operators.

For example, it is currently not possible to extract the keys from a STRUCT type. So there's no straightforward equivalent to how the original query uses json_keys() and then unnest() to spread out the keys from the original variables object into separate rows. (In the DuckDB github repository, @pkoppstein opened a discussion about creating a feature to extract the keys from STRUCTs in case you want to chime in.)

Even if we somehow managed to extract the keys from the STRUCT, there is currently no way to use it to extract the corresponding value. That is, STRUCTs of course support extraction methods, but all of them require the key to be a constant value. This prohibits using a column value, such as what we would get if we'd use an unnested set of keys:

WITH field_codes AS (
  SELECT  unnest( json_keys( variables ) ) AS field_code
  ,       variables
  FROM    'https://api.census.gov/data/2000/dec/sf1/variables.json'
)
SELECT  field_code
,       variables[field_code] AS "variable"
FROM    field_codes
WHERE   field_codes NOT IN ('for', 'in', 'ucgid')

In the example above, variables is the STRUCT-typed value we receive from the JSON-reader. The expression variables[field_code] attempts to extract a value from it using the column value field_code. The statement fails because field_code is not a constant:

Run Time (s): real 8.345 user 1.531250 sys 0.031250
Binder Error: Key name for struct_extract needs to be a constant string

As the error message suggests, the square bracket extraction syntax is just syntactic sugar for the struct_extract() function. Using a function call instead yields the exact same error.

A Solution based on `STRUCT`s

Eventually I came up with an approach that at least allows me to envision what the solution based on STRUCTs, and without relying on the JSON-type might look like. Here's that attempt:

WITH variables AS (
  UNPIVOT (
    WITH variables AS (
      SELECT  COLUMNS( variables.* EXCLUDE( "for", "in", "ucgid" ) )
      FROM    read_json( 'https://api.census.gov/data/2000/dec/sf1/variables.json' )
    )
    SELECT  struct_pack(
              label := struct_extract( 
                COLUMNS( * )
              , 'label' 
              )
            , concept := struct_extract( 
                COLUMNS( * )
              , 'concept' 
              )
            , predicateType := struct_extract( 
                COLUMNS( * )
              , 'predicateType' 
              )
            , "group" := struct_extract( 
                COLUMNS( * )
              , 'group' 
              )
            )
    FROM    variables
  )
  ON( * )
  INTO NAME field_code VALUE "variable"
)
SELECT field_code, "variable".*
FROM variables
ORDER BY field_code

The main idea here is to use variables.* to turn the keys of the STRUCT given to us by the JSON-reader into separate columns. The UNPIVOT-statement can then turn those columns into rows. In the original query this was achieved by combining json_keys() and unnest().

Since we don't need the metadata-like objects ucgid, for and in, we also don't need to unpack them from the variables object. So, we wrapped variables.* into a COLUMNS-"star expression" so we could add an EXCLUDE-clause to remove them. In the original query, those where removed in the WHERE-clause.

The UNPIVOT-statement not only creates rows from columns, but also lets us transform the column names into values. We use the NAME-clause to collect them into a new field_code column. The VALUE-clause can be used to name the column that receives the corresponding column values. In the example above, that value column is assigned the name variable.

Now, the values extracted from the variables-object are themselves STRUCTs, and they do not all have the same set of keys. From the DuckDB persective, they therefore have distinct types, and thus cannot all simply be lumped together directly into a single column.

However, we're only interested in a particular set of properties. They are 'label', 'concept', 'predicateType', and 'group'. If the type of their value is the same across all objects, then we can extract them and create new STRUCTs having only those keys. This is achieved using struct_extract() and struct_pack().

Again, the COLUMNS(*)-"star expression" proves to be a very useful tool! We use it here as argument for struct_extract(). This way, we need to write the STRUCT extraction-and-assembly business only once. The star expression then applies it to all columns without having to name them explicitly. The STRUCT-value assembled by struct_pack() gets the name of the original column.

The final touch is to unwrap the STRUCTs we assembled into separate columns. This is done in the terminal SELECT, again using variable.* syntax.

Nice! Pity it works only in theory

When we run it, we get:

Run Time (s): real 14.183 user 12.484375 sys 0.062500
Binder Error: Could not find key "concept" in struct

Apparently, not only do the objects in our data set sometimes have more keys than we're interested in, some also don't have all the keys we're attempting to extract. There does not seem to be any way to detect whether some arbitrary STRUCT-value has a specific key, so we can't simply work around it.

If we comment out the extraction for the the concept key, we get the same error but now for the predicateType key. If you comment out that extraction too, the query executes succesfully. On my laptop, it takes about as long as the original query, that is to say, much worse than the alternative using the JSON-type.

So even if we would somehow be able to overcome the issues with extracting the properties and creating the STRUCT, we still can't seem to really benefit from the more strict typing. I guess the main take-away here is, we can STRUCTggle all we want, but our data simply appears not to be given to us in a way that allows it to work for us.

JSON reader Parameters

We mentioned earlier that SELECT-ing directly from the url, as the original query did, causes DuckDB to invoke read_json() or read_json_auto(), which are synonyms of each other. If you like, you can convince yourself by running an equivalent DESCRIBE-statement that explicitly invokes the reader:

DESCRIBE
SELECT *
FROM read_json( 'https://api.census.gov/data/2000/dec/sf1/variables.json' )

If you execute it, you'll notice that the inferred column name and data type are identical to that returned by the prior DESCRIBE-statement. But invoking the reader explicitly has a benefit in that it offers us the possibility to pass parameters to control its behavior. Some relevant to the topic at hand are:

BOOLEAN auto_detect: Whether to auto detect the schema at all. Contrary to what the current documentation states, the default value is TRUE rather than FALSE.
STRUCT(name VARCHAR, type VARCHAR) columns: If you choose auto_detect to be FALSE, you are required to explicitly specify the columns and column types.
INTEGER maximum_depth: The number of nesting levels that are considered when detecting the datatype. The default is -1, which means there is no restriction on the depth of type detection. When set to a positive integer, and there are object-type values at the maximum level of nesting, then those will get 'detected' as being of the JSON-type.

Controlling the depth of the JSON-reader's type detection

Now that we learned about the maximum_depth parameter, let's apply it and experience its effect:

DESCRIBE
SELECT  *
FROM    read_json( 
          'https://api.census.gov/data/2000/dec/sf1/variables.json' 
        , maximum_depth = 0
        )

At maximum_depth = 0, there is no detection at all, and the entire data set is just a JSON-typed value:

┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ json        │ JSON        │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
Run Time (s): real 7.583 user 0.375000 sys 0.015625

This is on par from what we achieved with the read_text() and explicitly casting its content column to the JSON-type. Let's allow for one level:

DESCRIBE
SELECT  *
FROM    read_json( 
          'https://api.census.gov/data/2000/dec/sf1/variables.json' 
        , maximum_depth = 1
        )

At maximum_depth = 1 we get at least extraction into separate columns. However, the column data type will still be the generic JSON-type:

┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ variables   │ JSON        │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
Run Time (s): real 8.521 user 0.343750 sys 0.000000

We could allow another level:

DESCRIBE
SELECT  *
FROM    read_json( 
          'https://api.census.gov/data/2000/dec/sf1/variables.json' 
        , maximum_depth = 2
        )

At maximum_depth = 2, the type detected for the variables column is a bag of properties of the JSON-type. In the DuckDB type system, this is represented as MAP( VARCHAR, JSON ):

┌─────────────┬────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │    column_type     │  null   │   key   │ default │  extra  │
│   varchar   │      varchar       │ varchar │ varchar │ varchar │ varchar │
├─────────────┼────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ variables   │ MAP(VARCHAR, JSON) │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴────────────────────┴─────────┴─────────┴─────────┴─────────┘
Run Time (s): real 8.229 user 0.343750 sys 0.000000

This is interesting! So far we've seen STRUCTs, but not MAPs before.

`JSON`-objects as DuckDB `MAP`s

Let's attempt to rewrite the original query using the MAP( VARCHAR, JSON )-typed value we get from the JSON-reader when we pass maximum_depth = 2:

WITH variables AS (
  SELECT  unnest( map_keys( variables ) )   AS field_code
  ,       unnest( map_values( variables ) ) AS "variable"
  FROM    read_json( 
            'https://api.census.gov/data/2000/dec/sf1/variables.json' 
          , maximum_depth = 2
          )
)
SELECT    field_code
,         "variable"->>'label'              AS label
,         "variable"->>'concept'            AS concept
,         "variable"->>'predicateType'      AS predicate_type
,         "variable"->>'group'              AS group
FROM      variables
WHERE     field_code NOT IN ('for', 'in', 'ucgid')
ORDER BY  field_code

The main difference as compared to using only the JSON-type, is that the MAP-type lets us extract both the keys as well as the corresponding values using map_keys() and map_values() respectively. This means the final SELECT does not have to extract the variable object explicitly using the key: instead, we can immediately extract properties of interest from the variable objects.

In addition, the main difference with the original query is that the JSON-reader now hands the variable objects to us as JSON-typed values. The original query relied on implicit typecasting, which does not happen here.

If we download the JSON file and store it on the local disc, and modify our queries to use that, the results are quite remarkable: the one using read_text() and the explicit cast to the JSON-type takes about 12 seconds. However, that result is smashed completely by the last query, which now takes just half a second! A really nice improvement!

Conclusion

DuckDB's feature to SELECT directly from a file or URL is convenient, but it pays off to examine the underlying reader invocation and its parameters.
When using the JSON-reader, it's a good idea to examine whether the detected datatype is suitable for the extraction you're attempting to perform. By default, the JSON-reader will exhaustively detect the type. If you find the detected type is large as compared to your dataset, and/or more precise or detailed than required for your purpose, try using the maximum_depth parameter to curtail the type detection.
If you're using the JSON-reader and find that your extraction logic relies on JSON-type functions and operators, then beware of implicit casting of your extracted data to the JSON-type. Implicit casting to the JSON-type may point to an opportunity to limit the level of data type detection.
When reading JSON-data, you can try extracting from a JSON-typed value - either by loading your data to a table with a JSON-typed column, or using a combination of read_text() and an explicit cast to the JSON-type. Even if your goal is use the JSON-reader, the JSON-typed value is a useful baseline: if you find that your JSON-reader based query is slower than the one on the JSON-typed value, it means you have an opportunity to improve.
If you're using json_keys() with the purpose of extracting objects or values from the JSON-data, then consider using the JSON-reader and configuring it so that it returns a MAP-type. The MAP functions map_keys() and map_values() are really fast, and may help you avoid an an extra step to use the key to extract an object.
It would be nice if STRUCTs would be a little bit more flexible. For example, I'd like to be able to extract field values non-constant expressions. It would also be useful to have a function to obtain their keys, and it would help to check if it has a particular key. Another feature that would also have helped is if it would be possible to define STRUCT-fields to be optional, or to have a default value.

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Rolling up each game's lines into a single game row (6/6)

2024-07-12T01:06:00.007+02:00

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the sixth installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:

Rolling up each game's lines into a single game row

All essential elements are in place for transforming the PGN lines into a tabular result. The actual transformation entails two things:

Grouping all lines with the same game_id into one single row.
Create columns for each unique value of tag_name from tag_pair, and place the corresponding tag_value into those columns.

We can do that using DuckDB's PIVOT-statement.

Typical `PIVOT`-statements

The PIVOT-statement is typically used for OLAP use cases to create analytical crosstabs. In this context, we can think of PIVOT as an extension of a standard SELECT-statement with a GROUP BY-clause: each row is actually an aggregate row that represents the group of rows from the underlying dataset that have a unique combination of values for all expressions appearing in the GROUP BY-clause.

In addition, PIVOT also has an ON-clause, and each unique combination of values coming from the ON-clause expressions generates an aggregate column.

At the intersection of the aggregate rows and aggregate columns of the crosstab are the cell values. These are specified by the USING-clause, which specifies aggregate function expressions that are to be applied on those rows from the underlying resultset that belong to both the aggregate row and the aggregate column.

In typical OLAP use cases, the cell values are typically SUM()s or AVG()s over monetary amounts or quantities, sometimes COUNT()s.

`PIVOT` as a row-to-column transposer

The description of the PIVOT-statement above provides some hints on how we can use it to transform the game lines to a single game row, and how to turn the tags into separate columns:

We want to roll up the game lines of one game into a single game row, so the game_id expression should be placed in the GROUP BY-clause.
The tag names should be used to generate columns, so the tag_name-member of the tag_pair STRUCT-value returned by regexp_extract() should be placed in the ON-clause.
The tag_values should appear as cell values, so the tag_value-member of tag_pair should be placed in the USING-clause.

This makes it all a bit different from the typical analytical crosstabs:

The tag_values that are to appear as cell values are primarily of a text type: player names, event locations, and sometimes date- or time-like values. There are some numerical values too, like Elo scores, but these are non-additive, and quite unlike the amounts and quantities we find in the typical OLAP-case.
As the tagspairs are just attributes of the game, we expect set of tag_name-values for in one game to be unique. That means that for each game, we will find at most one tag_value in each generated column.

So for the PGN use case, it is more natural to think of the PIVOT-statement as a device to transpose rows to column, rather than an analytical crosstab.

Even though we expect only a single text value for the tag_value, the PIVOT-statement's USING-clause still requires some kind of aggregate function. To aggregate tag_value we can settle for anything that preserves the text: MIN(), MAX(), or ANY_VALUE(), as well as the text aggregate STRING_AGG() would all do.

When we put it all together, this is what our inital PIVOT-statement looks like:

PIVOT(
SELECT  line
,       line LIKE '[%'                               AS is_header
,       COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
         ROWS BETWEEN UNBOUNDED PRECEDING
              AND     CURRENT ROW
        ) + 
        CASE
          WHEN is_header THEN 1
          ELSE 0
        END                                          AS game_id
,       CASE
          WHEN is_header THEN
            regexp_extract(
              line
            , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]*$'
            , [ 'tag_name', 'tag_value' ]
            )
        END                                          AS tag_pair
FROM    read_csv(
          'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
        , columns = {'line': 'VARCHAR'}
        )
WHERE   line IS NOT NULL
)
ON tag_pair['tag_name']
USING ANY_VALUE( tag_pair['tag_value'] )
GROUP BY game_id

This is what its result looke like:

┌─────────┬──────────────────────┬──────────┬────────────┬───┬─────────┬─────────────────────┬──────────────────────┬──────────┐
│ game_id │        Black         │ BlackElo │    Date    │ . │  Round  │        Site         │        White         │ WhiteElo │
│  int64  │       varchar        │ varchar  │  varchar   │   │ varchar │       varchar       │       varchar        │ varchar  │
├─────────┼──────────────────────┼──────────┼────────────┼───┼─────────┼─────────────────────┼──────────────────────┼──────────┤
│       1 │ Pollock, William H.  │          │ 1895.??.?? │ . │ ?       │ Hastings            │ Tinsley, Samuel      │          │
│       2 │ Lasker, Edward       │          │ 1913.??.?? │ . │ ?       │ Scheveningen        │ Loman, Rudolf        │          │
│       3 │ Tartakower, Saviely  │          │ 1921.??.?? │ . │ 5       │ The Hague           │ Alekhine, Alexander  │          │
│       4 │ Wegemund, Otto       │          │ 1922.??.?? │ . │ 7       │ Bad Oeynhausen      │ Antze, O.            │          │
│       5 │ Tarrasch, Siegbert   │          │ 1922.??.?? │ . │ 19      │ Bad Pistyan         │ Johner, Paul F       │          │
│       6 │ Alekhine, Alexander  │          │ 1922.??.?? │ . │ ?       │ Hastings            │ Bogoljubow, Efim     │          │
│       7 │ Kmoch, Hans          │          │ 1922.??.?? │ . │ ?       │ Vienna              │ Rubinstein, Akiba    │          │
│       8 │ Mieses, Jacques      │          │ 1923.??.?? │ . │ 9       │ Hastings            │ Norman, George Mar.  │          │
│       9 │ Orlando, Placido     │          │ 1923.??.?? │ . │ ?       │ Trieste             │ Szabados, Eugenio    │          │
│      10 │ Tarrasch, Siegbert   │          │ 1923.??.?? │ . │ 1       │ Trieste             │ Seitz, Jakob Adolf   │          │
│      11 │ Wolf, Siegfried Re.  │          │ 1923.??.?? │ . │ 5       │ Vienna              │ Von Patay, J.        │          │
│      12 │ Tartakower, Saviely  │          │ 1924.??.?? │ . │ ?       │ New York            │ Bogoljubow, Efim     │          │
│      13 │ Pokorny, Amos        │          │ 1926.??.?? │ . │ 3       │ Trencianske Teplice │ Kostic, Boris        │          │
│      14 │ Tartakower, Saviely  │          │ 1927.??.?? │ . │ 2       │ Kecskemet           │ Vukovic, Vladimir    │          │
│      15 │ Botvinnik, Mikhail   │          │ 1927.??.?? │ . │ 2       │ Moscow              │ Rabinovich, Ilya L.  │          │
│       · │      ·               │  ·       │     ·      │ · │ ·       │   ·                 │    ·                 │  ·       │
│       · │      ·               │  ·       │     ·      │ · │ ·       │   ·                 │    ·                 │  ·       │
│       · │      ·               │  ·       │     ·      │ · │ ·       │   ·                 │    ·                 │  ·       │
│    7229 │ Kovacevic,Bl         │ 2400     │ 2023.12.09 │ . │ 7.3     │ Zagreb CRO          │ Kozul,Z              │ 2532     │
│    7230 │ Iskos,A              │ 2153     │ 2023.12.10 │ . │ 6.46    │ Skopje MKD          │ Zhezhovska,Monika    │ 1826     │
│    7231 │ Spichkin,A           │ 2035     │ 2023.12.12 │ . │ 2       │ chess.com INT       │ Rodriguez Santiago,J │ 2043     │
│    7232 │ Rogov,Matfey         │ 2213     │ 2023.12.12 │ . │ 3       │ chess.com INT       │ Clarke,Matthew       │ 2127     │
│    7233 │ Osmonbekov,T         │ 2137     │ 2023.12.12 │ . │ 3       │ chess.com INT       │ Sroczynski,M         │ 2266     │
│    7234 │ Novikova,Galina      │ 2073     │ 2023.12.12 │ . │ 8       │ chess.com INT       │ Marcziter,D          │ 2192     │
│    7235 │ Tomazini,A           │ 2336     │ 2023.12.14 │ . │ 5.24    │ Zagreb CRO          │ Pultinevicius,Paul.  │ 2584     │
│    7236 │ Spichkin,A           │ 2035     │ 2023.12.19 │ . │ 2       │ chess.com INT       │ Levine,D             │ 2040     │
│    7237 │ Kanyamarala,Tarun    │ 2305     │ 2023.12.19 │ . │ 4       │ chess.com INT       │ Nechitaylo,Nikita    │ 2203     │
│    7238 │ Ronka,E              │ 2291     │ 2023.12.19 │ . │ 4       │ chess.com INT       │ Gruzman,Ilya         │ 2151     │
│    7239 │ Kurbonboeva,Sarvinoz │ 2154     │ 2023.12.26 │ . │ 1.37    │ Samarkand UZB       │ Mammadzada,G         │ 2449     │
│    7240 │ Koneru,H             │ 2554     │ 2023.12.26 │ . │ 3.24    │ Samarkand UZB       │ Peycheva,Gergana     │ 2271     │
│    7241 │ Carlsson,Andreas     │ 1902     │ 2023.12.28 │ . │ 4.9     │ Karlstad SWE        │ Kreken,Eivind Grunt  │ 2271     │
│    7242 │ Kazarjan,Gachatur    │ 2078     │ 2023.12.30 │ . │ 9.31    │ Groningen NED       │ Schuricht,Emil Fre.  │ 2095     │
│    7243 │ Kurbonboeva,Sarvinoz │ 2154     │ 2023.12.30 │ . │ 14.47   │ Samarkand UZB       │ Zhu Chen             │ 2423     │
├─────────┴──────────────────────┴──────────┴────────────┴───┴─────────┴─────────────────────┴──────────────────────┴──────────┤
│ 7243 rows (30 shown)                                                                                    11 columns (8 shown) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

This certainly is starting to look a lot like the result we were after.

Folding in the movetext

The only thing stil missing is the movetext, but at this point, it's almost trivial to add that as well. We can simply amend the tag_pair-expression and let it return a new STRUCT-value with the literal text 'moves' as name member and the line itself as value in case the is_header expression is not TRUE:

CASE
  WHEN is_header THEN
    regexp_extract(
      line
    , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]*$'
    , [ 'column_name', 'column_value' ]
    )
  ELSE {
    'column_name': 'moves'
  , 'column_value': line
  }
END AS column_name_value

For consistency, we also changed the tag_pair alias to column_name_value and its member names from column_name and column_value to column_name and column_value respectively. Therefore we must also update the references elsewhere in the PIVOT statement accordingly.

Also, because one game could have multiple lines of movetext, we must also change the aggregate function in the USING-clause from ANY_VALUE() to STRING_AGG(). After these changes we get the final statement:

PIVOT(
SELECT  line
,       line LIKE '[%'                               AS is_header
,       COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
         ROWS BETWEEN UNBOUNDED PRECEDING
              AND     CURRENT ROW
        ) + 
        CASE
          WHEN is_header THEN 1
          ELSE 0
        END                                          AS game_id
,       CASE
          WHEN is_header THEN
            regexp_extract(
              line
            , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]*$'
            , [ 'column_name', 'column_value' ]
            )
          ELSE {
            'column_name': 'moves'
          , 'column_value': line
          }
        END                                          AS column_name_value
FROM    read_csv(
          'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
        , columns = {'line': 'VARCHAR'}
        )
WHERE   line IS NOT NULL
)
ON column_name_value['column_name']
USING STRING_AGG( column_name_value['column_value'], ' ' )
GROUP BY game_id

And its result:

┌─────────┬──────────────────────┬──────────┬────────────┬───┬──────────────────────┬──────────┬──────────────────────┐
│ game_id │        Black         │ BlackElo │    Date    │ . │        White         │ WhiteElo │        moves         │
│  int64  │       varchar        │ varchar  │  varchar   │   │       varchar        │ varchar  │       varchar        │
├─────────┼──────────────────────┼──────────┼────────────┼───┼──────────────────────┼──────────┼──────────────────────┤
│       1 │ Pollock, William H.  │          │ 1895.??.?? │ . │ Tinsley, Samuel      │          │ 1.d4 f5 2.c4 e6 3..  │
│       2 │ Lasker, Edward       │          │ 1913.??.?? │ . │ Loman, Rudolf        │          │ 1.d4 e6 2.c4 f5 3..  │
│       3 │ Tartakower, Saviely  │          │ 1921.??.?? │ . │ Alekhine, Alexander  │          │ 1.d4 f5 2.c4 e6 3..  │
│       4 │ Wegemund, Otto       │          │ 1922.??.?? │ . │ Antze, O.            │          │ 1.c4 f5 2.d4 Nf6 3.  │
│       5 │ Tarrasch, Siegbert   │          │ 1922.??.?? │ . │ Johner, Paul F       │          │ 1.d4 e6 2.c4 f5 3..  │
│       6 │ Alekhine, Alexander  │          │ 1922.??.?? │ . │ Bogoljubow, Efim     │          │ 1.d4 f5 2.c4 Nf6 3.  │
│       7 │ Kmoch, Hans          │          │ 1922.??.?? │ . │ Rubinstein, Akiba    │          │ 1.d4 e6 2.c4 f5 3..  │
│       8 │ Mieses, Jacques      │          │ 1923.??.?? │ . │ Norman, George Mar.  │          │ 1.d4 f5 2.g3 Nf6 3.  │
│       9 │ Orlando, Placido     │          │ 1923.??.?? │ . │ Szabados, Eugenio    │          │ 1.d4 e6 2.c4 f5 3..  │
│      10 │ Tarrasch, Siegbert   │          │ 1923.??.?? │ . │ Seitz, Jakob Adolf   │          │ 1.c4 e6 2.d4 f5 3..  │
│      11 │ Wolf, Siegfried Re.  │          │ 1923.??.?? │ . │ Von Patay, J.        │          │ 1.d4 e6 2.c4 f5 3..  │
│      12 │ Tartakower, Saviely  │          │ 1924.??.?? │ . │ Bogoljubow, Efim     │          │ 1.d4 f5 2.g3 e6 3..  │
│      13 │ Pokorny, Amos        │          │ 1926.??.?? │ . │ Kostic, Boris        │          │ 1.c4 f5 2.d4 Nf6 3.  │
│      14 │ Tartakower, Saviely  │          │ 1927.??.?? │ . │ Vukovic, Vladimir    │          │ 1.d4 f5 2.c4 e6 3..  │
│      15 │ Botvinnik, Mikhail   │          │ 1927.??.?? │ . │ Rabinovich, Ilya L.  │          │ 1.d4 e6 2.c4 f5 3..  │
│       · │      ·               │  ·       │     ·      │ · │    ·                 │  ·       │          ·           │
│       · │      ·               │  ·       │     ·      │ · │    ·                 │  ·       │          ·           │
│       · │      ·               │  ·       │     ·      │ · │    ·                 │  ·       │          ·           │
│    7229 │ Kovacevic,Bl         │ 2400     │ 2023.12.09 │ . │ Kozul,Z              │ 2532     │ 1.d4 e6 2.c4 f5 3..  │
│    7230 │ Iskos,A              │ 2153     │ 2023.12.10 │ . │ Zhezhovska,Monika    │ 1826     │ 1.d4 e6 2.c4 f5 3..  │
│    7231 │ Spichkin,A           │ 2035     │ 2023.12.12 │ . │ Rodriguez Santiago,J │ 2043     │ 1.d4 e6 2.c4 f5 3..  │
│    7232 │ Rogov,Matfey         │ 2213     │ 2023.12.12 │ . │ Clarke,Matthew       │ 2127     │ 1.d4 e6 2.c4 f5 3..  │
│    7233 │ Osmonbekov,T         │ 2137     │ 2023.12.12 │ . │ Sroczynski,M         │ 2266     │ 1.d4 e6 2.c4 f5 3..  │
│    7234 │ Novikova,Galina      │ 2073     │ 2023.12.12 │ . │ Marcziter,D          │ 2192     │ 1.d4 e6 2.c4 f5 3..  │
│    7235 │ Tomazini,A           │ 2336     │ 2023.12.14 │ . │ Pultinevicius,Paul.  │ 2584     │ 1.d4 e6 2.c4 f5 3..  │
│    7236 │ Spichkin,A           │ 2035     │ 2023.12.19 │ . │ Levine,D             │ 2040     │ 1.d4 e6 2.c4 f5 3..  │
│    7237 │ Kanyamarala,Tarun    │ 2305     │ 2023.12.19 │ . │ Nechitaylo,Nikita    │ 2203     │ 1.d4 e6 2.c4 f5 3..  │
│    7238 │ Ronka,E              │ 2291     │ 2023.12.19 │ . │ Gruzman,Ilya         │ 2151     │ 1.d4 f5 2.g3 e6 3..  │
│    7239 │ Kurbonboeva,Sarvinoz │ 2154     │ 2023.12.26 │ . │ Mammadzada,G         │ 2449     │ 1.d4 f5 2.c4 Nf6 3.  │
│    7240 │ Koneru,H             │ 2554     │ 2023.12.26 │ . │ Peycheva,Gergana     │ 2271     │ 1.d4 e6 2.c4 f5 3..  │
│    7241 │ Carlsson,Andreas     │ 1902     │ 2023.12.28 │ . │ Kreken,Eivind Grunt  │ 2271     │ 1.d4 e6 2.c4 f5 3..  │
│    7242 │ Kazarjan,Gachatur    │ 2078     │ 2023.12.30 │ . │ Schuricht,Emil Fre.  │ 2095     │ 1.d4 f5 2.g3 e6 3..  │
│    7243 │ Kurbonboeva,Sarvinoz │ 2154     │ 2023.12.30 │ . │ Zhu Chen             │ 2423     │ 1.d4 f5 2.g3 Nf6 3.  │
├─────────┴──────────────────────┴──────────┴────────────┴───┴──────────────────────┴──────────┴──────────────────────┤
│ 7243 rows (30 shown)                                                                           12 columns (7 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Next steps

From this point on there's many things that could be done to improve the solution, for example:

Column values that originate from headers tags should un-escape escaped characters.
The PGN-syntax itself does not dictate this but there are established conventions for what tag names to use and what kind of values are appropriate. For example, see the seven tag roster and optional tag pairs in the wikipedia entry for Portable Game Notation. It would make sense to further cleanse and conform the corresponding columns and give them a more suitable data type, or to actively validate their value.
Further data normalization could be attempted by creating separate tables for player, event, opening etc.
The moves could be further processed and analyzed to derive a table of board positions, something which would greatly increase the opportunities to analyze games.

We could also improve the statement and make it more robust:

Better detection of the first game line. Our assumption has been that the first movetext always starts with '1.'. But what if a game does not have any moves at all? This may sound like that shouldn't be possible, but especially on an online chess site, a player's connection might break after a game was started, but before a move was made.

Whatever the reason may be, and whether we're interested in such games or not, our current solution is not capable to detect these cases. Instead, games simply aren't identified as intended, and our final game would likely be a mixture of 2 or maybe even more games. Bad bad bad!

(If you're interested in looking into such a scenario, the Lichess chess database for October 2013 contains 411,039 games, but 140 do not have any moves.)
A more robust regular expression to deal with games that may not adhere fully to the PGN syntax (for example, more foregiving handling of whitespace)
Better handling of errors when reading CSV.

For now we leave these considerations as an excercise for the reader.

The goal of these posts was to show how DuckDB's features and SQL-dialect allow these kind of raw-text processing tasks to be solved quickly and elegantly. I hope I have succeeded in demonstrating that - it sure was a lot of fun to try!

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Extracting Tagpairs with Regular Expressions (5/6)

2024-07-12T01:05:00.013+02:00

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the fifth installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:

Parsing PGN tagpairs: Regular Expressions to the rescue!

Let's apply our attention now to the header lines to extract information embedded in the tagpairs to expose it in a more useful form. We'll be using this regular expression to do that:

^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$

If you're not familiar with regular expressions, or if you're not convinced this is correct, that's fine! The first part of this installment describes in detail how I arrived at this expression. In the second part, we will see how we can use regular expression in DuckDB to extract information from PGN headers.

Crafting a regular expression for PGN tagpairs

From our description of the tagpair-syntax, we know that the tag and value must be inside the square brackets, and that tag and its value are separated by whitespace. We also know that value is enclosed in double quotes. We may stylize this syntax as follows:

line-start [ tag whitespace "value" ] line-end

In the syntax model above, literal characters are in red; items in italics are syntactical concepts that we can't quite describe as a single, fixed sequence of characters. But we can describe them using a pattern that specifies exactly what sequence of characters we would expect there, given the assumption that we're reading valid PGN files.

We will now turn this model into a regular expression by replacing the items from the model into patterns. Please be advised that this section is not intended to be a regular expressions tutorial - there are plenty of online resources that do that. It's just here to help you explain how I arrived at the final regular expression: this will make it easier for you to understand its structure, and put you in a better position to undertand its limitations, and if necessary, to adapt it to your specific needs.

First, lets replace line-start and line-end with ^ and $ anchors:

^ [ tag whitespace "value" ] $

Next, the square brackets that enclose our tagpair are regular expression metacharacters, so to match them literally, we need to escape them using a backslash:

^ \[ tag whitespace "value" \] $

Extracting tags: whitespace and not-whitespace

We can now replace the whitespace-placeholder with a pattern that matches actual whitespace. We do this by specifying one whitespace character, \s, follewed by +, indicating that it must occur at least once and could repeat many times:

^ \[ tag \s+ "value" \] $

We know the tag is followed by whitespace. From that fact, it follows that the tag cannot itself contain any whitespace: if that would be possible, we wouldn't be able to distinguish between the whitespace that demarcates the end of the tag as opposed to whitespace appearing inside the tag.

So, we can craft the pattern matching the tag as any character that is not whitespace ([^\s]), one or more times (denoted by +). The tag is information we want to extract; therefore we enclose it in parentheses so we can capture it:

^ \[ ([^\s]+) \s+ "value" \] $

Refine the syntax model to extract the value

To replace the value placeholder with a pattern it will be helpful to refine our syntax model a little bit. Let's zoom in on just the value from our original syntax model, and introduce a value-character.

The value-character stands for any character that may appear between the double-quotes demarcating the value. There may be any number of value-characters inside value:

" value-character* "

As for finding a pattern for value-character, we know for sure is that such a character cannot be a double quote, because we would not be able to distinguish it from the double-quote that indicates the end of the value.

This is completely analogous to the prior example we saw with tag, which is followed by white-space and therefore cannot itself contain whitespace. So just like we wrote "not-whitespace" as [^\s] for the tag, we can write "not-double-quote" as: [^"].

Modeling Character Escape Sequences

The PGN syntax specifies that a literal double-quote inside the value can be denoted by immediately preceding it using a backslash: \". The backslash preceding the double-quote is a so-called "escape" character, relieving the double-quote from its normal role of demarcating the end of the value.

Whenever a syntax defines an escape character device, it automatically introduces the problem of how to denote a literal occurrence of the escape character. That issue is usually solved simply by escaping the escape character, and that's also how the PGN syntax solves it. So in order to represent a literal backslash \ in the PGN-value, it must be preceded immediately with the escape character, which is also the backslash. Thus, it becomes: \\.

Now that we realize a value-character can either be escaped or not escaped, it is a good idea to replace its syntax model with something more refined:

" ( \ escaped-value-character | unescaped-value-character )* "

Note: the pipe-character | is the standard regular expression syntax to denote alternative options ("either/or" in English).

We can immediately replace unescaped-value-character with the "not-double-quote" pattern we identified earlier:

" ( \ escaped-value-character | [^"] )* "

Now let's replace the PGN escape character \ with its regular expression pattern. To do that we need to realize the backslash is also a metacharacter in the regular expression context, where it's also used as escape character! So, in order to represent the PGN escape character in a regular expression, we need to escape it in the the regular expression context. That sounds harder than it is - it simply becomes \\. Here, the first backslash is the regular expression escape character, and the second backslash a literal backslash, which is the escape character in the PGN context:

" ( \\ escaped-value-character | [^"] )* "

We know that in the PGN-context both the double quote and the backslash are ecaped characters. We also know that in the reqular expression context, the backslash needs to be escaped. So, we can now replace escaped-value-character with ["\\]:

" ( \\ ["\\] | [^"] )* "

As final touch to the value pattern, we enclose the value in parenthesis to establish a capturing group. Just like we did earlier for tag, this will allow us to extract the value:

" ( (\\ ["\\] | [^"] )* ) "

Now that we found a pattern for value, we can complete the entire pattern. The double quotes around value can be matched directly, and need no special treatment to be used in the regular expression:

^ \[ ([^\s]+) \s+ "( ( \\ ["|\\] | [^"] )* )" \] $

Are you still reading? Thanks for hanging in there! Let's move on to pluck the fruits from our labour.

Using `regexp_extract()` to extract tags and values

In DuckDB, we can extract the matches made by a regular expression using regexp_extract(). This is an overloaded function. For all versions of regexp_extract(), the first two arguments are the same:

The string to which the regular expression is to be applied
The regular expression

In the simplest usage, it takes only these two arguments, and returns the string that was matched. For our use case, this would not be usful, as our regular expression should match any complete header line: it would simply return the header line we feed into the regular expression.

Using the capturing groups

We want to extract the tag and tagvalue from the header lines, and for this purpose we created corresponding capturing groups. Fortunately, we can pass the index of the capturing group as the third argument to regexp_extract(). In this usage, the function should return whatever was matched by the capturing group indicated by the 3rd argument.

The first capturing group corresponds to the tag, so to extract it we could write:

regexp_extract(
  '[Event "Hastings"]'
  --   capture1
  --   ___|__
  --  |      |
, '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
, 1
)

And this would give us Event as result. Likewise, to extract the value we could write:

regexp_extract(
  '[Event "Hastings"]'
  --                  capture2
  --               _______|_______
  --              |               |
, '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
, 2
)

And this would yield the result "Hastings"

Regular Expressions in SQL

Now, many database products offer support for regular expressions. Products with basic support offer at least matching against a regular expression, and typically also extraction and replacement of a single matched pattern or capturing group, which is the example I just gave above.

More advanced products also offer a feature to split a string on the regular expression matches to an array or a resultset.

Regular Expressions in DuckDB

DuckDB supports all prior mentioned regular expression usages. In addition, DuckDB allows you to extract the matches of multiple capturing groups, all in one go! To do that, we can pass a list of names to regexp_extract() as 3rd argument:

regexp_extract(
  '[Event "Hastings"]'
, '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
, [ 'tag_name', 'tag_value' ]
)

With this usage, the function returns a STRUCT-value with the names passed in 3rd argument as keys. The value assigned to the keys is the text matched by the capturing group appearing at the same position as the index of the name in the list.

So in our example the 3rd argument is this list ['tag_name', 'tag_value']. The name 'tag_name' appears at the first position in the list, and thus, the value assigned to the 'tag_name'-key in the return value will be assigned the text that matches the first capturing group in the regular expression. Likewise, 'tag_value' appears in the list at the second position and thus the match made by the second capturing group will be assigned to that key in the return value.

For our example, the return value is:

┌───────────────────────────────────────────────┐
│                value                          │
│ struct(tag_name varchar, "tag_value" varchar) │
├───────────────────────────────────────────────┤
│ {'tag_name': Event, 'tag_value': Hastings}    │
└───────────────────────────────────────────────┘

I think this is pretty awesome! We only have to write and apply a particular regular expression once, instead of having to repeat it for each group we want to capture. This avoids code duplication and your SQL can be shorter, easier to read, and therefore, more maintainable.

It would not be unreasonable to assume that a single regular expression invocation returning all the matches all at once might also be faster than calling out multiple times to return only one match at a time. However, for this particular case, I found the performance to be on par.

Adding the Regular Expression to our Query

Let's incorporate the tag extraction logic into our query. Because tag extraction is only applicable for header lines, it feels appropriate to use the is_header flag to check whether the regular expression should be applied. Like we already saw, we can benefit from the DuckDB reusable alias feature to do that:

CASE
  WHEN is_header THEN
    regexp_extract(
      line
    , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
    , [ 'tag_name', 'tag_value' ]
    )
END

With this addition our query becomes:

SELECT  line
,       line LIKE '[%'                               AS is_header
,       COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
         ROWS BETWEEN UNBOUNDED PRECEDING
              AND     CURRENT ROW
        ) + 
        CASE
          WHEN is_header THEN 1
          ELSE 0
        END                                          AS game_id
,       CASE
          WHEN is_header THEN
            regexp_extract(
              line
            , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
            , [ 'tag_name', 'tag_value' ]
            )
        END                                          AS tag_pair
FROM    read_csv(
          'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
        , columns = {'line': 'VARCHAR'}
        )
WHERE   line IS NOT NULL

Our results are now:

┌──────────────────────────────────────────────────┬───────────┬─────────┬───────────────────────────────────────────────────────┐
│                       line                       │ is_header │ game_id │                       tag_pair                        │
│                     varchar                      │  boolean  │  int64  │         struct(tag_name varchar, "tag_value" varchar) │
├──────────────────────────────────────────────────┼───────────┼─────────┼───────────────────────────────────────────────────────┤
│ [Event "Hastings"]                               │ true      │       1 │ {'tag': Event, 'value': Hastings}                     │
│ [Site "Hastings"]                                │ true      │       1 │ {'tag': Site, 'value': Hastings}                      │
│ [Date "1895.??.??"]                              │ true      │       1 │ {'tag': Date, 'value': 1895.??.??}                    │
│ [Round "?"]                                      │ true      │       1 │ {'tag': Round, 'value': ?}                            │
│ [White "Tinsley, Samuel"]                        │ true      │       1 │ {'tag': White, 'value': Tinsley, Samuel}              │
│ [Black "Pollock, William Henry Kraus"]           │ true      │       1 │ {'tag': Black, 'value': Pollock, William Henry Kraus} │
│ [Result "1-0"]                                   │ true      │       1 │ {'tag': Result, 'value': 1-0}                         │
│ [WhiteElo ""]                                    │ true      │       1 │ {'tag': WhiteElo, 'value': }                          │
│ [BlackElo ""]                                    │ true      │       1 │ {'tag': BlackElo, 'value': }                          │
│ [ECO "A90"]                                      │ true      │       1 │ {'tag': ECO, 'value': A90}                            │
│ 1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O .  │ false     │       1 │                                                       │
│ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc.  │ false     │       1 │                                                       │
│ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20.  │ false     │       1 │                                                       │
│ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 .  │ false     │       1 │                                                       │
│ 30.Be2 Nd6 31.Bf3 Ra4 32.Qb1 Rc4 33.Rd2 Ne4 34.  │ false     │       1 │                                                       │
│               ·                                  │  ·        │       · │                       ·                               │
│               ·                                  │  ·        │       · │                       ·                               │
│               ·                                  │  ·        │       · │                       ·                               │
│ [Black "Kurbonboeva,Sarvinoz"]                   │ true      │    7243 │ {'tag': Black, 'value': Kurbonboeva,Sarvinoz}         │
│ [Result "1-0"]                                   │ true      │    7243 │ {'tag': Result, 'value': 1-0}                         │
│ [WhiteElo "2423"]                                │ true      │    7243 │ {'tag': WhiteElo, 'value': 2423}                      │
│ [BlackElo "2154"]                                │ true      │    7243 │ {'tag': BlackElo, 'value': 2154}                      │
│ [ECO "A90"]                                      │ true      │    7243 │ {'tag': ECO, 'value': A90}                            │
│ 1.d4 f5 2.g3 Nf6 3.Bg2 e6 4.c4 d5 5.b3 c6 6.Nh.  │ false     │    7243 │                                                       │
│ 10.Bxe5 Qxe5 11.Nd2 O-O 12.cxd5 cxd5 13.Nf3 Qe.  │ false     │    7243 │                                                       │
│ 17.Qxd4 Ne4 18.Nxd5 Qf7 19.Rc7 Rxd5 20.Rxf7 Rx.  │ false     │    7243 │                                                       │
│ 24.Rc7 Na6 25.Rc6 Bc8 26.Rxa6 Bxa6 27.Bxa8 Bxe.  │ false     │    7243 │                                                       │
│ 31.Bd1 Kf7 32.Bc2 Rc3 33.Rd1 Be6 34.Rd2 Bxb3 3.  │ false     │    7243 │                                                       │
│ 38.Kg2 Ra1 39.h4 h6 40.h5 Ra3 41.a5 Ra2 42.Kf3.  │ false     │    7243 │                                                       │
│ 45.Kd4 Ra2 46.a6 Rxf2 47.Rc7 Ra2 48.a7 g5 49.h.  │ false     │    7243 │                                                       │
│ 52.gxh4 Kh5 53.Rh7+ Kg4 54.Kb7 Rb2+ 55.Kc8 Rc2.  │ false     │    7243 │                                                       │
│ 59.Kxa8 Kxh4 60.Rf7 Kg4 61.Kb7 f4 62.Kc6 f3 63.  │ false     │    7243 │                                                       │
│ 66.Ke4  1-0                                      │ false     │    7243 │                                                       │
├──────────────────────────────────────────────────┴───────────┴─────────┴───────────────────────────────────────────────────────┤
│ 117000 rows (30 shown)                                                                                               4 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Next Installment: Rolling up PGN lines to one game row

Next time, we will see how to roll up the multiple lines from a game into a single game row.

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Keeping the lines of a game together (4/6)

2024-07-12T01:04:00.022+02:00

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the fourth installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:

Keeping game lines together: window functions

The transformation to the tabular format that we want to achieve implies a big structural change compared to the raw lines from the PGN file, as we need to transform multiple subsequent lines into a single row that represents a single game. And, we want to do that for all games present in a PGN file. So we want to treat the lines that belong to one game together as a unit, in other words: we need to group them.

You might know that in SQL, the GROUP BY-clause would achieve that. But, if we want to use that, we'll need to have an expression (or combination of expressions) that identifies which lines should be grouped together. We currently don't have any such thing, so we're going to have to create it.

The fact that we're trying to work "across" rows should trigger an instinct to consider window functions. If you're unfamiliar with window functions, some time ago I wrote about year-to-date calculation using window functions - perhaps that may be of interest to you. To recap, a window function calculates a metric by performing some kind of aggregate function over a subset of the rows taken from the current resultset. That subset of rows is called the window.

Window-functions as 'running totals'

A simple example of a metric that would be calculated by a window function is the 'running total' (cumulative sum): the running total for the current row is the value of the current rows' amount, added to the sum of amounts calculated over the rows appearing before the current row. So for the running total, the window is defined as the current row and all its previous rows, and the aggregation that is performed over the window is SUM().

Row counters as running totals

Let's simplify this. Forget about the previous running total example on amount and use COUNT(*) instead of SUM(). What would the metric be?

Well, COUNT(*) counts the number of rows. For the first row, it would count only the current row because there are no preceding rows, and the result would be 1. For the second row, it would count the current row and the previous row, and the result would be 2. Of course, for the 3rd row the result would be 3, for the 4th row it would be 4, and so on, calculating an integer value for each row that correspons to the row's position within the resultset as a whole.

The metric from the simplified running total example would be aptly called 'row count' or 'row number'. To calculate it, we would write:

COUNT(*) OVER (
  ROWS BETWEEN UNBOUNDED PRECEDING
       AND     CURRENT ROW
)

Note how the OVER()-clause defines the window: the ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW-bit matches our explanation of the running total in a remarkably straight-forward way.

Counting only specifc rows

To be sure, if we'd only wanted to calculate the row number, we would have used the built-in ROW_NUMBER() window function, and this would allow you to write the previous expression simply as ROW_NUMBER() OVER().

But we're not interested in counting lines: we want to count the games, while each game spans multiple lines. To count games, and not just any line, we have to find something that occurs in only one line for each game, and apply the COUNT() window function to that instead.

Using the first movetext line to identify games

By analyzing the PGN-syntax, you might find that the movetext is the key that solves this problem:

Each game has at least one line of movetext, and that first movetext line starts with the first move.
We also know that moves are numbered, starting with move 1, and that the move numbers are followed by a full-stop.

So, instead of COUNT(*), we should count the cases where the line starts with '1.'! Our expression thus becomes:

COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
  ROWS BETWEEN UNBOUNDED PRECEDING
       AND     CURRENT ROW
) AS first_movetext_linecount

If the line is the first movetext line, LIKE '1.%' will be TRUE, and the CASE-expression evaluates to 1, which will be counted. Any non-NULL-value would have done the job, but 1 is a convenient choice: it reminds us that we're counting the first movetext line and at the same time, symbolizes that we're counting that as one game. In all other cases, the CASE-expression evaluates to NULL, and those lines will not be counted.

If we'd add this to our query, this would be the result:

┌────────────────────────────────────────────────────────────────────────────────────┬───────────┬──────────────────────────┐
│                                        line                                        │ is_header │ first_movetext_linecount │
│                                      varchar                                       │  boolean  │          int64           │
├────────────────────────────────────────────────────────────────────────────────────┼───────────┼──────────────────────────┤
│ [Event "Hastings"]                                                                 │ true      │                        0 │
│ [Site "Hastings"]                                                                  │ true      │                        0 │
│ [Date "1895.??.??"]                                                                │ true      │                        0 │
│ [Round "?"]                                                                        │ true      │                        0 │
│ [White "Tinsley, Samuel"]                                                          │ true      │                        0 │
│ [Black "Pollock, William Henry Kraus"]                                             │ true      │                        0 │
│ [Result "1-0"]                                                                     │ true      │                        0 │
│ [WhiteElo ""]                                                                      │ true      │                        0 │
│ [BlackElo ""]                                                                      │ true      │                        0 │
│ [ECO "A90"]                                                                        │ true      │                        0 │
│ 1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O 6.Qb3 c5 7.e3 Nc6 8.a3 cxd4          │ false     │                        1 │
│ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc1+ 13.Rhxc1 b6 14.Nd4 Ne4 15.Rd1 Qg5 │ false     │                        1 │
│ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20.Rxa7 Rxa7 21.Nxa7 Kh8 22.b5 Qf6     │ false     │                        1 │
│ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 27.Kf1 Ra1+ 28.Kg2 b5 29.Qb4 Kh7     │ false     │                        1 │
│ 30.Be2 Nd6 31.Bf3 Ra4 32.Qb1 Rc4 33.Rd2 Ne4 34.Rd3 Rc5 35.h5 Kg8 36.Qa2 Kh8        │ false     │                        1 │
│ 37.Qa5 Kh7 38.Qd8 Nxc3 39.Qf8 Qc8 40.Qf7 b4 41.g4 fxg4 42.Qg6+ Kh8 43.Bxg4 Qd8     │ false     │                        1 │
│ 44.f4 Qf6 45.Qe8+ Kh7 46.Be6 Qxe6 47.Qxe6 b3 48.Qb6 Ne4 49.Qg6+ Kg8 50.Rxd5 Rc2+   │ false     │                        1 │
│ 51.Kf3 Rd2 52.Qxe4  1-0                                                            │ false     │                        1 │
│ [Event "Scheveningen"]                                                             │ true      │                        1 │
│ [Site "Scheveningen"]                                                              │ true      │                        1 │
│ [Date "1913.??.??"]                                                                │ true      │                        1 │
│ [Round "?"]                                                                        │ true      │                        1 │
│ [White "Loman, Rudolf"]                                                            │ true      │                        1 │
│ [Black "Lasker, Edward"]                                                           │ true      │                        1 │
│ [Result "1/2-1/2"]                                                                 │ true      │                        1 │
│ [WhiteElo ""]                                                                      │ true      │                        1 │
│ [BlackElo ""]                                                                      │ true      │                        1 │
│ [ECO "A90"]                                                                        │ true      │                        1 │
│ 1.d4 e6 2.c4 f5 3.g3 Nf6 4.Bg2 d5 5.cxd5 exd5 6.Bg5 c6 7.Nc3 Bd6 8.Qc2 O-O         │ false     │                        2 │
│ 9.O-O-O Qa5 10.Bxf6 gxf6 11.Bh3 Bb4 12.Kb1 Bxc3 13.Qxc3 Qxc3 14.bxc3 Na6           │ false     │                        2 │
├────────────────────────────────────────────────────────────────────────────────────┴───────────┴──────────────────────────┤
│ 30 rows                                                                                                         3 columns │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Compensating Headers to get the game id

There's one problem left, and that's that the game's header lines occur before that game's first movetext line. So, the game's header lines get a number corresponding to the previous game's movetext line. But, this is easily solved: because it applies to all header lines of each game, we can simply compensate by adding 1 in case the line is a header line.

In the previous section, we already solved the problem of identifying header lines. Thanks to a DuckDB feature called reusable column alias (a concept similar to lateral column alias in DataBricks SQL), fixing first_movetext_counter so it becomes a game_id is trivial! We can simply check whether is_header is TRUE, and if so, add 1:

COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
  ROWS BETWEEN UNBOUNDED PRECEDING
       AND     CURRENT ROW
) + 
CASE
  WHEN is_header THEN 1
  ELSE 0
END AS game_id

With this compensation in place, the query now looks like this:

SELECT  line
,       line LIKE '[%'                               AS is_header
,       COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
         ROWS BETWEEN UNBOUNDED PRECEDING
              AND     CURRENT ROW
        ) + 
        CASE
          WHEN is_header THEN 1
          ELSE 0
        END                                          AS game_id
FROM    read_csv(
          'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
        , columns = {'line': 'VARCHAR'}
        )
WHERE   line IS NOT NULL

The result now becomes:

┌────────────────────────────────────────────────────────────────────────────────────┬───────────┬─────────┐
│                                        line                                        │ is_header │ game_id │
│                                      varchar                                       │  boolean  │  int64  │
├────────────────────────────────────────────────────────────────────────────────────┼───────────┼─────────┤
│ [Event "Hastings"]                                                                 │ true      │       1 │
│ [Site "Hastings"]                                                                  │ true      │       1 │
│ [Date "1895.??.??"]                                                                │ true      │       1 │
│ [Round "?"]                                                                        │ true      │       1 │
│ [White "Tinsley, Samuel"]                                                          │ true      │       1 │
│ [Black "Pollock, William Henry Kraus"]                                             │ true      │       1 │
│ [Result "1-0"]                                                                     │ true      │       1 │
│ [WhiteElo ""]                                                                      │ true      │       1 │
│ [BlackElo ""]                                                                      │ true      │       1 │
│ [ECO "A90"]                                                                        │ true      │       1 │
│ 1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O 6.Qb3 c5 7.e3 Nc6 8.a3 cxd4          │ false     │       1 │
│ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc1+ 13.Rhxc1 b6 14.Nd4 Ne4 15.Rd1 Qg5 │ false     │       1 │
│ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20.Rxa7 Rxa7 21.Nxa7 Kh8 22.b5 Qf6     │ false     │       1 │
│ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 27.Kf1 Ra1+ 28.Kg2 b5 29.Qb4 Kh7     │ false     │       1 │
│ 30.Be2 Nd6 31.Bf3 Ra4 32.Qb1 Rc4 33.Rd2 Ne4 34.Rd3 Rc5 35.h5 Kg8 36.Qa2 Kh8        │ false     │       1 │
│ 37.Qa5 Kh7 38.Qd8 Nxc3 39.Qf8 Qc8 40.Qf7 b4 41.g4 fxg4 42.Qg6+ Kh8 43.Bxg4 Qd8     │ false     │       1 │
│ 44.f4 Qf6 45.Qe8+ Kh7 46.Be6 Qxe6 47.Qxe6 b3 48.Qb6 Ne4 49.Qg6+ Kg8 50.Rxd5 Rc2+   │ false     │       1 │
│ 51.Kf3 Rd2 52.Qxe4  1-0                                                            │ false     │       1 │
│ [Event "Scheveningen"]                                                             │ true      │       2 │
│ [Site "Scheveningen"]                                                              │ true      │       2 │
│ [Date "1913.??.??"]                                                                │ true      │       2 │
│ [Round "?"]                                                                        │ true      │       2 │
│ [White "Loman, Rudolf"]                                                            │ true      │       2 │
│ [Black "Lasker, Edward"]                                                           │ true      │       2 │
│ [Result "1/2-1/2"]                                                                 │ true      │       2 │
│ [WhiteElo ""]                                                                      │ true      │       2 │
│ [BlackElo ""]                                                                      │ true      │       2 │
│ [ECO "A90"]                                                                        │ true      │       2 │
│ 1.d4 e6 2.c4 f5 3.g3 Nf6 4.Bg2 d5 5.cxd5 exd5 6.Bg5 c6 7.Nc3 Bd6 8.Qc2 O-O         │ false     │       2 │
│ 9.O-O-O Qa5 10.Bxf6 gxf6 11.Bh3 Bb4 12.Kb1 Bxc3 13.Qxc3 Qxc3 14.bxc3 Na6           │ false     │       2 │
├────────────────────────────────────────────────────────────────────────────────────┴───────────┴─────────┤
│ 30 rows                                                                                        3 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Perfect! We now have a game_id expression that we can use to identify all the lines belonging to one game. Remember, we need that to be able to roll up all the game's lines into a single row on our desired tabular result. We will return to that topic in the last installment.

Next Installment: Parsing PGN Tagpairs

Next time, we'll work with regular expressions to extract information from PGN tagpairs.

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Distinguishing the Line Type (3/6)

2024-07-12T01:02:00.007+02:00

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the third installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:

Distinguishing PGN Headers, Movetext and separators

In the previous installment, we learned how to access the raw lines of text from a PGN file. Before we can move on to greater things, we need to do a little bit of data enrichment. It's not difficult, and it doesn't show off any particular features specific to DuckDB, but it has to be done and this installment covers how. Think of it as a warm-up!

When you inspect the query result presented in the previous installment, you notice quickly there are three types of lines:

header lines (marked up in green)
movetext lines (marked up in brown)
NULL-values (marked up in red), which correspond to empy lines in the file.

We can safely ignore the NULL-values, as they don't carry any information. We do that by simply adding a WHERE-clause to our query:

WHERE line IS NOT NULL

Now let's add an expression that tells us whether a line is a header line (or not). We know header lines have a tagpair enclosed in square brackets. So in principle, checking whether the line starts with an opening square bracket is enough to know we're dealing with a header line. We can do that by using a LIKE-comparison:

line LIKE '[%'

Adding this to our query, we now get:

SELECT line
,      line LIKE '[%' AS is_header
FROM   read_csv(
         'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
       , columns = {'line': 'VARCHAR'}
       )
WHERE  line IS NOT NULL

Our result now looks like this:

┌──────────────────────────────────────────────────────────────────────────────────────┬───────────┐
│                                         line                                         │ is_header │
│                                       varchar                                        │  boolean  │
├──────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ [Event "Hastings"]                                                                   │ true      │
│ [Site "Hastings"]                                                                    │ true      │
│ [Date "1895.??.??"]                                                                  │ true      │
│ [Round "?"]                                                                          │ true      │
│ [White "Tinsley, Samuel"]                                                            │ true      │
│ [Black "Pollock, William Henry Kraus"]                                               │ true      │
│ [Result "1-0"]                                                                       │ true      │
│ [WhiteElo ""]                                                                        │ true      │
│ [BlackElo ""]                                                                        │ true      │
│ [ECO "A90"]                                                                          │ true      │
│ 1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O 6.Qb3 c5 7.e3 Nc6 8.a3 cxd4            │ false     │
│ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc1+ 13.Rhxc1 b6 14.Nd4 Ne4 15.Rd1 Qg5   │ false     │
│ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20.Rxa7 Rxa7 21.Nxa7 Kh8 22.b5 Qf6       │ false     │
│ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 27.Kf1 Ra1+ 28.Kg2 b5 29.Qb4 Kh7       │ false     │
│ 30.Be2 Nd6 31.Bf3 Ra4 32.Qb1 Rc4 33.Rd2 Ne4 34.Rd3 Rc5 35.h5 Kg8 36.Qa2 Kh8          │ false     │
│               ·                                                                      │  ·        │
│               ·                                                                      │  ·        │
│               ·                                                                      │  ·        │
│ [Black "Kurbonboeva,Sarvinoz"]                                                       │ true      │
│ [Result "1-0"]                                                                       │ true      │
│ [WhiteElo "2423"]                                                                    │ true      │
│ [BlackElo "2154"]                                                                    │ true      │
│ [ECO "A90"]                                                                          │ true      │
│ 1.d4 f5 2.g3 Nf6 3.Bg2 e6 4.c4 d5 5.b3 c6 6.Nh3 Bd6 7.O-O Qe7 8.Bf4 e5 9.dxe5 Bxe5   │ false     │
│ 10.Bxe5 Qxe5 11.Nd2 O-O 12.cxd5 cxd5 13.Nf3 Qe7 14.Nf4 Rd8 15.Nd4 Nc6 16.Rc1 Nxd4    │ false     │
│ 17.Qxd4 Ne4 18.Nxd5 Qf7 19.Rc7 Rxd5 20.Rxf7 Rxd4 21.Re7 Kf8 22.Rc7 Be6 23.Rxb7 Nc5   │ false     │
│ 24.Rc7 Na6 25.Rc6 Bc8 26.Rxa6 Bxa6 27.Bxa8 Bxe2 28.Re1 Bb5 29.a4 Bd7 30.Bf3 Rd3      │ false     │
│ 31.Bd1 Kf7 32.Bc2 Rc3 33.Rd1 Be6 34.Rd2 Bxb3 35.Bxb3+ Rxb3 36.Rd7+ Kf6 37.Rxa7 Rb1+  │ false     │
│ 38.Kg2 Ra1 39.h4 h6 40.h5 Ra3 41.a5 Ra2 42.Kf3 Ra3+ 43.Ke2 Ra2+ 44.Ke3 Ra3+          │ false     │
│ 45.Kd4 Ra2 46.a6 Rxf2 47.Rc7 Ra2 48.a7 g5 49.hxg6 Kxg6 50.Kc5 h5 51.Kb6 h4           │ false     │
│ 52.gxh4 Kh5 53.Rh7+ Kg4 54.Kb7 Rb2+ 55.Kc8 Rc2+ 56.Kb8 Rb2+ 57.Rb7 Ra2 58.a8=Q Rxa8+ │ false     │
│ 59.Kxa8 Kxh4 60.Rf7 Kg4 61.Kb7 f4 62.Kc6 f3 63.Kd5 Kg3 64.Ke4 f2 65.Ke3 Kg2          │ false     │
│ 66.Ke4  1-0                                                                          │ false     │
├──────────────────────────────────────────────────────────────────────────────────────┴───────────┤
│ 117000 rows (30 shown)                                                                 2 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘

We can be humble, and still say: that's not a bad start!

Next Installment: Keeping the lines of a game together

Next time, we'll add a game_id to keep all lines that belong to one game together.

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Ingesting raw text with the CSV Reader (2/6)

2024-07-12T01:01:00.014+02:00

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the second installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:

Ingesting lines of raw text from files using DuckDB's `read_csv()`

For this installment I'm going to use DuckDB's capabilities to read and query CSV-files. There is already a lot of content available that explains the ins and outs of using the DuckDB CSV-reader for typical, tabular datatasets, so I'm not going to cover that. Instead, for this particular example, I'm going to (ab)use DuckDB's CSV-reader just to read raw lines of text. So rather than letting the CSV reader discover columns, I want the file to be returned as a list of lines. This query does the trick:

SELECT line
FROM   read_csv(
         'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
       , columns = {'line': 'VARCHAR'}
       )

We use the columns argument to specify that our column should be called line. We also could've omitted it, but then the column would've been assigned the name column0, which I think is less clear. Running the query gives us a result like this:

┌──────────────────────────────────────────────────────────────────────────────────────┐
│                                         line                                         │
│                                       varchar                                        │
├──────────────────────────────────────────────────────────────────────────────────────┤
│ [Event "Hastings"]                                                                   │
│ [Site "Hastings"]                                                                    │
│ [Date "1895.??.??"]                                                                  │
│ [Round "?"]                                                                          │
│ [White "Tinsley, Samuel"]                                                            │
│ [Black "Pollock, William Henry Kraus"]                                               │
│ [Result "1-0"]                                                                       │
│ [WhiteElo ""]                                                                        │
│ [BlackElo ""]                                                                        │
│ [ECO "A90"]                                                                          │
│ <NULL>                                                                               │
│ 1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O 6.Qb3 c5 7.e3 Nc6 8.a3 cxd4            │
│ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc1+ 13.Rhxc1 b6 14.Nd4 Ne4 15.Rd1 Qg5   │
│ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20.Rxa7 Rxa7 21.Nxa7 Kh8 22.b5 Qf6       │
│ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 27.Kf1 Ra1+ 28.Kg2 b5 29.Qb4 Kh7       │
│         ·                                                                            │
│         ·                                                                            │
│         ·                                                                            │
│ [WhiteElo "2423"]                                                                    │
│ [BlackElo "2154"]                                                                    │
│ [ECO "A90"]                                                                          │
│ <NULL>                                                                               │
│ 1.d4 f5 2.g3 Nf6 3.Bg2 e6 4.c4 d5 5.b3 c6 6.Nh3 Bd6 7.O-O Qe7 8.Bf4 e5 9.dxe5 Bxe5   │
│ 10.Bxe5 Qxe5 11.Nd2 O-O 12.cxd5 cxd5 13.Nf3 Qe7 14.Nf4 Rd8 15.Nd4 Nc6 16.Rc1 Nxd4    │
│ 17.Qxd4 Ne4 18.Nxd5 Qf7 19.Rc7 Rxd5 20.Rxf7 Rxd4 21.Re7 Kf8 22.Rc7 Be6 23.Rxb7 Nc5   │
│ 24.Rc7 Na6 25.Rc6 Bc8 26.Rxa6 Bxa6 27.Bxa8 Bxe2 28.Re1 Bb5 29.a4 Bd7 30.Bf3 Rd3      │
│ 31.Bd1 Kf7 32.Bc2 Rc3 33.Rd1 Be6 34.Rd2 Bxb3 35.Bxb3+ Rxb3 36.Rd7+ Kf6 37.Rxa7 Rb1+  │
│ 38.Kg2 Ra1 39.h4 h6 40.h5 Ra3 41.a5 Ra2 42.Kf3 Ra3+ 43.Ke2 Ra2+ 44.Ke3 Ra3+          │
│ 45.Kd4 Ra2 46.a6 Rxf2 47.Rc7 Ra2 48.a7 g5 49.hxg6 Kxg6 50.Kc5 h5 51.Kb6 h4           │
│ 52.gxh4 Kh5 53.Rh7+ Kg4 54.Kb7 Rb2+ 55.Kc8 Rc2+ 56.Kb8 Rb2+ 57.Rb7 Ra2 58.a8=Q Rxa8+ │
│ 59.Kxa8 Kxh4 60.Rf7 Kg4 61.Kb7 f4 62.Kc6 f3 63.Kd5 Kg3 64.Ke4 f2 65.Ke3 Kg2          │
│ 66.Ke4  1-0                                                                          │
│ <NULL>                                                                               │
├──────────────────────────────────────────────────────────────────────────────────────┤
│                                131486 rows (30 shown)                                │
└──────────────────────────────────────────────────────────────────────────────────────┘

Text lines versus CSV reader rows: different, but related

At this point it's good to note that there's a direct correspondence between the lines in the file, and the rows returned by csv_read(): Not only does the content match, the order of the lines as they appear in the file is also preserved.

The latter sounds obvious but really isn't. The CSV-reader has a parallel option that controls whether multiple threads are used to read the file is read in parallel. Such a feature could easily lead to an implementation that does not guarantee that the order of the rows in the returned resultset matches the order of the lines read from the file(s). The SQL language also doesn't guarantee row order unless enforced by an explicit ORDER BY-clause, so there wouldn't be a clear requirement from that side either to preserve the line order.

Fortunately, the DuckDB CSV-reader does guarantee that the order is preserved. This is controlled by the preserve_insertion_order-configuration option, which is TRUE by default.

Next Installment: Detecting line types

Next time, we'll add an expression that can detect whether the line is a header line, or a movetext line.

DuckDB bag of tricks: Processing PGN chess games with DuckDB - An Introduction to PGN (1/6)

2024-07-12T01:00:00.017+02:00

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the first installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:

Chess and Portable Game Notation (PGN)

In this series of posts I will demonstrate how you can use DuckDB to read chess games recorded in PGN-format. You might not be interested in his particular task, but the DuckDB features that allow us to do this are applicable to a wide range of text-processing problems:

Ingesting raw text with the CSV Reader. The fundamental task of acquiring the raw data is covered in the 2nd installment
Detecting line types. Some light preprocessing of the data to allow for more focused processing is covered in the 3rd installment
Identifying groups of lines that belong together. Here we describe the key task of deciding how physcial lines should be grouped into one logical entity - the chess game.
Parsing lines according to line type to extract attributes. Here we do detailed analysis at the line level and extract the data that will become attributes of our chess games.
Rolling up multiple lines into single row objects. The final installment, we we use all of the previous steps to achieve the final desired tabular format of the data.

PGN Syntax

Portable Game Notation (PGN) is a plain-text format for recording and exchanging chess games. PGN is easily readable and writeable - both for humans and machines. It's widely used by chess software programs as well as by (online) chess magazines, tutorials etc.

Here's a simple PGN example illustrating the shortest game in history:

[Event "a chess café in Paris "]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "Gibaud"]
[Black "Lazard"]
[Result "0-1"]

1. d4 Nf6 2. Nd2 e5 3. dxe5 Ng4 4. h3 Ne3 0-1

This nicely illustrates the PGN-format:

A game consists of a set of header-lines, and one or more lines of movetext, separated by a blank line. In our example PGN, the first 7 lines are header lines, and the last line is movetext.
A header line contains a single tagpair, which is enclosed in square brackets ([ and ]):
- Inside the square brackets there are two pieces of text: the tag and the value, which are separated by whitespace.
- The tag should be a single token, immediately following the opening bracket ([). A tag cannot contain any whitespace.
- The value is enclosed in double quotes ("). The value appears directly before the closing square bracket (]) that demarcates the end of the header line.
- Almost any content can appear between the value's double quotes, except for linebreaks. Obviously, the value cannot contain a naked double-quote; see the next point.
- The backslash (\) may be used as escape character for denoting a literal double quote inside the value, like so: \".
- Because the backslash is used as escape character, a literal backslash also needs to be escaped. That is denoted by writing two backslashes: \\.
The movetext is denoted in Standard Algebraic Notation (SAN).
- SAN denotes a numbered sequence of moves.
- A move is the change of position of chess pieces on the chessboard.
- Each move starts with the move number, which is denoted as a decimal integer. Move numbers start at 1.
- The move number is immediately followed by a full stop character (.).
- Multiple moves may appear on a single line, and a single game's movetext may span multiple lines.
There's a lot more that could be said about the syntax of individual moves. However, this would require a lot more elaboration. For now, we'll proceed to treat the movetext as a whole, without worrying about the syntax that defines its structure.

From PGN to Tabular

While the PGN-format may be good for recording and exchanging games, it is not well suited for searching and analyzing large collections of chess games. A tabular format would fare a lot better there, and this is where DuckDB comes in. I will show you how a single, well formatted and indented DuckDB SELECT statement of barely more than 30 lines can transform the PGN above into the tabular dataset below:

┌─────────┬─────────┬────────────┬─────────┬─────────┬─────────┬─────────┬───────────────────────────────────────────────┐
│ game_id │  Black  │    Date    │ Result  │  Round  │  Site   │  White  │                     moves                     │
│  int64  │ varchar │  varchar   │ varchar │ varchar │ varchar │ varchar │                    varchar                    │
├─────────┼─────────┼────────────┼─────────┼─────────┼─────────┼─────────┼───────────────────────────────────────────────┤
│       1 │ Lazard  │ ????.??.?? │ 0-1     │ ?       │ ?       │ Gibaud  │ 1. d4 Nf6 2. Nd2 e5 3. dxe5 Ng4 4. h3 Ne3 0-1 │
└─────────┴─────────┴────────────┴─────────┴─────────┴─────────┴─────────┴───────────────────────────────────────────────┘

As you can see,

The game has been assigned a unique number which appears in the game_id column.
Multiple lines that made up a game in PGN notation are grouped together into a single row.
Tags occurring in the header lines now appear as column names, and the corresponding values appear as their column values.
The movetext too gets its own column.

Please note: I'm not saying this is the perfect relational format for representing chess games. However, on a spectrum with PGN blobs on the one end and a fully normalized relational schema on the other end, the format above is well underway. It is a complete representation of the original PGN source that already offers superior opportunities for search and analysis. In addition, it can quite easily be used for further transformation and processing, in particular with SQL.

Obtaining PGN-files

Many chess-related websites offer downloads of either games, puzzles or openings in PGN-format. To name a few:

PGN Mentor: collections of games by chess grandmasters. Offers .zip compressed files that group games by player, opening, event. Files are small and typically contain a couple of thousand games. Largest download I found is 13MB (40MB uncompressed)
Lichess open database: standard chess and chess-variant games played on the lichess community, grouped by month. Files typically contain millions of games, download sizes range from 17MB to 30GB

To follow along with this post, you'll need at least one PGN-file.

I suggest downloading the DutchClassical.zip openings from the PGN Mentor site. Please note that after download, you will need to unzip it first before you can read it with DuckDB. This is also the file I'll be using for most code samples, and for me the location of the unzipped .pgn file is:

C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn

So, if you want to run the code samples yourself, make sure to adjust that to whatever is the appropriate location on your system.

Alternatively you can get lichess_db_standard_rated_2013-01.pgn.zst, which is the smallest set of standard chess games from the the lichess.org database. DuckDB can already read .zst-compressed files, so you won't need to uncompress it. In fact, the DuckDB manual recommends against uncompressing .zst-files, so you probably shouldn't.

Next Installment: Ingestion of PGN Files

Next time, we'll start by using DuckDB's CSV Reader to ingest raw lines of text from PGN files.

DuckDB Bag of Tricks: First Splash

2024-07-11T21:20:00.001+02:00

Some 1.5 years ago I heard about DuckDB for the first time. I've been using it ever since, and more and more everytime. It's one of those tools that just always delivers, no matter what kind of problem I throw at it.

Sounds too good to be true? Maybe. I quickly became a fan, and I'm not ashamed to admit it. Yet despite this confession, I think my favourable opinion towards DuckDB is not merely subjective, but grounded in experience.

If you're professionally occupied with data integration, data platforms, databases, data warehouses, data lakes, etc. you get to deal with a lot of tools and products, all the time. For some, you use them because they're the only available option. In other cases, you use them because they're the agreed upon (or mandated!) standard within a particular team or community. Many tools may not be great, but you still use them, because they're affordable and they get the job done well enough.

And then, sometimes, you run into a tool that you genuinely like, or even love. Why? Because it delivered at a time when you most needed it, and it performed well beyond your expectation, without asking a lot back. And it continues to do so, the next time, and the time after that.

It's that "WOW! Could it really be that easy?"-feeling, or that "OMG, how can it be so fast?"-rush. After a few of those experiences, you enter a phase of curiosity, and become open to trying it in favor of test-and-tried tools. And then, at some point, you just enjoy working with it so much that you want to try it for anything and everything: it's fun! Finally, you treat it as a trustworthy friend. By then, it's part of your professional life, and using it is just natural.

DuckDB is such a tool for me. I don't need to convince you, but I hope my posts will inspire you to give it a try. Anyway. Done with the evangelism: let's go ahead and have some fun!

From here onwards, I will start sharing tips and tricks about DuckDB. I will label these posts as DuckDB Bag of Tricks. Jump in, share, and comment - we can all have fun together with DuckDb. Here's a few pointers to get you started:

Community Tools &Projects

All tricks so far

Reading and Transforming Chess games in PGN-format

SAP HANA Trick: DISTINCT STRING_AGG

2024-03-26T14:46:00.001+01:00

Nowadays, many SQL implementations offer some form of aggregate string concatenation function. Being an aggregate function, it has the effect concatenating values of a group of rows into a single string.

In most implementations, including MS SQL, SAP HANA, and PostgreSQL, this function goes by the name STRING_AGG(). In Snowflake it's called LISTAGG() and In MySQL, which is the first product I ever used that had such a function, it's called GROUP_CONCAT().

Basic Example

To illustrate how it behaves, lets assume we have a table of recipe ingredients like so:

CREATE TABLE RECIPE_INGREDIENT (
  RECIPE_NAME     VARCHAR(32) NOT NULL
, INGREDIENT_NAME VARCHAR(32) NOT NULL
, PRIMARY KEY(
    RECIPE_NAME
  , INGREDIENT_NAME
  )
);

With this data:

INSERT INTO 
RECIPE_INGREDIENT(
  RECIPE_NAME
, INGREDIENT_NAME
)
VALUES
  -- bread
  ('bread', 'flour'), ('bread', 'water'), ('bread', 'yeast'), ('bread', 'salt')
  -- cake 
, ('cake', 'flour'), ('cake', 'eggs'), ('cake', 'sugar'), ('cake', 'butter')
, ('cake', 'baking powder'), ('cake', 'salt'), ('cake', 'vanilla extract')
;

A typical query could then be:

SELECT   RECIPE_NAME
,        STRING_AGG(INGREDIENT_NAME||'; ') AS INGREDIENTS
FROM     RECIPE_INGREDIENT
GROUP BY RECIPE_NAME;

And the result would look something like this:

+-------------+-------------------------------------------------------------------+
| RECIPE_NAME | INGREDIENTS                                                       |
+-------------+-------------------------------------------------------------------+
| bread       | flour; water; yeast; salt;                                        |
| cake        | flour; eggs; sugar; butter; baking powder; salt; vanilla extract; |
+-------------+-------------------------------------------------------------------+

Arguments and Options: separator and ordering

All products that implement it offer the option to specify a separator string that is to be placed between pairs of values. All implementations also have a way to control the order in which the values are concatenated.

Options: `DISTINCT`

Only some products support the DISTINCT keyword in their string concatenation aggregate function. If supported the DISTINCT keyword appears right before the expression that is to be concatenated. It has the effect of adding each unique value only once to the concatenated string. That is, each distinct value is folded into the concatenated result only once.

MySQL and Snowflake support the DISTINCT keyword in their string concatenation aggregate function. SAP HANA, MS SQL and PostgreSQL, do not.

For our current example data, DISTINCT wouldn't make any difference, as INGREDIENT_NAME is (necessarily) unique for each RECIPE_NAME due to the PRIMARY KEY. But we can make a small change to the RECIPE_INGREDIENT table and add a INGREDIENT_TYPE column to help illustrate how DISTINCT can be useful:

CREATE TABLE RECIPE_INGREDIENT (
  RECIPE_NAME     VARCHAR(32) NOT NULL
, INGREDIENT_NAME VARCHAR(32) NOT NULL
, INGREDIENT_TYPE VARCHAR(32) NOT NULL
, PRIMARY KEY(
    RECIPE_NAME
  , INGREDIENT_NAME
  )
);

With this data:

INSERT INTO 
RECIPE_INGREDIENT(
  RECIPE_NAME
, INGREDIENT_NAME
, INGREDIENT_TYPE
)
VALUES
  -- bread
  ('bread', 'flour', 'carbs'), ('bread', 'water', NULL)
, ('bread', 'yeast', 'leavener'), ('bread', 'salt', 'flavor')
  -- cake 
, ('cake', 'flour', 'carbs'), ('cake', 'eggs', 'protein')
, ('cake', 'sugar', 'carbs'), ('cake', 'butter', 'fat')
, ('cake', 'baking powder', 'leavener'), ('cake', 'salt', 'flavor')
, ('cake', 'vanilla extract', 'flavor')
;

Suppose we write another query, grouping recipes by their name, and using STRING_AGG() on both INGREDIENT_NAME and INGREDIENT_TYPE. This time, we'll also use the separator option by passing it as second argument:

SELECT   RECIPE_NAME
,        STRING_AGG(INGREDIENT_NAME, ', ') AS INGREDIENTS
,        STRING_AGG(INGREDIENT_TYPE, ', ') AS INGREDIENT_TYPES
FROM     RECIPE_INGREDIENT
GROUP BY RECIPE_NAME;

The result would look something like this:

+-------------+----------------+-----------------------------------------------------+
| RECIPE_NAME | INGREDIENTS    | INGREDIENT_TYPES                                    |
+-------------+----------------+-----------------------------------------------------+
| bread       | flower, wat... | carbs, leavener, flavor                             |
| cake        | flower, egg... | carbs, protein, fat, leavener carbs, flavor, flavor |
+-------------+----------------+-----------------------------------------------------+

The list of INGREDIENT_NAMEs is just like what we saw before but edited for brevity.

Notice how the list of INGREDIENT_TYPE has 3 values for bread even though it has 4 ingredients. That's because the ingredient water has a NULL value for INGREDIENT_TYPE, and STRING_AGG() ignores NULL values (just like most other aggregate functions do).

Notice how the list of INGREDIENT_TYPE for cake has just as many entries as there are ingredients. Because the cake ingredients flour and sugar are both of type carbs, and ingredients vanilla extract and salt are both of type flavor, the INGREDIENT_TYPE list contains duplicates.

In those SQL implementations that support DISTINCT, we can get rid of the duplicates in the INGREDIENT_TYPE list simply by writing:

SELECT   RECIPE_NAME
,        STRING_AGG(INGREDIENT_NAME, ', ') AS INGREDIENTS
,        STRING_AGG( DISTINCT INGREDIENT_TYPE, ', ') AS INGREDIENT_TYPES
FROM     RECIPE_INGREDIENT
GROUP BY RECIPE_NAME;

Alternatives to `DISTINCT`

What can we do if the SQL implementation you're working with doesn't support it? A little googling finds many suggestions to solve it by applying a two-step process:

The first step serves to obtain unique column values. This can be done either by applying DISTINCT in the SELECT-list, or by adding a GROUP BY-clause.
Then, in a second step, STRING_AGG() is applied on the pre-deduplicated column values to obtain a list containing only unique entries.

The process is simple enough if you only need a single column:

WITH deduplicated_ingredient_types AS (
  SELECT   DISTINCT RECIPE_NAME, INGREDIENT_TYPE
  FROM     RECIPE_INGREDIENT
)
SELECT   RECIPE_NAME
,        STRING_AGG( INGREDIENT_TYPE, ', ') AS INGREDIENT_TYPES
FROM     deduplicated_ingredient_types
GROUP BY RECIPE_NAME;

This approach soon becomes unwieldy when we also want to have lists on other columns. For example, to get a result with a list of both INGREDIENT_NAME and INGREDIENT_TYPE, we'd have to write something like:

WITH deduplicated_ingredient_types AS (
  SELECT   DISTINCT RECIPE_NAME, INGREDIENT_TYPE
  FROM     RECIPE_INGREDIENT
), ingredient_types AS (
  SELECT   RECIPE_NAME
  ,        STRING_AGG( INGREDIENT_TYPE, ', ') AS INGREDIENT_TYPES
  FROM     deduplicated_ingredient_types
  GROUP BY RECIPE_NAME
)
SELECT   RECIPE_INGREDIENT.RECIPE_NAME
,        STRING_AGG( RECIPE_INGREDIENT.INGREDIENT_NAME, ', ') AS INGREDIENTS
,        ingredient_types.INGREDIENT_TYPES
FROM     RECIPE_INGREDIENT
INNER JOIN ingredient_types
ON RECIPE_INGREDIENT.RECIPE_NAME = ingredient_types.RECIPE_NAME
GROUP BY RECIPE_INGREDIENT.RECIPE_NAME;

The pattern of having a CTE to obtain unique values and JOIN would have to be repeated for each column from wich we want to derive a list.

A better way: the `ROW_NUMBER()` window function

We recently ran into a case for an actual SAP HANA customer where we need to produce such lists, and for many different columns. We found out that although we still need a two-step process, we can solve the problem more elegantly using the ROW_NUMBER() window function. Consider the following query:

SELECT  RECIPE_NAME
,       INGREDIENT_NAME
,       INGREDIENT_TYPE
,       ROW_NUMBER() OVER (
            PARTITION BY RECIPE_NAME, INGREDIENT_TYPE
        ) AS INGREDIENT_TYPE_OCCURRENCE 
FROM    RECIPE_INGREDIENT;

And the results are:

+-------------+-----------------+-----------------+----------------------------+
| RECIPE_NAME | INGREDIENT_NAME | INGREDIENT_TYPE | INGREDIENT_TYPE_OCCURRENCE |
+-------------+-----------------+-----------------+----------------------------+
| bread       | flour           | carbs           |                          1 |
| bread       | water           | NULL            |                          1 |
| bread       | yeast           | leavener        |                          1 |
| bread       | salt            | flavor          |                          1 |
| cake        | flour           | carbs           |                          1 |
| cake        | eggs            | protein         |                          1 |
| cake        | sugar           | carbs           |                          2 |
| cake        | butter          | fat             |                          1 |
| cake        | baking powder   | leavener        |                          1 |
| cake        | salt            | flavor          |                          1 |
| cake        | vanilla extract | flavor          |                          2 |
+-------------+-----------------------------------+----------------------------+

You might notice that every INGREDIENT_TYPE_OCCURRENCE is 1, except for the duplicate values for carbs and flavor in the cake recipe. And, this is of course the key to solving the problem!

Using `CASE` to let `STRING_AGG()` ignore duplicates

Now that we know (by looking at the INGREDIENT_TYPE_OCCURRENCE column) where the duplicate INGREDIENT_TYPE values are at, we can direct the STRING_AGG() function to ignore them using some CASE logic:

WITH occurrences AS (
  SELECT  RECIPE_NAME
  ,       INGREDIENT_NAME
  ,       INGREDIENT_TYPE
  ,       ROW_NUMBER() OVER (
              PARTITION BY RECIPE_NAME, INGREDIENT_TYPE
          ) AS INGREDIENT_TYPE_OCCURRENCE 
  FROM     RECIPE_INGREDIENT
)
SELECT  RECIPE_NAME
,       STRING_AGG(INGREDIENT_NAME, ', ') AS INGREDIENTS
,       STRING_AGG(
          CASE INGREDIENT_TYPE_OCCURRENCE
            WHEN 1 THEN INGREDIENT_TYPE  
            ELSE NULL  
          END, ', '
        ) AS INGREDIENT_TYPES
FROM     occurrences;
GROUP BY RECIPE_NAME;

And that's that! Of course, if we would like to add more columns for which we need unique lists of values, we would still need to repeat some steps. But rather than creating a new CTE to isolate the unique values, and a JOIN to add each of those to our main result, we now only need to add 1 ROW_NUMBER() expression, and one STRING_AGG() with nested CASE expression, which I think is a pretty good improvement.

Finally...

I hope you enjoyed this post! Drop a line in the comments if you have another suggestion or some remarks.

UI5 Tips: Persistent UI State

2024-01-03T12:29:00.007+01:00

This tip provides a way to centrally manage UI state, and to persist it - automatically and without requiring intrusive custom code sprinkled through your apps.

The UI State

Many ui5 controls and widgets allow some aspect of their appearance or behavior to be changed by the user. For example, a panel may be collapsed or expanded, a tab may be selected, columns width in a data grid may be adjused, and so on. We call all this the ui state. When the user restarts the app, normally, the ui state is reset: properties that were explicitly set are reinitialized to that value, and properties that were not explicitly assigned will get assigned some default, which may either be a constant or some calculated value, depending on how the component is coded. A reset of the ui state may not always be desirable. For example, if the user has to go through multiple clicks and selections before they arrive at a certain item inside the application that interests them, then it will be frustrating if they have to repeat the sequence the next time they open the application. Fortunately, for these use cases, UI5 offers routing and navigation, which lets the user find content inside the application by navigation to a particular url. However, not all ui state is about navigation. For example, the user may collapse a panel to get a bit more screen real estate, or resize the width of a column in a data grid, or toggle the state of a checkbox that controls some application-wide setting. These cases are clearly not navigational in nature, but have to do with layout and presentation. It would be confusing for the user to control this by visiting a particular url. Rather, we'd like the application to be able to retain the ui state exactly as the user left it. In this tip we will explain a way to achieve this, and in a way that does not require any specific application code. It can all

Sample Application

The sample application for this tip is in the uistate directory. Simply expose the contents of the directory with your webserver and use your browser to navigate to index.html. A screenshot is shown below:

Sample Application Features

The application has the following features:

A Main page with a splitter (sap.ui.layout.Splitter)
On the left, a Sidebar with a sap.ui.table.Table showing the Name and Country of a list of companies.
On the right, a Detail page.

Users can click on a company in the sidebar to select it, and then the company will be shown in more detail in the Detail Page. The Detail Page has some features of its own:

In the top, there's a sap.m.Panel which shows the Company Name as title. The Panel is expandable and expanded by default. Inside the panel we can see the company's phone number. Below the Panel, there's an sap.m.IconTabBar with 2 tabs:
Details, which shows the address of the currently selected company. This tab is also selected by default.
Departments, which shows the departments of the selected company.

Sample Application Demo

To test the application, try the following sequence of actions:

Use the browser to navigate to the index.html page. The sidebar should show the list op Companies, but no company will be selected yet. You can click any row in the sidebar to select a company, and if you do its details will be shown in the detailpage. For the demonstration it doesn't matter if you select one or not.
In the sidebar, the Name column is not wide enough to show the full company name Euismod Ac Fermentum Corp.. Adjust the width of the column by dragging the right end of its header to the right until the full company name is visible.
Also in the sidebar, the Country column is not wide enough to show the full name of the country Congo, the Democratic Republic of the. Adjust of that column too so the full name is visible.
After adjusting the column width in step 2. and 3., the sidebar will now have a horizontal scrollbar at the bottom, as the data grid is now wider than the position of the splitter grip. Drag the Splitter grip to the right so both columns of the sidebar are visible and the sidebar's horizontal scrollbar disappears again.
The Panel is expaned and the company's phone number is visible inside the panel. Click the button to left of the panel header title to collapse it.
The Details tab is selected by default. Click the Departments tab instead.

After all these actions, the application should now look something more like this: If you now refresh the browser window (or even close the browser alltogether) and then revisit the application, you will notice that the selection is lost. However, the column widths, the position of the splitter grip, the collapsed state of the panel and the selected tab have all been preserved. (You can restore the UI to the original state by pressing the Undo button in the top of the main page.)

Using the [`LocalStorageJSONModel`] to manage UI State

The UI State behavior is the result of binding all the relevant properties of the UI to a LocalStorageJSONModel. The model is declared in sample application's manifest.json, so it's instantiated automatically as the application starts, and becomes available throughout the application as uistate.

  "uistate": {
    "type": "ui5tips.utils.LocalStorageJSONModel",
    "dataSource": "uistateTemplate"
  }

The model is initialized with the uistateTemplate data source:

  "uistateTemplate": {
    "uri": "data/uistateTemplate.json",
    "type": "JSON"
  }

The datasource grabs its data from data/uistateTemplate.json:

{
  "autoSaveTimeout": 1000,
  "storagePrefix": "uistate",
  "template": {
    "appSettings": {
      "sidebar": {
        "splitterSize": "431px",
        "columns": {
          "name": {
            "width": "190px"
          },
          "country": {
            "width": "190px"
          }
        }
      },
      "detailpage": {
        "panel": {
          "expanded": true
        },
        "tabContainer": {
          "selectedTab": "details" 
        }
      }
    }
  }
}

The options for the LocalStorageJSONModel are described in the LocalStorageJSONModel wiki page. For the sample app, the autoSaveTimeout is relevant - for this example, the value is assigned 1000 and this means that if the state of the model is changed, there will be a 1000 ms (1 second) waiting period after which the data from the model is persisted in the Browser's local storage. The template represents the default initial state of the UI. Please refer to the the LocalStorageJSONModel wiki page for a detailed discussion of the template and model initialization.

Managing Binding

The LocalStorageJSONModel wiki page has some general remarks and clarifications about UI5 data binding. But it may not be entirely clear how to practically organize it to manage UI state. After all, even a simple example like the sample application we discuss here already has 5 distinct UI properties that the user can change.

Template Structure

One could, in principle, make one big property bag to to manage each and every UI property, and this may be the right choice if the application remains really simple. But as an application gets more features, more views and more functionality one may prefer to design a data structure that mimics the structure of the application, and that's the approach we have taken in this example. If you look at data/uistateTemplate.json, you'll notice that the template contains a applicationConfig key which itself has two keys: one to maintain all settings for the sidebar and one for all settings for the detailpage:

  "template": {
    "appSettings": {
      "sidebar": {
        ...
      },
      "detailpage": {
        ...
      }
    }
  }

To these keys, an object is assigned which may have a further hierarchical structure, depending on the UI control tree.

Template Structure and UI Tree Structure

Once we decide to structure the template hierarchically, it may be tempting to attempt to faithfully mimic the actual container/component structure of the UI in the model structure. However at this point it is my opinion that this is not necessary and not productive. The reason to warn against a too tight mapping of the UI tree to the model structure is that the UI tree is only to some extent a reflection of the functional organization of an application's parts. A simple example from the sample application may illustrate this. For example, let's take look at the structure of the template that manages the settings for the detail page:

  "detailpage": {
    "panel": {
      "expanded": true
    },
    "tabContainer": {
      "selectedTab": "details" 
    }
  }

Let's compare this to the ui tree of the detail page, which is defined in DetailPage.view.xml

  <layout:FixFlex binding="{uistate>detailpage}">
    <layout:fixContent>
      <m:Panel
        expandable="true"
        expandAnimation="false"
        binding="{uistate>panel}"
        expanded="{uistate>expanded}"        
        headerText="{companies>CompanyName}"
      >
        <m:content>
          <m:Text text="Phone: {companies>Phone}"/>
        </m:content>
      </m:Panel>
    </layout:fixContent>
    <layout:flexContent>
      <m:IconTabBar
        stretchContentHeight="true"
        applyContentPadding="false"
        expandable="false"
        binding="{uistate>tabContainer}"
        selectedKey="{uistate>selectedTab}"
      >
        <m:items>                                                                     
          <core:Fragment fragmentName="ui5tips.components.detailpage.DetailsIconTabFilter" type="XML" />
          <core:Fragment fragmentName="ui5tips.components.detailpage.DepartmentsIconTabFilter" type="XML" />
        </m:items>
      </m:IconTabBar>
    </layout:flexContent>
  </layout:FixFlex>

While we refer to it as detail page, the actual UI component that is used to implement it is a sap.ui.layout.FixFlex. But it might just as well have been another type of container, like, say, a sap.m.Page. Another example might be the tabs. In this sample application we chose the sap.m.IconTabBar and in the future we might change that to the sap.m.TabContainer. While these are functionally similar components, certain details, like property names and aggregation names might be different. During normal application development, changing and rewriting the UI, swapping out particular containers for other types of containers is quite common. Often, much of the functional aspects are retained and expanded, even though the details of the implementation and choice and structuring of UI components may be quite different. If our template would mimic the UI structure too closely, we would have to modify our template also, and often without any real benefit. So the recommendation here is to make sure the template is organized according to functionality, not to the exact details of the UI tree.

Managing Binding Paths with Element Binding

Once you settle for a hierarchical template structure, the problem arises of how to deal with the paths. For example, consider the selected tab in the the detail page:

  "template": {
    "appSettings": {
      "detailpage": {
        "tabContainer": {
          "selectedTab": "details" 
        }
      }
    }
  }

If we would have to write a path for this in a databinding, we would get uistate>/appSettings/detailpage/tabContainer/selectedTab. Obviously, in a realistic application we would have many properties and the UI code would soon become littered with these very long paths. Element binding is a UI5 feature that lets you bind a particular path from the model to a UI container or control, thus establishing a scope. Witin that scope, you can use relative paths, which UI5 will resolve against the path bound at the higher level. To understand this feature fully it's best to first look at the component that sits at the top of the UI tree - App.view.xml:

  <m:App 
    id="app"
    binding="{uistate>/appSettings}"
  >
    <m:pages>
      <mvc:XMLView viewName="ui5tips.components.mainpage.MainPage"/>
    </m:pages>
  </m:App>

Note how the binding property on the sap.m.App is bound to uistate>/appSettings. What this means is that relative bindings for the uistate model that occur on that component itself, but also to any components that are nested within it, will be resolved against uistate>/appSettings. MainPage.view.xml is nested inside the sap.m.App. If we look at that:

  <m:Page title="UIState App">
    ...
    <layout:Splitter>
      <mvc:XMLView id="sideBar" viewName="ui5tips.components.sidebar.SideBar">
        <mvc:layoutData>
          <layout:SplitterLayoutData size="{uistate>sidebar/splitterSize}"/>
        </mvc:layoutData> 
      </mvc:XMLView>
      <mvc:XMLView id="detailPage" viewName="ui5tips.components.detailpage.DetailPage"/>
    </layout:Splitter>
  </m:Page>

You might notice one bound property to the uistate model here - it is the size property of the sap.ui.layout.SplitterLayoutData object, which is bound to uistate>sidebar/splitterSize. As you can see, that binding refers also to the uistate model, but as it does not start with a /, it is a relative path. So, UI5 will try to resolve it by going up the UI tree, and it will then find the scope from the uistate model that is established by the element binding in App.view.xml. In App.view.xml, the binding was to uistate>/appSettings. If we resolve uistate>sidebar/splitterSize against that, the effective path will become uistate>/appSettings/sidebar/splitterSize. If you look back at the earlier example of DetailPage.view.xml, we notice that its top sap.layout.FixFlex component was bound to uistate>detailpage, effectively letting all relative bindings to the uistate model inside it to be resolved to uistate>/appSettings/detailpage. Of course, element binding does not only let you use shorter paths, using them consitently will also make it much easier to maintain the structure of the template. Whenever there is a radical change of structure, you should be able to rewire the element bindings without having to change each and every property individually.

Finally

Did you like this tip? Do you have a better tip? Feel free to post a comment and share your approach to the same or similar problem. Want more tips? Find other posts with the ui5tips tag!

UI5 Tips: Persisting JSONModel data using browser Storage

2024-01-03T12:29:00.003+01:00

In this ui5tip, we'll take a look at integrating ui5's sap.ui.model.json.JSONModel with sap.ui.util.Storage utility. Our immediate use case for this was to allow easy and transparent persistence of UI State, which has its own dedicated tip. In this tip, we describe how the actual persistence is implemented.

Sample application: a Shopping List

To illustrate just the storage model, we developed a tiny Shopping List application. You can run it yourself by downloading the contents of the localstoragemodel folder and exposing them with your webserver. This is what the application looks like:

Sample Application Features

A Products list (left), and a Shopping list (right). Users can browse the Products list and see name and price. The Products list has a row action button with a shopping cart icon. If the product is already on the shopping list, the shopping cart appears full. Hitting the row action button will add the product to the Shopping List.
In the Shopping List, users see the product name, item price, quantity, and item total. The items in the shopping list also have a row action to remove the item from the shopping list.

The Shopping List also has a toolbar with some buttons that control the Shopping List Data:

The Save button will save the current contents of the shopping list to the local storage
The Undo button will restore the current contents of the shopping list with whatever data was stored in the local storage
The Submit button represents the action of actually placing an order for the shopping list. It will also clear the shopping list and save.
The Clear button will empty the current contents of the shopping list, but without saving the state to the local storage.

Sample Application Demo

To test the application, try the following sequence of actions:

Open the application. Initially, the Shopping list should be empty.
In the Products list, add a Product to the shopping list by hitting the shopping cart button. The item should be added to the Shopping list.
Refresh the browser window. When the application reloads, you'll notice that the shopping list is empty - that's expected, since you didn't save the shopping list.
Now, repeat step 2 and add some products to the shopping list. Hit the Save button.
Refresh the Browser again. Now, when the application reloads, the products you added in step 4. should re-appear automatically in the list.

This demonstrates that the application is capable of persisting the saved shopping list data. Instead of refreshing the window, you can also try to completely close the browser, or even reboot your machine. But when you revisit the application - with the same browser - then you'll notice that the data will still appear. In addition to the persistence the application also provides a simple, one level undo action. Whenever you make a modification to the list, either by adding a new item, removing an item, or modifying the quantity of an item, both the Save and the Undo button will become enabled. We already demonstrated the Save button action. Hitting the Undo action button will restore the contents of the shopping list with whatever was available in the Storage, restoring the contents of the list to the previously saved state. You can use your browser to inspect the local storage. It might look something like this: In the remainder of this tip we will discuss how these features were built by combinding two classes in the ui5 framework - the sap.ui.util.Storage utility and the sap.ui.model.json.JSONModel.

`sap.ui.util.Storage` utility

The sap.ui.util.Storage utility offers a UI5 APIto access the Browser's standard HTML5 Web Storage API. It's a pretty basic, no-nonsense wrapper for managing modest amounts of data based on key/value access. Of course, you can use the sap.ui.util.Storage utility directly and code your own logic to control exactly when you want to retrieve and store some data. While there is nothing against that approach, we envisioned something that also works when using models and data binding. This may need a little bit of explanation.

UI5 models

Models are a way to achieve managed data access. A model manages a particular collection of data, and can be shared across multiple elements of the application, or even be accessible to all elements of the application. For example, both the Product List and the Shopping List are each managed by their own model, which are declared in the application's manifest.json:

  ...,
  "models": {
    "products": {
      "type": "sap.ui.model.json.JSONModel",
      "dataSource": "products"
    },
    "shoppingList": {
      "type": "ui5tips.utils.LocalStorageJSONModel",
      "dataSource": "shoppingListTemplate"
    }
  },
  ...

The model can be observed by listening to its events, and this allows different parts of the application to react whenever something interesting happens to the state of the model - i.e. when its data is manipulated. For example, in MainPage.controller.js, an event handler is attached to listen to the Shopping list's dirtyStateChange and propertyChange events, which in turn control certain aspects of the screen logic, such as enabling and disabling the Save and Undo buttons:

  ...,
  initShoppingListModelHandlers: function(){
    var shoppingListModel = this.getShoppingListModel();
    shoppingListModel.attachDirtyStateChange(function(event){
      this.dirtyStateChanged(event.getParameters());
    }, this);
    shoppingListModel.attachPropertyChange(function(event){
      var path = event.getParameter('path');
      var context = event.getParameter('context');
      var itemsPath = '/items';

      if (path === itemsPath || context && context.getPath() === itemsPath) {
        this.itemsChanged(shoppingListModel.getProperty(itemsPath));
      }          
    }, this);
  },
  ...

Data Binding

In Ui5, databinding is a mechanism that lets you declaratively construct objects and change their properties based on the state of a model. The declarative aspect means that no explicit coding is involved. For example, rather than setting up an event handler that contains explicit code to respond to changes to the state of the model, you can use a special syntax in designtime property assignments that ensures the runtime property value will be assigned directly from some part of the data in the model. Some examples include:

The actual data in both the Product List and Shopping List. Both these are implemented using a sap.ui.table.Table control, which only support adding rows through data binding. For example, take a look at ShoppingList.fragment.xml to see how it gets its rows from the Shopping List model:

  ...
  <table:Table
    id="shoppingList"      
    title="Shopping List"
    editable="true"
    selectionMode="None"
    enableBusyIndicator="true"
    visibleRowCountMode="Auto"
    rowActionCount="1"
    rows="{
      path: 'shoppingList>/items'
    }"
  >
  ...

(This example basically says: create a row in the shopping list for each item in the shopping list model)

The Product List's row action shows a full or empty shopping cart, depending upon whether the product is already in the shopping list. This is achieved in Products.fragment.xml with databinding, which passes the current product and the items from the shopping list to the controller's getShoppingCartRowActionIconSource formatter function:

  ...
  <table:RowActionItem
    binding="{shoppingList>/items}"
    icon="{
      parts: [
        {path: 'products>'},
        {path: 'shoppingList>'}
      ],
      formatter: '.getShoppingCartRowActionIconSource'
    }"
    text="Add to Cart"
    press="onCartButtonPressed"
  />
  ...

(This example says: call the getShoppingCartRowActionIconSource method to obtain an icon, depending on the current product from the product model and all the items in the shopping list model.)

In ShoppingList.fragment.xml, the state of the Submit and Clear buttons is enabled depending upon whether the shopping list has any items:

  ...
  <m:contentMiddle>
    <m:Button 
      id="approvalButton"
      icon="sap-icon://cart-approval"
      tooltip="Send Order"
      enabled="{= ${shoppingList>/items}.length > 0 && ${shoppingList>/items/0} !== undefined }"
      press="onApproveButtonPressed"
    />
  </m:contentMiddle>
  <m:contentRight>
    <m:Button 
      id="clearAllButton"
      icon="sap-icon://clear-all"
      tooltip="Clear Shoppinglist"
      enabled="{= ${shoppingList>/items}.length > 0 && ${shoppingList>/items/0} !== undefined}"
      press="onClearButtonPressed"
    />
  </m:contentRight>
  ...

(This example says uses a slightly different binding syntax called expression binding to enable the button if there is at least one item in the shopping list.) While all these features could also have been implemented by explicit coding, data binding allows a lot of this to be defined completely declaratively in the view, with much less code, and denoted in a way that transparently and unambiguously ties the data to the relevant item in the UI. Now - there is no shame (I think) in not immediately embracing ui5's data binding. There are a number of areas that can be somewhat complex and unintuitive at first. But by using it more and more often, you start experiencing the benefits, and - just as important -, learn about the limitations. Now, this post is not an in-depth article on ui5 databinding. It's just that, at some point, you learn to use it in such a way that it becomes one of the most important factors in how you design ui5 applications, as well as the way different parts of the application communicate with each other. So, we consider using ui5 models and data binding as a given. And if you find you have a need for the kind of client-side persistence capabilities offered by the Web Storage API, then you are probably not interested in that as an isolated way of storing some bits of data. Instead, you're going to want to have a normal, regular ui5 model that incorporates these persistence features.

A `sap.ui.model.json.JSONModel` backed by `sap.ui.util.Storage`

We decided to take the sap.ui.model.json.JSONModel as a base, and extend it to add a few methods that allow the model's data to be stored and retrieved from sap.ui.util.Storage. The reason for this approach is to allow our model exactly the same as the standard ui5 sap.ui.model.json.JSONModel. This means that in particular, all behavior with regards to databinding will be exactly as with the sap.ui.model.json.JSONModel. In theory, it would also be possible to extend the abstract sap.ui.model.ClientModel, but it turns out that implementing reliable databinding is not as easy as it seems. Or I should say, I took a naive shot at doing that, and failed. While it might be very instructive to try it in earnest, I decided that at this point I am more interested in having a working solution than to learn all the ui5 internals required to succesfully implement databinding. The result is the LocalStorageJSONModel.

Instantiating the `LocalStorageJSONModel` from the `manifest.json`

The sample application creates the LocalStorageJSONModel implicitly by declaring it in the manifest.json:

      "shoppingList": {
        "type": "ui5tips.utils.LocalStorageJSONModel",
        "dataSource": "shoppingListTemplate"
      }

It gets initialized with a datasource called shoppingListTemplate which is also declared in the manifest.json:

      "shoppingListTemplate": {
        "uri": "data/shoppingListTemplate.json",
        "type": "JSON"
      }

The datasource refers to some configuration data stored in data/shoppingListTemplate.json and its contents are:

{
  "autoSaveTimeout": -1,
  "storagePrefix": "shoppingList",
  "template": {
    "items": [
    ]
  }
}

This data is passed as first argument to the LocalStorageJSONModel constructor. Its properties are:

int autoSaveTimeout: (optional) an integer specifying the number of milliseconds to wait after the last change to the model before automatically saving the model's data to the persistent storage. If this is 0 or less, data is not automatically persisted.
string storagePrefix: (optional) a string that is used to prefix the key under which the models data will be stored in storage. The sap.ui.util.Storage constructor takes a storagePrefix, and the LocalStorageJSONModel takes its own class name for that. But if you have several of these models in one application, you can keep them apart by specifying a specific storagePrefix here.
object template: (optional) an object that will be used as template data for the model.

Instantiating the `LocalStorageJSONModel` directly

Of course, you can also import the class into your in your ui5 classes (for example, in a controller) and call its constructor to create an instance:

sap.ui.define([
  "sap/ui/core/mvc/Controller",
  "ui5tips/utils/LocalStorageJSONModel"
], 
function(
  Controller,
  LocalStorageJSONModel
){
  "use strict";  
  var controller = Controller.extend("ui5tips.components.app.App", {
    onInit: function(){
      var localStorageModel = new LocalStorageJSONModel({
        "autoSaveTimeout": -1,
        "storagePrefix": "myApp",
        "template": {
          ...data...
        }
      });
      this.getView().setModel(localStorageModel, 'localStorageModel');
    }
  });
  return controller;
});

Key Methods

The most important methods provided by LocalStorageJSONModel are:

loadFromStorage(template): populates the model with the data persisted in the storage. If the template argument is specified, then the data from the storage is patched with the data in the template. (For more details, see the next section about model initialization and the template). In the sample application, the Undo button action is implemented by calling loadFromStorage():

    onUndoButtonPressed: function(){
      var shoppingListModel = this.getShoppingListModel();
      shoppingListModel.loadFromStorage();
    }

saveToStorage(): stores the model data to the browser storage. In the sample application, the Save button action is implemented by calling the saveToStorage() method.

    onSaveButtonPressed: function(){
      var shoppingListModel = this.getShoppingListModel();
      shoppingListModel.saveToStorage();
    }

deleteFromStorage(): permanently removes the data from the local storage. Use this if you're sure the application will not need any of the currently stored data anymore.
isDirty(): returns a boolean that indicates whether the current model state is different from what is stored. If it returns true, it means the current state of the model is different from the stored state. Note that you can also use the dirtyStateChanged event to get notified of a change in the dirty state.

Template and Model initialization

As part of model initialization, whatever data the browser had associated with the storagePrefix is retrieved. If a template is specified, then the data retrieved from the storage is patched with the template and the resulting data structure is immediately saved to the storage. This provide a basic method to evolve the structure of the model and pre-populate it with any defaults. The patching of the data occurs non-destructively: only those paths in the template that do not exist already in the stored data structure will be added. If you need it, you can always apply more advanced patching schemes after instantiation, but in many cases, this built-in behavior will suffice to update and upgrade the model structure as your application grows and gets more features. You can use the following methods to work with the template:

getTemplateData(): retrieve the template passed to the constructor.
resetToTemplate(): repopulates the model with the template. Any data stored in the model will be lost.
updateDataFromTemplate(data, template): utility method that is used to patch the data argument with the template argument. It returns a object that represents the merge of the data argument and the template argument.

Events

The LocalStorageJSONModel provides these events:

dirtyStateChange: this event has two parameters, isDirty to indicate whether the model is now dirty and wasDirty, indicating whether the model was dirty prior to the latest change. The sample application uses this event to determine whether to enable or disable the Save and Undo buttons:

    shoppingListModel.attachDirtyStateChange(function(event){
      this.dirtyStateChanged(event.getParameters());
    }, this);

and

    dirtyStateChanged: function(parameters){
      var isDirty = parameters.isDirty;
      this.byId('saveButton').setEnabled(isDirty);
      this.byId('undoButton').setEnabled(isDirty);
    },

The following events can be used to keep track of the model state:

attachDirtyStateChange(data, handler, listener): attach a handler function to get notifications of a change in the dirty state. If a some change is made that causes a difference between the stored data and the model data, this event is fired and the handler is called in the scope of the listener, and gets passed the application specific payload data.

Autosave

The sample application controls when the model data will be persisted to local storage by calling saveToStorage() explicitly. But there are also use cases where you simply want the storage to always reflect the state of the model, or at least, track it as closely as possible. The persistence of UI state is such a case, and for these scenarios the LocalStorageJSONModel supports an automatic save feature. Autosave works by monitoring the state of the model, and then saving to the storage whenever a change is detected. While saving to storage should generally be pretty fast, it is a blocking operation. So rather than always explicitly persisting after a change occurs, we simply buffer the change events with the bufferedEventHandler and persist the data to storage some time after the occurrence of the last change event. To use the autosave feature, simply pass a positive value for the autoSaveTimeout property when you instantiate the model. Alternatively, can also get or set the value of the autoSaveTimeout property after model construction by calling the getAutoSaveTimeout() and setAutoSaveTimeout() methods respectively. To disable autosave, simply set the property to a zero or negative value.

Finally

Did you like this tip? Do you have a better tip? Feel free to post a comment and share your approach to the same or similar problem. Want more tips? Find other posts with the ui5tips tag!

UI5 Tips: Change expand/collapse icons for Tree, Panel and TreeTable using only CSS

2024-01-03T12:28:00.002+01:00

UI5 offers a couple of widgets that can expand and collapse. To do that, these controls render a button with an icon that indicates the current state, and which the user can click to toggle the state. The standard icons that UI5 renders for the expand/collapse button are navigation arrows, which some of our users disliked. In this tip, you'll learn how you can replace them with more appropriate icons using only a few lines of CSS. No javascript code is involved. If you want to check out this tip yourself, download the app from the expandcollapse directory and expose it to your webserver. You can then navigate to index.html to see the sample app in effect.

UI5 exandable/collapsible Controls

First, lets take a look at the standard UI5 controls.

Panel

The sap.m.Panel has an expandable property. If true the Panel renders a button which the user can use to hide and show the contents of the panel. A screenshot is shown below: (This screenshot is taken from UI5's Panel - Expand / Collapse sample)

Tree

The sap.m.Tree is a classical way of presenting hierarchically organized items like a folder structure. A screenshot is shown below: (This screenshot is taken from UI5's Tree - Basic sample)

TreeTable

The sap.ui.table.TreeTable is just like a regular data grid table (sap.ui.table.Table), but with an added functionality to hierarchically organize the rows in the table, and with the ability to expand or collapse rows according to the hierarchy. A screenshot is shown below: (This screenshot is taken from UI5's sap.ui.table.TreeTable JSONTreeBinding sample)

A look at the icons

Let's take a look at the standard icons that UI5 renders for the expand/collapse button:

When collapsed, the icon is the navigation-right-arrow icon. This is what it looks like:

When expanded, it's the navigation-down-arrow icon. This is what it looks like:

Proposed Icons

While I don't really have a problem with these icons, some of our users had a problem recognizing the collapse/expand functionality for Panels. We looked a bit around in the UI5 Icon explorer and decided we'd rather use these icons instead:

expand

collapse

Going by their name, it's a bit of a mystery to me why they weren't used by UI5 in the first place. But anway, now we have this tip to explain how you can change them.

CSS to change the icons

We prepared a separate CSS file for each of the aforementioned UI5 controls, and included them into the app via the manifest.json:

"resources": {
  "css": [
    { "uri": "css/ui5-customization-m.Panel.css" },
    { "uri": "css/ui5-customization-m.TabContainer.css" },
    { "uri": "css/ui5-customization-m.Tree.css" },
    { "uri": "css/ui5-customization-ui.tree.TreeTable.css" }
  ]
}

How UI5 renders icons

Before we discuss how to apply the CSS to change the icons, it's useful to understand how UI5 icon rendering works. In general, UI5 uses icon fonts. The UI5 framework loads a library.css stylesheet, which has a @font-face rule like this:

@font-face {
    font-family: "SAP-icons";
    src: url('../base/fonts/SAP-icons.woff2') format('woff2'),
         url('../base/fonts/SAP-icons.woff') format('woff'),
         url('../base/fonts/SAP-icons.ttf') format('truetype'),
         local('SAP-icons');
    font-weight: normal;
    font-style: normal
}

This binds the name SAP-icons to the font resource, and will ensure that whenever a HTML element is assigned the font-family: "SAP-icons" css property, it will render whatever text it contains with glyphs from that font. Now, when using the UI5 javascript API, you don't actually ever have to deal with these details at this level. Rather, if you ever need to assign an icon explicitly, for example, when using a sap.ui.core.Icon control, you can assign a custom icon uri using the sap-icon protocol, which maps more or less reasonable icon names to the glyph that depicts the desired icon. (You can read more about the sap-icon uri protocol in the Icon topic of the SAP UI5 walkthrough) Apart from these explicitly assigned icons, the renderer classes of various UI5 controls will write out the required HTML code for the icons that just happen to be fixed to it. Let's call these structural icons. For example, there is no property that allows you to change the icon that a sap.m.Panel uses for its exapand/collapse button - that's just part of how the Panel happens to be coded - it's part of its structure. As we will see in the following sections, the font-family is just the underlying medium that allows the UI5 framework to render icons. The details of how a particular control renderer renders its structural icons can still vary a bit, and we'll need to figure out how a particular control renders its icons before we can change them.

How `sap.ui.table.TreeTable` renders the collapse/expand icons

The sap.ui.table.TreeTable renderer takes a straightforward approach to rendering the collapse/expand icons. If you open one of the standard UI5 TreeTable samples, and right click the expand/collapse icon to inspect it (for example, with the Chrome developer tools), then you might see something like this: The sap.ui.table.TreeTable renderer has written a <span> element with a sapUiTableTreeIcon class:

<span 
  class="
    sapUiTableTreeIcon 
    sapUiTableTreeIconNodeClosed
" 
  title="Expand Node" 
  role="button" 
  aria-expanded="false"
></span>

The span does not actually contain any text - rather a css ::before pseudo class is used for that. This is also used to bind it to the "SAP-icons" font, using the font-family property - this ensures that element will render glyphs from the icon font:

.sapUiTableTreeIcon::before {
    font-family: "SAP-icons";
    font-size: .75rem;
    color: #0854a0;
}

The actual text content that determines the icon is controlled through another rule, using another css class, which uses the css content property to write out the character that renders the appropriate icon from the font. When collapsed, its:

.sapUiTableTreeIcon.sapUiTableTreeIconNodeClosed::before {
    content: '\e066';
}

(You may recall that \e066 is the character that corresponds to the navigation-right-arrow icon.) When expanded, its:

.sapUiTableTreeIcon.sapUiTableTreeIconNodeOpen::before {
    content: '\e1e2';
}

(You may recall that \e1e2 is the character that corresponds to the navigation-down-arrow icon.) This way, the sap.ui.TreeTable only needs to change the style class from sapUiTableTreeIconNodeClosed to sapUiTableTreeIconNodeOpen on the <span>, depending on the expanded/collapsted state of the row: the css magic will take care of rendering the right icon.

Changing the expand/collapse icons for the `sap.ui.table.TreeTable`

As we have just witnessed, the sap.ui.table.TreeTable uses separate classes for the collapse and expand icons. This makes it really quite simple to change the icons. We only have to write our own rules for the sapUiTableTreeIconNodeOpen::before and sapUiTableTreeIconNodeClosed::before classes to mask the default ones, and assign the proper value for the content property:

/**
* sap.ui.table.TreeTable: better icons for expanded
*/
.sapUiTableTreeIcon.sapUiTableTreeIconNodeOpen::before {
  content: '\e1d9';
}
/**
* sap.ui.table.TreeTable: better icons for collapsed
*/
.sapUiTableTreeIcon.sapUiTableTreeIconNodeClosed::before {
  content: '\e1da';
}

(You will find similar rules in the ui5-customization-ui.tree.TreeTable.css provided by this ui5tip) The only thing we really need to think of when applying this stylesheet is that it is loaded after UI5 framework loads the CSS specific to the ui.tree.TreeTable control: if our CSS is loaded before the framework's CSS, then our rules will be masked by the framework's, and we want to do it exactly the other way around. To ensure that the framework's CSS for the ui.tree.TreeTable control is loaded before our custom CSS, simply include the sap.ui.table library in the data-sap-ui-libs property of the <script> element you use to load UI5. (see the index.html for this tip):

<script
  id="sap-ui-bootstrap"
  src="https://openui5.hana.ondemand.com/1.87.0/resources/sap-ui-core.js"
  data-sap-ui-theme="sap_belize"
  data-sap-ui-libs="sap.m, sap.ui.table"
  data-sap-ui-bindingSyntax="complex"
  data-sap-ui-compatVersion="edge"
  data-sap-ui-preload="async"
  data-sap-ui-resourceroots='{
    "ui5tips": "./"
  }'
></script>

That's it! The screenshot below shows what the TreeTable looks like in this tip's sample app:

How `sap.m.Panel` renders the collaps/expand icons

Let's take a look at how the sap.m.Panel renders its collapse/expand icon. We can again open UI5's own sap.m.Panel sample and use our browser's develpoment tools to inspect the page's HTML code: . Just like the sap.ui.table.TreeTable we discussed in the previous section, the sap.m.Panel renders a <span> element for the icon, which is assigned a CSS class to mark it as the icon, and wich is bound to the icon font face:

<span 
  data-sap-ui-icon-content=""
  class="
      sapUiIcon 
      sapUiIconMirrorInRTL 
      sapMBtnCustomIcon 
      sapMBtnIcon 
      sapMBtnIconLeft" 
  style="font-family: 'SAP\2dicons';"
></span>

And, just like for the sap.ui.table.TreeTable, there is a CSS rule to select the ::before pseudo-class, which has the content property to insert the appropriate character that corresponds to the glyph.

.sapUiIcon::before {
    content: attr(data-sap-ui-icon-content);
    speak: none;
    font-weight: normal;
    -webkit-font-smoothing: antialiased;
}

There are some remarkable differences too with respect to the sap.ui.tree.TreeTable example. In this case, there are no separate classes corresponding to the collapsed/expanded state of the Panel. Instead, the content property of the .sapUiIcon::before pseudo-class uses the value of the elements data-sap-ui-icon-content attribute. It will render whatever text is in the elements data-sap-ui-icon-content attribute. If you check the code for the <span>, you'll note the data-sap-ui-icon-content has been assigned some text, which is rendered as a so-called .notdef glyph, both in the developer tools and here on the page. (The .notdef glyph is the "boxed question mark"). You can copy the text from the data-sap-ui-icon-content attribute in the browser tools and paste it in a hex editor, or in a javascript string to figure out what its character code is, for example:

// decimal: 57839
"".charCodeAt(0)

// hex: 0xE1EF
("".charCodeAt(0)).toString(16)

It turns out that this corresponds to UI5's slim-arrow-down icon, which has a similar appearance to the down-arrow icon. If you collapse the panel and inspect it again, you'll notice that the value of the data-sap-ui-icon-content is now 0xE1ED, which corresponds to UI5's slim-arrow-right icon.

Changing the expand/collapse icons for the `sap.m.Panel`

Now, it's clear that we cannot simply mask the existing classes in the same way we did in the sap.ui.tree.TreeTable case. The reason is that in this case the icon is driven directly by an attribute value, not by change of style class. Since the icon is so clearly driven by the value of the attribute, your initial hunch might be to somehow change the value that is written out to the HTML. But this would involve rewriting or overriding the sap.m.Panel or its renderer, and we're not quite prepared to do that just to change the icon. But, there is a way. What we can do is write some rules that match the <span> depending on the value of the data-sap-ui-icon-content attribute. And if we can match a CSS selector based on the attribute value, we can simply write out a content property with the desired character instead. This works as long as we know what values the attribute will have, which is of course the case here, as there will only be 2 different values, corresponding to the collapsed or expanded state of the panel. This is what it looks like in ui5-customization-m.Panel.css:

/*
  sap.m.Panel better expanded button. 

  The value in the predicate for data-sap-ui-icon-content may not render correctly,
  but this is decimal 57839, or 0xE1EF, which corresponds to UI5's "slim-arrow-down" icon
  (https://sapui5.hana.ondemand.com/sdk/test-resources/sap/m/demokit/iconExplorer/webapp/index.html#/overview/SAP-icons/?tab=grid&icon=slim-arrow-down)
  
*/
div.sapMPanel.sapMPanelExpandable > div > span[data-sap-ui-icon-content=].sapUiIcon::before {
  content: '\e1d9';
}

/*
  sap.m.Panel better collapse button
  
  The value in the predicate for data-sap-ui-icon-content may not render correctly,
  but this is decimal 57837, or 0xE1ED, which corresponds to UI5's "slim-arrow-right" icon
  (https://sapui5.hana.ondemand.com/sdk/test-resources/sap/m/demokit/iconExplorer/webapp/index.html#/overview/SAP-icons/?tab=grid&icon=slim-arrow-down)
*/
div.sapMPanel.sapMPanelExpandable > div > span[data-sap-ui-icon-content=].sapUiIcon::before {
  content: '\e1da';
}

Note the span[data-sap-ui-icon-content=].sapUiIcon::before is the essential bit that allows us to react to a specific icon value. The selector part before is there to ensure the rule will only apply to the expand/collapse button of a Panel, and not to some random other control's icon. And, here's what it looks like in the sample app:

How `sap.m.Tree` renders the collapse/expand icons

The sap.m.Tree uses exactly the same mechanism to render the icons as the sap.m.Panel does - the character that corresponds to the appropriate icon glyph is written to a data-sap-ui-icon-content, and the value of the attribute is rendered against the icon font's font face. The only difference with the Panel is that the sap.m.Tree uses the navigation-right-arrow and navigation-down-arrow icons, just like the sap.ui.table.TreeTable did. Apart from that, we also need to ensure the first bit of the selectors are specific to the sap.m.Tree, which is similar to what we did for the sap.m.Panel. This is what the CSS looks like in ui5-customization-m.Tree.css:

/**
  sap.m.TreeItem : better icons for collapsed 

  The value in the predicate for data-sap-ui-icon-content may not render correctly,
  but this is decimal 57446, or 0xE066, which corresponds to UI5's "navigation-right-arrow" icon
  (https://sapui5.hana.ondemand.com/sdk/test-resources/sap/m/demokit/iconExplorer/webapp/index.html#/overview/SAP-icons/?tab=grid&icon=navigation-right-arrow)

*/

li.sapMTreeItemBase > span[data-sap-ui-icon-content=].sapMTreeItemBaseExpander.sapUiIcon::before {
  content: '\e1da';
}


/**
  sap.m.TreeItem : better icons for expanded

  The value in the predicate for data-sap-ui-icon-content may not render correctly,
  but this is decimal 57826, or 0xE1E2, which corresponds to UI5's "navigation-down-arrow" icon
  (https://sapui5.hana.ondemand.com/sdk/test-resources/sap/m/demokit/iconExplorer/webapp/index.html#/overview/SAP-icons/?tab=grid&icon=navigation-down-arrow)
*/

li.sapMTreeItemBase > span[data-sap-ui-icon-content=].sapMTreeItemBaseExpander.sapUiIcon::before {
  content: '\e1d9';
}

And this is what the Tree looks like in the sample app:

Finally

Did you like this tip? Do you have a better tip? Feel free to post a comment and share your approach to the same or similar problem. Want more tips? Find other posts with the ui5tips tag!

UI5 Tips: Buffering Events to avoid a request-storm

2024-01-03T12:18:00.001+01:00

Standard UI5 event handling will usually go a long way. Yet sometimes, certain user actions can cause ui5 objects to generate a lot of similar events within a small period of time, and it is often not useful to handle each and all of them: only the last event needs handling. A very common scenario is doing a search in response to the liveChange event: if you'd attach a handler to handle the liveChange event, and do the backend query from there, then a backend request would be sent for each keystroke while the user is typing in the search field. This causes a storm of requests that the backend must somehow handle. But most of these requests will be for naught, as the user is only interested in the result of the query that matches the last complete search term they typed. So, rather than firing a query to the backend for each and every keystroke, it makes more sense to buffer these events, and react to only the last one. The bufferedEventHandler utility helps you to do just that in a generic and reusable way. This ui5tip describes the bufferedEventHandler utility. It is available on github under terms of the Apache 2.0 License. There's also a sample application so you can try it out yourself.

The BufferedEventHandler sample app

The bufferedEventHandler sample application illustrates the scenario from the introduction. It consists of a single page showing mockup company data in a sap.ui.table.Table. A screenshot is shown below: At the top left of the grid, there's a sap.m.SearchField labeled "Search in Name". The user can type some search term into the searchfield, and the grid will automatically refresh and show only the rows for which the CompanyName has a case-insensitive match with the entered search term. While the search happens automatically, it does not happen immediately as the search term changes at every keystroke. Rather, about 1 second after the user stops typing, the data grid is filtered. At the top right of the grid, there's a sap.m.ProgressIndicator labeled "Event buffer Timeout". The progress indicator reflects how much time has passed since the last keystroke. When the progress indicator reaches a 100%, the filter action is executed.

The bufferedEventHandler Utility

To buffer events we provide a bufferedEventHandler utility object with just one bufferEvents function. You can find this in the bufferedEventHandler file in the utils directory. To use it, we need to import it into the source file where we want to use it. This will usually be in a ui5 controller and in the sample app we do this in MainPage.controller.js:

sap.ui.define([
  "sap/ui/core/mvc/Controller",
  "sap/ui/table/Column",
  "sap/m/Text",
  "sap/ui/model/Filter",
  "sap/ui/model/FilterOperator",
  "sap/ui/model/FilterType",
  "ui5tips/utils/bufferedEventHandler"
], 
function(
  Controller,
  Column,
  Text,
  Filter,
  FilterOperator,
  FilterType,
  bufferedEventHandler
){
 "use strict";  
  var controller = Controller.extend("ui5tips.components.mainpage.MainPage", {
    ...
  });
  return controller;
}

We can now refer to the bufferedEventHandler utility through the local variable that is also called bufferedEventHandler. The controller uses the bufferedEventHandler utility in the initSearchField() method. This called from the controller's standard onInit() lifecycle method, which is called just once for the Controller instance:

  ...
  onInit: function() {
    this.initSearchField();
  },
  initSearchField: function(){
    var searchField = this.byId('searchField');
    bufferedEventHandler.bufferEvents(
      // event provider
      searchField,
      // timeInterval
      1000, 
      // eventId
      'liveChange', 
      // data
      null, 
      // handler
      this.doSearch, 
      // listener
      this,
      // progressHandler
      this.searchFieldProgress,
      // progressUpdateInterval
      50
    );
  },
  ...

The `bufferEvents()` Method

The meat of the initSearchField() method is the call to the bufferEvents method of the bufferedEventHandler utility. This method has the following arguments:

eventProvider: the 1st argument should be the object that emits the events - in our example this is the sap.m.SearchField. This object should be a subclass of sap.ui.base.EventProvider. (bufferEvents will throw an error if it's not!)
timeInterval: the 2nd argument is the timeout, in milliseconds. This is the amount of time that should pass between the occurrence of the last event and the call to the actual handler of the event. If a new event occurs during the wait period, the timeout is reset, and a new waiting period is started. In the example, we use a timeInterval of 1000 - that is, we will wait 1000 milliseconds (1 second) before handling the last event.

Choosing the timeInterval is a balancing act. In the case of the example, where the events are generated in response to user actions, the timeInterval should not be too short, as the user should be given enough time to type a meaningful searchterm before the actual query kicks in. But if the timeInterval is too long, the application may appear unresponsive to the user. If the application appears unresponsive, the user may try to retype their search term, which will only postpone the reaction even more. (There's more about this in the section about the ProgressIndicator). The next 4 arguments of bufferEvents correspond to sap.ui.base.EventProvider's attachEvent() method:

eventId: a string that identifies the event to listen to. In our example this is 'liveChange'. Some ui5 objects, (for example, sap.ui.base.ManagedObjects, which includes all sap.ui.core.Controls) expose the events they expose through their metadata. In these cases, bufferEvents will verify whether the passed eventId is in fact exposed by the object, and it will throw an error in case it doesn't. EventProviders that do not expose their events through metadata can still be used with the bufferedEventHandler, but you'll just need to make sure yourself the value for eventId is valid, as bufferEvent has no way of checking it.
data: an optional argument to pass any "extra" data that the event handler might need. In the example, we pass null as we have no need for any additional data.
handler: this should be the callback-function that will be called upon to actually handle the event. The callback function will receive an instance of an sap.ui.base.Event as single argument, which typically provides access to all relevant information pertaining to the event. In the example, we pass this.doSearch, which is a method of the controller that will perform the actual filtering of the data grid.
listener: this is an optional argument which you can use to specify the scope in which the handler will be called. Typically the handler will not be completely standalone, but it will refer to a this object, one way or another. If the handler function is not already bound (for example, by using the function's bind() method), then you should pass whatever object should act as this for the handler function via the listener argument. In the example, we simply use this which refers to the controller instance itself. This makes sense as the handler function is also a method of the controller. (Remember: we passed this.doSearch as handler.)

In the call to bufferEvents, these arguments will be used to create an actual handler for the event, and also automatically attach it to the eventProvider for the specified eventId. But rather then calling the passed handler, it will start a javascript timeout for a duration of the passed timeInterval. If the timeout was already initiated, it is cleared, thus canceling the previous event, and initiating a new waiting period.

Monitoring wait progress

The final 2 arguments to bufferEvents are optional, and may be used by the application to monitor the waiting period between the occurrence of the last event and the time when the handler will actually be called:

progressHandler: when passed, this should be a callback function which is to be called at the start and during the waiting period. If the progressHandler callback is called, it will be called using the listener as scope. The callback will be passed a floating point number between 0 and 1, indicating the fraction of the time that has passed between the last event and now. If a progressHandler is specified, it is always called at least once and passed 0 whenever a new waiting period is initiated. In this example we passed this.searchFieldProgress, which is a method of the controller that updates the sap.m.ProgressIndicator that sits in the right top of the data grid.
progressUpdateInterval: this should be in integer, indicating the number of milliseconds between the calls to the progressHandler. In our example it is 50, which means we will get 1000 / 50 = 20 updates during the waiting period, which ensures a smooth and regular update of the ProgressIndicator control.

The ProgressIndicator

The sample application provides a sap.m.ProgressIndicator to indicate when the entered search term will be used to filter the data. A progress indicator may not be necessary in case the timeInterval is so short that it will appear to the user as if the event is handled immediately. But when the timeInterval exceeds 200 or 250 milliseconds, most users will start to experience a noticeable lag. Now, there is this strange psychological phenomenon happening here - as the user is still typing their search term, they will be happy that the backend query is not already fired. It would make them feel rushed if the grid was constantly being updated while they were typing. But once the user is done typing their search term, they want to have the result as quickly as possible. Obviously, the software cannot read the user's mind (yet!), so once the user stops typing, the application needs to let the user know they acknowledged their action, and that it is 'working on it'. Hence the need for a progress indicator: by having a visual indicator that "something's happening", the user will be assured the application has acknowledged their input, and this will make the wait period before actually handling the event more acceptable. If the wait is sufficiently short, a simple busyIndicator might do the trick, but since the progressHandler gets passed an exacte estimate of how much longer the user will need to wait, our progress indicator can communicate this to the user. This will make the application's behavior more predictable and hopefully more satisfying to use. Of course, it is not absolutely necessary to use the sap.m.ProgressIndicator to give this kind of feedback to the user. It's just that for this sample, this was the easiest, most straighforward illustration of this principle. You can use the `progressHandler` callback to do anything you like to fit your need.

Detaching

The bufferEvents method will create and attach a handler to the eventProvider. bufferEvents will also return that generated handler so you can detach it explicitly from the eventProvider if you need to. As a convenience, the returned handler provides its own detach() method for this purpose:

var bufferedEventHandlerInstance = bufferedEventHandler.bufferEvents(...);
...
bufferedEventHandlerInstance.detach();

(Note that in a typical scenario, the eventHandler and the eventProvider will almost certainly be in the same scope and lifecycle, so there is rarely a need to explicitly do this.)

Other Use Cases

The liveSearch scenario may not always be a convincing use case. For example, if the query is done against a client model rather than a remote backend system then it might not actually be a problem to re-issue the query for every keystroke. But there are some other scenarios that benefit from event buffering. We will encounter one such case in the tip about Persisting UI State.

Finally

Did you like this tip? Do you have a better tip? Feel free to post a comment and share your approach to the same or similar problem. Want more tips? Find other posts with the ui5tips tag!

UI5 Tips: Manipulating the sap.m.TabContainer close buttons with custom CSS

2024-01-03T12:17:00.002+01:00

Here's a ui5tip to show how you can change the look and feel of the sap.m.TabContainer with a minimal amount of custom CSS. If you want to try this for yourself, be sure to check out the sample application from github.

The sap.m.TabContainer

The sap.m.TabContainer provides a simple, no-nonsense widget to build a tabbed user interface (check out the samples). Tabs can be added via the items aggregation, which should contain a collection of sap.m.TabContainerItem's. While this control generally suits my needs, it has one feature I find problematic: each tabs always has a close button, which appears as a little 'X' icon in the right side of the tab. If the user clicks it, it will actually 'close' the tab, that is: the respective sap.m.TabContainerItem will be removed from the TabContainer. See the screenshot below to see what the default looks like (close buttons highligthed in red):

Suppressing the close action

The openui5 samples show how you can suppress that behavior: you can write an event handler for the itemClose event, and then call the preventDefault() method on the event: (In the view xml:)

<m:TabContainer itemClose="onTabContainerItemClose">
  <m:items>
    <m:TabContainerItem>
       ...
    </m:TabContainerItem>
    <m:TabContainerItem>
       ...
    </m:TabContainerItem>
  </m:items>
</m:TabContainer>

(In the controller javascript:)

  onTabContainerItemClose: function(event){
    event.preventDefault();
  }

Obviously, it would be strange if we'd always prevent the tab from being closed: Suppressing the default action of closing the tab only makes sense in a context where the user is supposed to be able to close the tab at all, and in such a case this could be used to pop up a dialog to let the user choose if they really meant to close the tab or want to keep it open. But the use case I frequently encounter is that the tab should not be closeable in the first place. While suppressing the close action would ensure the tab is never closed, it would confuse and anger the user, as the close button itself would still be there, inviting users to perform an action that can never be fulfilled.

Using the other Tab widget

One might suggest to use the sap.m.IconTabBar widget instead of the sap.m.TabContainer. The sap.m.IconTabBar takes sap.m.IconTabFilter's in its items collection, and these do not have a close button. Now, in some cases the sap.m.IconTabBar/sap.m.IconTabFilter may suit your needs and then you're fine. However I find that it has a number of other drawbacks (which I won't get into here). Besides, the sap.m.IconTabBar introduces a similar problem, but the other way around: whereas we cannot get rid of the close button in the sap.m.TabContainer, we cannot ever have a close button in the sap.m.IconTabBar. What we really want, is a property or something like that, which will let us control whether the tab will have a close button or not.

CSS to the rescue

In the previous section we argued that we'd really like to be able to control for each individual tab whether they have a close action at all, for example, by setting a property. To add a property one would normally have to extend a ui5 control, and attach some code so that the property setting can somehow influence the behavior of the control - in this case, control whether or not the close button will be displayed. While this is probably possible (I haven't tried it for this case), it does seem like an extraordinary measure for such a humble request. I found that a similar effect can be achieved by applying some custom CSS in combination with standard ui5 features. That's what this entire sample is about. With this tip you can:

hide all close buttons for an entire sap.m.TabContainer
hide the close button on an individual sap.m.TabContainerItem
show the close button on an individual sap.m.TabContainerItem in case the close buttons are hidden by default on the sap.m.TabContainer

All this functionality requires the inclusion of some css. In the sample this is all isolated in a single ui5-customization.css file, which is included into the application by declaring it the manifest.json.

Hiding all close buttons

To hide all the close buttons for all sap.m.TabContainerItem in the items collection of a particular sap.m.TabContainer, simply add the noCloseButtons style class:

<m:TabContainer 
  class="noCloseButtons" 
>
  <m:items>
    <!-- note: close button will be hidden by default for each m:TabContainerItem -->
    ...
  </m:items>
</m:TabContainer>

This works because the class property in the ui5 xml view is rendered to the html dom directly. So whatever html elements that ui5 creates to implement the TabControl widget will then be selectable with a class selector in css, and this is how we can relatively simply influence the look of our TabContainer through css. In our ui5-customization.css file, this is how we use noCloseButtons class to hide the buttons:

div.sapMTabContainer.noCloseButtons > .sapMTabStripContainer > .sapMTabStrip > .sapMTSTabsContainer > .sapMTSTabs > .sapMTabStripItem > .sapMTSItemCloseBtnCnt {
  visibility: hidden;
}

(Note the initial selector, div.sapMTabContainer.noCloseButtons and the chain of > child selectors which target the actual bit of html that is used to render the close button, which is simply hidden by setting the css visibility property to hidden.) In the sample application, you can see this behavior in action in the App.view.xml. This contains the code for the outermost tabcontainer. A screenshot is shown below and as you can see both tabs ("Hide individually" and "Show individually") do not have a close button:

Hiding an individual close button

If we can add a custom css class to sap.m.TabContainer to hide all close buttons, then surely it should be possible to follow the same approach for an individual sap.m.TabContainerItem, right? Yes, it should, but sadly, we cannot. (The reason is that sap.m.TabContainer is a subclass of sap.m.Control, which provides a addStyleClass() method, whereas sap.m.TabContainerItem is a subclass of sap.ui.core.Element, which does not have such a method) Now, let's take a step back and think about how we used the css style class on the sap.m.TabContainer to hide all the close buttons. By setting the custom css style class on the sap.m.TabContainer, the html dom was changed to include the custom class, and we could then use that in a css selector. So even if we cannot apply a css style class to a sap.m.TabContainerItem, might there be another way that would allow us to influence how ui5 writes the html dom so we may write a css selector in our custom css? It turns out that such a feature exists in the shape of a feature called ui5 custom data. The custom data aggregation is provided by sap.ui.core.Element and thus available to its subclasses, including sap.m.TabContainerItem. A custom data item is an arbitrary key/value pair, and by setting its writeToDom property, ui5 will render it to the html dom as a html data attribute. To see what it looks like in our sample, take a look at TabContainerItemWithHiddenCloseButton.fragent.xml, which uses it to hide the close button in the second sap.m.TabContainerItem, in an otherwise normal sap.m.TabContainer:

<m:TabContainerItem
  id="item2"
  name="No Close Button"
>
  <m:customData>
    <core:CustomData writeToDom="true" key="noCloseButton" value="true"/>
  </m:customData>
  ...
</m:TabContainerItem>

Because the CusomtData's writeToDom property is set to true, ui5 will generate a html data attribute to the html dom that looks something like this:

data-noclosebutton='true'

And in our ui5-customization.css file, the following rule is intended to pick that up and hide the close button:

div.sapMTabContainer > .sapMTabStripContainer > .sapMTabStrip > .sapMTSTabsContainer > .sapMTSTabs > .sapMTabStripItem[data-noclosebutton='true'] > .sapMTSItemCloseBtnCnt {
  visibility: hidden;
}

As you can see it is very similar to the rule we used to hide all close buttons on any sap.m.TabContainer having the noCloseButtons class, except now the class is missing and instead we use a css predicate selector based on the data attribute:

.sapMTabStripItem[data-noclosebutton='true']

The screenshot below shows what it looks like in the app. Note that the tab named "Default" has the close button as usual, but the one named "No Close Button - the one with the custom data attribute - does not show a close button:

Showing an individual close button

The final hack in this sample combines the style class and the custom data attribute. CSS allows us to write a selector that takes both the presence of the css style class as well as the presence of a html data attribute into account. We can put this to good use if we want to have a sap.m.TabContainer that hides all close buttons by default, but undo the hiding of the close button of specific sap.m.TabContainerItem's, based on the value of a data attribute (which is in turn controlled by the ui5 Custom Data feature). You an see this in action in the TabContainerItemWithHiddenCloseButtons.fragment.xml file of the example:

<m:TabContainer 
  class="noCloseButtons"
>
  <m:items>

    <m:TabContainerItem name="Default">
     ...
    </m:TabContainerItem>

    <m:TabContainerItem name="Show Close Button">
      <m:customData>
        <core:CustomData key="noCloseButton" value="false" writeToDom="true" />
      </m:customData>
      ..
    </m:TabContainerItem>

  </m:items>
</m:TabContainer>

Again the second sap.m.TabContainerItem has a CustomData item with the key noCloseButton, but now the value is true so as to override the effect of the noCloseButtons style class applied to the sap.m.TabContainer. In our our ui5-customization.css file, the following rule is intended to pick that up and show the close button:

div.sapMTabContainer > .sapMTabStripContainer > .sapMTabStrip > .sapMTSTabsContainer > .sapMTSTabs > .sapMTabStripItem[data-noclosebutton='false'] > .sapMTSItemCloseBtnCnt {
  visibility: visible;
}

The screenshot below shows what it looks like when you run the sample application:

Finally

Did you like this tip? Do you have a better tip? Feel free to post a comment and share your approach to the same or similar problem. Want more tips? Find other posts with the ui5tips tag!

UI5 Tips: Adding a Splash-Screen / Loading indicator

2024-01-03T12:15:00.000+01:00

UI5 apps always take a little time to load. If you don't take any precautions, the user will be looking at a blank screen while the app is loading. It is fairly simple to add a splash screen and loading indicator to improve the user experience. This will give your app a more professional appearance, and it will cost minimal effort. This post shows you exactly how to do this. You might also be interested in checking out the sample app from github so you can run it yourself.

Running the app

The loadingscreen sample app is in the loadingscreen subdirectory of this repository. Simply expose the folder and all its contents with a webserver, and navigate to its index.html

How does it work?

In various tutorials and the ui5 walkthrough, the <body> is usually left empty. This is for a good reason - any content that is in the body would simply show up on the page, as is shown in the walkthrough's "Hello world!" example. But there is an exception: if an element has an id attribute with the value busyIndicator, then this content will be hidden after the ui5 bootstrap code is finished and the app content is placed inside the body. In the loadingscreen sample, the index.html page has a static <div> element with an id attribute with the value busyIndicator as static content in the <body> element. Indside that <div> we can place anything we like. In the example, the div contains a header and a footer with static text to indicate that the application is loading. Between the header and the footer is an image of the ui5 logo, and a css animation, which superficially resembles the ui5 busy indicator animation:

<body class="sapUiBody" id="content">

  <!-- Loading splash screen -->
  <div id="busyIndicator" style="text-align: center; font-family: Sans, Arial">
    <!-- static header text -->
    <h3>My UI5 App is loading</h3>
    
    <!-- static image of the ui5logo  -->
    <img src="images/openui5-logo.png" class="logo"/>

    <!-- 
         loader animation 
         CC0 licensed code used with permission from 
         https://loading.io/css/
    -->
    <div class="lds-ellipsis">
      <div></div>
      <div></div>
      <div></div>
      <div></div>
    </div>
    <!-- end of loader animation -->

    <!-- static footer text  -->
    <center><h5>This may take a few moments...</h5></center>
  </div>
  
</body>

The css animation requires some css, and this is included simply as a static css resource by including an appropriate <link> element in the <head> of the page:

<head>

  <!-- css required for the loading screen css anumation -->
  <link id="animation" rel="stylesheet" type="text/css" href="css/progress-animation.css"/>
   
</head>

In this case, we pulled the css animation from the excellent site https://loading.io/css/, which provides many different free css animations. Note that any css required by the loading screen must really be included statically via the <link> or <style> element. The standard ui5 mechanism to include css by declaring it in the manifest.json is no good as it will be loaded as part of the ui5 bootstrap, and the whole idea of the loading screen is to show something before the ui5 bootstrap even starts.

What does it look like

This is what it looks like when the app is loading: The text, logo image, and the loader animation are all static content and will show during UI5 bootstrap.

Next Steps

Obviously, this loading screen is only an example to show how you can make it work. The entire design of the loading screen is up to you. Just remember to keep it light and quick: the whole reason to include a loading screen in the first place was to give the user something to look at while ui5 is bootstrapping. If the loading screen itself requires a lot of resources then it defeats its purpose. For this reason, you might consider including any css directly using the <style> element, rather than relying on a network request to load external css with the <link> element.

Finally

Did you like this tip? Do you have a better tip? Feel free to post a comment and share your approach to the same or similar problem. Want more tips? Find other posts with the ui5tips tag!

Year-to-Date on Synapse Analytics 5: Using Window Functions

2021-02-22T02:49:00.004+01:00

For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I'd like to share some sample code, and discuss some of the merits and benefits of each approach.

TL;DR: A Year-to-Date solution based on a SUM() window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-JOIN (combined with a GROUP BY), a subquery, and a UNION (also combined with a GROUP BY).

Note: this is the 5th post in a series.

For sample data and setup, please see the 1st post in this series.
For a solution based on a self-JOIN and GROUP BY, please find the 2nd post in this series.
For a solution based on a subquery, please find the 3rd post in this series.
For a solution based on a UNION, please find the 4th post in this series.

(While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

Using window functions

Nowadays, many SQL engines and virtually all major RDBMSes support window functions (sometimes called analytic functions). A window function looks like a classic aggregate function. In some respects it also behaves like one, but at the same time there are essential differences.

Aggregate functions

Consider the following example:

select sum(SalesAmount) as SumOfSalesAmount
,      count(*)         as RowCount
from   SalesYearMonth

The example uses two aggregate functions, SUM() and COUNT(). It returns a result like this:


  
    
      SumOfSalesAmount
      RowCount
    
  
  
    
      109,846,381.43
      38

SumOfSalesAmount	RowCount
109,846,381.43	38

Two things are happening here:

Even though there are multiple rows in the SalesYearMonth table, the result consists of just one row. In other words, a collection of source rows have been aggregated into fewer (in this case, only one) result row.
The functions have caclculated a value based on some aspect of the individual rows in the source collection. In the case of SUM(SalesAmount), the value of the SalesAmount column of each individual row was added to obtain a total. In the case of COUNT(*), each row was counted, adding up to the total number of rows.

Because the previous example uses aggregate functions, we cannot also select any non-aggregated columns. For example, while SalesYear and SalesMonth are present in the individual underlying rows, we cannot simpy select them, because they do not exist in the result row, which is an aggregate.

Window functions

Now, SUM() and COUNT() also exist as window functions. Consider the following query:

select SalesYear
,      SalesMonth
,      SalesAmount
,      sum(SalesAmount) over() as TotalOfSalesAmount
,      count(*)         over() as RowCount
from   SalesYearMonth

You might notice the last two expressions in the SELECT-list look almost identical to the aggregate functions in the previous example. The difference is that in this query, the function call is followed by an OVER()-clause. Syntactically, this is what distinguishes ordinary aggregate functions from window functions.

Here is its result:


  
    
      SalesYear
      SalesMonth
      SalesAmount
      TotalOfSalesAmount
      RowCount
    
  
  
    
      2011
      5
      503,805.92
      109,846,381.43
      38
    
    
      2011
      6
      458,910.82
      109,846,381.43
      38
    
    
      ...more rows...
    
    
      2014
      6
      49,005.84
      109,846,381.43
      38

SalesYear	SalesMonth	SalesAmount	TotalOfSalesAmount	RowCount
2011	5	503,805.92	109,846,381.43	38
2011	6	458,910.82	109,846,381.43	38
...more rows...
2014	6	49,005.84	109,846,381.43	38

Note that we now get all the rows from the underlying SalesYearMonth table: no aggregation has ocurred. But the window functions do return a result that is identical to the one we got when using them as aggregate functions, and that for each row of the SalesYearMonth table.

It's as if for each row of the underlying table, the respective aggregate function was called over all rows in the entire table. Conceptually this is quite like the construct we used in the subquery-solution. The following example illustrates this:

select SalesYear
,      SalesMonth
,      SalesAmount
,      (
           select sum(SalesAmount) 
           from   SalesYearMonth
       ) as TotalOfSalesAmount
,      (
           select count(*)
           from   SalesYearMonth
       ) as RowCount
from   SalesYearMonth

The window and the `OVER()`-clause

Thinking about window functions as a shorthand for a subquery helps to understand how they work and also explains their name: a window function returns the result of an aggregate function on a particular subset of the rows in the query scope. This subset is called the window and it is defined by the OVER()-clause.

The parenthesis after the OVER-keyword can be used to define which rows will be considered as window. When left empty (like in the example above) all rows are considered.

Controlling the window using the `PARTITION BY`-clause

If you compare the previous example with our prior subquery-solution, you'll notice that here, we do not have a WHERE-clause to tie the subquery to the current row of the outer query. That's why our result is calculated over the entire table, rather than with respect to the current year and preceding months, as in our prior subquery-solution. This is equivalent to the empty parenthesis following the OVER-keyword in the corresponding window-function example.

In the subquery-solution we wrote a WHERE-clause to specify a condition to tie the rows of the subquery to the current row. For window functions, we can control which rows make up the window window by writing a PARTITION BY-clause inside the parenthesis following the OVER-keyword.

The PARTITION BY-clause does not let you specify an arbitrary condition, like we could in a subquery. Instead, the relationship between the current row and rows in the window must be expressed through one or more attributes for which they share a common value. The following example may illustrate this:

select SalesYear
,      SalesMonth
,      SalesAmount
,      sum(SalesAmount) over(partition by SalesYear) as YearTotalOfSalesAmount
from   SalesYearMonth

In the example above, sum(SalesAmount) over(partition by SalesYear) means: calculate the total of SalesAmount over all rows where the value of SalesYear is equal to the value of the SalesYear in the current row.

The equivalent query using subqueries would be:

select OriginalSales.SalesYear
,      OriginalSales.SalesMonth
,      OriginalSales.SalesAmount
,      (
           select sum(YearSales.SalesAmount) 
           from   SalesYearMonth as YearSales
           where  YearSales.SalesYear = OriginalSales.SalesYear
       ) as YearTotalOfSalesAmount
from   SalesYearMonth as OriginalSales

The result is shown below:


  
    
      SalesYear
      SalesMonth
      SalesAmount
      YearTotalOfSalesAmount
    
  
  
    
      2011
      5
      503,805.92
      12,641,672.21
    
    
      2011
      6
      458,910.82
      12,641,672.21
    
    
      ...more rows...
    
    
      2014
      6
      49,005.84
      20,057,928.81

SalesYear	SalesMonth	SalesAmount	YearTotalOfSalesAmount
2011	5	503,805.92	12,641,672.21
2011	6	458,910.82	12,641,672.21
...more rows...
2014	6	49,005.84	20,057,928.81

(Note that 12,641,672.21 is the sum of the SalesAmount for SalesYear 2011; 20,057,928.81 is the total for 2014.)

A partition for the preceding months?

It's great that the PARTITION BY-clause allows us to specify a window for relevant year, but it's still too wide: we want the window to contain only the rows from the current year, but only for this month and its preceding months. In the subquery-solution this was easy, as we could write whatever condition we want in the WHERE-clause. So we wrote:

where  SalesYtd.SalesYear   = SalesOriginal.SalesYear
and    SalesYtd.SalesMonth <= SalesOriginal.SalesMonth

Specifying SalesYear in the window functions' PARTITION BY-clause is equivalent to the first part of the subquery's WHERE-clause condition.

It's less clear what our partition expression should look like to select all months preceding the current month. It's not impossible though. For example, we can write an expression to mark whether the current SalesMonth is equal to or less than a specific month. For example:

-- every month up to and including june is 1, all months beyond june is 0
case
    when SalesMonth <= 6 then 1 
    else 0
end

If we can write such an expression, then of course, we can also use it in a PARTITION BY-clause, like so:

sum(SalesAmount) over (
  partition by 
    SalesYear
  , case
      when SalesMonth <= 6 then 1 
      else 0
    end
)

Let's try and think what this brings us.

Suppose the value for SalesMonth is 6 (june), or less? The CASE expression would return 1, and the window function would take all rows into account for which this is the case. So january, february, march and so on, up to june would all get the total of those six months - that is, the YTD value for june.

On the other hand, if SalesMonth is larger than 6, the CASE expression evaluates to 0. So all months beyond june (that is: july, august, and so on up to december) form a partition as well, and for those months, whatever is the sum over those months would be returned.

Now, it's not really clear what the outcome means in case the month is beyond june. But it doesn't really matter - what is important, is that we now know how to calculate the correct YTD value for a given month. And, what we did for june, we can do for any other month. So, once we have the YTD expressions for each individual month, we can set up yet another CASE-expression to pick the right one according to the current SalesMonth.

Putting all that together, we get:

select SalesYear
,      SalesMonth
,      SalesAmount
,      case SalesMonth
         -- january
         when 1 then SalesAmount
         -- february
         when 2 then
           sum(SalesAmount) over(
             partition by 
               SalesYear
             , case when SalesMonth <= 2 then 1 else 0 end
           ) 
           
         ...more cases for the other months...
           
         -- december
         when 12 then
           sum(SalesAmount) over(
             partition by 
               SalesYear
             , case when SalesMonth <= 12 then 1 else 0 end
           ) 
       end as YtDOfSalesAmount
from   SalesYearMonth

Like with the UNION-solution, we are taking advantage of our knowledge of the calendar, which allows us to create these static expressions. We would not be able to do this in a general case, or where the number of distinct values is very large. But for 12 months, we can manage.

While it's nice to know that this is possible, there is a much, much nicer way to achieve the same effect - the frame specification.

Frame Specification

The frame specification lets you specify a subset of rows within the partition. The way you can specify the frame feels a bit odd (to me at least), as it is specified in terms of the current row's position in the window. Hopefully the following example will make this more clear:

select SalesYear
,      SalesMonth
,      SalesAmount
,      sum(SalesAmount) over(
         partition by SalesYear
         order by SalesMonth
         rows between unbounded preceding
         and current row
       ) as SalesYtd
from   SalesYearMonth

We already discussed the PARTITION BY-clause, all the clause after that are new.

The ORDER BY-clause sorts the rows within the window, in this case by SalesMonth. We need to rows to be ordered because of how the frame specification works: it lets you pick rows by position, relative to the current row. The position of the rows is undetermined unless we sort them explicitly, so if we want to pick rows reliably we need the ORDER BY-clause to guarantee the order.

The frame specification follows the ORDER BY-clause. There are a number of possible options here, but I will only discuss the one in the example. In this case, it almost explains itself: we want to use the current row, and all rows that precede it. Since we ordered by SalesMonth, this means all the rows that chronologically precede it. As this selection applies to the current partition, we will only encounter months here that are within the current year.

So here we have it: a YTD calculation implemented using a window functions. It's about the same amount of code as compared to the subquery solution, but more delcarative, as we do not need to specify the details of a condition. On the other hand, it is also less flexible than a subquery, but in general one should expect the window functions to perform better than the equivalent subquery.

Generalizing the solutions

So far all our examples were based on the SalesYearMonth table, which provides SalesYear and SalesMonth as separate columns. One might wonder what would it would take to apply these various methods to a realistic use case.

For example, it is likely that in a real dataset, the time would be available as a single column of a DATE or DATETIME data type. A single date column potentially affects the YTD calculation in two ways:

Year: As the YTD is calculated over a period of a year and almost all solutions we described used the SalesYear column explicitly to implement that logic.
Preceding rows: To calculate the YTD for a specific row, there has to be a clear definition of what rows are in the same year, but which precede it. In our examples we could use the SalesMonth column for that, but this might be a but different in a realistic case.
Lowest Granularity: The lowest granularity of the SalesMonthYear table is at the month level, and we collected the YTD values at that level. (If we'd want to be precise we'd have to call that year-to-month).

Apart from the time aspect, the definition of the key affects all solutions that generate "extra" rows and require a GROUP BY to re-aggregate to the original granularity.

The Year

The ON-condition of the JOIN-solution and the WHERE-condition of the subquery-solution both rely on a condition that finds other rows in the same year, and the window function-solution uses the year in its PARTITION BY-clause.

It is usually quite simple to extract the year from a date, date/time or timestamp. In Synapse Analytics or MS SQL one can use the DATEPART or YEAR function to do this.

The UNION-solution has no direct dependency on the year.

The preceding rows

The need to find the preceding rows applies to all solutions that use the year to find the rows to apply the YTD calculation on. In our samples, this could all be solved using the SalesMonth column.

Again, it are the JOIN-solution and subquery-solution that used it in their condition, whereas the window function-solution uses it in its ORDER BY-clause.

In this case, the fix is more straighforward then with the year: instead of the month column, these solutions can simply use the date or date/time column directly. No conversion or datepart extraction is required.

Lowest granularity

The granularity is of special concern to the UNION-solution. The solution relies on an exhaustive and static enumeration of all possible future dates within the year. Already at the month level, this already required a lot of manual code.

Below the month, the next level would be day. While it would in theory be possible to extend the solution to that level, it is already bordering the impractible at the month level.

The Key

The key definition affects both the JOIN-solution and the UNION-solution, as that both require a GROUP BY over the key.

Year-to-Date on Synapse Analytics 4: Using UNION and GROUP BY

2021-02-22T02:36:00.003+01:00

For sample data and setup, please see the 1st post in this series.
For a solution based on a self-JOIN and GROUP BY, please find the 2nd post in this series.
For a solution based on a subquery, please find the 3rd post in this series.

(While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

Using a `UNION`

We mentioned how the solution with the JOIN relates each row of the main set with a subset of "extra" rows over which the YTD value is calculated by aggregating over the key of the main set using a GROUP BY.

It may not be immediately obvious, but we can also use the SQL UNION (or rather, UNION ALL) operator to generate such a related subset. Just like with the JOIN-solution, this can then be aggregated using GROUP BY. An example will help to explain this:

select      SalesYear
,           SalesMonth
,           sum(SumOfSalesAmount)      as SumOfSalesAmount
,           sum(YtdOfSumOfSalesAmount) as YtdOfSumOfSalesAmount
from (
    select  SalesYear
    ,       SalesMonth
    ,       SumOfSalesAmount
    ,       SumOfSalesAmount           as YtdOfSumOfSalesAmount
    from    SalesYearMonth
    union all
    -- JANUARY
    select  SalesYear
    ,       SalesMonth + 1            -- february
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 1
    union all
    select  SalesYear
    ,       SalesMonth + 2            -- march
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 1
    union all
    
    ... and so on, all for JANUARY ...

    union all
    select  SalesYear
    ,       SalesMonth + 11            -- december
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 1
    union all
    -- FEBRUARY
    select  SalesYear
    ,       SalesMonth + 1            -- march
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 2
    union all
    
    ... and so on, for the rest of FEBRUARY, 
        and then again for MARCH, APRIl, MAY, JUNE, JULY, AUGUST, SEPTEMBER, OCTOBER...   

    -- NOVEMBER
    select  SalesYear
    ,       SalesMonth + 1            -- december
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 11
) Sales
group by    SalesYear
,           SalesMonth

Duplicating metric-data so it contributes to the following months

In the top of the UNION we simply provide the entire resultset from SalesYearMonth, reporting the SumOfSalesAmount-metric as is, but also copying it to YtdSumOfSalesAmount. The other parts of the UNION are used to selectively duplicate the data for the SumOfSalesAmount-metric into the YtdSumOfSalesAmount, so that its data contributes to the YtdSumOfSalesAmount for all following months.

We start by grabbing january's data by applying the condition that demands that the SalesMonth equals 1:

-- JANUARY
select  SalesYear
,       SalesMonth + 1            -- february
,       null
,       SumOfSalesAmount
from    SalesYearMonth
where   SalesMonth = 1
union all
... repeat to duplicate january's data into february, march, and so on all the way up to december...

. This is done for a total of 11 times, each time adding 1, 2, 3 and so on - all the way up to 11 - to the SalesMonth attribute. This ensures january's data, as captured by the condition in the WHERE clause, is reported also in february (SalesMonth + 1), march (SalesMonth + 2), and so on, all the way up to december (SalesMonth + 11).

After the string of UNIONs for january appear more parts to duplicate the data also for february and all following months: -- FEBRUARY select SalesYear , SalesMonth + 1 -- march , null , SumOfSalesAmount from SalesYearMonth where SalesMonth = 02 union all ... repeat to duplicate february's data into march, april, and so on all the way up to december... . Again, february's data is selected by applying the condition

where   SalesMonth = 2

, and this happens now 10 times, again adding a number to the SalesMonth so it is duplicated to march, april, may, all the way up to december - in other words, all months following february.

What we thus did for january and februry is repeated for march, april, and so on for all months up to november. November is the last month we need to do this for: November's data still needs to be copied to december, but as december is the last month, that data only needs to be counted in december itself.

While it may seem wasteful to duplicate all this data, it really is not that different in that respect from the other solutions we've seen so far. It's just that now it's really in your face, because there is a pretty direct correspondence between the SQL code and the data sets that are being handled. The JOIN and subquery solutions hande similar amounts of data, it's just achieved with way less code, and in a far more implicit manner.

Original metric is retained

Note that the original metric also computes correctly, because the parts of the union only duplicate the data to the YTD column. The union parts that duplicate the data to the subsequent months select a NULL for the original metric. So the data for the original metric is never duplicated, and thus retains its normal value.

Drawbacks to the `UNION`-solution

The main drawback to the UNION-solution is its maintainability. A lot of code is required, far more than for any of the methods we have seen so far. Despite the indiviual patterns are simple (condition to get one month, adding a number to project that data to a future month), it is suprisingly easy to make a little mistake somewhere

We just argued that this solution is not so much different from the JOIN solution, but that remark only pertains to how the calculation is performed. The JOIN-solution generates the data it operates upon dynamically and declaratively; the UNION solution does this statically and explicitly. This is also why it is impossible to generalize ths approach for any arbitrary JOIN: YTD is a special case, because we know exactly how often we should duplicate the data as this is dictated by the cyclical structure of our calendar.

Next installment: Solution 4 - window functions

In the next installment we will present and discuss a solution based on a window functions.

Year-to-Date on Synapse Analytics 3: Using a Subquery

2021-02-22T02:25:00.007+01:00

For sample data and setup, please see the 1st post in this series.
For a solution based on a self-JOIN and GROUP BY, please find the 2nd post in this series.

(While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

Using a subquery

We can also think of YTD calculation as a separate query that we perform for each row of the SalesYearMonth table. While this does imply a row-by-row approach, we can still translate this easily to pure SQL by creating an expression in the SELECT-list, which uses a subquery to calculate the YTD value for the current row:

select      SalesOriginal.SalesYear
,           SalesOriginal.SalesMonth
,           SalesOriginal.SalesAmount
,           (
                select sum(SalesYtd.SalesAmount)
                from   SalesYearMonth as SalesYtd
                where  SalesYtd.SalesYear   = SalesOriginal.SalesYear
                and    SalesYtd.SalesMonth <= SalesOriginal.SalesMonth
            ) as SalesYtd
from        SalesYearMonth as SalesOriginal

There's a similarity with the JOIN-solution, in that we use the SalesYearMonth table twice, but in different roles. In the JOIN-solution both appeared on one side of the JOIN keyword and we used the aliases OriginalSales and YtdSales to be able to keep them apart. In the subquery approach, the distinction between these two different instances of the SalesYearMonth table is more explicit: the main instance of the SalesYearMonth table occurs in the FROM-clause, and the one for the YTD calculation occurs in the SELECT-list.

Also similar to the JOIN solution is the condition to tie the set for the YTD calculation to the main query using the SalesYear and SalesMonth columns. Such a subquery is referred to as a correlated subquery.

As for any differences with the JOIN solution: In the condition, the only difference is the left/right placement of SalesOriginal and SalesYtd, which is chosen only by order of appearance in the query but functionally completely equivalent. The most striking difference between the JOIN solution and the subquery is the absence of the GROUP BY-list in the latter.

Drawbacks of the subquery

As we had much to complain about the GROUP BY-list in the JOIN solution, it might seem that the subquery solution is somehow "better". However, a solution with a correlated subquery in general tends to be slower than a JOIN solution. Whether this is actually the case depends on on many variables and you'd really have to check it against your SQL engine and datasets.

Another drawback of the subquery solution becomes clear when we want to calculate the YTD for multiple measures. Our example only has one SalesAmount measure, but in this same context we can easily imagine that we also want to know about price, discount amounts, tax amounts, shipping costs, and so on.

In the JOIN solution, we would simply add any extra measures to the select list, using MAX() (or MIN() or AVG()) to obtain the original value, and SUM() to calculate its respective YTD value: As long as it's over the same set, the JOIN, its condition, and even the GROUP BY-list would remain the same, no matter for how many different measures we would add a YTD calculation.

This is very different in the subquery case. Each measure for which you need a YTD calculation would get its own subquery. Even though the condition would be the same for each such YTD calculation, you would still need to repeat the subquery code - one for each YTD measure.

Next installment: Solution 3 - using a UNION

In the next installment we will present and discuss a solution based on a UNION and a GROUP BY.

Year-to-Date on Synapse Analytics 2: Using a self-JOIN and GROUP BY

2021-02-22T02:13:00.005+01:00

Using a self-`JOIN`

The recipe for the set-oriented approach can be directly translated to SQL:

select      SalesOriginal.SalesYear
,           SalesOriginal.SalesMonth
,           max(SalesOriginal.SalesAmount) as SalesAmount
,           sum(SalesYtd.SalesAmount)      as SalesYtd
from        SalesYearMonth as SalesOriginal
inner join  SalesYearMonth as SalesYtd
on          SalesOriginal.SalesYear  = SalesYtd.SalesYear
and         SalesOriginal.SalesMonth >= SalesYtd.SalesMonth
group by    SalesOriginal.SalesYear
,           SalesOriginal.SalesMonth

The self-`JOIN`

In our discussion of the set-oriented approach we mentioned combining the rows from the table with each other to produce all different combinations. In the code sample about, the JOIN-clause takes care of that aspect.

As you can see, the SalesYearMonth table appears twice: on the left hand and on the right hand of the JOIN-keyword, but using different aliases: SalesOriginal and SalesYtd. It is a so-called self-join.

Even though both aliases refer to an instance of the same SalesYearMonth base table, each has a very different role. We can think of the one with the SalesOriginal alias as really the SalesYearMonth table itself. The SalesYtd alias refers to an instance of the SalesYearMonth table that, for any given row from SalesOriginal, represents a subset of rows that chronologically precedes the row from SalesOriginal.

The ON-clause that follows controls which combinations should be retained: for each particular row of SalesOriginal we only want to consider rows from SalesYtd from the same year, which is why the first predicate in the ON-clause is:

SalesOriginal.SalesYear  = SalesYtd.SalesYear

Within that year, we only want to consider rows that precede it chronologically, and that explains the second predicate:

SalesOriginal.SalesMonth >= SalesYtd.SalesMonth

`GROUP BY` and `SUM()`

It is is important to realize the JOIN is only half of the solution.

While the JOIN takes care of gathering and combining all related rows necessary to compute the YTD value, the actual calculation is done by the SUM() function in the SELECT-list, and the GROUP BY defines which rows should be taken together to be summed.

In summary:

the JOIN generates new rows by combining rows from its left-hand table with the rows from its right-hand table, bound by the condition in the ON-clause.
The GROUP BY partitions the rows into subsets having the same combinations of values for SalesYear and SalesMonth.
The SUM() aggregates the rows in each SalesYear, SalesMonth partition, turning its associated set of rows into one single row, while adding the values of the SalesAmount column together.

Note that the columns in the GROUP BY list are qualified by the SalesOriginal alias - and not SalesYtd. Also note that the GROUP BY columns form the key of the original SalesYearMonth table - together they uniquely identify a single row from the SalesYearMonth table. This is not a coincidence: it expresses precisely that SalesOriginal really has the role of being just itself - the SalesYearMonth table.

What about the other columns?

The GROUP BY affects treatment of the non-key columns as well. In this overly simple example, we had only one other column - OriginalSales.SalesAmount.

(Note that this is different from YtdSales.SalesAmount, which we aggregated using SUM() to calculate the YTD value)

Since OriginalSales.SalesAmount comes from the SalesOriginal instance of the SalesYearMonth table, we can reason that after the GROUP BY on the key columns SalesYear and SalesMonth, there must be exactly one SalesAmount value for each distinct combination of SalesYear and SalesMonth. In other words, SalesAmount is functionally dependent on SalesYear and SalesMonth.

Some SQL engines are smart enough to realize this and will let you refer to any expression that is functionally dependent upon the expressions in the GROUP BY-list in the SELECT-list. Unfortunately, Synapse and MS SQL Server are not among these and if we try we will get an Error:

Msg 8120, Level 16, State 1, Line 11
Column 'Sales.SalesAmount' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

The error message suggets we can do two things to solve it:

either we aggregate by wrapping the SalesOriginal.SalesAmount-expression into some aggregate function
or we expand the GROUP BY-list and add the SalesOriginal.SalesAmount-expression there.

To me, neither feels quite right.

SalesAmount is clearly intended as a measure, and it feels weird to treat them the same as the attributes SalesYear and SalesMonth. So adding it to the GROUP BY-list feels like the wrong choice. Besides, it also makes the code less maintainable, as each such column will now appear twice: once in the SELECT-list, where we need it no matter what, and once again in the GROUP BY-list, just to satisfy the SQL engine.

So, if we don't want to put it in the GROUP BY-list, we are going to need to wrap it in an aggregate function. We just mentioned that SalesAmount is a measure and therefore that does not sound unreasonable. However, we have to be careful which one we choose.

One would normally use SalesAmount as an additive measure and be able to use SUM() for that. But here, in this context, SUM() is definitily the wrong choice!

All we want to do is to "get" back" whatever value we had for SalesAmount, in other words, unaffected by the whole routine of join-and-then-aggregate, which we did only to calculate the YTD value. The "extra" rows generated by the JOIN are only needed to do the YTD calculation and should not affect any of the other measures. Using SUM() would simply add the SalesAmount just as many times as there are preceding rows in the current year, which simply does not have any meaningful application.

What we want instead is to report back the original SalesAmount for any given SalesYear, SalesMonth combination. We just reasoned that there will be just one distinct SalesOriginal.SalesAmount value for any combination of values in SalesOriginal.SalesYear, SalesOriginal.SalesMonth, and it would be great if we had an aggregate function that would simply pick the SalesOriginal.SalesAmount value from any of those rows. To the best of my knowledge, no such aggregate function exists in MS SQL Server or Synapse Analytics.

We can use MAX() or MIN(), or even AVG(). While this would all work and deliver the intended result, it still feels wrong as it seems wasteful to ask the SQL engine to do some calculation on a set of values while it could pick just any value.

Next installment: Solution 2 - using a subquery

In the next installment we will present and discuss a solution based on a subquery.

Year-to-Date on Synapse Analytics 1: Background

2021-02-22T02:03:00.006+01:00

For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I'd like to share some sample code, and discuss some of the merits and benefits of each approach.

(While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

TL;DR: A Year-to-Date solution based on a SUM() window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-JOIN (combined with a GROUP BY), a subquery, and a UNION (also combined with a GROUP BY).

In this Installment

A definition of what Year-to-Date (YTD) is
Some background on why we are interested in calculating the YTD using Synapse Analytics / MS SQL Server
How to calculate YTD using an iterative and a set-oriented approach
A sample table and dataset which will be used in the next installments to demonstrate our sample code.

Context

Our customer is using an Azure Data Lake to store data from all kinds of source systems, including its SAP ERP system. Azure Synapse Analytics sits on top of the Data Lake and is used as analytics workhorse, but also to integrate various data sets present in the data lake. Front-end BI tools, such as Microsoft PowerBI, can then connect to Synapse and import or query the data from there.

In many cases, the datamarts presented by Synapse are pretty straightforward. Calculations and derived measures needed to build dashboards and data visualizations can typically be developed rather quickly inside the Power BI data model. Once the front-end development has stabilized, one can consider to refactor the solution and move parts away from the front-end and push them down to the backend for performance or maintainability.

(There are all kinds of opinions regarding data architecture and on when to put what where. We do not pretend to have the final answer to that, but the current workflow allows us to very quickly deliver solutions that can be used and verified by the users. At present I do not think we could achieve the same productivity if we would demand that everything be designed and built on the backend right from the get go.)

So, today we were refactoring some of the logic in a PowerBI model, including a Year-to-Date calculation. The solution we ended up implementing to solve it seems to work rather nicely so I figured to share it.

Year-to-Date value

What's a year to date (YTD) value? Basically it's the cumulative value over a metric in time, which resets once a year. In other words, the year to date value is the per-year total of the value achieved up to the current date.

This is best explained with an example. Consider the following dataset:


  
    
      Date
      Value
      YTD Value
    
  
  
    2012-01-10 35,401.14 35,401.14
    2012-01-20 15,012.18 50,413.32
    2012-02-01 25,543.71 75,957.03
    2012-02-10 32,115.41 108,072.43
    2012-02-20 17,688.07 125,760.50
    2012-03-01 10,556.53 136,317.03
    ... ... ...
    2013-01-01 19,623.90 19,623.90
    2013-01-10 8,351.18 27,975.08 
    2013-01-20 20,287.65 48,262.73
    2013-02-01 33,055.69 81,318.42

Date	Value	YTD Value
2012-01-10	35,401.14	35,401.14
2012-01-20	15,012.18	50,413.32
2012-02-01	25,543.71	75,957.03
2012-02-10	32,115.41	108,072.43
2012-02-20	17,688.07	125,760.50
2012-03-01	10,556.53	136,317.03
...	...	...
2013-01-01	19,623.90	19,623.90
2013-01-10	8,351.18	27,975.08
2013-01-20	20,287.65	48,262.73
2013-02-01	33,055.69	81,318.42

In the table above we have dates from two years - 2012 and 2013 - and for each date a Value.

For the first date encountered within a year, the YTD Value is equal to the Value itself; For each subsequent Date, the YTD Value is maintained as a running total of the values that appeared at the earlier dates.

So, 2012-01-10 is the first date we encounter in 2012 and therefore its YTD Value is equal to the Value at that date (35,401.14). The next date is 2012-01-20 and its Value is 15,012.18; therefore its YTD Value is 50,413.32, which is 15,012.18 + 35,401.14. The accumulation continues until we reach the last date of 2012.

At 2013-01-01 the first date of the next year, the YTD Value resets again to be equal to the Value, and then in the subsequent dates of 2013, the YTD Value again accumulates the current Value by adding it to the preceding YTD-value.

How to use YTD

You can use YTD values to analyze how well actual trends are developing over time as compared to a planning or predicition. By comparing the calculated YTD of a measure to a projected value (for example, a sales target), we can see how far off we are at any point in time.

If you gather these comparisons for a couple of moments in time, you can get a sense of the pace in which the actual situation is deviating from on converging to the target or the projected situation. These insights allow you to intervene in some way: maybe you need to adjust your planning, or change your expectations. Or maybe you need to adjust your efforts in order to more closely approximate your target.

Thinking about YTD as iteration

From the way we explained what a year-to-date value is, you might think about it as an actual "rolling sum". By that I mean, you might think about it as an iterative problem, that you solve by going through the rows, one by one. In pseudocode, such a solution would do something like:

    declare year, ytd
  
    loop through rows:
    
        if year equals row.year then 
        
          assign ytd + row.value to ytd
          
        else
        
          assign row.value to ytd
          assign row.year to year
          
        end if
        
    end loop through rows

While this approach would apparently give you the desired result, it does not help you to solve the problem in SQL directly. Pure SQL does not let you iterate rows like that, and it also does not let you work with variables like that.

Even with the iterative approach there is a hidden problem: the reset of the ytd variable and the update of the year variable that occurs whenever the row.year is different from the current value of the year variable will only work properly if the rows of one particular year are next to each other (like when the rows are ordered by year prior to iteration). The same applies within the year: the rows need to be sorted in chronological order, as the YTD value should reflect how much of the value was accumulated at that date within that year.

It may seem like a waste of time to think about an approach that is of no use to solving the problem. But this simple iterative approach provides a very simple recipe for quickly checking whether an actual solution behaves as expected. We'll use it later to veryify some results.

A set-oriented approach

To implement it in SQL we have to think in a set-oriented way. Conceptually, we can think about it as if we combine each row in the set with all of the other rows, forming a cartesian product, and then retain only those combinations that have identical values for year, but a smaller or equal value for the month.

This way, each row will combine with itself, and with all the other rows that chronologically precede it within the same year. The YTD value is then obtained by aggregating the rows over year and month value, summing the value to become the YTD value.

Sample Data

To play around a bit with the problem in SQL, let's set up a simple table:

create table SalesYearMonth (
  SalesYear   int
, SalesMonth  int
, SalesAmount decimal(20,2)
, primary key(SalesYear, SalesMonth)
);

And, here's some data:

insert into SalesYearMonth (
  SalesYear
, SalesMonth
, SalesAmount
) values (
 (2011,5,503805.92)
,(2011,6,458910.82)
,(2011,7,2044600.00)
,(2011,8,2495816.73)
,(2011,9,502073.85)
,(2011,10,4588761.82)
,(2011,11,737839.82)
,(2011,12,1309863.25)
,(2012,1,3970627.28)
,(2012,2,1475426.91)
,(2012,3,2975748.24)
,(2012,4,1634600.80)
,(2012,5,3074602.81)
,(2012,6,4099354.36)
,(2012,7,3417953.87)
,(2012,8,2175637.22)
,(2012,9,3454151.94)
,(2012,10,2544091.11)
,(2012,11,1872701.98)
,(2012,12,2829404.82)
,(2013,1,2087872.46)
,(2013,2,2316922.15)
,(2013,3,3412068.97)
,(2013,4,2532265.91)
,(2013,5,3245623.76)
,(2013,6,5081069.13)
,(2013,7,4896353.74)
,(2013,8,3333964.07)
,(2013,9,4532908.71)
,(2013,10,4795813.29)
,(2013,11,3312130.25)
,(2013,12,4075486.63)
,(2014,1,4289817.95)
,(2014,2,1337725.04)
,(2014,3,7217531.09)
,(2014,4,1797173.92)
,(2014,5,5366674.97)
,(2014,6,49005.84);

This setup is slightly different from the original problem statement. Instead of a column with DATE data type, we have separate SalesYear and SalesMonth columns. This is fine - it doesn't change the problem or the solution in any way.

In fact, this setup allows us to think about the essential elements of the problem without having to worry about the details of getting to that point. Once we done that, we can apply the approach to a more realistic case.

Next installment: Solution 1 - a self-`JOIN`

In the next installment we will present and discuss a solution based on a self-JOIN and a GROUP BY.

Building a UI5 Demo for SAP HANA Text Analysis: Part 4

2019-12-02T00:32:00.000+01:00

This is the last of a series of blogposts describing a simple web front end tool to explore SAP HANA's Text Analysis features on documents uploaded by the user. As a reminder, the following overview outlines all the posts in the series:

Part 1 - an Overview: SAP HANA Text Analysis on Documents uploaded by an end-user
Part 2 - Hands on: Building the backend for a SAP HANA Text Analysis application
Part 3 - Presenting: A UI5 front-end to upload documents and explore SAP HANA Text Analytics features
Part 4 - Deep dive: How to upload documents with OData in a UI5 Application

In the previous post we presented the sample application, explained its functionality, and concluded by pointing to the github repository and the installation instructions so that you may run the application on your own HANA system.

In this post, we'll explain in detail how the upload functionality works from the UI5 side of things

Uploading Files and Binary data to HANA `.xsodata` services uing UI5

In the first installment of the series, options and concerns were discussed on the topic of loading the (binday) document content into the SAP HANA database. We chose to use an OData service. In this installment, we'll go into fairly deep detail how to implement a file upload feature backed by a HANA .xsodata service using the UI5 front-end framework.

Some notes on the UI5 Application Implementation

Before we discuss any particular details of the implementation of the UI5 application, it is necessary to point out that this particular application is demoware. Many typical patterns of UI5 application development were omitted here: there is no internationalization, and no modules or dependency injection with sap.ui.define(). There is not even a MVC architecture, so no XML views or controllers; no Component configuration, and no application descriptor (manifest.json).

Instead, the application consists of just a single index.html, which contains 2 <script> tags:

<script 
  src="https://sapui5.hana.ondemand.com/1.71.5/resources/sap-ui-core.js"
  id="sap-ui-bootstrap"
  data-sap-ui-libs="
    sap.ui.core,
    sap.ui.layout,
    sap.ui.unified,
    sap.ui.table,
    sap.ui.commons, 
    sap.m
  "
  data-sap-ui-theme="sap_bluecrystal"
>
</script>
<script src="index.js" type="text/javascript"></script>

The first one bootstraps ui5, and the second one loads index.js, which contains the implementation.

The main reason for this rather spartan approach is that the primary goal of me and my colleagues Arjen and Mitchell was to quickly come up with a functional prototype that demonstrates the file upload feature. Although I have grown used to a more orhtodox UI5 boilerplate, it was a distraction when it came to just quickly illustrating an upload feature. Once we built the upload feature, I wanted to see how easy it would be to augment it and make it somewhat useful application, and I was kind of interested to experience how it would be to carry on using this unorthodox, pure-javascript approach.

There's much more that could be said about this approach but it's another topic. So for now: if you're new to UI5 and want to learn more: don't take this application as an example, it's atypical. And if you are an experienced UI5 developer: now you have the background, let's move on to the essence.

Communication with the backend using an OData Model

Before we get to the topic of building and controlling the upload feature, a couple of words should be said about how UI5 applications can communicate with their (xsodata) backend.

In UI5, we needn't worry about the exact details of doing the backend call directly. Rather, UI5 offers an object that provides javascript methods that take care of this. This object is the model. In our application, the model is an instance of the sap.ui.model.odata.v2.ODataModel, which we instantiated somewhere in the top of our index.js:

var pkg = document.location.pathname.split('/').slice(1, -2);
var odataService = [].concat(pkg, ['service', 'ta']);  

/**
 * OData
 */
var modelName = 'data';
var model = new sap.ui.model.odata.v2.ODataModel('/' + odataService.join('/') + '.xsodata', {
  disableHeadRequestForToken: true
});

It's not necessary to go over the model instantiation in detail - for now it is enough to know that upon instantiation, the model is passed the uri of the .xsodata service we already built. We obtain the url in the code preceding the model instantiation by taking the url of the current webpage and building a path to service/ta.xsodata relative to that location:

var pkg = document.location.pathname.split('/').slice(1, -2);
var odataService = [].concat(pkg, ['service', 'ta']);

Uploading a file: high level client-side tasks

From a functional point of view, the web app (client) there's two distinct tasks to be considered:

Building the user interface so the user can select the file to upload.
Loading the file contents into the database.

The first high-level task is strictly a matter of user interaction and is more or less independent from how the second high level task is implemented. For the second high-level task, we already have the backend in place - this is the OData service we built in the second installment of this blogpost series. What remains is how to do this from within UI5.

But already, we can break down this task in two subtasks:

Extracting the content from the chosen file. Once the user has chosen a file, they have only identified the thing they want to upload. The web app does not need to parse or understand the file content, but it does need to extract the data (file content) so it can send it to the server.
Sending the right request to the backend. The request will somehow include the contents extracted from the file, and it will have such a form that the server understands what to do with those contents - in this case, store it in the a table for text analysis.

A UI5 File Upload control

For the file upload user interaface, we settled on the sap.ui.unified.FileUploader control. Here's the relevant code from index.js that instantiates the control:

var fileUploader = new sap.ui.unified.FileUploader({
  buttonText: 'Browse File...',
  change: onFileToUploadChanged
  busyIndicatorDelay: 0
});

The sap.ui.unified.FileUploader control is presented to the user as a input field and a button to open a file chooser. This lets the user browse and pick a file from their client device.

In addition, the sap.ui.unified.FileUploader control provides events, configuration options and methods to validate the user's choice, and to send the file off to a server. For example, you can set the uploadUrl property to specify where to send the file to, and there's an upload() method to let the control do the request.

As it turns out, most of this addition functionality did not prove to be very useful for the task at hand, because the request we need to make is quite specific, and we didn't really find a clear way of configuring the control to send just the right request. Perhaps it is possible, and then we would be most obliged to learn how.

What we ended up doing instead is to only use the file choosing capabilities of the sap.ui.unified.FileUploader control. To keep track of the user's choice, we configured a handler for the change event, which gets called whenever the user chooses a file, or cancels the choice.

The handler does a couple of things:

Determine wheter a file was chosen. If not, the Upload confirmation button gets disabled so the user can only either retry choosing a file, or close the upload dialog.
If a file is chosen, a request is sent to the backend to figure out if the file already exists.
Depending upon whether the file already exists, the state of the upload dialog is set to inform the user of what action will be taken if they confirm the upload.

Let's go over these tasks in detail. First, validating the user's choice by checking if the user did in fact choose a file:

var fileToUpload;
var fileToUploadExists;
function onFileToUploadChanged(event){
  fileToUpload = null;
  fileToUploadExists = false;
  var files = event.getParameter('files');
  if (files.length === 0) {
    initFileUploadDialog();
    return;
  }
  fileToUpload  = files[0];
  
  ...more code here...
}

Note that we set up the fileToUpload variable to keep track of the user's choice. We need to keep track of it somewhere, since the choosing of the file and the upload are separate tasks with regards to the UI: choosing the file happens when the user hits the Browse button provided by the sap.ui.unified.FileUploader control, wheras the upload is triggered by hitting the confirm button of the upload dialog.

When the user is done choosing the file, the sap.ui.unified.FileUploader will fire the change event, and our handler onFileToUploadChanged() gets called and passed the event as an argument. This event provides access to the FileList object associated with the file chooser:

  var files = event.getParameter('files');

Note: the FileList is not part of UI5. Rather, it is one of a number of brower built-in objects, which together form the Web File API. We would have loved to obtain the FileList or the File object from our sap.ui.unified.FileUploader control directly by using a getter or something like that, but at the time we found no such method, and settled for a handler in the change event.

Once we have the FileList, we can check whether the user selected any files, and either disable the upload confirmation button (if no file was selected), or assign the chosen file to our fileToUpload variable so we can refer to it when the upload is confirmed:

function onFileToUploadChanged(event){

  ...

  if (files.length === 0) {
    initFileUploadDialog();
    return;
  }
  fileToUpload = files[0];

  ....

}

If we pass the check, our variable fileToUpload will now contain the File object reflecting the user's choice. (Note that this object too is not a UI5 object, it's also part of the Web File API.)

Note that in theory, the list of files associated with the sap.ui.unified.FileUploader could have contained more than one file. But the default behavior is to let the user choose only one file. You can override that behavior by setting the sap.ui.unified.FileUploader's multiple property to true. Because we know that in this case, there can be at most only one file, we only need to check whether there is a file or not - there's no need to consider muliple files.

Checking whether the File was already Uploaded

Once we know for sure the user has chosen a file, it remains to be determined what should be done with it should the user decide to confirm the upload. To help the user decide whether they should confirm, we send a request to the backend to find out if the file was already uploaded:

function onFileToUploadChanged(event){

  ...

  fileToUpload  = files[0];
  fileUploader.setBusy(true);
  model.read('/' + filesEntityName, {
    filters: [new sap.ui.model.Filter({
      path: fileNamePath,
      operator: sap.ui.model.FilterOperator.EQ,
      value1: fileToUpload.name
    })],
    urlParameters: {
      $select: [fileNamePath, 'FILE_LAST_MODIFIED']
    },            
    success: function(data){

      ...update state depending upon whether the file exists...

    },
    error: function(error){

      ...update state to inform the user of an error...

    }
  });

  ....

}

The model provides a read() method which can be used to query the backend OData service. The first argument to the read() method is the so-called path, which identifies the OData EntitySet we want to query. In this case, we are interested in the Files EntitySet, as this corresonds to our CT_FILE database table in our backend. Because we use the name of the Files EntitySet in a lot of places, we stored it in the filesEntityName variable. So, our path becomes:

'/' + filesEntityName

Apart from the path, the read() method takes a second argument, which is an object of query options. We'll highlight the few we need here.

Because we only want to know whether the backend already has a file with the same name as the one the user just selected , we add a parameter to restict the search. This is done with the filters option:

    filters: [new sap.ui.model.Filter({
      path: fileNamePath,
      operator: sap.ui.model.FilterOperator.EQ,
      value1: fileToUpload.name
    })],

The filters option takes an array of sap.ui.model.Filter objects. When we instantiate the sap.ui.model.Filter object, we pass an object with the following configuration options:

path - this should get a value that refers to a property defined by the OData entity type of this Entity Set. It corresponds to the name of a column of our database table. In this case, it is set to fileNamePath, which is a variable we initialized with 'FILE_NAME', i.e., the name of the column in the CT_FILE table that holds the name of our files.
value1 - this should be the literal value that we want to use in our filter. In this case, we want to look for files with the same name as the file chosen by the user, so we set it to the name property of the File object that the user selected - fileToUpload.name
operator - this should be one of the values defined by the sap.ui.model.FilterOperator object, which defines how the given filter value should be compared to the value of the column. In this case the operator is sap.ui.model.FilterOperator.EQ, which stands for an equals comparison. By using this operator, we demand that the value of the column should be exactly the same as the name of the chosen file.

There is one other option specified that affects the request:

    urlParameters: {
      $select: [fileNamePath, 'FILE_LAST_MODIFIED']
    },

This specifies for which columns we want to retrieve the values from the backend. It may be omitted, but in that case, all columns would be returned. Often this will not be a problem, but in this case, we really want to prevent the server from returning the values for the FILE_CONTENT column. Always retrieving the file contents would be an unnessary burden for both the front- and the backend so we actively suppress the default behavior. The only columns requested here are FILE_NAME and FILE_LAST_MODIFIED. The latter is currently unused but might come in handy to provide even more information to the user so they can better decide whether they want to re-upload an existing file.

The remaining options in the call to the model's read() method have nothing todo with the request, but are callback functions for handling the result of the read request. The error callback gets called if there is some kind of issue with the request itself - maybe the backend has gone away, or maybe the structure of the service changed. The success callback is called when the read request executes normally, and any results are then passed to it as argument. This is even true if no results are found - the callback then simply gets passed an empty list of results.
In our example, the main purpose of the success callback is to flag whether the file already exists, and to update the state of the file uploader accordingly to inform the user. The existence of the file is flagged by assigning the fileToUploadExists variable, and we will see its significance in the next section where we discuss the implementation of the upload of the file contents.

Handling the Upload

We've just seen exactly how the UI5 application can let the user choose a file, and we even used our model to check whether the chosen file is already present in the backend. Once these steps are done, we now have successfully initialized two variables, fileToUpload and fileAlreadyExists. This is all we need to handle the upload.

In the application, the user initiates the upload by clicking the Confirmation Button of the uploadDialog. This then triggers the button's press event, where we've attached the function uploadFile as handler.

So, in this handler, we must examine the value of the fileAlreadyExists variable and take the appopriate action:

If fileAlreadyExists is false, we should tell our model to add a new item. This is done by calling the createEntry()-method
If fileAlreadyExists is true, we should tell our model to update the existing item. This is done by calling the update()-method

The path argument

Both methods take a path as first argument to indicate where the new item should be added, or which item to update.

When adding a new item, the path is simply the path of the Files Entity Set with the model:

'/' + filesEntityName

(Note: this is exactly the same as the path we used in the read() call to figure out whether the file already exists.)

The path for updating an existing item also starts with the path of the Entity Set, but includes the key to identify the item that is to be updated. Lucky for us, the sap.ui.model.odata.v2.ODataModel model provides the createKey() method which constructs such a path, including the key part, based on the values of the properties that make up the key. So, the code to construct the path for the update method becomes:

'/' + model.createKey(filesEntityName, {"FILE_NAME": fileToUpload.name})

(For more detailed information about how OData keys and paths work, see ODAta Uri Conventions, in particular the section on "Adressing Entries".)

The payload

In addition to the path, we also need to pass the data, which is sometimes referred to as the payload. While the path tells the model where to add or update an item, the payload specifies what should be added or updated.

Now, even though the UI documentation is not very specific about how to construct to payload, we have used the createEntry() and update() methods of the sap.ui.model.odata.v2.ODataModel in the past without any problems. It is normally quite intuitive and hasslefree: you simply specify an Object, and specify keys that match the property names of the target entity set, and assign JavaScript values, just as-is. So, if we disregard the FILE_CONTENT field for a moment, the payload for the Files entity set could be something like this:

var payload = {
  "FILE_NAME": fileToUpload.name,
  "FILE_TYPE": fileToUpload.type,
  "FILE_LAST_MODIFIED": new Date(fileToUpload.lastModified),
  "FILE_SIZE": fileToUpload.size,
  "FILE_LAST_UPLOADED": new Date(Date.now())
};

Let's compare this to the data types of the corresponding properties in the entity type of the entity set:

<EntityType Name="FilesType">
  <Key>
    <PropertyRef Name="FILE_NAME"/>
  </Key>
  <Property Name="FILE_NAME" Type="Edm.String" Nullable="false" MaxLength="256"/>
  <Property Name="FILE_TYPE" Type="Edm.String" Nullable="false" MaxLength="256"/>
  <Property Name="FILE_LAST_MODIFIED" Type="Edm.DateTime" Nullable="false"/>
  <Property Name="FILE_SIZE" Type="Edm.Int32" Nullable="false"/>
  <Property Name="FILE_CONTENT" Type="Edm.Binary" Nullable="false"/>
  <Property Name="FILE_LAST_UPLOADED" Type="Edm.DateTime" Nullable="false"/>
</EntityType>

(Note: this entity type is taken from the $metadata document of our service.)

So, in short - there is a pretty straightforward mapping between the JavaScript runtime values and the Edm Type System (see: "6. Primitive Data Types") used by OData: JavaScript Strings may be assigned to Edm.Strings, JavaScript Date objects may be assigned to Edm.DateTimes, and JavaScript Numbers may be assigned to Edm.Int32s.

This is less trivial than one might think when one considers what happens here: one the one hand, we have the types as declared by OData service, which are Edm Types. Then, we have to consider the content type used to transport the payload in the HTTP request: OData services may support several content types, and by default OData supports both application/atom+xml and application/json.So, when starting with a payload as a JavaScript runtime object, this first needs to be converted to an equivalent representation in one of these content types (UI5 uses the JSON representation) by the client before it can be sent off to the OData service in a HTTP request.

It bears repeating that this is not a simple, standard JSON serialization, since the type system used by the JSON standard only knows how to represent JavaScript Strings, Numbers, Booleans (and arrays and objects containing values of those types). The native JSON type system is simply too minimal to represent all the types in the Edm Type system used by OData, hence the need for an extra JSON representation format. The sap.ui.model.odata.v2.ODataModel does a pretty remarkable job in hiding all this complexity and making sure things work relatively painless.

Representing `FILE_CONTENT` in the payload

Now for the FILE_CONTENT property. In the entity type, we notice that the data type is Edm.Binary. What would be the proper JavaScript runtime type to construct the payload?

We just mentioned that normally, the mapping from JavaScript runtime types is usually taken care of by the sap.ui.model.odata.v2.ODataModel. So we might be tempted to simply pass the File object itself directly as value for the FILE_CONTENT property. But when we call either the createEntry or update method with a payload like this:

var payload = {
  "FILE_NAME": fileToUpload.name,
  "FILE_TYPE": fileToUpload.type,
  "FILE_LAST_MODIFIED": new Date(fileToUpload.lastModified),
  "FILE_SIZE": fileToUpload.size,
  "FILE_LAST_UPLOADED": new Date(Date.now()),
  "FILE_CONTENT": fileToUpload
};

we get an error in the response:

The serialized resource has an invalid value in member 'FILE_CONTENT'.

So clearly, the sap.ui.model.odata.v2.ODataModel needs some help here.

One might assume that the problem has to do with the File object being a little bit too specific for UI5 - after all, a File object is not just some binary value, but it is a sublclass of the Blob object, which has all kinds of file-specific properties of itself. However, assigning a proper, plain Blob object in the payload yields exactly the same result, so that's not it either.

Instead of continuing to experiment with different values and types, we took a step back and took a look at the OData specification to see if we could learn a bit more about the Edm.Binary type. In the part about the JSON representation (See: "4. Primitive Types") we found this:

Base64 encoded value of an EDM.Binary value represented as a JSON string

It seems to suggest the whatever thing that represents the Edm.Binary value need to be Base64 encoded, which yields a string value at runtime, and this string may then be serialized to a JSON string. So, if we could make a Base64 encoded string value of our binary value, we could assign that in the payload. (We already saw that sap.ui.model.odata.v2.ODataModel will turn JavaScript String values to a JSON representation so we don't have to do that step ourselves.)

Fortunately, it's easy to create Base64 encoded values. The browser built-in function btoa() does this for us.

However, we're not there yet, as the spec starts with a binary value, and JavaScript does not have a binary type (and hence, no binary values).

We then took a look at the specification to find out exactly what a Edm.Binary value is. We found something in the section about Primitive Datatypes on how to create literal Edm.Binary Values:

binary'[A-Fa-f0-9][A-Fa-f0-9]*' OR X '[A-Fa-f0-9][A-Fa-f0-9]*' 
NOTE: X and binary are case sensitive. 
Spaces are not allowed between binary and the quoted portion. 
Spaces are not allowed between X and the quoted portion. 
Odd pairs of hex digits are not allowed.

Example 1: X'23AB' 
Example 2: binary'23ABFF'

At this point the thinking was that we could take the bytes that make up the binary value and convert it to its hexadecimal string representation, single-quote it the resulting hex string, and finally prepend either X or binary to it. At runtime, this would then be a JavaScript string value reprenting an Edm.Binary literal, which we could then turn into its Base64 encoded value, and send assign to the payload.

When we went this route, the error message went away, and sure enough, documents started to show up in our backend table. Unfortunately, the documents ended up there as Edm.Binary literals, that is, as strings that are an accurate Edm.Binary literal representation of the document but otherwise useless.

At this point the solution was clear though - just leave out the intermediate step of converting the original value to an Edm.Binary literal.

The `uploadFile` function

Remember, at this point we have the File object stored in the fileToUpload variable, and a flag fileToUploadExists is set to true or false depending upon whether the file is already stored in the backend table. This is code we ended up with for uploading the file:

function uploadFile(){
  var fileReader = new FileReader();
  fileReader.onload = function(event){
    var binaryString = event.target.result;
    var payload = {
      "FILE_NAME": fileToUpload.name,
      "FILE_TYPE": fileToUpload.type,
      "FILE_LAST_MODIFIED": new Date(fileToUpload.lastModified),
      "FILE_SIZE": fileToUpload.size,
      "FILE_CONTENT": btoa(binaryString),
      "FILE_LAST_UPLOADED": new Date(Date.now())
    };
    if (fileToUploadExists) {
      model.update(
        '/' + model.createKey(filesEntityName, {
          "FILE_NAME": fileToUpload.name
        }), 
        payload
      );
    }
    else {
      model.createEntry('/' + filesEntityName, {
        properties: payload
      });
    }
    model.submitChanges({
      success: function(){
        closeUploadDialog();                
      }
    });
  };
  fileReader.readAsBinaryString(fileToUpload);
}

As explained earlier, uploading the file breaks down into 2 subtasks, and this handler takes care of both:

First, we use the FileReader to read the contents of the File object
Then, we send it to the backend. To do that, construct the path and the payload, and call either the createEntry or the update method based on whether the file already exists, passing the path and the payload.

Using the `FileReader` to read the contents of a `File` object

First, we need to read the contents of the file. We do that using a FileReader, which is also part of the Web File API. To get the contents of a File object, we can call one of the FileReader's read methods.

The FileReader'sread methods do not return the contents of the file directly: the Web File API is mostly asynchronous. Instead, we have to attach an event handler to the FileReader which can respond to the FileReader's events. In this case we overrided the FileReader's onload() method, which gets called when the FileReader is done reading a File. (Instead of the override, we could also have attached a handler with addEventListener but it really doesn't matter too much how the handler is attached.)

Once set up, we can now call a read() method and wait for the reader to call our onload() handler.

So the general structure to read the file is as follows:

function uploadFile(){
  var fileReader = new FileReader();
  fileReader.onload = function(event){

    var binaryString = event.target.result;

    ...do something with the file contents...

  };
  fileReader.readAsBinaryString(fileToUpload);
}

We already mentioned the FileReader provides a number of different read methods, and the chosen method determines the type of the value that will be available in event.target.result by the time the load handler is called. Today, the FileReader provides:

To figure out which method we should use, we should consider how our backend expects to receive the data. Or rather, how our sap.ui.model.odata.v2.ODataModel wants us to pass the data so it can do the appropriate call to the backend. In a previous section we already explained our struggle to figure out how to represent a Edm.Binary value in the payload, and based on those findings, readAsBinaryString() is the appropriate method. With this read method, the FileReader turns each individual byte of the file contents in to a JavaScript character, much like the fromCharCode()-method of the String object would do. The resulting value is a JavasScript binary string: each character represents a byte.

Note that this is very different from what the readAsText() method would do: that would attempt to read the bytes as if they are encoded characters in UTF-8 encoding, in other words it would result in a character string, not a binary string.

After obtaining the file contents as binary string, we can apply the Base64 encoding and assign it to the payload:

    var payload = {
      "FILE_NAME": fileToUpload.name,
      "FILE_TYPE": fileToUpload.type,
      "FILE_LAST_MODIFIED": new Date(fileToUpload.lastModified),
      "FILE_SIZE": fileToUpload.size,
      "FILE_CONTENT": btoa(binaryString),
      "FILE_LAST_UPLOADED": new Date(Date.now())
    };

Summary

This concludes the last installment in this blog series. In this post we learned how to use:

the sap.ui.unified.FileUploader to present a file chooser to the user
the sap.ui.unified.FileUploader's change event to get a hold of the File object representing the user's selection.
the FileReader to read the contents of a File object
the sap.ui.model.odata.v2.ODataModel's read() method to specify a query using a sap.ui.model.Filter to check whether an item already exists in the backend.
the createEntry() and update() methods of the sap.ui.model.odata.v2.ODataModel to create or update an entry in the backend.
the btoa() function to create payloads for OData properties of the Edm.Binary type

And by putting these elements together we created a File upload feature for a UI5 application, backed by a HANA OData service.

Odds and Ends

It's nice that we finally found out how to write the file contents to our OData Service. But something does not feel quite right. Although we happened to find a way to write the binary data that satisfies both the sap.ui.model.odata.v2.ODataModel as well as the SAP HANA .xsodata service that backs it, we still haven't found any official documentation, either from the OData specification or from SAP that confirms that this is really the correct way. We would hope that SAP HANA's .xsodata implementation is a faithful implementation of the standard, but for the Edm.Binary type, I'm just not 100% sure. If anybody could chime in and confirm this, and preferably point me to something in the OData specification that confirms this, then I would be most grateful.

Building a UI5 Demo for SAP HANA Text Analysis: Part 3

2019-12-01T01:37:00.000+01:00

We now continue our series on building a simple web application for exploring SAP HANA Text Analysis features. As a reminder, here are the links to the other installments in the series:

Part 1 - an Overview: SAP HANA Text Analysis on Documents uploaded by an end-user
Part 2 - Hands on: Building the backend for a SAP HANA Text Analysis application
Part 3 - Presenting: A UI5 front-end to upload documents and explore SAP HANA Text Analytics features
Part 4 - Deep dive: How to upload documents with OData in a UI5 Application

In the previous blog post, we built a backend for our SAP HANA Text Analysis application. In this blog post I present a simple web application which lets end-users upload documents to the backend and inspect the SAP HANA Text analysis results.

The application is a very simple UI5 application. The code, along with the back-end code is available on github, and instructions to install this application on your own SAP HANA system are provided as well.

The UI5 Application: Functional Overview

Here's an overview of the UI5 demo application:

The application features a single page, which is split vertically. On the left hand side of the splitter is the list of uploaded files, and it shows all rows from the CT_FILE database table. On the right hand side of the splitter is the list of text analysis results, and this shows rows from the $TA_ database table.

In the screenshot, only the FILE_NAME column is visible, but you can reveal the other columns by choosing them from the column header menu, which pops up when you right click a column header:

Since we haven't uploaded any files yet, both lists are currently empty. So, let's upload a file to see it in action! To upload a file, hit the button on the top left side of the application toolbar (1):

After clicking the "Upload File for Text Analysis" toolbar button, a dialog appears that lets you browse files so you can upload them. Hit the "Browse File..." button in the dialog to open a File explorer (2). Use the file explorer to choose a file (3). Note that this demo project's github repository provides a number of sample files in the sample-docs folder.

After choosing a file in the File explorer, the file name appears in the dialog:

To actually upload the chosen file, confirm the dialog by clicking the "Upload" button at the bottom of the dialog. The file will then appear in the file list left of the splitter, and is then selected.

Whenever the selection in the file list changes, the text analysis results in the list on the right of the splitter are updated to match the selected item. As we mentioned in the previous post, collection of text analysis results is ASYNCHRONOUS, so after uploading a new file, there is a possibility that the text analysis results have not yet arrived. Unfortunately, there is not much that can be done about that at this point.

You can now browse, filter, and sort the list of analysis results to explore the results of the text analysis. Obviously, by itself this is not very useful, but the point of this app is to make if very easy to inspect the actual raw text analysis results. Hopefully, it will give you some ideas on how you could use this type of information to build actual real world applications.

Once you're done with a particular file, you can remove it too using this application: in the File list, simply hit the trashbin icon to remove that particular file. A dialog will appear where you need to confirm the deletion of that file. When you confirm the dialog, the file will be deleted form the CT_FILE table. Note that any corresponding analysis results from the $TA_ table will not be removed by this demo application, unless you manually added a foreign key constraint on the $TA_ table that cascades the deletes from the CT_FILE table.

Installing this application on your own HANA System

Front-end and back-end code for this application is available on github and licensed as open source software under the terms and conditions of the Apache 2.0 software license. The remainder of this post provides the installation instructions.

Obtaining the source and placing it in a destination package on your HANA system

Create a package with your favorite IDE for SAP HANA (Web IDE, SAP HANA Studio, Eclipse with SAP HANA Developer Tools)
Download an archive of the github repository
Unzip the archive and transfer its contents to the HANA package you just created.

Updating Package and Schema names

With db/CT_FILE.hdbdd:
- update the namespace, update the package identifier from "system-local"."public"."rbouman"."ta" to the name of the package you just created.
- modify the @Schema from 'RBOUMAN' to whatever schema you want to use. (Create a schema yourself if you don't already have one)
- Activate db/CT_FILE.hdbdd. In the database catalog, you should now have this table. Hana should have created a corresponding $TA_ table as well.
With service/ta.xsodata:
- In the first entity definition, update the table repository object identifier "system-local.public.rbouman.ta.db::CT_FILE" so it matches the location of the table on your system.
- In the second entity definition, update the catalog table identifier from "RBOUMAN"."$TA_system-local.public.rbouman.ta.db::CT_FILE.FT_IDX_CT_FILE" so it matches the database schema and catalog table name on your system.
- Activate service/ta.xsodata.

Activation

You can now activate the package you created to activate all remaining objects, such as the .xsapp and .xsaccess files, as well as the web subpackage and all its contents.

Running the application

After installation, you should be able to open the web application. You can do this by navigating to:

http://yourhanahost:yourxsport/path/to/your/package/web/index.html

where:

yourhanahost is the hostname or IP address of your SAP HANA system
yourxsport is the port where your HANA's xs engine is running. Typically this is 80 followed by your HANA instance number.
path/to/your/package is the name of the package where you installed the app, but using slashes (/) instead of dots (.) as the separator character.

Summary

In this blog post we finally got to use the backend we built previously by installing and running the UI5 App. You may either use the app to explore the SAP HANA Text Analysis results and to experiment with different different document formats.

If you're also interested into how the actual upload process works and how it is implemented in the UI5 app, then you can read all about it in the next and final installment of this series.

Building a UI5 Demo for SAP HANA Text Analysis: Part 2

2019-12-01T00:44:00.001+01:00

In the previous blog post, I explained some of the pre-requisites for building a SAP HANA Text Analysis application, and some thoughts were dedicated on how to expose these features to a end-user facing web-application. In this blog post, these considerations are put to practice and a basic but functional backend is created to support such an application.

Part 1 - an Overview: SAP HANA Text Analysis on Documents uploaded by an end-user
Part 2 - Hands on: Building the backend for a SAP HANA Text Analysis application
Part 3 - Presenting: A UI5 front-end to upload documents and explore SAP HANA Text Analytics features
Part 4 - Deep dive: How to upload documents with OData in a UI5 Application

Building the HANA Text Analysis Backend

We'll quicky go through a setup so you can try this out yourselves. This assumes you have access to a HANA System and development tools (either the SAP Web IDE, HANA Studio, or Eclipse IDE with SAP HANA Develpoment tools, or whatever - it doesn't really matter.

Package Structure

There is a lot that could be said about the proper package structure for HANA applications, but we won't go there now. It's enough to have something with just enough structure to keep the different responsibilities of the application apart. We settled for a simple base package called ta, which acts as the root package of the demo project, with 3 subpackages:

db for anything related to the physical database structure
service for the OData service, that exposes the database to our web-application
web for the HTML5/UI5 application - i.e. the stuff that is served from HANA to run inside the client's web-browser

Apart from these 3 subpackages, the ta package also contains these 2 files, which are necessary to expose it as a web application:

.xsapp - an empty file to make the contents of the package available via the XS webserver.
.xsaccess - A configuration file for managing access and authorizations for the web application.

Note that with this setup, all of our subpackages are exposed, whereas only the service and web subpackages are required to be exposed. An actual, serious application would only expose whatever is minimally required to be exposed, and would not expose any packages related to the physical database structure.

The `CT_FILE` table

We created a HANA table to hold our uploaded files by creating a file called CT_FILE.hdbdd file in the db package. This allows you to maintain the table definition as a repository object, which makes it transportable.

The CT_FILE.hdbdd file has the following contents:

namespace "system-local"."public"."rbouman"."ta"."db";

@Schema: 'RBOUMAN'
@Catalog.tableType: #COLUMN
Entity CT_FILE {
  Key "FILE_NAME"                     : String(256)  not null;
      "FILE_TYPE"                     : String(256)  not null;
      "FILE_LAST_MODIFIED"            : UTCTimestamp not null;
      "FILE_SIZE"                     : Integer      not null;
      "FILE_CONTENT"                  : LargeBinary  not null;
      "FILE_LAST_UPLOADED"            : UTCTimestamp not null ;
} 
technical configuration {
    FULLTEXT INDEX "FT_IDX_CT_FILE" ON ("FILE_CONTENT")
        ASYNCHRONOUS
        LANGUAGE DETECTION ('en')
        MIME TYPE COLUMN "FILE_TYPE"
        FUZZY SEARCH INDEX off
        PHRASE INDEX RATIO 0.721
        SEARCH ONLY OFF
        FAST PREPROCESS OFF
        TEXT ANALYSIS ON
        CONFIGURATION 'GRAMMATICAL_ROLE_ANALYSIS';
};

The important feature here is the definition of the FILE_CONTENT column as a LargeBinary, and the FULLTEXT INDEX definition on that column. The particular syntax to define the fulltext index in a .hdbdd table definition is described SAP HANA Core Data Services (CDS) Reference, whereas the actual options that are applicable for FULLTEXT INDEXes are described in the SAP HANA SQL and System Views Reference. Finally, guidance on the meaning and functionality of the text analysis configurations are described in the SAP HANA Text Analysis Developer Guide.

The short of it is that with this configuration, the (binary) content of docments stored in the TEXT_CONTENT column will be analyzed automatically. The results of the analysis will be stored in a sepearate $TA_ table called $TA_system-local.public.rbouman.ta.db::CT_FILE.FT_IDX_CT_FILE. This table is created and maintained by the HANA system. The structure of this $TA_ table is described here.

`ASYNCHRONOUS`

I just mentioned that the analysis results will be stored in the $TA_ table automatically. While this is true, the analysis does not occur immediately. This is because the FULLTEXT INDEX is created with the ASYNCHRONOUS option. This allows HANA to store documents in the CT_FILE table without having to wait for the text analysis process to finish.

We could the debate the advantages and drawbacks of the ASYNCHRONOUS option and whether it would make more sense to specify SYNCHRONOUS instead, or leave the option out alltogether (in which case SYNCHRONOUS would be implied). However there is a very simple reason why it is currently specified as ASYNCHRONOUS: if a FULLTEXT INDEX specifies a CONFIGURATION option, then it must be specified as SYNCHRONOUS, or else the following error occurs upon activation:

CONFIGURATION not supported with synchronous processing

For actual analysis, we really do need the CONFIGURATION option, as it offers all the truly interesting properties of text analysis. So, it seems there's just no way around it - text analysis results are collected in the background, and finish at some point after our document is stored in the table. And there seems to be no way of finding out whether the analysis is finished, or even if it is still busy. For instance, this makes it impossible to determine whether a recently uploaded document is still being analyzed, or whether the document was not eligible for text analysis at all: in both cases, the analysis results will remain absent.

That said, even though the FULLTEXT INDEX is specified as ASYNCHRONOUS HANA will let you specify when the analysis results should be updated. At least, according to the SAP HANA SQL and System Views Reference, it is possible to specify a FLUSH [QUEUE] <flush_queue_elem>-clause right after the ASYNCHRONOUS option, with <flush_queue_elem> indicating either a time interval (expressed as a number of minutes) or a number of documents. So, in theory, it would be possible to write:

ASYNCHRONOUS FLUSH QUEUE AFTER 1 DOCUMENTS

which would indicate that update of the analysis would kick in as soon as a new document has been loaded.

Unfortunately, on the HANA System I have access to, this results in the following error upon activation:

Flush based on documents/minutes not yet supported

The same error message occurs when I try ASYNCHRONOUS FLUSH QUEUE EVERY 1 MINUTES instead.

So, it looks like we'll just have to live with this for now. I did some checks and I noticed that analysis kicks in after a couple of seconds, but this is on a system that is not very heavily used. So for the purpose of exploration it's not too bad, but this does seem like it could become a problem for real-time applications (like chatbots).

The Key

Another thing worth mentioning here is the key of the CT_FILE table. For this very simple demo application, we chose to make only the FILE_NAME column the primary key of the table. The choice of key will depend on what kind of application you're building. In many practical cases you might not care at all about the fysical name of the uploaded file at all, and a name given by the uploader might be a better choice. Or maybe you don't care about names at all, only about whether the content of the document may be considered unique, in which case some hash of the file contents may be a suitable choice.

No matter what key you choose for the FULLTEXT INDEXed table, the column definitions that make up the key are copied to the corresponding $TA_ table in order to maintain the relationship between the analysis result and the original source document of the analysis, thus using those columns as a foreign key. Note however that HANA does not automatically create a FOREIGN KEY constraint to enforce referential integrity. But you may add such a constraint yourself. This may be useful in particular to cascade deletes on the document table to the text analysis results table. (Adding such a constraint manually is suggested in the introduction of the SAP HANA Text Analysis Developer Guide.)

The primary key of the $TA_ table consists of the key columns from the document table, plus to additional columns that identify an individual analysis result: TA_RULE and TA_COUNTER, where TA_RULE is the type of analysis that yielded the result, and TA_COUNTER is an integer that identifies, in order, each analysis result within a document and within a particular analysis type.

The `.xsodata` Service Definition

We expose both the CT_TEXT and the $TA_system-local.public.rbouman.ta.db::CT_FILE.FT_IDX_CT_FILE tables via an OData service. The OData service is created by creating a .xsodata service definition file called ta.xsodata in the service subpackage.

The contents of ta.xsodata service definition file are shown below:

service {

    entity "system-local.public.rbouman.ta.db::CT_FILE" as "Files";
    
    entity "RBOUMAN"."$TA_system-local.public.rbouman.ta.db::CT_FILE.FT_IDX_CT_FILE" as "TextAnalysis";
}
annotations {
  enable OData4SAP;
}
settings {
  support null;
}

This creates a OData service and maps our two tables CT_FILE and the $TA_ table to the OData EntitySets Files and TextAnalysis respectively. The OData service will be available at a url of which the path corresponds to the fully qualified package name of the .xsodata file.

Note that the syntax for mapping the tables to EntitySets is slightly different, depending upon whether the table is created as a repository object or as a database catalog object:

for CT_FILE, it is the package name containing the table's .hdbdd file, followed by 2 colons, and then followed by the local table name.
for $TA_ table, it is the (quoted) database schema name, followed by a dot, and them followed by the quoted table name.

The reason for the difference is that we only maintain the CT_FILE table as a repository object. There is no corresponding repository object for the $TA_ file, since HANA creates that autonomously as a result of the full text index on CT_FILE. Since the $TA_ table is created automatically we can assume the entire thing is transportable as it is, as long as we make sure we maintain our document table as a repository object, and refer to it in the .xsodata file using it's repository object name.

Activation and Verification

Now that we have all these artifacts, we should try and activate our package and test the service. You can either attempt to activate the entire package, or activate each file individually. For the latter, you need to make sure to activate the .hdbdd file before attempting to activate the .xsodata file, because the .xsodata file is dependent upon the existence of the tables in the database catalog.

After succesful activation, you can attempt to visit the service by navigating to its service document or metadata document using your web browser. These documents should be available at the following urls:

Service Document: http://yourhanahost:yourxsport/path/to/your/package/service/ta.xsodata
Metadata Document: http://yourhanahost:yourxsport/path/to/your/package/service/ta.xsodata/$metadata

where:

yourhanahost is the hostname or ipadress of your SAP HANA system
yourxsport is the portname of your xsengine. This is normally 80 followed by the HANA Instance number. For example, if the instance number is 10, the port will be 8010
path/to/your/package is the path you get when you take the package identifier where you put the db, service and web subpackages and replace the dot (.) that separates the individual package names with a slash (/).

Inspect the `$metadata` document of the service

If all is well, your $metadata document should look something like this:


<edmx:Edmx 
  xmlns:edmx="http://schemas.microsoft.com/ado/2007/06/edmx"
  xmlns:sap="http://www.sap.com/Protocols/SAPData" 
  Version="1.0"
>
  <edmx:DataServices 
    xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" 
    m:DataServiceVersion="2.0"
  >
    <Schema 
      xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
      xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
      xmlns="http://schemas.microsoft.com/ado/2008/09/edm"
      Namespace="system-local.public.rbouman.ta.service.ta"
    >
      <EntityType Name="FilesType">
        <Key>
          <PropertyRef Name="FILE_NAME"/>
        </Key>
        <Property Name="FILE_NAME" Type="Edm.String" Nullable="false" MaxLength="256"/>
        <Property Name="FILE_TYPE" Type="Edm.String" Nullable="false" MaxLength="256"/>
        <Property Name="FILE_LAST_MODIFIED" Type="Edm.DateTime" Nullable="false"/>
        <Property Name="FILE_SIZE" Type="Edm.Int32" Nullable="false"/>
        <Property Name="FILE_CONTENT" Type="Edm.Binary" Nullable="false"/>
        <Property Name="FILE_LAST_UPLOADED" Type="Edm.DateTime" Nullable="false"/>
      </EntityType>
      <EntityType Name="TextAnalysisType">
        <Key>
          <PropertyRef Name="FILE_NAME"/>
          <PropertyRef Name="TA_RULE"/>
          <PropertyRef Name="TA_COUNTER"/>
        </Key>
        <Property Name="FILE_NAME" Type="Edm.String" Nullable="false" MaxLength="256"/>
        <Property Name="TA_RULE" Type="Edm.String" Nullable="false" MaxLength="200"/>
        <Property Name="TA_COUNTER" Type="Edm.Int64" Nullable="false"/>
        <Property Name="TA_TOKEN" Type="Edm.String" MaxLength="5000"/>
        <Property Name="TA_LANGUAGE" Type="Edm.String" MaxLength="2"/>
        <Property Name="TA_TYPE" Type="Edm.String" MaxLength="100"/>
        <Property Name="TA_NORMALIZED" Type="Edm.String" MaxLength="5000"/>
        <Property Name="TA_STEM" Type="Edm.String" MaxLength="5000"/>
        <Property Name="TA_PARAGRAPH" Type="Edm.Int32"/>
        <Property Name="TA_SENTENCE" Type="Edm.Int32"/>
        <Property Name="TA_CREATED_AT" Type="Edm.DateTime"/>
        <Property Name="TA_OFFSET" Type="Edm.Int64"/>
        <Property Name="TA_PARENT" Type="Edm.Int64"/>
      </EntityType>
      <EntityContainer 
        Name="ta" 
        m:IsDefaultEntityContainer="true"
      >
        <EntitySet Name="Files" EntityType="system-local.public.rbouman.ta.service.ta.FilesType"/>
        <EntitySet Name="TextAnalysis" EntityType="system-local.public.rbouman.ta.service.ta.TextAnalysisType"/>
      </EntityContainer>
    </Schema>
  </edmx:DataServices>
</edmx:Edmx>

Summary

In this installment, we executed the plan formulated in part 1 of this series. We should now have a functional back-end which we may use to support our front-end application.

In the next installment we will present a front-end application, and explain how you can obtain it and install it yourself on your own SAP HANA System.

Building a UI5 Demo for SAP HANA Text Analysis: Part 1

2019-12-01T00:18:00.000+01:00

Last week, me and my Just-BI co-workers Arjen Koot and Mitchell Beekink had a bit of a rumble with HANA (1.0) and the UI5 toolkit. In the process, we made a few observations and found out a few things which we figured might be worth sharing in a couple of blog posts:

Part 1 - an Overview: SAP HANA Text Analysis on Documents uploaded by an end-user
Part 2 - Hands on: Building the backend for a SAP HANA Text Analysis application
Part 3 - Presenting: A UI5 front-end to upload documents and explore SAP HANA Text Analytics features
Part 4 - Deep dive: How to upload documents with OData in a UI5 Application

(Even though this was all done on HANA 1.0, many of these things should still work on HANA 2.0 as well using XS Classic).

Exploring HANA Text Analysis

The main use case of our concern is SAP HANA Text Analytics. SAP HANA's Text Analysis features let you extract tokens and semantics from various sources. Text analysis is not limited to plaintext but is also supported for binary documents in various formats, such as PDF and Microsoft Office documents such Word documents and Excel workbooks. After analysis, the analysis result may then be used for further processing.

Business cases that might utilize text analysis features include automated classification of invoices or reimbursment requests, matching CV's from employees or job applicants to vacancies, and detection of plagiary, to name just a few.

The hard work of converting the documents, and performing the actual text analysis is all handled fully by HANA, which is great. In our specific case, this process includes the conversion of binary documents in PDF format to text.

What Document Types can HANA handle?

To find out which types and formats your HANA instance can handle, run a query on "SYS"."M_TEXT_ANALYSIS_MIME_TYPES":

SELECT * 
FROM "SYS"."M_TEXT_ANALYSIS_MIME_TYPES";

+---------------------------------------------------------------------------+--------------------------------------------+
|MIME_TYPE_NAME                                                             | MIME_TYPE_DESCRIPTION                      |
+---------------------------------------------------------------------------+--------------------------------------------+
| text/plain                                                                | Plain Text                                 |
| text/html                                                                 | HyperText Markup Language                  |
| text/xml                                                                  | Extensible Markup Language                 |
| application/x-cscompr                                                     | SAP compression format                     |
| application/x-abap-rawstring                                              | ABAP rawstring format                      |
| application/msword                                                        | Microsoft Word                             |
| application/vnd.openxmlformats-officedocument.wordprocessingml.document   | Microsoft Word                             |
| application/vnd.ms-powerpoint                                             | Microsoft PowerPoint                       |
| application/vnd.openxmlformats-officedocument.presentationml.presentation | Microsoft PowerPoint                       |
| application/vnd.ms-excel                                                  | Microsoft Excel                            |
| application/vnd.openxmlformats-officedocument.spreadsheetml.sheet         | Microsoft Excel                            |
| application/rtf                                                           | Rich Text Format                           |
| application/vnd.ms-outlook                                                | Microsoft Outlook e-mail (".msg") messages |
| message/rfc822                                                            | Generic e-mail (".eml") messages           |
| application/vnd.oasis.opendocument.text                                   | Open Document Text                         |
| application/vnd.oasis.opendocument.spreadsheet                            | Open Document Spreadsheet                  |
| application/vnd.oasis.opendocument.presentation                           | Open Document Presentation                 |
| application/vnd.wordperfect                                               | WordPerfect                                |
| application/pdf                                                           | Portable Document Format                   |
+---------------------------------------------------------------------------+--------------------------------------------+

HANA Text Analysis Applications: pre-requisites

To use the text analysis features, you need to

Create a database table with a column having the BLOB data type. SAP HANA's text analysis features also work with plaintext, but for our specific use case we are interested in analyzing PDF documents. From the point of view of the application and database storage, these documents are binary files, which is why end up with a BLOB.
Create a FULLTEXT INDEX on that BLOB column, which configures all text analysis features we need, such as tokenization, stemming, and semantic extraction.

Once this is in place, we only need to store our documents (binary PDF files) in the BLOB column, and SAP HANA will do the rest (more or less automatically). The text analysis results can then be collected from a $TA table and used for further, application specific processing.

Uploading Document Content

Now, as humble a task as it may seem, storing binary document content into a BLOB column is a bit of a challenge.

Various resources published by SAP (like this one) focus on the text analysis features itself. They only offer a simple and pragmatic suggestion for loading the data into the table, which relies on a (client-side) Python script that reads the file from the client, and then uses a plain SQL INSERT-statement to upload the file contents to the BLOB column.

This approach is fine for exploring just the text analysis features of course, but it's not of much use if you want to create a end user facing application. What we would like to do for them instead, is to offer a web application or mobile app (for example, based on UI5), which would allow end users to upload their own documents to the database with an easy-to use graphical user-interface.

What about a `.xsjs` script?

Now, it is entirely possible to come up with a solution that is somewhat similar to the client-side Python script. For example, we could write an .xsjs script that runs on the HANA server. This script would then handle a HTTP POST request and receive the raw document bytes in the request body.

Then, the .xsjs script would run a similar kind of SQL statement, and store the received data in the BLOB column. (An example .xsjs script using SQL can be found here.)

Drawbacks of a `.xsjs` approach

However, there are a couple of drawbacks to this approach.

First of all, it is unlikely that a user-facing application would only want to upload the binary content of the document. Most likely, some application-specific metadata will need to be stored as well.

Another problem is that we would need to write specifc code to handle a very specific kind of request: uploading data to one particular table - that is: the unstructured document content itself, plus whatever structured data the application needs to associate with it. The script will need to refer to the table and its columns by name (or perhaps via a stored procedure that does the actual loading - same difference), and when the name of the table or one of its columns changes, the script must also be changed.

This same issue bites back when you need to transport such a solution to another HANA system. Since there is no formal dependency between the .xsjs script and the database catalog object it references, this poses a bit of a challenge for the HANA transport system which is typically used for these kinds of tasks. At least, such a dependency is not registered anywhere and needs thus be managed manually.

Another thing to keep in mind is that if an application can write data somewhere, it generally also needs to read data from that source. For example, the user may want to search some of the metadata fields to see if a particular file was already uploaded, or to see when a particular file was last uploaded, or to update a previously uploaded file.

Now, we could certainly write the .xsjs script so that it does those things as well. But the point is, even though we only minimally need a service that allows us to upload the document, this interface is incomplete from the application's point of view, so the actual script will need to implement much more data operations than only the upload. And so, what seemed like a small task to write a, simple script to do one simple thing, becomes a pretty serious task to write a pretty full featured service which does a whole bunch of things. And even if writing that would seem okay, it then needs to be maintained and documented and so on.

Finally, writing a server side script that executes SQL could introduce a security risk. Even though it might be written in a way that is safe, one actually needs to make sure that it is indeed written safely, and this needs to be implemented for all functionalities that the service offers.

So, in summary - even though it may seem like a simple task, a realistic, practical solution that is safe, maintainable and fully featured is not so trivial at all. Not with .xsjs anyway! It will require substantial effort to write, document and maintain. The time and effort would be much better spent when dedicated to desiging or building actual business features.

Using OData instead

HANA also offers an OData implementation. HANA OData services are defined in .xsodata service definitions and they pretty much solve all drawbacks mentioned above:

The WEB API is generic, and works the same for just about any database table-like object you want to expose
The API is complete and well defined, and supports all CRUD data operations you need in practice
HANA registers the dependency between the .xsodata file and the repository or database catalog objects it references. This is essential to create transportable solutions.
Typically, changes in table structure will be automatically picked up by the .xsodata service and is a low maintenance task. Writing the .xsjs service in that way implies another level of complexity which is certainly doable but increases development and maintenance effort.

So: what's not to like about OData? We'll certainly get back on that topic, especially in the last part of the series! But for now let's just be happy with the benefits of .xsodata over .xsjs.

Summary

We learned that we must create a table with a BLOB column to hold our documents, and a FULLTEXT INDEX so HANA knows to analyze the contents. After some consideration we decided to try if we can upload use HANA's OData implementation to upload document content to such a table.

In the next installment, we will explain in more detail how to build these backend objects.

Roland Bouman's blog

DataZen winter meetup 2025

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

Dataset: Size and Structure

Using read_text() and a DuckDB table column with the JSON-datatype

Understanding the difference: the JSON-reader

One lesson learned

Avoiding the implicit JSON-type cast?

A Solution based on STRUCTs

Nice! Pity it works only in theory

JSON reader Parameters

Controlling the depth of the JSON-reader's type detection

JSON-objects as DuckDB MAPs

Conclusion

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Rolling up each game's lines into a single game row (6/6)

Rolling up each game's lines into a single game row

Typical PIVOT-statements

PIVOT as a row-to-column transposer

Folding in the movetext

Next steps

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Extracting Tagpairs with Regular Expressions (5/6)

Parsing PGN tagpairs: Regular Expressions to the rescue!

Crafting a regular expression for PGN tagpairs

Extracting tags: whitespace and not-whitespace

Refine the syntax model to extract the value

Modeling Character Escape Sequences

Using regexp_extract() to extract tags and values

Using the capturing groups

Regular Expressions in SQL

Regular Expressions in DuckDB

Adding the Regular Expression to our Query

Next Installment: Rolling up PGN lines to one game row

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Keeping the lines of a game together (4/6)

Keeping game lines together: window functions

Window-functions as 'running totals'

Row counters as running totals

Counting only specifc rows

Using the first movetext line to identify games

Compensating Headers to get the game id

Next Installment: Parsing PGN Tagpairs

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Distinguishing the Line Type (3/6)

Distinguishing PGN Headers, Movetext and separators

Next Installment: Keeping the lines of a game together

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Ingesting raw text with the CSV Reader (2/6)

Ingesting lines of raw text from files using DuckDB's read_csv()

Text lines versus CSV reader rows: different, but related

Next Installment: Detecting line types

DuckDB bag of tricks: Processing PGN chess games with DuckDB - An Introduction to PGN (1/6)

Chess and Portable Game Notation (PGN)

PGN Syntax

From PGN to Tabular

Obtaining PGN-files

Next Installment: Ingestion of PGN Files

DuckDB Bag of Tricks: First Splash

Community Tools &Projects

All tricks so far

SAP HANA Trick: DISTINCT STRING_AGG

Basic Example

Arguments and Options: separator and ordering

Options: DISTINCT

Alternatives to DISTINCT

A better way: the ROW_NUMBER() window function

Using CASE to let STRING_AGG() ignore duplicates

Finally...

UI5 Tips: Persistent UI State

The UI State

Sample Application

Sample Application Features

Sample Application Demo

Using the [LocalStorageJSONModel] to manage UI State

Managing Binding

Template Structure

Template Structure and UI Tree Structure

Managing Binding Paths with Element Binding

Finally

UI5 Tips: Persisting JSONModel data using browser Storage

Sample application: a Shopping List

Sample Application Features

Sample Application Demo

sap.ui.util.Storage utility

Using `read_text()` and a DuckDB table column with the `JSON`-datatype

Avoiding the implicit `JSON`-type cast?

A Solution based on `STRUCT`s

`JSON`-objects as DuckDB `MAP`s

Typical `PIVOT`-statements

`PIVOT` as a row-to-column transposer

Using `regexp_extract()` to extract tags and values

Ingesting lines of raw text from files using DuckDB's `read_csv()`

Options: `DISTINCT`

Alternatives to `DISTINCT`

A better way: the `ROW_NUMBER()` window function

Using `CASE` to let `STRING_AGG()` ignore duplicates

Using the [`LocalStorageJSONModel`] to manage UI State

`sap.ui.util.Storage` utility

A `sap.ui.model.json.JSONModel` backed by `sap.ui.util.Storage`

Instantiating the `LocalStorageJSONModel` from the `manifest.json`

Instantiating the `LocalStorageJSONModel` directly

How `sap.ui.table.TreeTable` renders the collapse/expand icons

Changing the expand/collapse icons for the `sap.ui.table.TreeTable`

How `sap.m.Panel` renders the collaps/expand icons

Changing the expand/collapse icons for the `sap.m.Panel`

How `sap.m.Tree` renders the collapse/expand icons

The `bufferEvents()` Method

The window and the `OVER()`-clause

Controlling the window using the `PARTITION BY`-clause

Using a `UNION`

Drawbacks to the `UNION`-solution