Thursday, December 19, 2024

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is about a particular challenge posted on the DuckDB Discord server which happened to interest me. So, here goes!!

Yesterday on the DuckDB discord server user @col_montium asked:
I've got a bit of a head scratcher. My json from https://api.census.gov/data/2000/dec/sf1/variables.json is a single struct with dynamic keys. I want to extract a table with columns field_code, label, concept, and group. I completed this with a small sample of the data but with the entire dataset the query uses up 24 gigabytes of RAM and then crashes. Here's my query:
WITH source AS (
  SELECT *
  FROM 'https://api.census.gov/data/2000/dec/sf1/variables.json'
),
all_keys AS (
  SELECT unnest( json_keys( variables ) ) AS "key",
         variables
  FROM   source
),
extracted_fields AS (
  SELECT "key" AS field_code
  ,      variables->key->>'$.label' AS label
  ,      variables->key->>'$.concept' AS concept
  ,      variables->key->>'$.predicateType' AS predicate_type
  ,      variables->key->>'$.group' AS "group"
  FROM all_keys
  WHERE "key" NOT IN ('for', 'in', 'ucgid')
)
SELECT *
FROM extracted_fields
ORDER BY field_code
I tried to run this myself on my laptop using the DuckDB 1.1.3 (GA) command line interface, and it failed with a similar error as reported by @col_montium:
Run Time (s): real 252.335 user 197.031250 sys 9.500000
Out of Memory Error: failed to allocate data of size 16.0 MiB (24.9 GiB/25.0 GiB used)
I was able to run it successfully using a 1.1.4 nightly build, but it took about 3 minutes to complete:
Run Time (s): real 188.681 user 148.500000 sys 8.078125

Dataset: Size and Structure


Surprisingly, the source data set is less than 2MB (2,039,748 bytes)! The query also doesn't appear to be terribly complicated, and if it finishes successfully, it yields only 8141 rows. So, clearly something interesting must be the matter with the structure of the dataset.

As it is so modest in size, we can easily inspect it with a text editor:
{
  "variables": {
    "for": {
      "label": "Census API FIPS 'for' clause",
      "concept": "Census API Geography Specification",
      "predicateType": "fips-for",
      "group": "N/A",
      "limit": 0,
      "predicateOnly": true
    },
    "in" {
      "label": "Census API FIPS 'in' clause",
      "concept": "Census API Geography Specification",
      "predicateType": "fips-in",
      "group": "N/A",
      "limit": 0,
      "predicateOnly": true
    },
    "ucgid" {
      "label": "Uniform Census Geography Identifier clause",
      "concept": "Census API Geography Specification",
      "predicateType": "ucgid",
      "group": "N/A",
      "limit": 0,
      "predicateOnly": true,
      "hasGeoCollectionSupport": true
    },
    "P029009" {
      "label": "Total!!In households!!Related child!!Own child!!6 to 11 years",
      "concept": "RELATIONSHIP BY AGE FOR THE POPULATION UNDER 18 YEARS [46]",
      "predicateType": "int",
      "group": "P029",
      "limit": 0
    },
    
    ...many more variables...
    
    "PCT012H185" {
      "label": "Total!!Female!!78 years",
      "concept": "SEX BY AGE (HISPANIC OR LATINO) [209]",
      "predicateType": "int",
      "group": "PCT012H",
      "limit": 0
    }
  }
}    
The structure of the dataset is quite simple: a single object with a single variables property, which has an object value with many object-typed properties. The object-type values of these properties contain a handful of recurring scalar properties like label, concept, predicateType, group and limit.

Now that we analyzed the structure, it should be clear what the query attempts to achieve. It wants to make a row for each property of the outermost variables-object. The property name becomes the field_code column, and forms its natural primary key. The recurring properties of the innermost objects become its non-key columns label, concept, predicate_type and group.

Using read_text() and a DuckDB table column with the JSON-datatype


In an attempt to get the query working without running out of memory, I decided to store the dataset in a DuckDB table like so:
CREATE TABLE t_json
AS
SELECT  filename
,       content::JSON AS data
FROM    read_text('https://api.census.gov/data/2000/dec/sf1/variables.json');
With this in place, a modified version of the initial query might be:
WITH variables AS (
  SELECT  data->variables                         AS variables
  ,       unnest( json_keys( variables ) )        AS field_code
  FROM    t_json
)
SELECT    field_code
,         variables->field_code->>'label'         AS label
,         variables->field_code->>'concept'       AS concept
,         variables->field_code->>'predicateType' AS predicate_type
,         variables->field_code->>'group'         AS group
FROM      variables
WHERE     field_code NOT IN ('for', 'in', 'ucgid')
ORDER BY  field_code
The results look like this:
┌────────────┬──────────────────────────────────┬─────────┬────────────────┬─────────┐
│ field_code │              label               │ concept │ predicate_type │  group  │
│  varchar   │             varchar              │ varchar │    varchar     │ varchar │
├────────────┼──────────────────────────────────┼─────────┼────────────────┼─────────┤
│ AIANHH     │ American Indian Area/Alaska Na…  │ NULL    │ NULL           │ N/A     │
│ AIHHTLI    │ American Indian Area (Off-Rese…  │ NULL    │ NULL           │ N/A     │
│ AITSCE     │ American Indian Tribal Subdivi…  │ NULL    │ NULL           │ N/A     │
│ ANRC       │ Alaska Native Regional Corpora…  │ NULL    │ NULL           │ N/A     │
│ BLKGRP     │ Census Block Group               │ NULL    │ NULL           │ N/A     │
│ BLOCK      │ Census Block                     │ NULL    │ NULL           │ N/A     │
│ CD106      │ Congressional District (106th)   │ NULL    │ string         │ N/A     │
│ CONCIT     │ Consolidated City                │ NULL    │ NULL           │ N/A     │
│   ·        │         ·                        │  ·      │  ·             │  ·      │
│   ·        │         ·                        │  ·      │  ·             │  ·      │
│   ·        │         ·                        │  ·      │  ·             │  ·      │
│ SUBMCD     │ Sub-Minor Civil Division (FIPS)  │ NULL    │ NULL           │ N/A     │
│ SUMLEVEL   │ Summary Level code               │ NULL    │ string         │ N/A     │
│ TRACT      │ Census Tract                     │ NULL    │ NULL           │ N/A     │
│ UA         │ Urban Area                       │ NULL    │ NULL           │ N/A     │
│ US         │ United States                    │ NULL    │ NULL           │ N/A     │
│ ZCTA3      │ ZIP Code Tabulation Area (Thre…  │ NULL    │ NULL           │ N/A     │
│ ZCTA5      │ Zip Code Tabulation Area (Five…  │ NULL    │ NULL           │ N/A     │
├────────────┴──────────────────────────────────┴─────────┴────────────────┴─────────┤
│ 8141 rows (15 shown)                                                     5 columns │
└────────────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 11.682 user 8.015625 sys 2.406250
Less than 12 seconds. Not super-fast, but still: a pretty substantial improvement as compared to the initial query. What could explain this difference? Two things stand out as compared to the initial query:
  • Using the read_text()-function to ingest the data. In the original query, the data was ingested by the first common table expression called source, which did a SELECT directly FROM the dataset's URL 'https://api.census.gov/data/2000/dec/sf1/variables.json'. Clearly, with this syntax, DuckDB will do some magic to perform the appropriate action, which in this case will be to invoke read_json_auto('https://api.census.gov/data/2000/dec/sf1/variables.json').
  • Explicitly casting the content of the data file to DuckDB's JSON-datatype. I expected the issue with the original query had something todo with the transformation of JSON data to a tabular result. Having the data in a JSON-datatype seemed like a good start to play with different methods to extract data. The read_text()-function returns the contents of the file as VARCHAR. It cannot and does not attempt to parse or process the data. That's why we need to cast its output explictly to the JSON-type ourselves.
You might object to a comparison as this query does not include the time to fetch the dataset from the internet, nor the time to store it in the DuckDB table. For me that takes about 6 - 10 seconds. However, we can easily rewrite the query to include read_text() directly:
WITH variables AS (
  SELECT  json( content )->variables              AS variables
  ,       unnest( json_keys( variables ) )        AS field_code
  FROM    read_text( 'https://api.census.gov/data/2000/dec/sf1/variables.json' )
)
SELECT    field_code
,         variables->field_code->>'label'         AS label
,         variables->field_code->>'concept'       AS concept
,         variables->field_code->>'predicateType' AS predicate_type
,         variables->field_code->>'group'         AS group
FROM      varibles
WHERE     field_code NOT IN ('for', 'in', 'ucgid')
ORDER BY  field_code
This query takes about 20 seconds, in other words: about the same time as it takes to run the previous query plus the time to fetch the dataset and load it in the table. For a serious comparison of query performance we should probably eliminate the download and use only local files, but we'll get to that later. For now it's enough to notice that this setup is a significant improvement, both in terms of query execution time as well as in stability/memory consumption.

You might also have noticed this alternative query uses slightly different syntax to the extract label, concept, predicateType and group. Also, the final data presentation query is combined with the field extraction logic. However, these changes are really a matter of style. They do not have any bearing on query performance - not for this data set anyway.

Understanding the difference: the JSON-reader


With these results in mind, we have to conclude that the actual operations on the JSON data cannot really explain the difference. So, it must have something todo with the JSON-reader that was implicitly invoked by the initial query. Let's zoom in a bit on the difference between data extracted from a JSON-typed value and data extracted by the JSON-reader.

We know that in the second example, the data is of the JSON-type, because we explicitly cast it to that type. However, that only tells us that the data is essentially text, conforming to the JSON-syntax. It is of course very useful to know the encoding scheme used to represent values and structures, that still does not reveal anything about the content and structure of the data itself.

What data type does the JSON reader think the data has? Let's find out using this DESCRIBE-statement:
DESCRIBE
SELECT *
FROM 'https://api.census.gov/data/2000/dec/sf1/variables.json'
Its result is:
┌─────────────┬─────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │         column_type         │  null   │   key   │ default │  extra  │
│   varchar   │           varchar           │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ variables   │ STRUCT("for" STRUCT("labe…  │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────────────────────┴─────────┴─────────┴─────────┴─────────┘
Run Time (s): real 9.541 user 1.968750 sys 0.031250
As you can see, the variables property of the main object that makes up our dataset was detected and projected as a column. DuckDB has also inferred a pretty detailed data type for the column, expressed using DuckDB native SQL data types.

In the example above it's truncated, but if we run it again using .mode line, we can see the entire type descriptor. It goes like this:
STRUCT(
  "for" STRUCT(
    "label" VARCHAR
  , concept VARCHAR
  , predicateType VARCHAR
  , "group" VARCHAR
  , "limit" BIGINT
  , predicateOnly BOOLEAN
  )
, "in" STRUCT(
    "label" VARCHAR
  , concept VARCHAR
  , predicateType VARCHAR
  , "group" VARCHAR
  , "limit" BIGINT
  , predicateOnly BOOLEAN
  )
, ucgid STRUCT(
    "label" VARCHAR
  , concept VARCHAR
  , predicateType VARCHAR
  , "group" VARCHAR
  , "limit" BIGINT
  , predicateOnly BOOLEAN
  , hasGeoCollectionSupport BOOLEAN
  )
, P029009 STRUCT(
    "label" VARCHAR
  , concept VARCHAR
  , predicateType VARCHAR
  , "group" VARCHAR
  , "limit" BIGINT
  )
  
  ...many more ...
  
, PCT012H185 STRUCT(
    "label" VARCHAR
  , concept VARCHAR
  , predicateType VARCHAR
  , "group" VARCHAR
  , "limit" BIGINT
  )
)
Looks familiar? It should, as it corresponds closely to the structure we witnessed when analyzing the raw JSON data using a text editor.

So, this really is quite different as compared to the JSON-type column, as the JSON-reader must have done some work to explore the structure of the data. Not only did it detect the data type, it also uses the type to represent the data, which in turn determines what operators and functions will be available by the remainder of the query to work with the data.

Whether this is related to the difference in performance and to the out of memory errors, is yet to be determined, but it does not seem unlikely. For starters, the data type closely resembles the data. The textual description of the data type is about half the size of the data itself, which does not seem like a good thing!

We will investigate the JSON-readers' capabilities for data type detection and the possibilities to control that in the next couple of sections. But before we examine that in more detail, let's take a step back and consider what we just learned.

One lesson learned


So far, we've learned at least one important lesson: the JSON reader does in fact do exactly what its name implies - read JSON-encoded data. But while doing so, it tends to output data using DuckDB native data types, and typically not the JSON data type.

In hindsight this certainly sounds perfectly sensible. Yet, both me and @col_montium apparently did not realize that fully, as the original query takes the data coming out of the JSON reader, and processes it using functions like json_keys(), and the extraction operators -> and ->>. These are designed for working with values in the JSON data type, and not for the STRUCT values that the JSON reader hands back to us.

You may wonder: how it is possible that these functions and operators work at all on the STRUCT value returned by the JSON reader? The answer is quite simple: when we apply these functions and operators, they first implicitly cast their arguments and operands to the JSON-type. So, just as the JSON reader spent all its effort to read raw JSON-encoded data and convert it into neat native DuckDB STRUCTs, we immediately undid that effort only to convert them back to a JSON-type. And initially we - or at least I - didn't even realize it.

Actually - it is a little bit different still, as the JSON-reader deals with reading JSON-encoded text whereas the JSON-data type is still a database type. But the point here is that the JSON-reader did work to provide a precise description of the data as STRUCTs, and we didn't bother treating the data as such. Instead, we implicitly cast it to the JSON-type, which is much looser. So it feels there might be an unused opportunity to benefit from the detected data type. We will examine this possibility in the next section.

Avoiding the implicit JSON-type cast?


Now that we have a better understanding of what we've done, could we rewrite the original query so we can avoid the implicit JSON-type cast, and benefit from the DuckDB native types returned by the JSON reader? Somewhat surprisingly, this is not quite as simple as one might think it would (or should) be.

First of all, we cannot simply use whatever are the STRUCT equivalents for the JSON-type functions and operators.

For example, it is currently not possible to extract the keys from a STRUCT type. So there's no straightforward equivalent to how the original query uses json_keys() and then unnest() to spread out the keys from the original variables object into separate rows. (In the DuckDB github repository, @pkoppstein opened a discussion about creating a feature to extract the keys from STRUCTs in case you want to chime in.)

Even if we somehow managed to extract the keys from the STRUCT, there is currently no way to use it to extract the corresponding value. That is, STRUCTs of course support extraction methods, but all of them require the key to be a constant value. This prohibits using a column value, such as what we would get if we'd use an unnested set of keys:
WITH field_codes AS (
  SELECT  unnest( json_keys( variables ) ) AS field_code
  ,       variables
  FROM    'https://api.census.gov/data/2000/dec/sf1/variables.json'
)
SELECT  field_code
,       variables[field_code] AS "variable"
FROM    field_codes
WHERE   field_codes NOT IN ('for', 'in', 'ucgid')
In the example above, variables is the STRUCT-typed value we receive from the JSON-reader. The expression variables[field_code] attempts to extract a value from it using the column value field_code. The statement fails because field_code is not a constant:
Run Time (s): real 8.345 user 1.531250 sys 0.031250
Binder Error: Key name for struct_extract needs to be a constant string
As the error message suggests, the square bracket extraction syntax is just syntactic sugar for the struct_extract() function. Using a function call instead yields the exact same error.

A Solution based on STRUCTs


Eventually I came up with an approach that at least allows me to envision what the solution based on STRUCTs, and without relying on the JSON-type might look like. Here's that attempt:
WITH variables AS (
  UNPIVOT (
    WITH variables AS (
      SELECT  COLUMNS( variables.* EXCLUDE( "for", "in", "ucgid" ) )
      FROM    read_json( 'https://api.census.gov/data/2000/dec/sf1/variables.json' )
    )
    SELECT  struct_pack(
              label := struct_extract( 
                COLUMNS( * )
              , 'label' 
              )
            , concept := struct_extract( 
                COLUMNS( * )
              , 'concept' 
              )
            , predicateType := struct_extract( 
                COLUMNS( * )
              , 'predicateType' 
              )
            , "group" := struct_extract( 
                COLUMNS( * )
              , 'group' 
              )
            )
    FROM    variables
  )
  ON( * )
  INTO NAME field_code VALUE "variable"
)
SELECT field_code, "variable".*
FROM variables
ORDER BY field_code
The main idea here is to use variables.* to turn the keys of the STRUCT given to us by the JSON-reader into separate columns. The UNPIVOT-statement can then turn those columns into rows. In the original query this was achieved by combining json_keys() and unnest().

Since we don't need the metadata-like objects ucgid, for and in, we also don't need to unpack them from the variables object. So, we wrapped variables.* into a COLUMNS-"star expression" so we could add an EXCLUDE-clause to remove them. In the original query, those where removed in the WHERE-clause.

The UNPIVOT-statement not only creates rows from columns, but also lets us transform the column names into values. We use the NAME-clause to collect them into a new field_code column. The VALUE-clause can be used to name the column that receives the corresponding column values. In the example above, that value column is assigned the name variable.

Now, the values extracted from the variables-object are themselves STRUCTs, and they do not all have the same set of keys. From the DuckDB persective, they therefore have distinct types, and thus cannot all simply be lumped together directly into a single column.

However, we're only interested in a particular set of properties. They are 'label', 'concept', 'predicateType', and 'group'. If the type of their value is the same across all objects, then we can extract them and create new STRUCTs having only those keys. This is achieved using struct_extract() and struct_pack().

Again, the COLUMNS(*)-"star expression" proves to be a very useful tool! We use it here as argument for struct_extract(). This way, we need to write the STRUCT extraction-and-assembly business only once. The star expression then applies it to all columns without having to name them explicitly. The STRUCT-value assembled by struct_pack() gets the name of the original column.

The final touch is to unwrap the STRUCTs we assembled into separate columns. This is done in the terminal SELECT, again using variable.* syntax.

Nice! Pity it works only in theory


When we run it, we get:
Run Time (s): real 14.183 user 12.484375 sys 0.062500
Binder Error: Could not find key "concept" in struct
Apparently, not only do the objects in our data set sometimes have more keys than we're interested in, some also don't have all the keys we're attempting to extract. There does not seem to be any way to detect whether some arbitrary STRUCT-value has a specific key, so we can't simply work around it.

If we comment out the extraction for the the concept key, we get the same error but now for the predicateType key. If you comment out that extraction too, the query executes succesfully. On my laptop, it takes about as long as the original query, that is to say, much worse than the alternative using the JSON-type.

So even if we would somehow be able to overcome the issues with extracting the properties and creating the STRUCT, we still can't seem to really benefit from the more strict typing. I guess the main take-away here is, we can STRUCTggle all we want, but our data simply appears not to be given to us in a way that allows it to work for us.

JSON reader Parameters


We mentioned earlier that SELECT-ing directly from the url, as the original query did, causes DuckDB to invoke read_json() or read_json_auto(), which are synonyms of each other. If you like, you can convince yourself by running an equivalent DESCRIBE-statement that explicitly invokes the reader:
DESCRIBE
SELECT *
FROM read_json( 'https://api.census.gov/data/2000/dec/sf1/variables.json' )
If you execute it, you'll notice that the inferred column name and data type are identical to that returned by the prior DESCRIBE-statement. But invoking the reader explicitly has a benefit in that it offers us the possibility to pass parameters to control its behavior. Some relevant to the topic at hand are:
BOOLEAN auto_detect
Whether to auto detect the schema at all. Contrary to what the current documentation states, the default value is TRUE rather than FALSE.
STRUCT(name VARCHAR, type VARCHAR) columns
If you choose auto_detect to be FALSE, you are required to explicitly specify the columns and column types.
INTEGER maximum_depth
The number of nesting levels that are considered when detecting the datatype. The default is -1, which means there is no restriction on the depth of type detection. When set to a positive integer, and there are object-type values at the maximum level of nesting, then those will get 'detected' as being of the JSON-type.

Controlling the depth of the JSON-reader's type detection


Now that we learned about the maximum_depth parameter, let's apply it and experience its effect:
DESCRIBE
SELECT  *
FROM    read_json( 
          'https://api.census.gov/data/2000/dec/sf1/variables.json' 
        , maximum_depth = 0
        )
At maximum_depth = 0, there is no detection at all, and the entire data set is just a JSON-typed value:
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ json        │ JSON        │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
Run Time (s): real 7.583 user 0.375000 sys 0.015625
This is on par from what we achieved with the read_text() and explicitly casting its content column to the JSON-type. Let's allow for one level:
DESCRIBE
SELECT  *
FROM    read_json( 
          'https://api.census.gov/data/2000/dec/sf1/variables.json' 
        , maximum_depth = 1
        )
At maximum_depth = 1 we get at least extraction into separate columns. However, the column data type will still be the generic JSON-type:
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ variables   │ JSON        │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
Run Time (s): real 8.521 user 0.343750 sys 0.000000
We could allow another level:
DESCRIBE
SELECT  *
FROM    read_json( 
          'https://api.census.gov/data/2000/dec/sf1/variables.json' 
        , maximum_depth = 2
        )
At maximum_depth = 2, the type detected for the variables column is a bag of properties of the JSON-type. In the DuckDB type system, this is represented as MAP( VARCHAR, JSON ):
┌─────────────┬────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │    column_type     │  null   │   key   │ default │  extra  │
│   varchar   │      varchar       │ varchar │ varchar │ varchar │ varchar │
├─────────────┼────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ variables   │ MAP(VARCHAR, JSON) │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴────────────────────┴─────────┴─────────┴─────────┴─────────┘
Run Time (s): real 8.229 user 0.343750 sys 0.000000
This is interesting! So far we've seen STRUCTs, but not MAPs before.

JSON-objects as DuckDB MAPs

Let's attempt to rewrite the original query using the MAP( VARCHAR, JSON )-typed value we get from the JSON-reader when we pass maximum_depth = 2:
WITH variables AS (
  SELECT  unnest( map_keys( variables ) )   AS field_code
  ,       unnest( map_values( variables ) ) AS "variable"
  FROM    read_json( 
            'https://api.census.gov/data/2000/dec/sf1/variables.json' 
          , maximum_depth = 2
          )
)
SELECT    field_code
,         "variable"->>'label'              AS label
,         "variable"->>'concept'            AS concept
,         "variable"->>'predicateType'      AS predicate_type
,         "variable"->>'group'              AS group
FROM      variables
WHERE     field_code NOT IN ('for', 'in', 'ucgid')
ORDER BY  field_code
The main difference as compared to using only the JSON-type, is that the MAP-type lets us extract both the keys as well as the corresponding values using map_keys() and map_values() respectively. This means the final SELECT does not have to extract the variable object explicitly using the key: instead, we can immediately extract properties of interest from the variable objects.

In addition, the main difference with the original query is that the JSON-reader now hands the variable objects to us as JSON-typed values. The original query relied on implicit typecasting, which does not happen here.

If we download the JSON file and store it on the local disc, and modify our queries to use that, the results are quite remarkable: the one using read_text() and the explicit cast to the JSON-type takes about 12 seconds. However, that result is smashed completely by the last query, which now takes just half a second! A really nice improvement!

Conclusion


  • DuckDB's feature to SELECT directly from a file or URL is convenient, but it pays off to examine the underlying reader invocation and its parameters.
  • When using the JSON-reader, it's a good idea to examine whether the detected datatype is suitable for the extraction you're attempting to perform. By default, the JSON-reader will exhaustively detect the type. If you find the detected type is large as compared to your dataset, and/or more precise or detailed than required for your purpose, try using the maximum_depth parameter to curtail the type detection.
  • If you're using the JSON-reader and find that your extraction logic relies on JSON-type functions and operators, then beware of implicit casting of your extracted data to the JSON-type. Implicit casting to the JSON-type may point to an opportunity to limit the level of data type detection.
  • When reading JSON-data, you can try extracting from a JSON-typed value - either by loading your data to a table with a JSON-typed column, or using a combination of read_text() and an explicit cast to the JSON-type. Even if your goal is use the JSON-reader, the JSON-typed value is a useful baseline: if you find that your JSON-reader based query is slower than the one on the JSON-typed value, it means you have an opportunity to improve.
  • If you're using json_keys() with the purpose of extracting objects or values from the JSON-data, then consider using the JSON-reader and configuring it so that it returns a MAP-type. The MAP functions map_keys() and map_values() are really fast, and may help you avoid an an extra step to use the key to extract an object.
  • It would be nice if STRUCTs would be a little bit more flexible. For example, I'd like to be able to extract field values non-constant expressions. It would also be useful to have a function to obtain their keys, and it would help to check if it has a particular key. Another feature that would also have helped is if it would be possible to define STRUCT-fields to be optional, or to have a default value.

Friday, July 12, 2024

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Rolling up each game's lines into a single game row (6/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the sixth installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:
  1. Chess and Portable Game Notation (PGN)
  2. Ingesting raw text with the CSV Reader
  3. Distinguishing the Line Type
  4. Keeping game lines together: window functions
  5. Extracting Tagpairs with Regular Expressions
  6. Rolling up each game's lines into a single game row

Rolling up each game's lines into a single game row


All essential elements are in place for transforming the PGN lines into a tabular result. The actual transformation entails two things:
  • Grouping all lines with the same game_id into one single row.
  • Create columns for each unique value of tag_name from tag_pair, and place the corresponding tag_value into those columns.
We can do that using DuckDB's PIVOT-statement.

Typical PIVOT-statements


The PIVOT-statement is typically used for OLAP use cases to create analytical crosstabs. In this context, we can think of PIVOT as an extension of a standard SELECT-statement with a GROUP BY-clause: each row is actually an aggregate row that represents the group of rows from the underlying dataset that have a unique combination of values for all expressions appearing in the GROUP BY-clause.

In addition, PIVOT also has an ON-clause, and each unique combination of values coming from the ON-clause expressions generates an aggregate column.

At the intersection of the aggregate rows and aggregate columns of the crosstab are the cell values. These are specified by the USING-clause, which specifies aggregate function expressions that are to be applied on those rows from the underlying resultset that belong to both the aggregate row and the aggregate column.

In typical OLAP use cases, the cell values are typically SUM()s or AVG()s over monetary amounts or quantities, sometimes COUNT()s.

PIVOT as a row-to-column transposer


The description of the PIVOT-statement above provides some hints on how we can use it to transform the game lines to a single game row, and how to turn the tags into separate columns:
  • We want to roll up the game lines of one game into a single game row, so the game_id expression should be placed in the GROUP BY-clause.
  • The tag names should be used to generate columns, so the tag_name-member of the tag_pair STRUCT-value returned by regexp_extract() should be placed in the ON-clause.
  • The tag_values should appear as cell values, so the tag_value-member of tag_pair should be placed in the USING-clause.
This makes it all a bit different from the typical analytical crosstabs:
  • The tag_values that are to appear as cell values are primarily of a text type: player names, event locations, and sometimes date- or time-like values. There are some numerical values too, like Elo scores, but these are non-additive, and quite unlike the amounts and quantities we find in the typical OLAP-case.
  • As the tagspairs are just attributes of the game, we expect set of tag_name-values for in one game to be unique. That means that for each game, we will find at most one tag_value in each generated column.
So for the PGN use case, it is more natural to think of the PIVOT-statement as a device to transpose rows to column, rather than an analytical crosstab.

Even though we expect only a single text value for the tag_value, the PIVOT-statement's USING-clause still requires some kind of aggregate function. To aggregate tag_value we can settle for anything that preserves the text: MIN(), MAX(), or ANY_VALUE(), as well as the text aggregate STRING_AGG() would all do.

When we put it all together, this is what our inital PIVOT-statement looks like:
PIVOT(
SELECT  line
,       line LIKE '[%'                               AS is_header
,       COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
         ROWS BETWEEN UNBOUNDED PRECEDING
              AND     CURRENT ROW
        ) + 
        CASE
          WHEN is_header THEN 1
          ELSE 0
        END                                          AS game_id
,       CASE
          WHEN is_header THEN
            regexp_extract(
              line
            , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]*$'
            , [ 'tag_name', 'tag_value' ]
            )
        END                                          AS tag_pair
FROM    read_csv(
          'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
        , columns = {'line': 'VARCHAR'}
        )
WHERE   line IS NOT NULL
)
ON tag_pair['tag_name']
USING ANY_VALUE( tag_pair['tag_value'] )
GROUP BY game_id
This is what its result looke like:
┌─────────┬──────────────────────┬──────────┬────────────┬───┬─────────┬─────────────────────┬──────────────────────┬──────────┐
│ game_id │        Black         │ BlackElo │    Date    │ . │  Round  │        Site         │        White         │ WhiteElo │
│  int64  │       varchar        │ varchar  │  varchar   │   │ varchar │       varchar       │       varchar        │ varchar  │
├─────────┼──────────────────────┼──────────┼────────────┼───┼─────────┼─────────────────────┼──────────────────────┼──────────┤
│       1 │ Pollock, William H.  │          │ 1895.??.?? │ . │ ?       │ Hastings            │ Tinsley, Samuel      │          │
│       2 │ Lasker, Edward       │          │ 1913.??.?? │ . │ ?       │ Scheveningen        │ Loman, Rudolf        │          │
│       3 │ Tartakower, Saviely  │          │ 1921.??.?? │ . │ 5       │ The Hague           │ Alekhine, Alexander  │          │
│       4 │ Wegemund, Otto       │          │ 1922.??.?? │ . │ 7       │ Bad Oeynhausen      │ Antze, O.            │          │
│       5 │ Tarrasch, Siegbert   │          │ 1922.??.?? │ . │ 19      │ Bad Pistyan         │ Johner, Paul F       │          │
│       6 │ Alekhine, Alexander  │          │ 1922.??.?? │ . │ ?       │ Hastings            │ Bogoljubow, Efim     │          │
│       7 │ Kmoch, Hans          │          │ 1922.??.?? │ . │ ?       │ Vienna              │ Rubinstein, Akiba    │          │
│       8 │ Mieses, Jacques      │          │ 1923.??.?? │ . │ 9       │ Hastings            │ Norman, George Mar.  │          │
│       9 │ Orlando, Placido     │          │ 1923.??.?? │ . │ ?       │ Trieste             │ Szabados, Eugenio    │          │
│      10 │ Tarrasch, Siegbert   │          │ 1923.??.?? │ . │ 1       │ Trieste             │ Seitz, Jakob Adolf   │          │
│      11 │ Wolf, Siegfried Re.  │          │ 1923.??.?? │ . │ 5       │ Vienna              │ Von Patay, J.        │          │
│      12 │ Tartakower, Saviely  │          │ 1924.??.?? │ . │ ?       │ New York            │ Bogoljubow, Efim     │          │
│      13 │ Pokorny, Amos        │          │ 1926.??.?? │ . │ 3       │ Trencianske Teplice │ Kostic, Boris        │          │
│      14 │ Tartakower, Saviely  │          │ 1927.??.?? │ . │ 2       │ Kecskemet           │ Vukovic, Vladimir    │          │
│      15 │ Botvinnik, Mikhail   │          │ 1927.??.?? │ . │ 2       │ Moscow              │ Rabinovich, Ilya L.  │          │
│       · │      ·               │  ·       │     ·      │ · │ ·       │   ·                 │    ·                 │  ·       │
│       · │      ·               │  ·       │     ·      │ · │ ·       │   ·                 │    ·                 │  ·       │
│       · │      ·               │  ·       │     ·      │ · │ ·       │   ·                 │    ·                 │  ·       │
│    7229 │ Kovacevic,Bl         │ 2400     │ 2023.12.09 │ . │ 7.3     │ Zagreb CRO          │ Kozul,Z              │ 2532     │
│    7230 │ Iskos,A              │ 2153     │ 2023.12.10 │ . │ 6.46    │ Skopje MKD          │ Zhezhovska,Monika    │ 1826     │
│    7231 │ Spichkin,A           │ 2035     │ 2023.12.12 │ . │ 2       │ chess.com INT       │ Rodriguez Santiago,J │ 2043     │
│    7232 │ Rogov,Matfey         │ 2213     │ 2023.12.12 │ . │ 3       │ chess.com INT       │ Clarke,Matthew       │ 2127     │
│    7233 │ Osmonbekov,T         │ 2137     │ 2023.12.12 │ . │ 3       │ chess.com INT       │ Sroczynski,M         │ 2266     │
│    7234 │ Novikova,Galina      │ 2073     │ 2023.12.12 │ . │ 8       │ chess.com INT       │ Marcziter,D          │ 2192     │
│    7235 │ Tomazini,A           │ 2336     │ 2023.12.14 │ . │ 5.24    │ Zagreb CRO          │ Pultinevicius,Paul.  │ 2584     │
│    7236 │ Spichkin,A           │ 2035     │ 2023.12.19 │ . │ 2       │ chess.com INT       │ Levine,D             │ 2040     │
│    7237 │ Kanyamarala,Tarun    │ 2305     │ 2023.12.19 │ . │ 4       │ chess.com INT       │ Nechitaylo,Nikita    │ 2203     │
│    7238 │ Ronka,E              │ 2291     │ 2023.12.19 │ . │ 4       │ chess.com INT       │ Gruzman,Ilya         │ 2151     │
│    7239 │ Kurbonboeva,Sarvinoz │ 2154     │ 2023.12.26 │ . │ 1.37    │ Samarkand UZB       │ Mammadzada,G         │ 2449     │
│    7240 │ Koneru,H             │ 2554     │ 2023.12.26 │ . │ 3.24    │ Samarkand UZB       │ Peycheva,Gergana     │ 2271     │
│    7241 │ Carlsson,Andreas     │ 1902     │ 2023.12.28 │ . │ 4.9     │ Karlstad SWE        │ Kreken,Eivind Grunt  │ 2271     │
│    7242 │ Kazarjan,Gachatur    │ 2078     │ 2023.12.30 │ . │ 9.31    │ Groningen NED       │ Schuricht,Emil Fre.  │ 2095     │
│    7243 │ Kurbonboeva,Sarvinoz │ 2154     │ 2023.12.30 │ . │ 14.47   │ Samarkand UZB       │ Zhu Chen             │ 2423     │
├─────────┴──────────────────────┴──────────┴────────────┴───┴─────────┴─────────────────────┴──────────────────────┴──────────┤
│ 7243 rows (30 shown)                                                                                    11 columns (8 shown) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
This certainly is starting to look a lot like the result we were after.

Folding in the movetext


The only thing stil missing is the movetext, but at this point, it's almost trivial to add that as well. We can simply amend the tag_pair-expression and let it return a new STRUCT-value with the literal text 'moves' as name member and the line itself as value in case the is_header expression is not TRUE:
CASE
  WHEN is_header THEN
    regexp_extract(
      line
    , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]*$'
    , [ 'column_name', 'column_value' ]
    )
  ELSE {
    'column_name': 'moves'
  , 'column_value': line
  }
END AS column_name_value
For consistency, we also changed the tag_pair alias to column_name_value and its member names from column_name and column_value to column_name and column_value respectively. Therefore we must also update the references elsewhere in the PIVOT statement accordingly.

Also, because one game could have multiple lines of movetext, we must also change the aggregate function in the USING-clause from ANY_VALUE() to STRING_AGG(). After these changes we get the final statement:
PIVOT(
SELECT  line
,       line LIKE '[%'                               AS is_header
,       COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
         ROWS BETWEEN UNBOUNDED PRECEDING
              AND     CURRENT ROW
        ) + 
        CASE
          WHEN is_header THEN 1
          ELSE 0
        END                                          AS game_id
,       CASE
          WHEN is_header THEN
            regexp_extract(
              line
            , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]*$'
            , [ 'column_name', 'column_value' ]
            )
          ELSE {
            'column_name': 'moves'
          , 'column_value': line
          }
        END                                          AS column_name_value
FROM    read_csv(
          'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
        , columns = {'line': 'VARCHAR'}
        )
WHERE   line IS NOT NULL
)
ON column_name_value['column_name']
USING STRING_AGG( column_name_value['column_value'], ' ' )
GROUP BY game_id
And its result:
┌─────────┬──────────────────────┬──────────┬────────────┬───┬──────────────────────┬──────────┬──────────────────────┐
│ game_id │        Black         │ BlackElo │    Date    │ . │        White         │ WhiteElo │        moves         │
│  int64  │       varchar        │ varchar  │  varchar   │   │       varchar        │ varchar  │       varchar        │
├─────────┼──────────────────────┼──────────┼────────────┼───┼──────────────────────┼──────────┼──────────────────────┤
│       1 │ Pollock, William H.  │          │ 1895.??.?? │ . │ Tinsley, Samuel      │          │ 1.d4 f5 2.c4 e6 3..  │
│       2 │ Lasker, Edward       │          │ 1913.??.?? │ . │ Loman, Rudolf        │          │ 1.d4 e6 2.c4 f5 3..  │
│       3 │ Tartakower, Saviely  │          │ 1921.??.?? │ . │ Alekhine, Alexander  │          │ 1.d4 f5 2.c4 e6 3..  │
│       4 │ Wegemund, Otto       │          │ 1922.??.?? │ . │ Antze, O.            │          │ 1.c4 f5 2.d4 Nf6 3.  │
│       5 │ Tarrasch, Siegbert   │          │ 1922.??.?? │ . │ Johner, Paul F       │          │ 1.d4 e6 2.c4 f5 3..  │
│       6 │ Alekhine, Alexander  │          │ 1922.??.?? │ . │ Bogoljubow, Efim     │          │ 1.d4 f5 2.c4 Nf6 3.  │
│       7 │ Kmoch, Hans          │          │ 1922.??.?? │ . │ Rubinstein, Akiba    │          │ 1.d4 e6 2.c4 f5 3..  │
│       8 │ Mieses, Jacques      │          │ 1923.??.?? │ . │ Norman, George Mar.  │          │ 1.d4 f5 2.g3 Nf6 3.  │
│       9 │ Orlando, Placido     │          │ 1923.??.?? │ . │ Szabados, Eugenio    │          │ 1.d4 e6 2.c4 f5 3..  │
│      10 │ Tarrasch, Siegbert   │          │ 1923.??.?? │ . │ Seitz, Jakob Adolf   │          │ 1.c4 e6 2.d4 f5 3..  │
│      11 │ Wolf, Siegfried Re.  │          │ 1923.??.?? │ . │ Von Patay, J.        │          │ 1.d4 e6 2.c4 f5 3..  │
│      12 │ Tartakower, Saviely  │          │ 1924.??.?? │ . │ Bogoljubow, Efim     │          │ 1.d4 f5 2.g3 e6 3..  │
│      13 │ Pokorny, Amos        │          │ 1926.??.?? │ . │ Kostic, Boris        │          │ 1.c4 f5 2.d4 Nf6 3.  │
│      14 │ Tartakower, Saviely  │          │ 1927.??.?? │ . │ Vukovic, Vladimir    │          │ 1.d4 f5 2.c4 e6 3..  │
│      15 │ Botvinnik, Mikhail   │          │ 1927.??.?? │ . │ Rabinovich, Ilya L.  │          │ 1.d4 e6 2.c4 f5 3..  │
│       · │      ·               │  ·       │     ·      │ · │    ·                 │  ·       │          ·           │
│       · │      ·               │  ·       │     ·      │ · │    ·                 │  ·       │          ·           │
│       · │      ·               │  ·       │     ·      │ · │    ·                 │  ·       │          ·           │
│    7229 │ Kovacevic,Bl         │ 2400     │ 2023.12.09 │ . │ Kozul,Z              │ 2532     │ 1.d4 e6 2.c4 f5 3..  │
│    7230 │ Iskos,A              │ 2153     │ 2023.12.10 │ . │ Zhezhovska,Monika    │ 1826     │ 1.d4 e6 2.c4 f5 3..  │
│    7231 │ Spichkin,A           │ 2035     │ 2023.12.12 │ . │ Rodriguez Santiago,J │ 2043     │ 1.d4 e6 2.c4 f5 3..  │
│    7232 │ Rogov,Matfey         │ 2213     │ 2023.12.12 │ . │ Clarke,Matthew       │ 2127     │ 1.d4 e6 2.c4 f5 3..  │
│    7233 │ Osmonbekov,T         │ 2137     │ 2023.12.12 │ . │ Sroczynski,M         │ 2266     │ 1.d4 e6 2.c4 f5 3..  │
│    7234 │ Novikova,Galina      │ 2073     │ 2023.12.12 │ . │ Marcziter,D          │ 2192     │ 1.d4 e6 2.c4 f5 3..  │
│    7235 │ Tomazini,A           │ 2336     │ 2023.12.14 │ . │ Pultinevicius,Paul.  │ 2584     │ 1.d4 e6 2.c4 f5 3..  │
│    7236 │ Spichkin,A           │ 2035     │ 2023.12.19 │ . │ Levine,D             │ 2040     │ 1.d4 e6 2.c4 f5 3..  │
│    7237 │ Kanyamarala,Tarun    │ 2305     │ 2023.12.19 │ . │ Nechitaylo,Nikita    │ 2203     │ 1.d4 e6 2.c4 f5 3..  │
│    7238 │ Ronka,E              │ 2291     │ 2023.12.19 │ . │ Gruzman,Ilya         │ 2151     │ 1.d4 f5 2.g3 e6 3..  │
│    7239 │ Kurbonboeva,Sarvinoz │ 2154     │ 2023.12.26 │ . │ Mammadzada,G         │ 2449     │ 1.d4 f5 2.c4 Nf6 3.  │
│    7240 │ Koneru,H             │ 2554     │ 2023.12.26 │ . │ Peycheva,Gergana     │ 2271     │ 1.d4 e6 2.c4 f5 3..  │
│    7241 │ Carlsson,Andreas     │ 1902     │ 2023.12.28 │ . │ Kreken,Eivind Grunt  │ 2271     │ 1.d4 e6 2.c4 f5 3..  │
│    7242 │ Kazarjan,Gachatur    │ 2078     │ 2023.12.30 │ . │ Schuricht,Emil Fre.  │ 2095     │ 1.d4 f5 2.g3 e6 3..  │
│    7243 │ Kurbonboeva,Sarvinoz │ 2154     │ 2023.12.30 │ . │ Zhu Chen             │ 2423     │ 1.d4 f5 2.g3 Nf6 3.  │
├─────────┴──────────────────────┴──────────┴────────────┴───┴──────────────────────┴──────────┴──────────────────────┤
│ 7243 rows (30 shown)                                                                           12 columns (7 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Next steps


From this point on there's many things that could be done to improve the solution, for example:
  • Column values that originate from headers tags should un-escape escaped characters.
  • The PGN-syntax itself does not dictate this but there are established conventions for what tag names to use and what kind of values are appropriate. For example, see the seven tag roster and optional tag pairs in the wikipedia entry for Portable Game Notation. It would make sense to further cleanse and conform the corresponding columns and give them a more suitable data type, or to actively validate their value.
  • Further data normalization could be attempted by creating separate tables for player, event, opening etc.
  • The moves could be further processed and analyzed to derive a table of board positions, something which would greatly increase the opportunities to analyze games.
We could also improve the statement and make it more robust:
  • Better detection of the first game line. Our assumption has been that the first movetext always starts with '1.'. But what if a game does not have any moves at all? This may sound like that shouldn't be possible, but especially on an online chess site, a player's connection might break after a game was started, but before a move was made.

    Whatever the reason may be, and whether we're interested in such games or not, our current solution is not capable to detect these cases. Instead, games simply aren't identified as intended, and our final game would likely be a mixture of 2 or maybe even more games. Bad bad bad!

    (If you're interested in looking into such a scenario, the Lichess chess database for October 2013 contains 411,039 games, but 140 do not have any moves.)
  • A more robust regular expression to deal with games that may not adhere fully to the PGN syntax (for example, more foregiving handling of whitespace)
  • Better handling of errors when reading CSV.
For now we leave these considerations as an excercise for the reader.

The goal of these posts was to show how DuckDB's features and SQL-dialect allow these kind of raw-text processing tasks to be solved quickly and elegantly. I hope I have succeeded in demonstrating that - it sure was a lot of fun to try!

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Extracting Tagpairs with Regular Expressions (5/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the fifth installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:
  1. Chess and Portable Game Notation (PGN)
  2. Ingesting raw text with the CSV Reader
  3. Distinguishing the Line Type
  4. Keeping game lines together: window functions
  5. Extracting Tagpairs with Regular Expressions
  6. Rolling up each game's lines into a single game row

Parsing PGN tagpairs: Regular Expressions to the rescue!


Let's apply our attention now to the header lines to extract information embedded in the tagpairs to expose it in a more useful form. We'll be using this regular expression to do that:
^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$
If you're not familiar with regular expressions, or if you're not convinced this is correct, that's fine! The first part of this installment describes in detail how I arrived at this expression. In the second part, we will see how we can use regular expression in DuckDB to extract information from PGN headers.

Crafting a regular expression for PGN tagpairs


From our description of the tagpair-syntax, we know that the tag and value must be inside the square brackets, and that tag and its value are separated by whitespace. We also know that value is enclosed in double quotes. We may stylize this syntax as follows:
line-start [ tag whitespace "value" ] line-end
In the syntax model above, literal characters are in red; items in italics are syntactical concepts that we can't quite describe as a single, fixed sequence of characters. But we can describe them using a pattern that specifies exactly what sequence of characters we would expect there, given the assumption that we're reading valid PGN files.

We will now turn this model into a regular expression by replacing the items from the model into patterns. Please be advised that this section is not intended to be a regular expressions tutorial - there are plenty of online resources that do that. It's just here to help you explain how I arrived at the final regular expression: this will make it easier for you to understand its structure, and put you in a better position to undertand its limitations, and if necessary, to adapt it to your specific needs.

First, lets replace line-start and line-end with ^ and $ anchors:
^ [ tag whitespace "value" ] $
Next, the square brackets that enclose our tagpair are regular expression metacharacters, so to match them literally, we need to escape them using a backslash:
^ \[ tag whitespace "value" \] $
Extracting tags: whitespace and not-whitespace
We can now replace the whitespace-placeholder with a pattern that matches actual whitespace. We do this by specifying one whitespace character, \s, follewed by +, indicating that it must occur at least once and could repeat many times:
^ \[ tag \s+ "value" \] $
We know the tag is followed by whitespace. From that fact, it follows that the tag cannot itself contain any whitespace: if that would be possible, we wouldn't be able to distinguish between the whitespace that demarcates the end of the tag as opposed to whitespace appearing inside the tag.

So, we can craft the pattern matching the tag as any character that is not whitespace ([^\s]), one or more times (denoted by +). The tag is information we want to extract; therefore we enclose it in parentheses so we can capture it:
^ \[ ([^\s]+) \s+ "value" \] $
Refine the syntax model to extract the value
To replace the value placeholder with a pattern it will be helpful to refine our syntax model a little bit. Let's zoom in on just the value from our original syntax model, and introduce a value-character.

The value-character stands for any character that may appear between the double-quotes demarcating the value. There may be any number of value-characters inside value:
" value-character* "
As for finding a pattern for value-character, we know for sure is that such a character cannot be a double quote, because we would not be able to distinguish it from the double-quote that indicates the end of the value.

This is completely analogous to the prior example we saw with tag, which is followed by white-space and therefore cannot itself contain whitespace. So just like we wrote "not-whitespace" as [^\s] for the tag, we can write "not-double-quote" as: [^"].
Modeling Character Escape Sequences
The PGN syntax specifies that a literal double-quote inside the value can be denoted by immediately preceding it using a backslash: \". The backslash preceding the double-quote is a so-called "escape" character, relieving the double-quote from its normal role of demarcating the end of the value.

Whenever a syntax defines an escape character device, it automatically introduces the problem of how to denote a literal occurrence of the escape character. That issue is usually solved simply by escaping the escape character, and that's also how the PGN syntax solves it. So in order to represent a literal backslash \ in the PGN-value, it must be preceded immediately with the escape character, which is also the backslash. Thus, it becomes: \\.

Now that we realize a value-character can either be escaped or not escaped, it is a good idea to replace its syntax model with something more refined:
" ( \ escaped-value-character | unescaped-value-character )* "
Note: the pipe-character | is the standard regular expression syntax to denote alternative options ("either/or" in English).

We can immediately replace unescaped-value-character with the "not-double-quote" pattern we identified earlier:
" ( \ escaped-value-character | [^"] )* "
Now let's replace the PGN escape character \ with its regular expression pattern. To do that we need to realize the backslash is also a metacharacter in the regular expression context, where it's also used as escape character! So, in order to represent the PGN escape character in a regular expression, we need to escape it in the the regular expression context. That sounds harder than it is - it simply becomes \\. Here, the first backslash is the regular expression escape character, and the second backslash a literal backslash, which is the escape character in the PGN context:
" ( \\ escaped-value-character | [^"] )* "
We know that in the PGN-context both the double quote and the backslash are ecaped characters. We also know that in the reqular expression context, the backslash needs to be escaped. So, we can now replace escaped-value-character with ["\\]:
" ( \\ ["\\] | [^"] )* "
As final touch to the value pattern, we enclose the value in parenthesis to establish a capturing group. Just like we did earlier for tag, this will allow us to extract the value:
" ( (\\ ["\\] | [^"] )* ) "
Now that we found a pattern for value, we can complete the entire pattern. The double quotes around value can be matched directly, and need no special treatment to be used in the regular expression:
^ \[ ([^\s]+) \s+ "( ( \\ ["|\\] | [^"] )* )" \] $
Are you still reading? Thanks for hanging in there! Let's move on to pluck the fruits from our labour.

Using regexp_extract() to extract tags and values


In DuckDB, we can extract the matches made by a regular expression using regexp_extract(). This is an overloaded function. For all versions of regexp_extract(), the first two arguments are the same:
  1. The string to which the regular expression is to be applied
  2. The regular expression
In the simplest usage, it takes only these two arguments, and returns the string that was matched. For our use case, this would not be usful, as our regular expression should match any complete header line: it would simply return the header line we feed into the regular expression.
Using the capturing groups
We want to extract the tag and tagvalue from the header lines, and for this purpose we created corresponding capturing groups. Fortunately, we can pass the index of the capturing group as the third argument to regexp_extract(). In this usage, the function should return whatever was matched by the capturing group indicated by the 3rd argument.

The first capturing group corresponds to the tag, so to extract it we could write:
regexp_extract(
  '[Event "Hastings"]'
  --   capture1
  --   ___|__
  --  |      |
, '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
, 1
)
And this would give us Event as result. Likewise, to extract the value we could write:
regexp_extract(
  '[Event "Hastings"]'
  --                  capture2
  --               _______|_______
  --              |               |
, '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
, 2
)
And this would yield the result "Hastings"
Regular Expressions in SQL
Now, many database products offer support for regular expressions. Products with basic support offer at least matching against a regular expression, and typically also extraction and replacement of a single matched pattern or capturing group, which is the example I just gave above.

More advanced products also offer a feature to split a string on the regular expression matches to an array or a resultset.
Regular Expressions in DuckDB
DuckDB supports all prior mentioned regular expression usages. In addition, DuckDB allows you to extract the matches of multiple capturing groups, all in one go! To do that, we can pass a list of names to regexp_extract() as 3rd argument:
regexp_extract(
  '[Event "Hastings"]'
, '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
, [ 'tag_name', 'tag_value' ]
)
With this usage, the function returns a STRUCT-value with the names passed in 3rd argument as keys. The value assigned to the keys is the text matched by the capturing group appearing at the same position as the index of the name in the list.

So in our example the 3rd argument is this list ['tag_name', 'tag_value']. The name 'tag_name' appears at the first position in the list, and thus, the value assigned to the 'tag_name'-key in the return value will be assigned the text that matches the first capturing group in the regular expression. Likewise, 'tag_value' appears in the list at the second position and thus the match made by the second capturing group will be assigned to that key in the return value.

For our example, the return value is:
┌───────────────────────────────────────────────┐
│                value                          │
│ struct(tag_name varchar, "tag_value" varchar) │
├───────────────────────────────────────────────┤
│ {'tag_name': Event, 'tag_value': Hastings}    │
└───────────────────────────────────────────────┘
I think this is pretty awesome! We only have to write and apply a particular regular expression once, instead of having to repeat it for each group we want to capture. This avoids code duplication and your SQL can be shorter, easier to read, and therefore, more maintainable.

It would not be unreasonable to assume that a single regular expression invocation returning all the matches all at once might also be faster than calling out multiple times to return only one match at a time. However, for this particular case, I found the performance to be on par.

Adding the Regular Expression to our Query


Let's incorporate the tag extraction logic into our query. Because tag extraction is only applicable for header lines, it feels appropriate to use the is_header flag to check whether the regular expression should be applied. Like we already saw, we can benefit from the DuckDB reusable alias feature to do that:
CASE
  WHEN is_header THEN
    regexp_extract(
      line
    , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
    , [ 'tag_name', 'tag_value' ]
    )
END
With this addition our query becomes:
SELECT  line
,       line LIKE '[%'                               AS is_header
,       COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
         ROWS BETWEEN UNBOUNDED PRECEDING
              AND     CURRENT ROW
        ) + 
        CASE
          WHEN is_header THEN 1
          ELSE 0
        END                                          AS game_id
,       CASE
          WHEN is_header THEN
            regexp_extract(
              line
            , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
            , [ 'tag_name', 'tag_value' ]
            )
        END                                          AS tag_pair
FROM    read_csv(
          'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
        , columns = {'line': 'VARCHAR'}
        )
WHERE   line IS NOT NULL
Our results are now:
┌──────────────────────────────────────────────────┬───────────┬─────────┬───────────────────────────────────────────────────────┐
│                       line                       │ is_header │ game_id │                       tag_pair                        │
│                     varchar                      │  boolean  │  int64  │         struct(tag_name varchar, "tag_value" varchar) │
├──────────────────────────────────────────────────┼───────────┼─────────┼───────────────────────────────────────────────────────┤
│ [Event "Hastings"]                               │ true      │       1 │ {'tag': Event, 'value': Hastings}                     │
│ [Site "Hastings"]                                │ true      │       1 │ {'tag': Site, 'value': Hastings}                      │
│ [Date "1895.??.??"]                              │ true      │       1 │ {'tag': Date, 'value': 1895.??.??}                    │
│ [Round "?"]                                      │ true      │       1 │ {'tag': Round, 'value': ?}                            │
│ [White "Tinsley, Samuel"]                        │ true      │       1 │ {'tag': White, 'value': Tinsley, Samuel}              │
│ [Black "Pollock, William Henry Kraus"]           │ true      │       1 │ {'tag': Black, 'value': Pollock, William Henry Kraus} │
│ [Result "1-0"]                                   │ true      │       1 │ {'tag': Result, 'value': 1-0}                         │
│ [WhiteElo ""]                                    │ true      │       1 │ {'tag': WhiteElo, 'value': }                          │
│ [BlackElo ""]                                    │ true      │       1 │ {'tag': BlackElo, 'value': }                          │
│ [ECO "A90"]                                      │ true      │       1 │ {'tag': ECO, 'value': A90}                            │
│ 1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O .  │ false     │       1 │                                                       │
│ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc.  │ false     │       1 │                                                       │
│ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20.  │ false     │       1 │                                                       │
│ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 .  │ false     │       1 │                                                       │
│ 30.Be2 Nd6 31.Bf3 Ra4 32.Qb1 Rc4 33.Rd2 Ne4 34.  │ false     │       1 │                                                       │
│               ·                                  │  ·        │       · │                       ·                               │
│               ·                                  │  ·        │       · │                       ·                               │
│               ·                                  │  ·        │       · │                       ·                               │
│ [Black "Kurbonboeva,Sarvinoz"]                   │ true      │    7243 │ {'tag': Black, 'value': Kurbonboeva,Sarvinoz}         │
│ [Result "1-0"]                                   │ true      │    7243 │ {'tag': Result, 'value': 1-0}                         │
│ [WhiteElo "2423"]                                │ true      │    7243 │ {'tag': WhiteElo, 'value': 2423}                      │
│ [BlackElo "2154"]                                │ true      │    7243 │ {'tag': BlackElo, 'value': 2154}                      │
│ [ECO "A90"]                                      │ true      │    7243 │ {'tag': ECO, 'value': A90}                            │
│ 1.d4 f5 2.g3 Nf6 3.Bg2 e6 4.c4 d5 5.b3 c6 6.Nh.  │ false     │    7243 │                                                       │
│ 10.Bxe5 Qxe5 11.Nd2 O-O 12.cxd5 cxd5 13.Nf3 Qe.  │ false     │    7243 │                                                       │
│ 17.Qxd4 Ne4 18.Nxd5 Qf7 19.Rc7 Rxd5 20.Rxf7 Rx.  │ false     │    7243 │                                                       │
│ 24.Rc7 Na6 25.Rc6 Bc8 26.Rxa6 Bxa6 27.Bxa8 Bxe.  │ false     │    7243 │                                                       │
│ 31.Bd1 Kf7 32.Bc2 Rc3 33.Rd1 Be6 34.Rd2 Bxb3 3.  │ false     │    7243 │                                                       │
│ 38.Kg2 Ra1 39.h4 h6 40.h5 Ra3 41.a5 Ra2 42.Kf3.  │ false     │    7243 │                                                       │
│ 45.Kd4 Ra2 46.a6 Rxf2 47.Rc7 Ra2 48.a7 g5 49.h.  │ false     │    7243 │                                                       │
│ 52.gxh4 Kh5 53.Rh7+ Kg4 54.Kb7 Rb2+ 55.Kc8 Rc2.  │ false     │    7243 │                                                       │
│ 59.Kxa8 Kxh4 60.Rf7 Kg4 61.Kb7 f4 62.Kc6 f3 63.  │ false     │    7243 │                                                       │
│ 66.Ke4  1-0                                      │ false     │    7243 │                                                       │
├──────────────────────────────────────────────────┴───────────┴─────────┴───────────────────────────────────────────────────────┤
│ 117000 rows (30 shown)                                                                                               4 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Next Installment: Rolling up PGN lines to one game row


Next time, we will see how to roll up the multiple lines from a game into a single game row.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...