Friday, July 12, 2024

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Extracting Tagpairs with Regular Expressions (5/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the fifth installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:
  1. Chess and Portable Game Notation (PGN)
  2. Ingesting raw text with the CSV Reader
  3. Distinguishing the Line Type
  4. Keeping game lines together: window functions
  5. Extracting Tagpairs with Regular Expressions
  6. Rolling up each game's lines into a single game row

Parsing PGN tagpairs: Regular Expressions to the rescue!


Let's apply our attention now to the header lines to extract information embedded in the tagpairs to expose it in a more useful form. We'll be using this regular expression to do that:
^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$
If you're not familiar with regular expressions, or if you're not convinced this is correct, that's fine! The first part of this installment describes in detail how I arrived at this expression. In the second part, we will see how we can use regular expression in DuckDB to extract information from PGN headers.

Crafting a regular expression for PGN tagpairs


From our description of the tagpair-syntax, we know that the tag and value must be inside the square brackets, and that tag and its value are separated by whitespace. We also know that value is enclosed in double quotes. We may stylize this syntax as follows:
line-start [ tag whitespace "value" ] line-end
In the syntax model above, literal characters are in red; items in italics are syntactical concepts that we can't quite describe as a single, fixed sequence of characters. But we can describe them using a pattern that specifies exactly what sequence of characters we would expect there, given the assumption that we're reading valid PGN files.

We will now turn this model into a regular expression by replacing the items from the model into patterns. Please be advised that this section is not intended to be a regular expressions tutorial - there are plenty of online resources that do that. It's just here to help you explain how I arrived at the final regular expression: this will make it easier for you to understand its structure, and put you in a better position to undertand its limitations, and if necessary, to adapt it to your specific needs.

First, lets replace line-start and line-end with ^ and $ anchors:
^ [ tag whitespace "value" ] $
Next, the square brackets that enclose our tagpair are regular expression metacharacters, so to match them literally, we need to escape them using a backslash:
^ \[ tag whitespace "value" \] $
Extracting tags: whitespace and not-whitespace
We can now replace the whitespace-placeholder with a pattern that matches actual whitespace. We do this by specifying one whitespace character, \s, follewed by +, indicating that it must occur at least once and could repeat many times:
^ \[ tag \s+ "value" \] $
We know the tag is followed by whitespace. From that fact, it follows that the tag cannot itself contain any whitespace: if that would be possible, we wouldn't be able to distinguish between the whitespace that demarcates the end of the tag as opposed to whitespace appearing inside the tag.

So, we can craft the pattern matching the tag as any character that is not whitespace ([^\s]), one or more times (denoted by +). The tag is information we want to extract; therefore we enclose it in parentheses so we can capture it:
^ \[ ([^\s]+) \s+ "value" \] $
Refine the syntax model to extract the value
To replace the value placeholder with a pattern it will be helpful to refine our syntax model a little bit. Let's zoom in on just the value from our original syntax model, and introduce a value-character.

The value-character stands for any character that may appear between the double-quotes demarcating the value. There may be any number of value-characters inside value:
" value-character* "
As for finding a pattern for value-character, we know for sure is that such a character cannot be a double quote, because we would not be able to distinguish it from the double-quote that indicates the end of the value.

This is completely analogous to the prior example we saw with tag, which is followed by white-space and therefore cannot itself contain whitespace. So just like we wrote "not-whitespace" as [^\s] for the tag, we can write "not-double-quote" as: [^"].
Modeling Character Escape Sequences
The PGN syntax specifies that a literal double-quote inside the value can be denoted by immediately preceding it using a backslash: \". The backslash preceding the double-quote is a so-called "escape" character, relieving the double-quote from its normal role of demarcating the end of the value.

Whenever a syntax defines an escape character device, it automatically introduces the problem of how to denote a literal occurrence of the escape character. That issue is usually solved simply by escaping the escape character, and that's also how the PGN syntax solves it. So in order to represent a literal backslash \ in the PGN-value, it must be preceded immediately with the escape character, which is also the backslash. Thus, it becomes: \\.

Now that we realize a value-character can either be escaped or not escaped, it is a good idea to replace its syntax model with something more refined:
" ( \ escaped-value-character | unescaped-value-character )* "
Note: the pipe-character | is the standard regular expression syntax to denote alternative options ("either/or" in English).

We can immediately replace unescaped-value-character with the "not-double-quote" pattern we identified earlier:
" ( \ escaped-value-character | [^"] )* "
Now let's replace the PGN escape character \ with its regular expression pattern. To do that we need to realize the backslash is also a metacharacter in the regular expression context, where it's also used as escape character! So, in order to represent the PGN escape character in a regular expression, we need to escape it in the the regular expression context. That sounds harder than it is - it simply becomes \\. Here, the first backslash is the regular expression escape character, and the second backslash a literal backslash, which is the escape character in the PGN context:
" ( \\ escaped-value-character | [^"] )* "
We know that in the PGN-context both the double quote and the backslash are ecaped characters. We also know that in the reqular expression context, the backslash needs to be escaped. So, we can now replace escaped-value-character with ["\\]:
" ( \\ ["\\] | [^"] )* "
As final touch to the value pattern, we enclose the value in parenthesis to establish a capturing group. Just like we did earlier for tag, this will allow us to extract the value:
" ( (\\ ["\\] | [^"] )* ) "
Now that we found a pattern for value, we can complete the entire pattern. The double quotes around value can be matched directly, and need no special treatment to be used in the regular expression:
^ \[ ([^\s]+) \s+ "( ( \\ ["|\\] | [^"] )* )" \] $
Are you still reading? Thanks for hanging in there! Let's move on to pluck the fruits from our labour.

Using regexp_extract() to extract tags and values


In DuckDB, we can extract the matches made by a regular expression using regexp_extract(). This is an overloaded function. For all versions of regexp_extract(), the first two arguments are the same:
  1. The string to which the regular expression is to be applied
  2. The regular expression
In the simplest usage, it takes only these two arguments, and returns the string that was matched. For our use case, this would not be usful, as our regular expression should match any complete header line: it would simply return the header line we feed into the regular expression.
Using the capturing groups
We want to extract the tag and tagvalue from the header lines, and for this purpose we created corresponding capturing groups. Fortunately, we can pass the index of the capturing group as the third argument to regexp_extract(). In this usage, the function should return whatever was matched by the capturing group indicated by the 3rd argument.

The first capturing group corresponds to the tag, so to extract it we could write:
regexp_extract(
  '[Event "Hastings"]'
  --   capture1
  --   ___|__
  --  |      |
, '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
, 1
)
And this would give us Event as result. Likewise, to extract the value we could write:
regexp_extract(
  '[Event "Hastings"]'
  --                  capture2
  --               _______|_______
  --              |               |
, '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
, 2
)
And this would yield the result "Hastings"
Regular Expressions in SQL
Now, many database products offer support for regular expressions. Products with basic support offer at least matching against a regular expression, and typically also extraction and replacement of a single matched pattern or capturing group, which is the example I just gave above.

More advanced products also offer a feature to split a string on the regular expression matches to an array or a resultset.
Regular Expressions in DuckDB
DuckDB supports all prior mentioned regular expression usages. In addition, DuckDB allows you to extract the matches of multiple capturing groups, all in one go! To do that, we can pass a list of names to regexp_extract() as 3rd argument:
regexp_extract(
  '[Event "Hastings"]'
, '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
, [ 'tag_name', 'tag_value' ]
)
With this usage, the function returns a STRUCT-value with the names passed in 3rd argument as keys. The value assigned to the keys is the text matched by the capturing group appearing at the same position as the index of the name in the list.

So in our example the 3rd argument is this list ['tag_name', 'tag_value']. The name 'tag_name' appears at the first position in the list, and thus, the value assigned to the 'tag_name'-key in the return value will be assigned the text that matches the first capturing group in the regular expression. Likewise, 'tag_value' appears in the list at the second position and thus the match made by the second capturing group will be assigned to that key in the return value.

For our example, the return value is:
┌───────────────────────────────────────────────┐
│                value                          │
│ struct(tag_name varchar, "tag_value" varchar) │
├───────────────────────────────────────────────┤
│ {'tag_name': Event, 'tag_value': Hastings}    │
└───────────────────────────────────────────────┘
I think this is pretty awesome! We only have to write and apply a particular regular expression once, instead of having to repeat it for each group we want to capture. This avoids code duplication and your SQL can be shorter, easier to read, and therefore, more maintainable.

It would not be unreasonable to assume that a single regular expression invocation returning all the matches all at once might also be faster than calling out multiple times to return only one match at a time. However, for this particular case, I found the performance to be on par.

Adding the Regular Expression to our Query


Let's incorporate the tag extraction logic into our query. Because tag extraction is only applicable for header lines, it feels appropriate to use the is_header flag to check whether the regular expression should be applied. Like we already saw, we can benefit from the DuckDB reusable alias feature to do that:
CASE
  WHEN is_header THEN
    regexp_extract(
      line
    , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
    , [ 'tag_name', 'tag_value' ]
    )
END
With this addition our query becomes:
SELECT  line
,       line LIKE '[%'                               AS is_header
,       COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
         ROWS BETWEEN UNBOUNDED PRECEDING
              AND     CURRENT ROW
        ) + 
        CASE
          WHEN is_header THEN 1
          ELSE 0
        END                                          AS game_id
,       CASE
          WHEN is_header THEN
            regexp_extract(
              line
            , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
            , [ 'tag_name', 'tag_value' ]
            )
        END                                          AS tag_pair
FROM    read_csv(
          'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
        , columns = {'line': 'VARCHAR'}
        )
WHERE   line IS NOT NULL
Our results are now:
┌──────────────────────────────────────────────────┬───────────┬─────────┬───────────────────────────────────────────────────────┐
│                       line                       │ is_header │ game_id │                       tag_pair                        │
│                     varchar                      │  boolean  │  int64  │         struct(tag_name varchar, "tag_value" varchar) │
├──────────────────────────────────────────────────┼───────────┼─────────┼───────────────────────────────────────────────────────┤
│ [Event "Hastings"]                               │ true      │       1 │ {'tag': Event, 'value': Hastings}                     │
│ [Site "Hastings"]                                │ true      │       1 │ {'tag': Site, 'value': Hastings}                      │
│ [Date "1895.??.??"]                              │ true      │       1 │ {'tag': Date, 'value': 1895.??.??}                    │
│ [Round "?"]                                      │ true      │       1 │ {'tag': Round, 'value': ?}                            │
│ [White "Tinsley, Samuel"]                        │ true      │       1 │ {'tag': White, 'value': Tinsley, Samuel}              │
│ [Black "Pollock, William Henry Kraus"]           │ true      │       1 │ {'tag': Black, 'value': Pollock, William Henry Kraus} │
│ [Result "1-0"]                                   │ true      │       1 │ {'tag': Result, 'value': 1-0}                         │
│ [WhiteElo ""]                                    │ true      │       1 │ {'tag': WhiteElo, 'value': }                          │
│ [BlackElo ""]                                    │ true      │       1 │ {'tag': BlackElo, 'value': }                          │
│ [ECO "A90"]                                      │ true      │       1 │ {'tag': ECO, 'value': A90}                            │
│ 1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O .  │ false     │       1 │                                                       │
│ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc.  │ false     │       1 │                                                       │
│ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20.  │ false     │       1 │                                                       │
│ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 .  │ false     │       1 │                                                       │
│ 30.Be2 Nd6 31.Bf3 Ra4 32.Qb1 Rc4 33.Rd2 Ne4 34.  │ false     │       1 │                                                       │
│               ·                                  │  ·        │       · │                       ·                               │
│               ·                                  │  ·        │       · │                       ·                               │
│               ·                                  │  ·        │       · │                       ·                               │
│ [Black "Kurbonboeva,Sarvinoz"]                   │ true      │    7243 │ {'tag': Black, 'value': Kurbonboeva,Sarvinoz}         │
│ [Result "1-0"]                                   │ true      │    7243 │ {'tag': Result, 'value': 1-0}                         │
│ [WhiteElo "2423"]                                │ true      │    7243 │ {'tag': WhiteElo, 'value': 2423}                      │
│ [BlackElo "2154"]                                │ true      │    7243 │ {'tag': BlackElo, 'value': 2154}                      │
│ [ECO "A90"]                                      │ true      │    7243 │ {'tag': ECO, 'value': A90}                            │
│ 1.d4 f5 2.g3 Nf6 3.Bg2 e6 4.c4 d5 5.b3 c6 6.Nh.  │ false     │    7243 │                                                       │
│ 10.Bxe5 Qxe5 11.Nd2 O-O 12.cxd5 cxd5 13.Nf3 Qe.  │ false     │    7243 │                                                       │
│ 17.Qxd4 Ne4 18.Nxd5 Qf7 19.Rc7 Rxd5 20.Rxf7 Rx.  │ false     │    7243 │                                                       │
│ 24.Rc7 Na6 25.Rc6 Bc8 26.Rxa6 Bxa6 27.Bxa8 Bxe.  │ false     │    7243 │                                                       │
│ 31.Bd1 Kf7 32.Bc2 Rc3 33.Rd1 Be6 34.Rd2 Bxb3 3.  │ false     │    7243 │                                                       │
│ 38.Kg2 Ra1 39.h4 h6 40.h5 Ra3 41.a5 Ra2 42.Kf3.  │ false     │    7243 │                                                       │
│ 45.Kd4 Ra2 46.a6 Rxf2 47.Rc7 Ra2 48.a7 g5 49.h.  │ false     │    7243 │                                                       │
│ 52.gxh4 Kh5 53.Rh7+ Kg4 54.Kb7 Rb2+ 55.Kc8 Rc2.  │ false     │    7243 │                                                       │
│ 59.Kxa8 Kxh4 60.Rf7 Kg4 61.Kb7 f4 62.Kc6 f3 63.  │ false     │    7243 │                                                       │
│ 66.Ke4  1-0                                      │ false     │    7243 │                                                       │
├──────────────────────────────────────────────────┴───────────┴─────────┴───────────────────────────────────────────────────────┤
│ 117000 rows (30 shown)                                                                                               4 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Next Installment: Rolling up PGN lines to one game row


Next time, we will see how to roll up the multiple lines from a game into a single game row.

No comments:

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...