Friday, July 12, 2024

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Rolling up each game's lines into a single game row (6/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the sixth installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:
  1. Chess and Portable Game Notation (PGN)
  2. Ingesting raw text with the CSV Reader
  3. Distinguishing the Line Type
  4. Keeping game lines together: window functions
  5. Extracting Tagpairs with Regular Expressions
  6. Rolling up each game's lines into a single game row

Rolling up each game's lines into a single game row


All essential elements are in place for transforming the PGN lines into a tabular result. The actual transformation entails two things:
  • Grouping all lines with the same game_id into one single row.
  • Create columns for each unique value of tag_name from tag_pair, and place the corresponding tag_value into those columns.
We can do that using DuckDB's PIVOT-statement.

Typical PIVOT-statements


The PIVOT-statement is typically used for OLAP use cases to create analytical crosstabs. In this context, we can think of PIVOT as an extension of a standard SELECT-statement with a GROUP BY-clause: each row is actually an aggregate row that represents the group of rows from the underlying dataset that have a unique combination of values for all expressions appearing in the GROUP BY-clause.

In addition, PIVOT also has an ON-clause, and each unique combination of values coming from the ON-clause expressions generates an aggregate column.

At the intersection of the aggregate rows and aggregate columns of the crosstab are the cell values. These are specified by the USING-clause, which specifies aggregate function expressions that are to be applied on those rows from the underlying resultset that belong to both the aggregate row and the aggregate column.

In typical OLAP use cases, the cell values are typically SUM()s or AVG()s over monetary amounts or quantities, sometimes COUNT()s.

PIVOT as a row-to-column transposer


The description of the PIVOT-statement above provides some hints on how we can use it to transform the game lines to a single game row, and how to turn the tags into separate columns:
  • We want to roll up the game lines of one game into a single game row, so the game_id expression should be placed in the GROUP BY-clause.
  • The tag names should be used to generate columns, so the tag_name-member of the tag_pair STRUCT-value returned by regexp_extract() should be placed in the ON-clause.
  • The tag_values should appear as cell values, so the tag_value-member of tag_pair should be placed in the USING-clause.
This makes it all a bit different from the typical analytical crosstabs:
  • The tag_values that are to appear as cell values are primarily of a text type: player names, event locations, and sometimes date- or time-like values. There are some numerical values too, like Elo scores, but these are non-additive, and quite unlike the amounts and quantities we find in the typical OLAP-case.
  • As the tagspairs are just attributes of the game, we expect set of tag_name-values for in one game to be unique. That means that for each game, we will find at most one tag_value in each generated column.
So for the PGN use case, it is more natural to think of the PIVOT-statement as a device to transpose rows to column, rather than an analytical crosstab.

Even though we expect only a single text value for the tag_value, the PIVOT-statement's USING-clause still requires some kind of aggregate function. To aggregate tag_value we can settle for anything that preserves the text: MIN(), MAX(), or ANY_VALUE(), as well as the text aggregate STRING_AGG() would all do.

When we put it all together, this is what our inital PIVOT-statement looks like:
PIVOT(
SELECT  line
,       line LIKE '[%'                               AS is_header
,       COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
         ROWS BETWEEN UNBOUNDED PRECEDING
              AND     CURRENT ROW
        ) + 
        CASE
          WHEN is_header THEN 1
          ELSE 0
        END                                          AS game_id
,       CASE
          WHEN is_header THEN
            regexp_extract(
              line
            , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]*$'
            , [ 'tag_name', 'tag_value' ]
            )
        END                                          AS tag_pair
FROM    read_csv(
          'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
        , columns = {'line': 'VARCHAR'}
        )
WHERE   line IS NOT NULL
)
ON tag_pair['tag_name']
USING ANY_VALUE( tag_pair['tag_value'] )
GROUP BY game_id
This is what its result looke like:
┌─────────┬──────────────────────┬──────────┬────────────┬───┬─────────┬─────────────────────┬──────────────────────┬──────────┐
│ game_id │        Black         │ BlackElo │    Date    │ . │  Round  │        Site         │        White         │ WhiteElo │
│  int64  │       varchar        │ varchar  │  varchar   │   │ varchar │       varchar       │       varchar        │ varchar  │
├─────────┼──────────────────────┼──────────┼────────────┼───┼─────────┼─────────────────────┼──────────────────────┼──────────┤
│       1 │ Pollock, William H.  │          │ 1895.??.?? │ . │ ?       │ Hastings            │ Tinsley, Samuel      │          │
│       2 │ Lasker, Edward       │          │ 1913.??.?? │ . │ ?       │ Scheveningen        │ Loman, Rudolf        │          │
│       3 │ Tartakower, Saviely  │          │ 1921.??.?? │ . │ 5       │ The Hague           │ Alekhine, Alexander  │          │
│       4 │ Wegemund, Otto       │          │ 1922.??.?? │ . │ 7       │ Bad Oeynhausen      │ Antze, O.            │          │
│       5 │ Tarrasch, Siegbert   │          │ 1922.??.?? │ . │ 19      │ Bad Pistyan         │ Johner, Paul F       │          │
│       6 │ Alekhine, Alexander  │          │ 1922.??.?? │ . │ ?       │ Hastings            │ Bogoljubow, Efim     │          │
│       7 │ Kmoch, Hans          │          │ 1922.??.?? │ . │ ?       │ Vienna              │ Rubinstein, Akiba    │          │
│       8 │ Mieses, Jacques      │          │ 1923.??.?? │ . │ 9       │ Hastings            │ Norman, George Mar.  │          │
│       9 │ Orlando, Placido     │          │ 1923.??.?? │ . │ ?       │ Trieste             │ Szabados, Eugenio    │          │
│      10 │ Tarrasch, Siegbert   │          │ 1923.??.?? │ . │ 1       │ Trieste             │ Seitz, Jakob Adolf   │          │
│      11 │ Wolf, Siegfried Re.  │          │ 1923.??.?? │ . │ 5       │ Vienna              │ Von Patay, J.        │          │
│      12 │ Tartakower, Saviely  │          │ 1924.??.?? │ . │ ?       │ New York            │ Bogoljubow, Efim     │          │
│      13 │ Pokorny, Amos        │          │ 1926.??.?? │ . │ 3       │ Trencianske Teplice │ Kostic, Boris        │          │
│      14 │ Tartakower, Saviely  │          │ 1927.??.?? │ . │ 2       │ Kecskemet           │ Vukovic, Vladimir    │          │
│      15 │ Botvinnik, Mikhail   │          │ 1927.??.?? │ . │ 2       │ Moscow              │ Rabinovich, Ilya L.  │          │
│       · │      ·               │  ·       │     ·      │ · │ ·       │   ·                 │    ·                 │  ·       │
│       · │      ·               │  ·       │     ·      │ · │ ·       │   ·                 │    ·                 │  ·       │
│       · │      ·               │  ·       │     ·      │ · │ ·       │   ·                 │    ·                 │  ·       │
│    7229 │ Kovacevic,Bl         │ 2400     │ 2023.12.09 │ . │ 7.3     │ Zagreb CRO          │ Kozul,Z              │ 2532     │
│    7230 │ Iskos,A              │ 2153     │ 2023.12.10 │ . │ 6.46    │ Skopje MKD          │ Zhezhovska,Monika    │ 1826     │
│    7231 │ Spichkin,A           │ 2035     │ 2023.12.12 │ . │ 2       │ chess.com INT       │ Rodriguez Santiago,J │ 2043     │
│    7232 │ Rogov,Matfey         │ 2213     │ 2023.12.12 │ . │ 3       │ chess.com INT       │ Clarke,Matthew       │ 2127     │
│    7233 │ Osmonbekov,T         │ 2137     │ 2023.12.12 │ . │ 3       │ chess.com INT       │ Sroczynski,M         │ 2266     │
│    7234 │ Novikova,Galina      │ 2073     │ 2023.12.12 │ . │ 8       │ chess.com INT       │ Marcziter,D          │ 2192     │
│    7235 │ Tomazini,A           │ 2336     │ 2023.12.14 │ . │ 5.24    │ Zagreb CRO          │ Pultinevicius,Paul.  │ 2584     │
│    7236 │ Spichkin,A           │ 2035     │ 2023.12.19 │ . │ 2       │ chess.com INT       │ Levine,D             │ 2040     │
│    7237 │ Kanyamarala,Tarun    │ 2305     │ 2023.12.19 │ . │ 4       │ chess.com INT       │ Nechitaylo,Nikita    │ 2203     │
│    7238 │ Ronka,E              │ 2291     │ 2023.12.19 │ . │ 4       │ chess.com INT       │ Gruzman,Ilya         │ 2151     │
│    7239 │ Kurbonboeva,Sarvinoz │ 2154     │ 2023.12.26 │ . │ 1.37    │ Samarkand UZB       │ Mammadzada,G         │ 2449     │
│    7240 │ Koneru,H             │ 2554     │ 2023.12.26 │ . │ 3.24    │ Samarkand UZB       │ Peycheva,Gergana     │ 2271     │
│    7241 │ Carlsson,Andreas     │ 1902     │ 2023.12.28 │ . │ 4.9     │ Karlstad SWE        │ Kreken,Eivind Grunt  │ 2271     │
│    7242 │ Kazarjan,Gachatur    │ 2078     │ 2023.12.30 │ . │ 9.31    │ Groningen NED       │ Schuricht,Emil Fre.  │ 2095     │
│    7243 │ Kurbonboeva,Sarvinoz │ 2154     │ 2023.12.30 │ . │ 14.47   │ Samarkand UZB       │ Zhu Chen             │ 2423     │
├─────────┴──────────────────────┴──────────┴────────────┴───┴─────────┴─────────────────────┴──────────────────────┴──────────┤
│ 7243 rows (30 shown)                                                                                    11 columns (8 shown) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
This certainly is starting to look a lot like the result we were after.

Folding in the movetext


The only thing stil missing is the movetext, but at this point, it's almost trivial to add that as well. We can simply amend the tag_pair-expression and let it return a new STRUCT-value with the literal text 'moves' as name member and the line itself as value in case the is_header expression is not TRUE:
CASE
  WHEN is_header THEN
    regexp_extract(
      line
    , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]*$'
    , [ 'column_name', 'column_value' ]
    )
  ELSE {
    'column_name': 'moves'
  , 'column_value': line
  }
END AS column_name_value
For consistency, we also changed the tag_pair alias to column_name_value and its member names from column_name and column_value to column_name and column_value respectively. Therefore we must also update the references elsewhere in the PIVOT statement accordingly.

Also, because one game could have multiple lines of movetext, we must also change the aggregate function in the USING-clause from ANY_VALUE() to STRING_AGG(). After these changes we get the final statement:
PIVOT(
SELECT  line
,       line LIKE '[%'                               AS is_header
,       COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
         ROWS BETWEEN UNBOUNDED PRECEDING
              AND     CURRENT ROW
        ) + 
        CASE
          WHEN is_header THEN 1
          ELSE 0
        END                                          AS game_id
,       CASE
          WHEN is_header THEN
            regexp_extract(
              line
            , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]*$'
            , [ 'column_name', 'column_value' ]
            )
          ELSE {
            'column_name': 'moves'
          , 'column_value': line
          }
        END                                          AS column_name_value
FROM    read_csv(
          'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
        , columns = {'line': 'VARCHAR'}
        )
WHERE   line IS NOT NULL
)
ON column_name_value['column_name']
USING STRING_AGG( column_name_value['column_value'], ' ' )
GROUP BY game_id
And its result:
┌─────────┬──────────────────────┬──────────┬────────────┬───┬──────────────────────┬──────────┬──────────────────────┐
│ game_id │        Black         │ BlackElo │    Date    │ . │        White         │ WhiteElo │        moves         │
│  int64  │       varchar        │ varchar  │  varchar   │   │       varchar        │ varchar  │       varchar        │
├─────────┼──────────────────────┼──────────┼────────────┼───┼──────────────────────┼──────────┼──────────────────────┤
│       1 │ Pollock, William H.  │          │ 1895.??.?? │ . │ Tinsley, Samuel      │          │ 1.d4 f5 2.c4 e6 3..  │
│       2 │ Lasker, Edward       │          │ 1913.??.?? │ . │ Loman, Rudolf        │          │ 1.d4 e6 2.c4 f5 3..  │
│       3 │ Tartakower, Saviely  │          │ 1921.??.?? │ . │ Alekhine, Alexander  │          │ 1.d4 f5 2.c4 e6 3..  │
│       4 │ Wegemund, Otto       │          │ 1922.??.?? │ . │ Antze, O.            │          │ 1.c4 f5 2.d4 Nf6 3.  │
│       5 │ Tarrasch, Siegbert   │          │ 1922.??.?? │ . │ Johner, Paul F       │          │ 1.d4 e6 2.c4 f5 3..  │
│       6 │ Alekhine, Alexander  │          │ 1922.??.?? │ . │ Bogoljubow, Efim     │          │ 1.d4 f5 2.c4 Nf6 3.  │
│       7 │ Kmoch, Hans          │          │ 1922.??.?? │ . │ Rubinstein, Akiba    │          │ 1.d4 e6 2.c4 f5 3..  │
│       8 │ Mieses, Jacques      │          │ 1923.??.?? │ . │ Norman, George Mar.  │          │ 1.d4 f5 2.g3 Nf6 3.  │
│       9 │ Orlando, Placido     │          │ 1923.??.?? │ . │ Szabados, Eugenio    │          │ 1.d4 e6 2.c4 f5 3..  │
│      10 │ Tarrasch, Siegbert   │          │ 1923.??.?? │ . │ Seitz, Jakob Adolf   │          │ 1.c4 e6 2.d4 f5 3..  │
│      11 │ Wolf, Siegfried Re.  │          │ 1923.??.?? │ . │ Von Patay, J.        │          │ 1.d4 e6 2.c4 f5 3..  │
│      12 │ Tartakower, Saviely  │          │ 1924.??.?? │ . │ Bogoljubow, Efim     │          │ 1.d4 f5 2.g3 e6 3..  │
│      13 │ Pokorny, Amos        │          │ 1926.??.?? │ . │ Kostic, Boris        │          │ 1.c4 f5 2.d4 Nf6 3.  │
│      14 │ Tartakower, Saviely  │          │ 1927.??.?? │ . │ Vukovic, Vladimir    │          │ 1.d4 f5 2.c4 e6 3..  │
│      15 │ Botvinnik, Mikhail   │          │ 1927.??.?? │ . │ Rabinovich, Ilya L.  │          │ 1.d4 e6 2.c4 f5 3..  │
│       · │      ·               │  ·       │     ·      │ · │    ·                 │  ·       │          ·           │
│       · │      ·               │  ·       │     ·      │ · │    ·                 │  ·       │          ·           │
│       · │      ·               │  ·       │     ·      │ · │    ·                 │  ·       │          ·           │
│    7229 │ Kovacevic,Bl         │ 2400     │ 2023.12.09 │ . │ Kozul,Z              │ 2532     │ 1.d4 e6 2.c4 f5 3..  │
│    7230 │ Iskos,A              │ 2153     │ 2023.12.10 │ . │ Zhezhovska,Monika    │ 1826     │ 1.d4 e6 2.c4 f5 3..  │
│    7231 │ Spichkin,A           │ 2035     │ 2023.12.12 │ . │ Rodriguez Santiago,J │ 2043     │ 1.d4 e6 2.c4 f5 3..  │
│    7232 │ Rogov,Matfey         │ 2213     │ 2023.12.12 │ . │ Clarke,Matthew       │ 2127     │ 1.d4 e6 2.c4 f5 3..  │
│    7233 │ Osmonbekov,T         │ 2137     │ 2023.12.12 │ . │ Sroczynski,M         │ 2266     │ 1.d4 e6 2.c4 f5 3..  │
│    7234 │ Novikova,Galina      │ 2073     │ 2023.12.12 │ . │ Marcziter,D          │ 2192     │ 1.d4 e6 2.c4 f5 3..  │
│    7235 │ Tomazini,A           │ 2336     │ 2023.12.14 │ . │ Pultinevicius,Paul.  │ 2584     │ 1.d4 e6 2.c4 f5 3..  │
│    7236 │ Spichkin,A           │ 2035     │ 2023.12.19 │ . │ Levine,D             │ 2040     │ 1.d4 e6 2.c4 f5 3..  │
│    7237 │ Kanyamarala,Tarun    │ 2305     │ 2023.12.19 │ . │ Nechitaylo,Nikita    │ 2203     │ 1.d4 e6 2.c4 f5 3..  │
│    7238 │ Ronka,E              │ 2291     │ 2023.12.19 │ . │ Gruzman,Ilya         │ 2151     │ 1.d4 f5 2.g3 e6 3..  │
│    7239 │ Kurbonboeva,Sarvinoz │ 2154     │ 2023.12.26 │ . │ Mammadzada,G         │ 2449     │ 1.d4 f5 2.c4 Nf6 3.  │
│    7240 │ Koneru,H             │ 2554     │ 2023.12.26 │ . │ Peycheva,Gergana     │ 2271     │ 1.d4 e6 2.c4 f5 3..  │
│    7241 │ Carlsson,Andreas     │ 1902     │ 2023.12.28 │ . │ Kreken,Eivind Grunt  │ 2271     │ 1.d4 e6 2.c4 f5 3..  │
│    7242 │ Kazarjan,Gachatur    │ 2078     │ 2023.12.30 │ . │ Schuricht,Emil Fre.  │ 2095     │ 1.d4 f5 2.g3 e6 3..  │
│    7243 │ Kurbonboeva,Sarvinoz │ 2154     │ 2023.12.30 │ . │ Zhu Chen             │ 2423     │ 1.d4 f5 2.g3 Nf6 3.  │
├─────────┴──────────────────────┴──────────┴────────────┴───┴──────────────────────┴──────────┴──────────────────────┤
│ 7243 rows (30 shown)                                                                           12 columns (7 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Next steps


From this point on there's many things that could be done to improve the solution, for example:
  • Column values that originate from headers tags should un-escape escaped characters.
  • The PGN-syntax itself does not dictate this but there are established conventions for what tag names to use and what kind of values are appropriate. For example, see the seven tag roster and optional tag pairs in the wikipedia entry for Portable Game Notation. It would make sense to further cleanse and conform the corresponding columns and give them a more suitable data type, or to actively validate their value.
  • Further data normalization could be attempted by creating separate tables for player, event, opening etc.
  • The moves could be further processed and analyzed to derive a table of board positions, something which would greatly increase the opportunities to analyze games.
We could also improve the statement and make it more robust:
  • Better detection of the first game line. Our assumption has been that the first movetext always starts with '1.'. But what if a game does not have any moves at all? This may sound like that shouldn't be possible, but especially on an online chess site, a player's connection might break after a game was started, but before a move was made.

    Whatever the reason may be, and whether we're interested in such games or not, our current solution is not capable to detect these cases. Instead, games simply aren't identified as intended, and our final game would likely be a mixture of 2 or maybe even more games. Bad bad bad!

    (If you're interested in looking into such a scenario, the Lichess chess database for October 2013 contains 411,039 games, but 140 do not have any moves.)
  • A more robust regular expression to deal with games that may not adhere fully to the PGN syntax (for example, more foregiving handling of whitespace)
  • Better handling of errors when reading CSV.
For now we leave these considerations as an excercise for the reader.

The goal of these posts was to show how DuckDB's features and SQL-dialect allow these kind of raw-text processing tasks to be solved quickly and elegantly. I hope I have succeeded in demonstrating that - it sure was a lot of fun to try!

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Extracting Tagpairs with Regular Expressions (5/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the fifth installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:
  1. Chess and Portable Game Notation (PGN)
  2. Ingesting raw text with the CSV Reader
  3. Distinguishing the Line Type
  4. Keeping game lines together: window functions
  5. Extracting Tagpairs with Regular Expressions
  6. Rolling up each game's lines into a single game row

Parsing PGN tagpairs: Regular Expressions to the rescue!


Let's apply our attention now to the header lines to extract information embedded in the tagpairs to expose it in a more useful form. We'll be using this regular expression to do that:
^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$
If you're not familiar with regular expressions, or if you're not convinced this is correct, that's fine! The first part of this installment describes in detail how I arrived at this expression. In the second part, we will see how we can use regular expression in DuckDB to extract information from PGN headers.

Crafting a regular expression for PGN tagpairs


From our description of the tagpair-syntax, we know that the tag and value must be inside the square brackets, and that tag and its value are separated by whitespace. We also know that value is enclosed in double quotes. We may stylize this syntax as follows:
line-start [ tag whitespace "value" ] line-end
In the syntax model above, literal characters are in red; items in italics are syntactical concepts that we can't quite describe as a single, fixed sequence of characters. But we can describe them using a pattern that specifies exactly what sequence of characters we would expect there, given the assumption that we're reading valid PGN files.

We will now turn this model into a regular expression by replacing the items from the model into patterns. Please be advised that this section is not intended to be a regular expressions tutorial - there are plenty of online resources that do that. It's just here to help you explain how I arrived at the final regular expression: this will make it easier for you to understand its structure, and put you in a better position to undertand its limitations, and if necessary, to adapt it to your specific needs.

First, lets replace line-start and line-end with ^ and $ anchors:
^ [ tag whitespace "value" ] $
Next, the square brackets that enclose our tagpair are regular expression metacharacters, so to match them literally, we need to escape them using a backslash:
^ \[ tag whitespace "value" \] $
Extracting tags: whitespace and not-whitespace
We can now replace the whitespace-placeholder with a pattern that matches actual whitespace. We do this by specifying one whitespace character, \s, follewed by +, indicating that it must occur at least once and could repeat many times:
^ \[ tag \s+ "value" \] $
We know the tag is followed by whitespace. From that fact, it follows that the tag cannot itself contain any whitespace: if that would be possible, we wouldn't be able to distinguish between the whitespace that demarcates the end of the tag as opposed to whitespace appearing inside the tag.

So, we can craft the pattern matching the tag as any character that is not whitespace ([^\s]), one or more times (denoted by +). The tag is information we want to extract; therefore we enclose it in parentheses so we can capture it:
^ \[ ([^\s]+) \s+ "value" \] $
Refine the syntax model to extract the value
To replace the value placeholder with a pattern it will be helpful to refine our syntax model a little bit. Let's zoom in on just the value from our original syntax model, and introduce a value-character.

The value-character stands for any character that may appear between the double-quotes demarcating the value. There may be any number of value-characters inside value:
" value-character* "
As for finding a pattern for value-character, we know for sure is that such a character cannot be a double quote, because we would not be able to distinguish it from the double-quote that indicates the end of the value.

This is completely analogous to the prior example we saw with tag, which is followed by white-space and therefore cannot itself contain whitespace. So just like we wrote "not-whitespace" as [^\s] for the tag, we can write "not-double-quote" as: [^"].
Modeling Character Escape Sequences
The PGN syntax specifies that a literal double-quote inside the value can be denoted by immediately preceding it using a backslash: \". The backslash preceding the double-quote is a so-called "escape" character, relieving the double-quote from its normal role of demarcating the end of the value.

Whenever a syntax defines an escape character device, it automatically introduces the problem of how to denote a literal occurrence of the escape character. That issue is usually solved simply by escaping the escape character, and that's also how the PGN syntax solves it. So in order to represent a literal backslash \ in the PGN-value, it must be preceded immediately with the escape character, which is also the backslash. Thus, it becomes: \\.

Now that we realize a value-character can either be escaped or not escaped, it is a good idea to replace its syntax model with something more refined:
" ( \ escaped-value-character | unescaped-value-character )* "
Note: the pipe-character | is the standard regular expression syntax to denote alternative options ("either/or" in English).

We can immediately replace unescaped-value-character with the "not-double-quote" pattern we identified earlier:
" ( \ escaped-value-character | [^"] )* "
Now let's replace the PGN escape character \ with its regular expression pattern. To do that we need to realize the backslash is also a metacharacter in the regular expression context, where it's also used as escape character! So, in order to represent the PGN escape character in a regular expression, we need to escape it in the the regular expression context. That sounds harder than it is - it simply becomes \\. Here, the first backslash is the regular expression escape character, and the second backslash a literal backslash, which is the escape character in the PGN context:
" ( \\ escaped-value-character | [^"] )* "
We know that in the PGN-context both the double quote and the backslash are ecaped characters. We also know that in the reqular expression context, the backslash needs to be escaped. So, we can now replace escaped-value-character with ["\\]:
" ( \\ ["\\] | [^"] )* "
As final touch to the value pattern, we enclose the value in parenthesis to establish a capturing group. Just like we did earlier for tag, this will allow us to extract the value:
" ( (\\ ["\\] | [^"] )* ) "
Now that we found a pattern for value, we can complete the entire pattern. The double quotes around value can be matched directly, and need no special treatment to be used in the regular expression:
^ \[ ([^\s]+) \s+ "( ( \\ ["|\\] | [^"] )* )" \] $
Are you still reading? Thanks for hanging in there! Let's move on to pluck the fruits from our labour.

Using regexp_extract() to extract tags and values


In DuckDB, we can extract the matches made by a regular expression using regexp_extract(). This is an overloaded function. For all versions of regexp_extract(), the first two arguments are the same:
  1. The string to which the regular expression is to be applied
  2. The regular expression
In the simplest usage, it takes only these two arguments, and returns the string that was matched. For our use case, this would not be usful, as our regular expression should match any complete header line: it would simply return the header line we feed into the regular expression.
Using the capturing groups
We want to extract the tag and tagvalue from the header lines, and for this purpose we created corresponding capturing groups. Fortunately, we can pass the index of the capturing group as the third argument to regexp_extract(). In this usage, the function should return whatever was matched by the capturing group indicated by the 3rd argument.

The first capturing group corresponds to the tag, so to extract it we could write:
regexp_extract(
  '[Event "Hastings"]'
  --   capture1
  --   ___|__
  --  |      |
, '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
, 1
)
And this would give us Event as result. Likewise, to extract the value we could write:
regexp_extract(
  '[Event "Hastings"]'
  --                  capture2
  --               _______|_______
  --              |               |
, '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
, 2
)
And this would yield the result "Hastings"
Regular Expressions in SQL
Now, many database products offer support for regular expressions. Products with basic support offer at least matching against a regular expression, and typically also extraction and replacement of a single matched pattern or capturing group, which is the example I just gave above.

More advanced products also offer a feature to split a string on the regular expression matches to an array or a resultset.
Regular Expressions in DuckDB
DuckDB supports all prior mentioned regular expression usages. In addition, DuckDB allows you to extract the matches of multiple capturing groups, all in one go! To do that, we can pass a list of names to regexp_extract() as 3rd argument:
regexp_extract(
  '[Event "Hastings"]'
, '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
, [ 'tag_name', 'tag_value' ]
)
With this usage, the function returns a STRUCT-value with the names passed in 3rd argument as keys. The value assigned to the keys is the text matched by the capturing group appearing at the same position as the index of the name in the list.

So in our example the 3rd argument is this list ['tag_name', 'tag_value']. The name 'tag_name' appears at the first position in the list, and thus, the value assigned to the 'tag_name'-key in the return value will be assigned the text that matches the first capturing group in the regular expression. Likewise, 'tag_value' appears in the list at the second position and thus the match made by the second capturing group will be assigned to that key in the return value.

For our example, the return value is:
┌───────────────────────────────────────────────┐
│                value                          │
│ struct(tag_name varchar, "tag_value" varchar) │
├───────────────────────────────────────────────┤
│ {'tag_name': Event, 'tag_value': Hastings}    │
└───────────────────────────────────────────────┘
I think this is pretty awesome! We only have to write and apply a particular regular expression once, instead of having to repeat it for each group we want to capture. This avoids code duplication and your SQL can be shorter, easier to read, and therefore, more maintainable.

It would not be unreasonable to assume that a single regular expression invocation returning all the matches all at once might also be faster than calling out multiple times to return only one match at a time. However, for this particular case, I found the performance to be on par.

Adding the Regular Expression to our Query


Let's incorporate the tag extraction logic into our query. Because tag extraction is only applicable for header lines, it feels appropriate to use the is_header flag to check whether the regular expression should be applied. Like we already saw, we can benefit from the DuckDB reusable alias feature to do that:
CASE
  WHEN is_header THEN
    regexp_extract(
      line
    , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
    , [ 'tag_name', 'tag_value' ]
    )
END
With this addition our query becomes:
SELECT  line
,       line LIKE '[%'                               AS is_header
,       COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
         ROWS BETWEEN UNBOUNDED PRECEDING
              AND     CURRENT ROW
        ) + 
        CASE
          WHEN is_header THEN 1
          ELSE 0
        END                                          AS game_id
,       CASE
          WHEN is_header THEN
            regexp_extract(
              line
            , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$'
            , [ 'tag_name', 'tag_value' ]
            )
        END                                          AS tag_pair
FROM    read_csv(
          'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
        , columns = {'line': 'VARCHAR'}
        )
WHERE   line IS NOT NULL
Our results are now:
┌──────────────────────────────────────────────────┬───────────┬─────────┬───────────────────────────────────────────────────────┐
│                       line                       │ is_header │ game_id │                       tag_pair                        │
│                     varchar                      │  boolean  │  int64  │         struct(tag_name varchar, "tag_value" varchar) │
├──────────────────────────────────────────────────┼───────────┼─────────┼───────────────────────────────────────────────────────┤
│ [Event "Hastings"]                               │ true      │       1 │ {'tag': Event, 'value': Hastings}                     │
│ [Site "Hastings"]                                │ true      │       1 │ {'tag': Site, 'value': Hastings}                      │
│ [Date "1895.??.??"]                              │ true      │       1 │ {'tag': Date, 'value': 1895.??.??}                    │
│ [Round "?"]                                      │ true      │       1 │ {'tag': Round, 'value': ?}                            │
│ [White "Tinsley, Samuel"]                        │ true      │       1 │ {'tag': White, 'value': Tinsley, Samuel}              │
│ [Black "Pollock, William Henry Kraus"]           │ true      │       1 │ {'tag': Black, 'value': Pollock, William Henry Kraus} │
│ [Result "1-0"]                                   │ true      │       1 │ {'tag': Result, 'value': 1-0}                         │
│ [WhiteElo ""]                                    │ true      │       1 │ {'tag': WhiteElo, 'value': }                          │
│ [BlackElo ""]                                    │ true      │       1 │ {'tag': BlackElo, 'value': }                          │
│ [ECO "A90"]                                      │ true      │       1 │ {'tag': ECO, 'value': A90}                            │
│ 1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O .  │ false     │       1 │                                                       │
│ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc.  │ false     │       1 │                                                       │
│ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20.  │ false     │       1 │                                                       │
│ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 .  │ false     │       1 │                                                       │
│ 30.Be2 Nd6 31.Bf3 Ra4 32.Qb1 Rc4 33.Rd2 Ne4 34.  │ false     │       1 │                                                       │
│               ·                                  │  ·        │       · │                       ·                               │
│               ·                                  │  ·        │       · │                       ·                               │
│               ·                                  │  ·        │       · │                       ·                               │
│ [Black "Kurbonboeva,Sarvinoz"]                   │ true      │    7243 │ {'tag': Black, 'value': Kurbonboeva,Sarvinoz}         │
│ [Result "1-0"]                                   │ true      │    7243 │ {'tag': Result, 'value': 1-0}                         │
│ [WhiteElo "2423"]                                │ true      │    7243 │ {'tag': WhiteElo, 'value': 2423}                      │
│ [BlackElo "2154"]                                │ true      │    7243 │ {'tag': BlackElo, 'value': 2154}                      │
│ [ECO "A90"]                                      │ true      │    7243 │ {'tag': ECO, 'value': A90}                            │
│ 1.d4 f5 2.g3 Nf6 3.Bg2 e6 4.c4 d5 5.b3 c6 6.Nh.  │ false     │    7243 │                                                       │
│ 10.Bxe5 Qxe5 11.Nd2 O-O 12.cxd5 cxd5 13.Nf3 Qe.  │ false     │    7243 │                                                       │
│ 17.Qxd4 Ne4 18.Nxd5 Qf7 19.Rc7 Rxd5 20.Rxf7 Rx.  │ false     │    7243 │                                                       │
│ 24.Rc7 Na6 25.Rc6 Bc8 26.Rxa6 Bxa6 27.Bxa8 Bxe.  │ false     │    7243 │                                                       │
│ 31.Bd1 Kf7 32.Bc2 Rc3 33.Rd1 Be6 34.Rd2 Bxb3 3.  │ false     │    7243 │                                                       │
│ 38.Kg2 Ra1 39.h4 h6 40.h5 Ra3 41.a5 Ra2 42.Kf3.  │ false     │    7243 │                                                       │
│ 45.Kd4 Ra2 46.a6 Rxf2 47.Rc7 Ra2 48.a7 g5 49.h.  │ false     │    7243 │                                                       │
│ 52.gxh4 Kh5 53.Rh7+ Kg4 54.Kb7 Rb2+ 55.Kc8 Rc2.  │ false     │    7243 │                                                       │
│ 59.Kxa8 Kxh4 60.Rf7 Kg4 61.Kb7 f4 62.Kc6 f3 63.  │ false     │    7243 │                                                       │
│ 66.Ke4  1-0                                      │ false     │    7243 │                                                       │
├──────────────────────────────────────────────────┴───────────┴─────────┴───────────────────────────────────────────────────────┤
│ 117000 rows (30 shown)                                                                                               4 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Next Installment: Parsing PGn Tagpairs


Next time, we will see how to roll up the multiple lines from a game into a single game row.

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Keeping the lines of a game together (4/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the fourth installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:
  1. Chess and Portable Game Notation (PGN)
  2. Ingesting raw text with the CSV Reader
  3. Distinguishing the Line Type
  4. Keeping game lines together: window functions
  5. Extracting Tagpairs with Regular Expressions
  6. Rolling up each game's lines into a single game row

Keeping game lines together: window functions


The transformation to the tabular format that we want to achieve implies a big structural change compared to the raw lines from the PGN file, as we need to transform multiple subsequent lines into a single row that represents a single game. And, we want to do that for all games present in a PGN file. So we want to treat the lines that belong to one game together as a unit, in other words: we need to group them.

You might know that in SQL, the GROUP BY-clause would achieve that. But, if we want to use that, we'll need to have an expression (or combination of expressions) that identifies which lines should be grouped together. We currently don't have any such thing, so we're going to have to create it.

The fact that we're trying to work "across" rows should trigger an instinct to consider window functions. If you're unfamiliar with window functions, some time ago I wrote about year-to-date calculation using window functions - perhaps that may be of interest to you. To recap, a window function calculates a metric by performing some kind of aggregate function over a subset of the rows taken from the current resultset. That subset of rows is called the window.

Window-functions as 'running totals'


A simple example of a metric that would be calculated by a window function is the 'running total' (cumulative sum): the running total for the current row is the value of the current rows' amount, added to the sum of amounts calculated over the rows appearing before the current row. So for the running total, the window is defined as the current row and all its previous rows, and the aggregation that is performed over the window is SUM().

Row counters as running totals


Let's simplify this. Forget about the previous running total example on amount and use COUNT(*) instead of SUM(). What would the metric be?

Well, COUNT(*) counts the number of rows. For the first row, it would count only the current row because there are no preceding rows, and the result would be 1. For the second row, it would count the current row and the previous row, and the result would be 2. Of course, for the 3rd row the result would be 3, for the 4th row it would be 4, and so on, calculating an integer value for each row that correspons to the row's position within the resultset as a whole.

The metric from the simplified running total example would be aptly called 'row count' or 'row number'. To calculate it, we would write:
COUNT(*) OVER (
  ROWS BETWEEN UNBOUNDED PRECEDING
       AND     CURRENT ROW
)
Note how the OVER()-clause defines the window: the ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW-bit matches our explanation of the running total in a remarkably straight-forward way.

Counting only specifc rows


To be sure, if we'd only wanted to calculate the row number, we would have used the built-in ROW_NUMBER() window function, and this would allow you to write the previous expression simply as ROW_NUMBER() OVER().

But we're not interested in counting lines: we want to count the games, while each game spans multiple lines. To count games, and not just any line, we have to find something that occurs in only one line for each game, and apply the COUNT() window function to that instead.

Using the first movetext line to identify games


By analyzing the PGN-syntax, you might find that the movetext is the key that solves this problem:
  • Each game has at least one line of movetext, and that first movetext line starts with the first move.
  • We also know that moves are numbered, starting with move 1, and that the move numbers are followed by a full-stop.

So, instead of COUNT(*), we should count the cases where the line starts with '1.'! Our expression thus becomes:
COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
  ROWS BETWEEN UNBOUNDED PRECEDING
       AND     CURRENT ROW
) AS first_movetext_linecount
If the line is the first movetext line, LIKE '1.%' will be TRUE, and the CASE-expression evaluates to 1, which will be counted. Any non-NULL-value would have done the job, but 1 is a convenient choice: it reminds us that we're counting the first movetext line and at the same time, symbolizes that we're counting that as one game. In all other cases, the CASE-expression evaluates to NULL, and those lines will not be counted.

If we'd add this to our query, this would be the result:
┌────────────────────────────────────────────────────────────────────────────────────┬───────────┬──────────────────────────┐
│                                        line                                        │ is_header │ first_movetext_linecount │
│                                      varchar                                       │  boolean  │          int64           │
├────────────────────────────────────────────────────────────────────────────────────┼───────────┼──────────────────────────┤
│ [Event "Hastings"]                                                                 │ true      │                        0 │
│ [Site "Hastings"]                                                                  │ true      │                        0 │
│ [Date "1895.??.??"]                                                                │ true      │                        0 │
│ [Round "?"]                                                                        │ true      │                        0 │
│ [White "Tinsley, Samuel"]                                                          │ true      │                        0 │
│ [Black "Pollock, William Henry Kraus"]                                             │ true      │                        0 │
│ [Result "1-0"]                                                                     │ true      │                        0 │
│ [WhiteElo ""]                                                                      │ true      │                        0 │
│ [BlackElo ""]                                                                      │ true      │                        0 │
│ [ECO "A90"]                                                                        │ true      │                        0 │
│ 1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O 6.Qb3 c5 7.e3 Nc6 8.a3 cxd4false1 │
│ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc1+ 13.Rhxc1 b6 14.Nd4 Ne4 15.Rd1 Qg5 │ false     │                        1 │
│ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20.Rxa7 Rxa7 21.Nxa7 Kh8 22.b5 Qf6     │ false     │                        1 │
│ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 27.Kf1 Ra1+ 28.Kg2 b5 29.Qb4 Kh7     │ false     │                        1 │
│ 30.Be2 Nd6 31.Bf3 Ra4 32.Qb1 Rc4 33.Rd2 Ne4 34.Rd3 Rc5 35.h5 Kg8 36.Qa2 Kh8        │ false     │                        1 │
│ 37.Qa5 Kh7 38.Qd8 Nxc3 39.Qf8 Qc8 40.Qf7 b4 41.g4 fxg4 42.Qg6+ Kh8 43.Bxg4 Qd8     │ false     │                        1 │
│ 44.f4 Qf6 45.Qe8+ Kh7 46.Be6 Qxe6 47.Qxe6 b3 48.Qb6 Ne4 49.Qg6+ Kg8 50.Rxd5 Rc2+   │ false     │                        1 │
│ 51.Kf3 Rd2 52.Qxe4  1-0                                                            │ false     │                        1 │
│ [Event "Scheveningen"]                                                             │ true      │                        1 │
│ [Site "Scheveningen"]                                                              │ true      │                        1 │
│ [Date "1913.??.??"]                                                                │ true      │                        1 │
│ [Round "?"]                                                                        │ true      │                        1 │
│ [White "Loman, Rudolf"]                                                            │ true      │                        1 │
│ [Black "Lasker, Edward"]                                                           │ true      │                        1 │
│ [Result "1/2-1/2"]                                                                 │ true      │                        1 │
│ [WhiteElo ""]                                                                      │ true      │                        1 │
│ [BlackElo ""]                                                                      │ true      │                        1 │
│ [ECO "A90"]                                                                        │ true      │                        1 │
│ 1.d4 e6 2.c4 f5 3.g3 Nf6 4.Bg2 d5 5.cxd5 exd5 6.Bg5 c6 7.Nc3 Bd6 8.Qc2 O-Ofalse2 │
│ 9.O-O-O Qa5 10.Bxf6 gxf6 11.Bh3 Bb4 12.Kb1 Bxc3 13.Qxc3 Qxc3 14.bxc3 Na6           │ false     │                        2 │
├────────────────────────────────────────────────────────────────────────────────────┴───────────┴──────────────────────────┤
│ 30 rows                                                                                                         3 columns │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Compensating Headers to get the game id


There's one problem left, and that's that the game's header lines occur before that game's first movetext line. So, the game's header lines get a number corresponding to the previous game's movetext line. But, this is easily solved: because it applies to all header lines of each game, we can simply compensate by adding 1 in case the line is a header line.

In the previous section, we already solved the problem of identifying header lines. Thanks to a DuckDB feature called reusable column alias (a concept similar to lateral column alias in DataBricks SQL), fixing first_movetext_counter so it becomes a game_id is trivial! We can simply check whether is_header is TRUE, and if so, add 1:
COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
  ROWS BETWEEN UNBOUNDED PRECEDING
       AND     CURRENT ROW
) + 
CASE
  WHEN is_header THEN 1
  ELSE 0
END AS game_id
With this compensation in place, the query now looks like this:
SELECT  line
,       line LIKE '[%'                               AS is_header
,       COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
         ROWS BETWEEN UNBOUNDED PRECEDING
              AND     CURRENT ROW
        ) + 
        CASE
          WHEN is_header THEN 1
          ELSE 0
        END                                          AS game_id
FROM    read_csv(
          'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
        , columns = {'line': 'VARCHAR'}
        )
WHERE   line IS NOT NULL
The result now becomes:
┌────────────────────────────────────────────────────────────────────────────────────┬───────────┬─────────┐
│                                        line                                        │ is_header │ game_id │
│                                      varchar                                       │  boolean  │  int64  │
├────────────────────────────────────────────────────────────────────────────────────┼───────────┼─────────┤
│ [Event "Hastings"]                                                                 │ true      │       1 │
│ [Site "Hastings"]                                                                  │ true      │       1 │
│ [Date "1895.??.??"]                                                                │ true      │       1 │
│ [Round "?"]                                                                        │ true      │       1 │
│ [White "Tinsley, Samuel"]                                                          │ true      │       1 │
│ [Black "Pollock, William Henry Kraus"]                                             │ true      │       1 │
│ [Result "1-0"]                                                                     │ true      │       1 │
│ [WhiteElo ""]                                                                      │ true      │       1 │
│ [BlackElo ""]                                                                      │ true      │       1 │
│ [ECO "A90"]                                                                        │ true      │       1 │
│ 1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O 6.Qb3 c5 7.e3 Nc6 8.a3 cxd4          │ false     │       1 │
│ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc1+ 13.Rhxc1 b6 14.Nd4 Ne4 15.Rd1 Qg5 │ false     │       1 │
│ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20.Rxa7 Rxa7 21.Nxa7 Kh8 22.b5 Qf6     │ false     │       1 │
│ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 27.Kf1 Ra1+ 28.Kg2 b5 29.Qb4 Kh7     │ false     │       1 │
│ 30.Be2 Nd6 31.Bf3 Ra4 32.Qb1 Rc4 33.Rd2 Ne4 34.Rd3 Rc5 35.h5 Kg8 36.Qa2 Kh8        │ false     │       1 │
│ 37.Qa5 Kh7 38.Qd8 Nxc3 39.Qf8 Qc8 40.Qf7 b4 41.g4 fxg4 42.Qg6+ Kh8 43.Bxg4 Qd8     │ false     │       1 │
│ 44.f4 Qf6 45.Qe8+ Kh7 46.Be6 Qxe6 47.Qxe6 b3 48.Qb6 Ne4 49.Qg6+ Kg8 50.Rxd5 Rc2+   │ false     │       1 │
│ 51.Kf3 Rd2 52.Qxe4  1-0                                                            │ false     │       1 │
│ [Event "Scheveningen"]                                                             │ true      │       2 │
│ [Site "Scheveningen"]                                                              │ true      │       2 │
│ [Date "1913.??.??"]                                                                │ true      │       2 │
│ [Round "?"]                                                                        │ true      │       2 │
│ [White "Loman, Rudolf"]                                                            │ true      │       2 │
│ [Black "Lasker, Edward"]                                                           │ true      │       2 │
│ [Result "1/2-1/2"]                                                                 │ true      │       2 │
│ [WhiteElo ""]                                                                      │ true      │       2 │
│ [BlackElo ""]                                                                      │ true      │       2 │
│ [ECO "A90"]                                                                        │ true      │       2 │
│ 1.d4 e6 2.c4 f5 3.g3 Nf6 4.Bg2 d5 5.cxd5 exd5 6.Bg5 c6 7.Nc3 Bd6 8.Qc2 O-O         │ false     │       2 │
│ 9.O-O-O Qa5 10.Bxf6 gxf6 11.Bh3 Bb4 12.Kb1 Bxc3 13.Qxc3 Qxc3 14.bxc3 Na6           │ false     │       2 │
├────────────────────────────────────────────────────────────────────────────────────┴───────────┴─────────┤
│ 30 rows                                                                                        3 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Perfect! We now have a game_id expression that we can use to identify all the lines belonging to one game. Remember, we need that to be able to roll up all the game's lines into a single row on our desired tabular result. We will return to that topic in the last installment.

Next Installment: Parsing PGN Tagpairs


Next time, we'll work with regular expressions to extract information from PGN tagpairs.

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Distinguishing the Line Type (3/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the third installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:
  1. Chess and Portable Game Notation (PGN)
  2. Ingesting raw text with the CSV Reader
  3. Distinguishing the Line Type
  4. Keeping game lines together: window functions
  5. Extracting Tagpairs with Regular Expressions
  6. Rolling up each game's lines into a single game row

Distinguishing PGN Headers, Movetext and separators


In the previous installment, we learned how to access the raw lines of text from a PGN file. Before we can move on to greater things, we need to do a little bit of data enrichment. It's not difficult, and it doesn't show off any particular features specific to DuckDB, but it has to be done and this installment covers how. Think of it as a warm-up!

When you inspect the query result presented in the previous installment, you notice quickly there are three types of lines:
  • header lines (marked up in green)
  • movetext lines (marked up in brown)
  • NULL-values (marked up in red), which correspond to empy lines in the file.
We can safely ignore the NULL-values, as they don't carry any information. We do that by simply adding a WHERE-clause to our query:
WHERE line IS NOT NULL
Now let's add an expression that tells us whether a line is a header line (or not). We know header lines have a tagpair enclosed in square brackets. So in principle, checking whether the line starts with an opening square bracket is enough to know we're dealing with a header line. We can do that by using a LIKE-comparison:
line LIKE '[%'
Adding this to our query, we now get:
SELECT line
,      line LIKE '[%' AS is_header
FROM   read_csv(
         'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
       , columns = {'line': 'VARCHAR'}
       )
WHERE  line IS NOT NULL
Our result now looks like this:
┌──────────────────────────────────────────────────────────────────────────────────────┬───────────┐
│                                         line                                         │ is_header │
│                                       varchar                                        │  boolean  │
├──────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ [Event "Hastings"]                                                                   │ true      │
│ [Site "Hastings"]                                                                    │ true      │
│ [Date "1895.??.??"]                                                                  │ true      │
│ [Round "?"]                                                                          │ true      │
│ [White "Tinsley, Samuel"]                                                            │ true      │
│ [Black "Pollock, William Henry Kraus"]                                               │ true      │
│ [Result "1-0"]                                                                       │ true      │
│ [WhiteElo ""]                                                                        │ true      │
│ [BlackElo ""]                                                                        │ true      │
│ [ECO "A90"]                                                                          │ true      │
│ 1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O 6.Qb3 c5 7.e3 Nc6 8.a3 cxd4            │ false     │
│ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc1+ 13.Rhxc1 b6 14.Nd4 Ne4 15.Rd1 Qg5   │ false     │
│ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20.Rxa7 Rxa7 21.Nxa7 Kh8 22.b5 Qf6       │ false     │
│ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 27.Kf1 Ra1+ 28.Kg2 b5 29.Qb4 Kh7       │ false     │
│ 30.Be2 Nd6 31.Bf3 Ra4 32.Qb1 Rc4 33.Rd2 Ne4 34.Rd3 Rc5 35.h5 Kg8 36.Qa2 Kh8          │ false     │
│               ·                                                                      │  ·        │
│               ·                                                                      │  ·        │
│               ·                                                                      │  ·        │
│ [Black "Kurbonboeva,Sarvinoz"]                                                       │ true      │
│ [Result "1-0"]                                                                       │ true      │
│ [WhiteElo "2423"]                                                                    │ true      │
│ [BlackElo "2154"]                                                                    │ true      │
│ [ECO "A90"]                                                                          │ true      │
│ 1.d4 f5 2.g3 Nf6 3.Bg2 e6 4.c4 d5 5.b3 c6 6.Nh3 Bd6 7.O-O Qe7 8.Bf4 e5 9.dxe5 Bxe5   │ false     │
│ 10.Bxe5 Qxe5 11.Nd2 O-O 12.cxd5 cxd5 13.Nf3 Qe7 14.Nf4 Rd8 15.Nd4 Nc6 16.Rc1 Nxd4    │ false     │
│ 17.Qxd4 Ne4 18.Nxd5 Qf7 19.Rc7 Rxd5 20.Rxf7 Rxd4 21.Re7 Kf8 22.Rc7 Be6 23.Rxb7 Nc5   │ false     │
│ 24.Rc7 Na6 25.Rc6 Bc8 26.Rxa6 Bxa6 27.Bxa8 Bxe2 28.Re1 Bb5 29.a4 Bd7 30.Bf3 Rd3      │ false     │
│ 31.Bd1 Kf7 32.Bc2 Rc3 33.Rd1 Be6 34.Rd2 Bxb3 35.Bxb3+ Rxb3 36.Rd7+ Kf6 37.Rxa7 Rb1+  │ false     │
│ 38.Kg2 Ra1 39.h4 h6 40.h5 Ra3 41.a5 Ra2 42.Kf3 Ra3+ 43.Ke2 Ra2+ 44.Ke3 Ra3+          │ false     │
│ 45.Kd4 Ra2 46.a6 Rxf2 47.Rc7 Ra2 48.a7 g5 49.hxg6 Kxg6 50.Kc5 h5 51.Kb6 h4           │ false     │
│ 52.gxh4 Kh5 53.Rh7+ Kg4 54.Kb7 Rb2+ 55.Kc8 Rc2+ 56.Kb8 Rb2+ 57.Rb7 Ra2 58.a8=Q Rxa8+ │ false     │
│ 59.Kxa8 Kxh4 60.Rf7 Kg4 61.Kb7 f4 62.Kc6 f3 63.Kd5 Kg3 64.Ke4 f2 65.Ke3 Kg2          │ false     │
│ 66.Ke4  1-0                                                                          │ false     │
├──────────────────────────────────────────────────────────────────────────────────────┴───────────┤
│ 117000 rows (30 shown)                                                                 2 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
We can be humble, and still say: that's not a bad start!

Next Installment: Keeping the lines of a game together


Next time, we'll add a game_id to keep all lines that belong to one game together.

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Ingesting raw text with the CSV Reader (2/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the second installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:
  1. Chess and Portable Game Notation (PGN)
  2. Ingesting raw text with the CSV Reader
  3. Distinguishing the Line Type
  4. Keeping game lines together: window functions
  5. Extracting Tagpairs with Regular Expressions
  6. Rolling up each game's lines into a single game row

Ingesting lines of raw text from files using DuckDB's read_csv()


For this installment I'm going to use DuckDB's capabilities to read and query CSV-files. There is already a lot of content available that explains the ins and outs of using the DuckDB CSV-reader for typical, tabular datatasets, so I'm not going to cover that. Instead, for this particular example, I'm going to (ab)use DuckDB's CSV-reader just to read raw lines of text. So rather than letting the CSV reader discover columns, I want the file to be returned as a list of lines. This query does the trick:
SELECT line
FROM   read_csv(
         'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
       , columns = {'line': 'VARCHAR'}
       )
We use the columns argument to specify that our column should be called line. We also could've omitted it, but then the column would've been assigned the name column0, which I think is less clear. Running the query gives us a result like this:
┌──────────────────────────────────────────────────────────────────────────────────────┐
│                                         line                                         │
│                                       varchar                                        │
├──────────────────────────────────────────────────────────────────────────────────────┤
│ [Event "Hastings"]                                                                   │
│ [Site "Hastings"]                                                                    │
│ [Date "1895.??.??"]                                                                  │
│ [Round "?"]                                                                          │
│ [White "Tinsley, Samuel"]                                                            │
│ [Black "Pollock, William Henry Kraus"]                                               │
│ [Result "1-0"]                                                                       │
│ [WhiteElo ""]                                                                        │
│ [BlackElo ""]                                                                        │
│ [ECO "A90"]                                                                          │
│ <NULL>                                                                               │
│ 1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O 6.Qb3 c5 7.e3 Nc6 8.a3 cxd4            │
│ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc1+ 13.Rhxc1 b6 14.Nd4 Ne4 15.Rd1 Qg5   │
│ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20.Rxa7 Rxa7 21.Nxa7 Kh8 22.b5 Qf6       │
│ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 27.Kf1 Ra1+ 28.Kg2 b5 29.Qb4 Kh7       │
│         ·                                                                            │
│         ·                                                                            │
│         ·                                                                            │
│ [WhiteElo "2423"]                                                                    │
│ [BlackElo "2154"]                                                                    │
│ [ECO "A90"]                                                                          │
│ <NULL>                                                                               │
│ 1.d4 f5 2.g3 Nf6 3.Bg2 e6 4.c4 d5 5.b3 c6 6.Nh3 Bd6 7.O-O Qe7 8.Bf4 e5 9.dxe5 Bxe5   │
│ 10.Bxe5 Qxe5 11.Nd2 O-O 12.cxd5 cxd5 13.Nf3 Qe7 14.Nf4 Rd8 15.Nd4 Nc6 16.Rc1 Nxd4    │
│ 17.Qxd4 Ne4 18.Nxd5 Qf7 19.Rc7 Rxd5 20.Rxf7 Rxd4 21.Re7 Kf8 22.Rc7 Be6 23.Rxb7 Nc5   │
│ 24.Rc7 Na6 25.Rc6 Bc8 26.Rxa6 Bxa6 27.Bxa8 Bxe2 28.Re1 Bb5 29.a4 Bd7 30.Bf3 Rd3      │
│ 31.Bd1 Kf7 32.Bc2 Rc3 33.Rd1 Be6 34.Rd2 Bxb3 35.Bxb3+ Rxb3 36.Rd7+ Kf6 37.Rxa7 Rb1+  │
│ 38.Kg2 Ra1 39.h4 h6 40.h5 Ra3 41.a5 Ra2 42.Kf3 Ra3+ 43.Ke2 Ra2+ 44.Ke3 Ra3+          │
│ 45.Kd4 Ra2 46.a6 Rxf2 47.Rc7 Ra2 48.a7 g5 49.hxg6 Kxg6 50.Kc5 h5 51.Kb6 h4           │
│ 52.gxh4 Kh5 53.Rh7+ Kg4 54.Kb7 Rb2+ 55.Kc8 Rc2+ 56.Kb8 Rb2+ 57.Rb7 Ra2 58.a8=Q Rxa8+ │
│ 59.Kxa8 Kxh4 60.Rf7 Kg4 61.Kb7 f4 62.Kc6 f3 63.Kd5 Kg3 64.Ke4 f2 65.Ke3 Kg2          │
│ 66.Ke4  1-0                                                                          │
│ <NULL>                                                                               │
├──────────────────────────────────────────────────────────────────────────────────────┤
│                                131486 rows (30 shown)                                │
└──────────────────────────────────────────────────────────────────────────────────────┘

Text lines versus CSV reader rows: different, but related


At this point it's good to note that there's a direct correspondence between the lines in the file, and the rows returned by csv_read(): Not only does the content match, the order of the lines as they appear in the file is also preserved.

The latter sounds obvious but really isn't. The CSV-reader has a parallel option that controls whether multiple threads are used to read the file is read in parallel. Such a feature could easily lead to an implementation that does not guarantee that the order of the rows in the returned resultset matches the order of the lines read from the file(s). The SQL language also doesn't guarantee row order unless enforced by an explicit ORDER BY-clause, so there wouldn't be a clear requirement from that side either to preserve the line order.

Fortunately, the DuckDB CSV-reader does guarantee that the order is preserved. This is controlled by the preserve_insertion_order-configuration option, which is TRUE by default.

Next Installment: Detecting line types


Next time, we'll add an expression that can detect whether the line is a header line, or a movetext line.

DuckDB bag of tricks: Processing PGN chess games with DuckDB - An Introduction to PGN (1/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the first installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:
  1. Chess and Portable Game Notation (PGN)
  2. Ingesting raw text with the CSV Reader
  3. Distinguishing the Line Type
  4. Keeping game lines together: window functions
  5. Extracting Tagpairs with Regular Expressions
  6. Rolling up each game's lines into a single game row

Chess and Portable Game Notation (PGN)


In this series of posts I will demonstrate how you can use DuckDB to read chess games recorded in PGN-format. You might not be interested in his particular task, but the DuckDB features that allow us to do this are applicable to a wide range of text-processing problems:

PGN Syntax


Portable Game Notation (PGN) is a plain-text format for recording and exchanging chess games. PGN is easily readable and writeable - both for humans and machines. It's widely used by chess software programs as well as by (online) chess magazines, tutorials etc.

Here's a simple PGN example illustrating the shortest game in history:
[Event "a chess café in Paris "]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "Gibaud"]
[Black "Lazard"]
[Result "0-1"]

1. d4 Nf6 2. Nd2 e5 3. dxe5 Ng4 4. h3 Ne3 0-1
This nicely illustrates the PGN-format:
  • A game consists of a set of header-lines, and one or more lines of movetext, separated by a blank line. In our example PGN, the first 7 lines are header lines, and the last line is movetext.
  • A header line contains a single tagpair, which is enclosed in square brackets ([ and ]):
    • Inside the square brackets there are two pieces of text: the tag and the value, which are separated by whitespace.
    • The tag should be a single token, immediately following the opening bracket ([). A tag cannot contain any whitespace.
    • The value is enclosed in double quotes ("). The value appears directly before the closing square bracket (]) that demarcates the end of the header line.
    • Almost any content can appear between the value's double quotes, except for linebreaks. Obviously, the value cannot contain a naked double-quote; see the next point.
    • The backslash (\) may be used as escape character for denoting a literal double quote inside the value, like so: \".
    • Because the backslash is used as escape character, a literal backslash also needs to be escaped. That is denoted by writing two backslashes: \\.
  • The movetext is denoted in Standard Algebraic Notation (SAN).
    • SAN denotes a numbered sequence of moves.
    • A move is the change of position of chess pieces on the chessboard.
    • Each move starts with the move number, which is denoted as a decimal integer. Move numbers start at 1.
    • The move number is immediately followed by a full stop character (.).
    • Multiple moves may appear on a single line, and a single game's movetext may span multiple lines.
    There's a lot more that could be said about the syntax of individual moves. However, this would require a lot more elaboration. For now, we'll proceed to treat the movetext as a whole, without worrying about the syntax that defines its structure.

From PGN to Tabular


While the PGN-format may be good for recording and exchanging games, it is not well suited for searching and analyzing large collections of chess games. A tabular format would fare a lot better there, and this is where DuckDB comes in. I will show you how a single, well formatted and indented DuckDB SELECT statement of barely more than 30 lines can transform the PGN above into the tabular dataset below:
┌─────────┬─────────┬────────────┬─────────┬─────────┬─────────┬─────────┬───────────────────────────────────────────────┐
│ game_id │  Black  │    Date    │ Result  │  Round  │  Site   │  White  │                     moves                     │
│  int64  │ varchar │  varchar   │ varchar │ varchar │ varchar │ varchar │                    varchar                    │
├─────────┼─────────┼────────────┼─────────┼─────────┼─────────┼─────────┼───────────────────────────────────────────────┤
│       1 │ Lazard  │ ????.??.?? │ 0-1     │ ?       │ ?       │ Gibaud  │ 1. d4 Nf6 2. Nd2 e5 3. dxe5 Ng4 4. h3 Ne3 0-1 │
└─────────┴─────────┴────────────┴─────────┴─────────┴─────────┴─────────┴───────────────────────────────────────────────┘
As you can see,
  • The game has been assigned a unique number which appears in the game_id column.
  • Multiple lines that made up a game in PGN notation are grouped together into a single row.
  • Tags occurring in the header lines now appear as column names, and the corresponding values appear as their column values.
  • The movetext too gets its own column.
Please note: I'm not saying this is the perfect relational format for representing chess games. However, on a spectrum with PGN blobs on the one end and a fully normalized relational schema on the other end, the format above is well underway. It is a complete representation of the original PGN source that already offers superior opportunities for search and analysis. In addition, it can quite easily be used for further transformation and processing, in particular with SQL.

Obtaining PGN-files


Many chess-related websites offer downloads of either games, puzzles or openings in PGN-format. To name a few:
  • PGN Mentor: collections of games by chess grandmasters. Offers .zip compressed files that group games by player, opening, event. Files are small and typically contain a couple of thousand games. Largest download I found is 13MB (40MB uncompressed)
  • Lichess open database: standard chess and chess-variant games played on the lichess community, grouped by month. Files typically contain millions of games, download sizes range from 17MB to 30GB
To follow along with this post, you'll need at least one PGN-file.

I suggest downloading the DutchClassical.zip openings from the PGN Mentor site. Please note that after download, you will need to unzip it first before you can read it with DuckDB. This is also the file I'll be using for most code samples, and for me the location of the unzipped .pgn file is:
C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn
So, if you want to run the code samples yourself, make sure to adjust that to whatever is the appropriate location on your system.

Alternatively you can get lichess_db_standard_rated_2013-01.pgn.zst, which is the smallest set of standard chess games from the the lichess.org database. DuckDB can already read .zst-compressed files, so you won't need to uncompress it. In fact, the DuckDB manual recommends against uncompressing .zst-files, so you probably shouldn't.

Next Installment: Ingestion of PGN Files


Next time, we'll start by using DuckDB's CSV Reader to ingest raw lines of text from PGN files.

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Rolling up each game's lines into a single game row (6/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is the sixth installment of a s...