Friday, July 12, 2024

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Rolling up each game's lines into a single game row (6/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the sixth installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:
  1. Chess and Portable Game Notation (PGN)
  2. Ingesting raw text with the CSV Reader
  3. Distinguishing the Line Type
  4. Keeping game lines together: window functions
  5. Extracting Tagpairs with Regular Expressions
  6. Rolling up each game's lines into a single game row

Rolling up each game's lines into a single game row


All essential elements are in place for transforming the PGN lines into a tabular result. The actual transformation entails two things:
  • Grouping all lines with the same game_id into one single row.
  • Create columns for each unique value of tag_name from tag_pair, and place the corresponding tag_value into those columns.
We can do that using DuckDB's PIVOT-statement.

Typical PIVOT-statements


The PIVOT-statement is typically used for OLAP use cases to create analytical crosstabs. In this context, we can think of PIVOT as an extension of a standard SELECT-statement with a GROUP BY-clause: each row is actually an aggregate row that represents the group of rows from the underlying dataset that have a unique combination of values for all expressions appearing in the GROUP BY-clause.

In addition, PIVOT also has an ON-clause, and each unique combination of values coming from the ON-clause expressions generates an aggregate column.

At the intersection of the aggregate rows and aggregate columns of the crosstab are the cell values. These are specified by the USING-clause, which specifies aggregate function expressions that are to be applied on those rows from the underlying resultset that belong to both the aggregate row and the aggregate column.

In typical OLAP use cases, the cell values are typically SUM()s or AVG()s over monetary amounts or quantities, sometimes COUNT()s.

PIVOT as a row-to-column transposer


The description of the PIVOT-statement above provides some hints on how we can use it to transform the game lines to a single game row, and how to turn the tags into separate columns:
  • We want to roll up the game lines of one game into a single game row, so the game_id expression should be placed in the GROUP BY-clause.
  • The tag names should be used to generate columns, so the tag_name-member of the tag_pair STRUCT-value returned by regexp_extract() should be placed in the ON-clause.
  • The tag_values should appear as cell values, so the tag_value-member of tag_pair should be placed in the USING-clause.
This makes it all a bit different from the typical analytical crosstabs:
  • The tag_values that are to appear as cell values are primarily of a text type: player names, event locations, and sometimes date- or time-like values. There are some numerical values too, like Elo scores, but these are non-additive, and quite unlike the amounts and quantities we find in the typical OLAP-case.
  • As the tagspairs are just attributes of the game, we expect set of tag_name-values for in one game to be unique. That means that for each game, we will find at most one tag_value in each generated column.
So for the PGN use case, it is more natural to think of the PIVOT-statement as a device to transpose rows to column, rather than an analytical crosstab.

Even though we expect only a single text value for the tag_value, the PIVOT-statement's USING-clause still requires some kind of aggregate function. To aggregate tag_value we can settle for anything that preserves the text: MIN(), MAX(), or ANY_VALUE(), as well as the text aggregate STRING_AGG() would all do.

When we put it all together, this is what our inital PIVOT-statement looks like:
PIVOT(
SELECT  line
,       line LIKE '[%'                               AS is_header
,       COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
         ROWS BETWEEN UNBOUNDED PRECEDING
              AND     CURRENT ROW
        ) + 
        CASE
          WHEN is_header THEN 1
          ELSE 0
        END                                          AS game_id
,       CASE
          WHEN is_header THEN
            regexp_extract(
              line
            , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]*$'
            , [ 'tag_name', 'tag_value' ]
            )
        END                                          AS tag_pair
FROM    read_csv(
          'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
        , columns = {'line': 'VARCHAR'}
        )
WHERE   line IS NOT NULL
)
ON tag_pair['tag_name']
USING ANY_VALUE( tag_pair['tag_value'] )
GROUP BY game_id
This is what its result looke like:
┌─────────┬──────────────────────┬──────────┬────────────┬───┬─────────┬─────────────────────┬──────────────────────┬──────────┐
│ game_id │        Black         │ BlackElo │    Date    │ . │  Round  │        Site         │        White         │ WhiteElo │
│  int64  │       varchar        │ varchar  │  varchar   │   │ varchar │       varchar       │       varchar        │ varchar  │
├─────────┼──────────────────────┼──────────┼────────────┼───┼─────────┼─────────────────────┼──────────────────────┼──────────┤
│       1 │ Pollock, William H.  │          │ 1895.??.?? │ . │ ?       │ Hastings            │ Tinsley, Samuel      │          │
│       2 │ Lasker, Edward       │          │ 1913.??.?? │ . │ ?       │ Scheveningen        │ Loman, Rudolf        │          │
│       3 │ Tartakower, Saviely  │          │ 1921.??.?? │ . │ 5       │ The Hague           │ Alekhine, Alexander  │          │
│       4 │ Wegemund, Otto       │          │ 1922.??.?? │ . │ 7       │ Bad Oeynhausen      │ Antze, O.            │          │
│       5 │ Tarrasch, Siegbert   │          │ 1922.??.?? │ . │ 19      │ Bad Pistyan         │ Johner, Paul F       │          │
│       6 │ Alekhine, Alexander  │          │ 1922.??.?? │ . │ ?       │ Hastings            │ Bogoljubow, Efim     │          │
│       7 │ Kmoch, Hans          │          │ 1922.??.?? │ . │ ?       │ Vienna              │ Rubinstein, Akiba    │          │
│       8 │ Mieses, Jacques      │          │ 1923.??.?? │ . │ 9       │ Hastings            │ Norman, George Mar.  │          │
│       9 │ Orlando, Placido     │          │ 1923.??.?? │ . │ ?       │ Trieste             │ Szabados, Eugenio    │          │
│      10 │ Tarrasch, Siegbert   │          │ 1923.??.?? │ . │ 1       │ Trieste             │ Seitz, Jakob Adolf   │          │
│      11 │ Wolf, Siegfried Re.  │          │ 1923.??.?? │ . │ 5       │ Vienna              │ Von Patay, J.        │          │
│      12 │ Tartakower, Saviely  │          │ 1924.??.?? │ . │ ?       │ New York            │ Bogoljubow, Efim     │          │
│      13 │ Pokorny, Amos        │          │ 1926.??.?? │ . │ 3       │ Trencianske Teplice │ Kostic, Boris        │          │
│      14 │ Tartakower, Saviely  │          │ 1927.??.?? │ . │ 2       │ Kecskemet           │ Vukovic, Vladimir    │          │
│      15 │ Botvinnik, Mikhail   │          │ 1927.??.?? │ . │ 2       │ Moscow              │ Rabinovich, Ilya L.  │          │
│       · │      ·               │  ·       │     ·      │ · │ ·       │   ·                 │    ·                 │  ·       │
│       · │      ·               │  ·       │     ·      │ · │ ·       │   ·                 │    ·                 │  ·       │
│       · │      ·               │  ·       │     ·      │ · │ ·       │   ·                 │    ·                 │  ·       │
│    7229 │ Kovacevic,Bl         │ 2400     │ 2023.12.09 │ . │ 7.3     │ Zagreb CRO          │ Kozul,Z              │ 2532     │
│    7230 │ Iskos,A              │ 2153     │ 2023.12.10 │ . │ 6.46    │ Skopje MKD          │ Zhezhovska,Monika    │ 1826     │
│    7231 │ Spichkin,A           │ 2035     │ 2023.12.12 │ . │ 2       │ chess.com INT       │ Rodriguez Santiago,J │ 2043     │
│    7232 │ Rogov,Matfey         │ 2213     │ 2023.12.12 │ . │ 3       │ chess.com INT       │ Clarke,Matthew       │ 2127     │
│    7233 │ Osmonbekov,T         │ 2137     │ 2023.12.12 │ . │ 3       │ chess.com INT       │ Sroczynski,M         │ 2266     │
│    7234 │ Novikova,Galina      │ 2073     │ 2023.12.12 │ . │ 8       │ chess.com INT       │ Marcziter,D          │ 2192     │
│    7235 │ Tomazini,A           │ 2336     │ 2023.12.14 │ . │ 5.24    │ Zagreb CRO          │ Pultinevicius,Paul.  │ 2584     │
│    7236 │ Spichkin,A           │ 2035     │ 2023.12.19 │ . │ 2       │ chess.com INT       │ Levine,D             │ 2040     │
│    7237 │ Kanyamarala,Tarun    │ 2305     │ 2023.12.19 │ . │ 4       │ chess.com INT       │ Nechitaylo,Nikita    │ 2203     │
│    7238 │ Ronka,E              │ 2291     │ 2023.12.19 │ . │ 4       │ chess.com INT       │ Gruzman,Ilya         │ 2151     │
│    7239 │ Kurbonboeva,Sarvinoz │ 2154     │ 2023.12.26 │ . │ 1.37    │ Samarkand UZB       │ Mammadzada,G         │ 2449     │
│    7240 │ Koneru,H             │ 2554     │ 2023.12.26 │ . │ 3.24    │ Samarkand UZB       │ Peycheva,Gergana     │ 2271     │
│    7241 │ Carlsson,Andreas     │ 1902     │ 2023.12.28 │ . │ 4.9     │ Karlstad SWE        │ Kreken,Eivind Grunt  │ 2271     │
│    7242 │ Kazarjan,Gachatur    │ 2078     │ 2023.12.30 │ . │ 9.31    │ Groningen NED       │ Schuricht,Emil Fre.  │ 2095     │
│    7243 │ Kurbonboeva,Sarvinoz │ 2154     │ 2023.12.30 │ . │ 14.47   │ Samarkand UZB       │ Zhu Chen             │ 2423     │
├─────────┴──────────────────────┴──────────┴────────────┴───┴─────────┴─────────────────────┴──────────────────────┴──────────┤
│ 7243 rows (30 shown)                                                                                    11 columns (8 shown) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
This certainly is starting to look a lot like the result we were after.

Folding in the movetext


The only thing stil missing is the movetext, but at this point, it's almost trivial to add that as well. We can simply amend the tag_pair-expression and let it return a new STRUCT-value with the literal text 'moves' as name member and the line itself as value in case the is_header expression is not TRUE:
CASE
  WHEN is_header THEN
    regexp_extract(
      line
    , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]*$'
    , [ 'column_name', 'column_value' ]
    )
  ELSE {
    'column_name': 'moves'
  , 'column_value': line
  }
END AS column_name_value
For consistency, we also changed the tag_pair alias to column_name_value and its member names from column_name and column_value to column_name and column_value respectively. Therefore we must also update the references elsewhere in the PIVOT statement accordingly.

Also, because one game could have multiple lines of movetext, we must also change the aggregate function in the USING-clause from ANY_VALUE() to STRING_AGG(). After these changes we get the final statement:
PIVOT(
SELECT  line
,       line LIKE '[%'                               AS is_header
,       COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
         ROWS BETWEEN UNBOUNDED PRECEDING
              AND     CURRENT ROW
        ) + 
        CASE
          WHEN is_header THEN 1
          ELSE 0
        END                                          AS game_id
,       CASE
          WHEN is_header THEN
            regexp_extract(
              line
            , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]*$'
            , [ 'column_name', 'column_value' ]
            )
          ELSE {
            'column_name': 'moves'
          , 'column_value': line
          }
        END                                          AS column_name_value
FROM    read_csv(
          'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
        , columns = {'line': 'VARCHAR'}
        )
WHERE   line IS NOT NULL
)
ON column_name_value['column_name']
USING STRING_AGG( column_name_value['column_value'], ' ' )
GROUP BY game_id
And its result:
┌─────────┬──────────────────────┬──────────┬────────────┬───┬──────────────────────┬──────────┬──────────────────────┐
│ game_id │        Black         │ BlackElo │    Date    │ . │        White         │ WhiteElo │        moves         │
│  int64  │       varchar        │ varchar  │  varchar   │   │       varchar        │ varchar  │       varchar        │
├─────────┼──────────────────────┼──────────┼────────────┼───┼──────────────────────┼──────────┼──────────────────────┤
│       1 │ Pollock, William H.  │          │ 1895.??.?? │ . │ Tinsley, Samuel      │          │ 1.d4 f5 2.c4 e6 3..  │
│       2 │ Lasker, Edward       │          │ 1913.??.?? │ . │ Loman, Rudolf        │          │ 1.d4 e6 2.c4 f5 3..  │
│       3 │ Tartakower, Saviely  │          │ 1921.??.?? │ . │ Alekhine, Alexander  │          │ 1.d4 f5 2.c4 e6 3..  │
│       4 │ Wegemund, Otto       │          │ 1922.??.?? │ . │ Antze, O.            │          │ 1.c4 f5 2.d4 Nf6 3.  │
│       5 │ Tarrasch, Siegbert   │          │ 1922.??.?? │ . │ Johner, Paul F       │          │ 1.d4 e6 2.c4 f5 3..  │
│       6 │ Alekhine, Alexander  │          │ 1922.??.?? │ . │ Bogoljubow, Efim     │          │ 1.d4 f5 2.c4 Nf6 3.  │
│       7 │ Kmoch, Hans          │          │ 1922.??.?? │ . │ Rubinstein, Akiba    │          │ 1.d4 e6 2.c4 f5 3..  │
│       8 │ Mieses, Jacques      │          │ 1923.??.?? │ . │ Norman, George Mar.  │          │ 1.d4 f5 2.g3 Nf6 3.  │
│       9 │ Orlando, Placido     │          │ 1923.??.?? │ . │ Szabados, Eugenio    │          │ 1.d4 e6 2.c4 f5 3..  │
│      10 │ Tarrasch, Siegbert   │          │ 1923.??.?? │ . │ Seitz, Jakob Adolf   │          │ 1.c4 e6 2.d4 f5 3..  │
│      11 │ Wolf, Siegfried Re.  │          │ 1923.??.?? │ . │ Von Patay, J.        │          │ 1.d4 e6 2.c4 f5 3..  │
│      12 │ Tartakower, Saviely  │          │ 1924.??.?? │ . │ Bogoljubow, Efim     │          │ 1.d4 f5 2.g3 e6 3..  │
│      13 │ Pokorny, Amos        │          │ 1926.??.?? │ . │ Kostic, Boris        │          │ 1.c4 f5 2.d4 Nf6 3.  │
│      14 │ Tartakower, Saviely  │          │ 1927.??.?? │ . │ Vukovic, Vladimir    │          │ 1.d4 f5 2.c4 e6 3..  │
│      15 │ Botvinnik, Mikhail   │          │ 1927.??.?? │ . │ Rabinovich, Ilya L.  │          │ 1.d4 e6 2.c4 f5 3..  │
│       · │      ·               │  ·       │     ·      │ · │    ·                 │  ·       │          ·           │
│       · │      ·               │  ·       │     ·      │ · │    ·                 │  ·       │          ·           │
│       · │      ·               │  ·       │     ·      │ · │    ·                 │  ·       │          ·           │
│    7229 │ Kovacevic,Bl         │ 2400     │ 2023.12.09 │ . │ Kozul,Z              │ 2532     │ 1.d4 e6 2.c4 f5 3..  │
│    7230 │ Iskos,A              │ 2153     │ 2023.12.10 │ . │ Zhezhovska,Monika    │ 1826     │ 1.d4 e6 2.c4 f5 3..  │
│    7231 │ Spichkin,A           │ 2035     │ 2023.12.12 │ . │ Rodriguez Santiago,J │ 2043     │ 1.d4 e6 2.c4 f5 3..  │
│    7232 │ Rogov,Matfey         │ 2213     │ 2023.12.12 │ . │ Clarke,Matthew       │ 2127     │ 1.d4 e6 2.c4 f5 3..  │
│    7233 │ Osmonbekov,T         │ 2137     │ 2023.12.12 │ . │ Sroczynski,M         │ 2266     │ 1.d4 e6 2.c4 f5 3..  │
│    7234 │ Novikova,Galina      │ 2073     │ 2023.12.12 │ . │ Marcziter,D          │ 2192     │ 1.d4 e6 2.c4 f5 3..  │
│    7235 │ Tomazini,A           │ 2336     │ 2023.12.14 │ . │ Pultinevicius,Paul.  │ 2584     │ 1.d4 e6 2.c4 f5 3..  │
│    7236 │ Spichkin,A           │ 2035     │ 2023.12.19 │ . │ Levine,D             │ 2040     │ 1.d4 e6 2.c4 f5 3..  │
│    7237 │ Kanyamarala,Tarun    │ 2305     │ 2023.12.19 │ . │ Nechitaylo,Nikita    │ 2203     │ 1.d4 e6 2.c4 f5 3..  │
│    7238 │ Ronka,E              │ 2291     │ 2023.12.19 │ . │ Gruzman,Ilya         │ 2151     │ 1.d4 f5 2.g3 e6 3..  │
│    7239 │ Kurbonboeva,Sarvinoz │ 2154     │ 2023.12.26 │ . │ Mammadzada,G         │ 2449     │ 1.d4 f5 2.c4 Nf6 3.  │
│    7240 │ Koneru,H             │ 2554     │ 2023.12.26 │ . │ Peycheva,Gergana     │ 2271     │ 1.d4 e6 2.c4 f5 3..  │
│    7241 │ Carlsson,Andreas     │ 1902     │ 2023.12.28 │ . │ Kreken,Eivind Grunt  │ 2271     │ 1.d4 e6 2.c4 f5 3..  │
│    7242 │ Kazarjan,Gachatur    │ 2078     │ 2023.12.30 │ . │ Schuricht,Emil Fre.  │ 2095     │ 1.d4 f5 2.g3 e6 3..  │
│    7243 │ Kurbonboeva,Sarvinoz │ 2154     │ 2023.12.30 │ . │ Zhu Chen             │ 2423     │ 1.d4 f5 2.g3 Nf6 3.  │
├─────────┴──────────────────────┴──────────┴────────────┴───┴──────────────────────┴──────────┴──────────────────────┤
│ 7243 rows (30 shown)                                                                           12 columns (7 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Next steps


From this point on there's many things that could be done to improve the solution, for example:
  • Column values that originate from headers tags should un-escape escaped characters.
  • The PGN-syntax itself does not dictate this but there are established conventions for what tag names to use and what kind of values are appropriate. For example, see the seven tag roster and optional tag pairs in the wikipedia entry for Portable Game Notation. It would make sense to further cleanse and conform the corresponding columns and give them a more suitable data type, or to actively validate their value.
  • Further data normalization could be attempted by creating separate tables for player, event, opening etc.
  • The moves could be further processed and analyzed to derive a table of board positions, something which would greatly increase the opportunities to analyze games.
We could also improve the statement and make it more robust:
  • Better detection of the first game line. Our assumption has been that the first movetext always starts with '1.'. But what if a game does not have any moves at all? This may sound like that shouldn't be possible, but especially on an online chess site, a player's connection might break after a game was started, but before a move was made.

    Whatever the reason may be, and whether we're interested in such games or not, our current solution is not capable to detect these cases. Instead, games simply aren't identified as intended, and our final game would likely be a mixture of 2 or maybe even more games. Bad bad bad!

    (If you're interested in looking into such a scenario, the Lichess chess database for October 2013 contains 411,039 games, but 140 do not have any moves.)
  • A more robust regular expression to deal with games that may not adhere fully to the PGN syntax (for example, more foregiving handling of whitespace)
  • Better handling of errors when reading CSV.
For now we leave these considerations as an excercise for the reader.

The goal of these posts was to show how DuckDB's features and SQL-dialect allow these kind of raw-text processing tasks to be solved quickly and elegantly. I hope I have succeeded in demonstrating that - it sure was a lot of fun to try!

No comments:

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...