This post is the sixth installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.
The installments are:
- Chess and Portable Game Notation (PGN)
- Ingesting raw text with the CSV Reader
- Distinguishing the Line Type
- Keeping game lines together: window functions
- Extracting Tagpairs with Regular Expressions
- Rolling up each game's lines into a single game row
Rolling up each game's lines into a single game row
All essential elements are in place for transforming the PGN lines into a tabular result. The actual transformation entails two things:
- Grouping all lines with the same
game_id
into one single row. - Create columns for each unique value of
tag_name
fromtag_pair
, and place the correspondingtag_value
into those columns.
PIVOT
-statement.
Typical PIVOT
-statements
The
PIVOT
-statement is typically used for OLAP use cases to create analytical crosstabs.
In this context, we can think of PIVOT
as an extension of a standard SELECT
-statement with a GROUP BY
-clause: each row is actually an aggregate row that represents the group of rows from the underlying dataset that have a unique combination of values for all expressions appearing in the GROUP BY
-clause.
In addition,
PIVOT
also has an ON
-clause, and each unique combination of values coming from the ON
-clause expressions generates an aggregate column.
At the intersection of the aggregate rows and aggregate columns of the crosstab are the cell values. These are specified by the
USING
-clause, which specifies aggregate function expressions that are to be applied on those rows from the underlying resultset that belong to both the aggregate row and the aggregate column.
In typical OLAP use cases, the cell values are typically
SUM()
s or AVG()
s over monetary amounts or quantities, sometimes COUNT()
s.
PIVOT
as a row-to-column transposer
The description of the
PIVOT
-statement above provides some hints on how we can use it to transform the game lines to a single game row, and how to turn the tags into separate columns:
- We want to roll up the game lines of one game into a single game row, so the
game_id
expression should be placed in theGROUP BY
-clause. - The tag names should be used to generate columns, so the
tag_name
-member of thetag_pair
STRUCT
-value returned byregexp_extract()
should be placed in theON
-clause. - The
tag_value
s should appear as cell values, so thetag_value
-member oftag_pair
should be placed in theUSING
-clause.
-
The
tag_value
s that are to appear as cell values are primarily of a text type: player names, event locations, and sometimes date- or time-like values. There are some numerical values too, like Elo scores, but these are non-additive, and quite unlike the amounts and quantities we find in the typical OLAP-case. - As the tagspairs are just attributes of the game, we expect set of tag_name-values for in one game to be unique. That means that for each game, we will find at most one tag_value in each generated column.
PIVOT
-statement as a device to transpose rows to column, rather than an analytical crosstab.
Even though we expect only a single text value for the
tag_value
, the PIVOT
-statement's USING
-clause still requires some kind of aggregate function.
To aggregate tag_value
we can settle for anything that preserves the text: MIN()
, MAX()
, or ANY_VALUE()
, as well as the text aggregate STRING_AGG()
would all do.
When we put it all together, this is what our inital
PIVOT
-statement looks like:
PIVOT( SELECT line , line LIKE '[%' AS is_header , COUNT(CASE line LIKE '1.%' THEN 1 END) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) + CASE WHEN is_header THEN 1 ELSE 0 END AS game_id , CASE WHEN is_header THEN regexp_extract( line , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]*$' , [ 'tag_name', 'tag_value' ] ) END AS tag_pair FROM read_csv( 'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn' , columns = {'line': 'VARCHAR'} ) WHERE line IS NOT NULL ) ON tag_pair[This is what its result looke like:'tag_name'
] USINGANY_VALUE
( tag_pair['tag_value'
] ) GROUP BY game_id
┌─────────┬──────────────────────┬──────────┬────────────┬───┬─────────┬─────────────────────┬──────────────────────┬──────────┐ │ game_id │ Black │ BlackElo │ Date │ . │ Round │ Site │ White │ WhiteElo │ │ int64 │ varchar │ varchar │ varchar │ │ varchar │ varchar │ varchar │ varchar │ ├─────────┼──────────────────────┼──────────┼────────────┼───┼─────────┼─────────────────────┼──────────────────────┼──────────┤ │ 1 │ Pollock, William H. │ │ 1895.??.?? │ . │ ? │ Hastings │ Tinsley, Samuel │ │ │ 2 │ Lasker, Edward │ │ 1913.??.?? │ . │ ? │ Scheveningen │ Loman, Rudolf │ │ │ 3 │ Tartakower, Saviely │ │ 1921.??.?? │ . │ 5 │ The Hague │ Alekhine, Alexander │ │ │ 4 │ Wegemund, Otto │ │ 1922.??.?? │ . │ 7 │ Bad Oeynhausen │ Antze, O. │ │ │ 5 │ Tarrasch, Siegbert │ │ 1922.??.?? │ . │ 19 │ Bad Pistyan │ Johner, Paul F │ │ │ 6 │ Alekhine, Alexander │ │ 1922.??.?? │ . │ ? │ Hastings │ Bogoljubow, Efim │ │ │ 7 │ Kmoch, Hans │ │ 1922.??.?? │ . │ ? │ Vienna │ Rubinstein, Akiba │ │ │ 8 │ Mieses, Jacques │ │ 1923.??.?? │ . │ 9 │ Hastings │ Norman, George Mar. │ │ │ 9 │ Orlando, Placido │ │ 1923.??.?? │ . │ ? │ Trieste │ Szabados, Eugenio │ │ │ 10 │ Tarrasch, Siegbert │ │ 1923.??.?? │ . │ 1 │ Trieste │ Seitz, Jakob Adolf │ │ │ 11 │ Wolf, Siegfried Re. │ │ 1923.??.?? │ . │ 5 │ Vienna │ Von Patay, J. │ │ │ 12 │ Tartakower, Saviely │ │ 1924.??.?? │ . │ ? │ New York │ Bogoljubow, Efim │ │ │ 13 │ Pokorny, Amos │ │ 1926.??.?? │ . │ 3 │ Trencianske Teplice │ Kostic, Boris │ │ │ 14 │ Tartakower, Saviely │ │ 1927.??.?? │ . │ 2 │ Kecskemet │ Vukovic, Vladimir │ │ │ 15 │ Botvinnik, Mikhail │ │ 1927.??.?? │ . │ 2 │ Moscow │ Rabinovich, Ilya L. │ │ │ · │ · │ · │ · │ · │ · │ · │ · │ · │ │ · │ · │ · │ · │ · │ · │ · │ · │ · │ │ · │ · │ · │ · │ · │ · │ · │ · │ · │ │ 7229 │ Kovacevic,Bl │ 2400 │ 2023.12.09 │ . │ 7.3 │ Zagreb CRO │ Kozul,Z │ 2532 │ │ 7230 │ Iskos,A │ 2153 │ 2023.12.10 │ . │ 6.46 │ Skopje MKD │ Zhezhovska,Monika │ 1826 │ │ 7231 │ Spichkin,A │ 2035 │ 2023.12.12 │ . │ 2 │ chess.com INT │ Rodriguez Santiago,J │ 2043 │ │ 7232 │ Rogov,Matfey │ 2213 │ 2023.12.12 │ . │ 3 │ chess.com INT │ Clarke,Matthew │ 2127 │ │ 7233 │ Osmonbekov,T │ 2137 │ 2023.12.12 │ . │ 3 │ chess.com INT │ Sroczynski,M │ 2266 │ │ 7234 │ Novikova,Galina │ 2073 │ 2023.12.12 │ . │ 8 │ chess.com INT │ Marcziter,D │ 2192 │ │ 7235 │ Tomazini,A │ 2336 │ 2023.12.14 │ . │ 5.24 │ Zagreb CRO │ Pultinevicius,Paul. │ 2584 │ │ 7236 │ Spichkin,A │ 2035 │ 2023.12.19 │ . │ 2 │ chess.com INT │ Levine,D │ 2040 │ │ 7237 │ Kanyamarala,Tarun │ 2305 │ 2023.12.19 │ . │ 4 │ chess.com INT │ Nechitaylo,Nikita │ 2203 │ │ 7238 │ Ronka,E │ 2291 │ 2023.12.19 │ . │ 4 │ chess.com INT │ Gruzman,Ilya │ 2151 │ │ 7239 │ Kurbonboeva,Sarvinoz │ 2154 │ 2023.12.26 │ . │ 1.37 │ Samarkand UZB │ Mammadzada,G │ 2449 │ │ 7240 │ Koneru,H │ 2554 │ 2023.12.26 │ . │ 3.24 │ Samarkand UZB │ Peycheva,Gergana │ 2271 │ │ 7241 │ Carlsson,Andreas │ 1902 │ 2023.12.28 │ . │ 4.9 │ Karlstad SWE │ Kreken,Eivind Grunt │ 2271 │ │ 7242 │ Kazarjan,Gachatur │ 2078 │ 2023.12.30 │ . │ 9.31 │ Groningen NED │ Schuricht,Emil Fre. │ 2095 │ │ 7243 │ Kurbonboeva,Sarvinoz │ 2154 │ 2023.12.30 │ . │ 14.47 │ Samarkand UZB │ Zhu Chen │ 2423 │ ├─────────┴──────────────────────┴──────────┴────────────┴───┴─────────┴─────────────────────┴──────────────────────┴──────────┤ │ 7243 rows (30 shown) 11 columns (8 shown) │ └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘This certainly is starting to look a lot like the result we were after.
Folding in the movetext
The only thing stil missing is the movetext, but at this point, it's almost trivial to add that as well. We can simply amend the
tag_pair
-expression and let it return a new STRUCT
-value with the literal text 'moves' as name member and the line
itself as value in case the is_header
expression is not TRUE
:CASE WHEN is_header THEN regexp_extract( line , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]*$' , [ 'column_name', 'column_value' ] ) ELSE { 'column_name': 'moves' , 'column_value': line } END AS column_name_valueFor consistency, we also changed the
tag_pair
alias to column_name_value
and its member names from column_name
and column_value
to column_name
and column_value
respectively.
Therefore we must also update the references elsewhere in the PIVOT
statement accordingly.
Also, because one game could have multiple lines of movetext, we must also change the aggregate function in the
USING
-clause from ANY_VALUE()
to STRING_AGG()
.
After these changes we get the final statement:PIVOT( SELECT line , line LIKE '[%' AS is_header , COUNT(CASE line LIKE '1.%' THEN 1 END) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) + CASE WHEN is_header THEN 1 ELSE 0 END AS game_id , CASE WHEN is_header THEN regexp_extract( line , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]*$' , [ 'column_name', 'column_value' ] ) ELSE { 'column_name': 'moves' , 'column_value': line } END AS column_name_value FROM read_csv( 'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn' , columns = {'line': 'VARCHAR'} ) WHERE line IS NOT NULL ) ON column_name_value[And its result:'column_name'
] USINGSTRING_AGG
( column_name_value['column_value'
], ' ' ) GROUP BY game_id
┌─────────┬──────────────────────┬──────────┬────────────┬───┬──────────────────────┬──────────┬──────────────────────┐ │ game_id │ Black │ BlackElo │ Date │ . │ White │ WhiteElo │ moves │ │ int64 │ varchar │ varchar │ varchar │ │ varchar │ varchar │ varchar │ ├─────────┼──────────────────────┼──────────┼────────────┼───┼──────────────────────┼──────────┼──────────────────────┤ │ 1 │ Pollock, William H. │ │ 1895.??.?? │ . │ Tinsley, Samuel │ │ 1.d4 f5 2.c4 e6 3.. │ │ 2 │ Lasker, Edward │ │ 1913.??.?? │ . │ Loman, Rudolf │ │ 1.d4 e6 2.c4 f5 3.. │ │ 3 │ Tartakower, Saviely │ │ 1921.??.?? │ . │ Alekhine, Alexander │ │ 1.d4 f5 2.c4 e6 3.. │ │ 4 │ Wegemund, Otto │ │ 1922.??.?? │ . │ Antze, O. │ │ 1.c4 f5 2.d4 Nf6 3. │ │ 5 │ Tarrasch, Siegbert │ │ 1922.??.?? │ . │ Johner, Paul F │ │ 1.d4 e6 2.c4 f5 3.. │ │ 6 │ Alekhine, Alexander │ │ 1922.??.?? │ . │ Bogoljubow, Efim │ │ 1.d4 f5 2.c4 Nf6 3. │ │ 7 │ Kmoch, Hans │ │ 1922.??.?? │ . │ Rubinstein, Akiba │ │ 1.d4 e6 2.c4 f5 3.. │ │ 8 │ Mieses, Jacques │ │ 1923.??.?? │ . │ Norman, George Mar. │ │ 1.d4 f5 2.g3 Nf6 3. │ │ 9 │ Orlando, Placido │ │ 1923.??.?? │ . │ Szabados, Eugenio │ │ 1.d4 e6 2.c4 f5 3.. │ │ 10 │ Tarrasch, Siegbert │ │ 1923.??.?? │ . │ Seitz, Jakob Adolf │ │ 1.c4 e6 2.d4 f5 3.. │ │ 11 │ Wolf, Siegfried Re. │ │ 1923.??.?? │ . │ Von Patay, J. │ │ 1.d4 e6 2.c4 f5 3.. │ │ 12 │ Tartakower, Saviely │ │ 1924.??.?? │ . │ Bogoljubow, Efim │ │ 1.d4 f5 2.g3 e6 3.. │ │ 13 │ Pokorny, Amos │ │ 1926.??.?? │ . │ Kostic, Boris │ │ 1.c4 f5 2.d4 Nf6 3. │ │ 14 │ Tartakower, Saviely │ │ 1927.??.?? │ . │ Vukovic, Vladimir │ │ 1.d4 f5 2.c4 e6 3.. │ │ 15 │ Botvinnik, Mikhail │ │ 1927.??.?? │ . │ Rabinovich, Ilya L. │ │ 1.d4 e6 2.c4 f5 3.. │ │ · │ · │ · │ · │ · │ · │ · │ · │ │ · │ · │ · │ · │ · │ · │ · │ · │ │ · │ · │ · │ · │ · │ · │ · │ · │ │ 7229 │ Kovacevic,Bl │ 2400 │ 2023.12.09 │ . │ Kozul,Z │ 2532 │ 1.d4 e6 2.c4 f5 3.. │ │ 7230 │ Iskos,A │ 2153 │ 2023.12.10 │ . │ Zhezhovska,Monika │ 1826 │ 1.d4 e6 2.c4 f5 3.. │ │ 7231 │ Spichkin,A │ 2035 │ 2023.12.12 │ . │ Rodriguez Santiago,J │ 2043 │ 1.d4 e6 2.c4 f5 3.. │ │ 7232 │ Rogov,Matfey │ 2213 │ 2023.12.12 │ . │ Clarke,Matthew │ 2127 │ 1.d4 e6 2.c4 f5 3.. │ │ 7233 │ Osmonbekov,T │ 2137 │ 2023.12.12 │ . │ Sroczynski,M │ 2266 │ 1.d4 e6 2.c4 f5 3.. │ │ 7234 │ Novikova,Galina │ 2073 │ 2023.12.12 │ . │ Marcziter,D │ 2192 │ 1.d4 e6 2.c4 f5 3.. │ │ 7235 │ Tomazini,A │ 2336 │ 2023.12.14 │ . │ Pultinevicius,Paul. │ 2584 │ 1.d4 e6 2.c4 f5 3.. │ │ 7236 │ Spichkin,A │ 2035 │ 2023.12.19 │ . │ Levine,D │ 2040 │ 1.d4 e6 2.c4 f5 3.. │ │ 7237 │ Kanyamarala,Tarun │ 2305 │ 2023.12.19 │ . │ Nechitaylo,Nikita │ 2203 │ 1.d4 e6 2.c4 f5 3.. │ │ 7238 │ Ronka,E │ 2291 │ 2023.12.19 │ . │ Gruzman,Ilya │ 2151 │ 1.d4 f5 2.g3 e6 3.. │ │ 7239 │ Kurbonboeva,Sarvinoz │ 2154 │ 2023.12.26 │ . │ Mammadzada,G │ 2449 │ 1.d4 f5 2.c4 Nf6 3. │ │ 7240 │ Koneru,H │ 2554 │ 2023.12.26 │ . │ Peycheva,Gergana │ 2271 │ 1.d4 e6 2.c4 f5 3.. │ │ 7241 │ Carlsson,Andreas │ 1902 │ 2023.12.28 │ . │ Kreken,Eivind Grunt │ 2271 │ 1.d4 e6 2.c4 f5 3.. │ │ 7242 │ Kazarjan,Gachatur │ 2078 │ 2023.12.30 │ . │ Schuricht,Emil Fre. │ 2095 │ 1.d4 f5 2.g3 e6 3.. │ │ 7243 │ Kurbonboeva,Sarvinoz │ 2154 │ 2023.12.30 │ . │ Zhu Chen │ 2423 │ 1.d4 f5 2.g3 Nf6 3. │ ├─────────┴──────────────────────┴──────────┴────────────┴───┴──────────────────────┴──────────┴──────────────────────┤ │ 7243 rows (30 shown) 12 columns (7 shown) │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Next steps
From this point on there's many things that could be done to improve the solution, for example:
- Column values that originate from headers tags should un-escape escaped characters.
- The PGN-syntax itself does not dictate this but there are established conventions for what tag names to use and what kind of values are appropriate. For example, see the seven tag roster and optional tag pairs in the wikipedia entry for Portable Game Notation. It would make sense to further cleanse and conform the corresponding columns and give them a more suitable data type, or to actively validate their value.
- Further data normalization could be attempted by creating separate tables for player, event, opening etc.
- The moves could be further processed and analyzed to derive a table of board positions, something which would greatly increase the opportunities to analyze games.
-
Better detection of the first game line.
Our assumption has been that the first movetext always starts with
'1.'
. But what if a game does not have any moves at all? This may sound like that shouldn't be possible, but especially on an online chess site, a player's connection might break after a game was started, but before a move was made.
Whatever the reason may be, and whether we're interested in such games or not, our current solution is not capable to detect these cases. Instead, games simply aren't identified as intended, and our final game would likely be a mixture of 2 or maybe even more games. Bad bad bad!
(If you're interested in looking into such a scenario, the Lichess chess database for October 2013 contains 411,039 games, but 140 do not have any moves.) - A more robust regular expression to deal with games that may not adhere fully to the PGN syntax (for example, more foregiving handling of whitespace)
- Better handling of errors when reading CSV.
The goal of these posts was to show how DuckDB's features and SQL-dialect allow these kind of raw-text processing tasks to be solved quickly and elegantly. I hope I have succeeded in demonstrating that - it sure was a lot of fun to try!
No comments:
Post a Comment