Friday, July 12, 2024

DuckDB bag of tricks: Processing PGN chess games with DuckDB - An Introduction to PGN (1/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the first installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:
  1. Chess and Portable Game Notation (PGN)
  2. Ingesting raw text with the CSV Reader
  3. Distinguishing the Line Type
  4. Keeping game lines together: window functions
  5. Extracting Tagpairs with Regular Expressions
  6. Rolling up each game's lines into a single game row

Chess and Portable Game Notation (PGN)


In this series of posts I will demonstrate how you can use DuckDB to read chess games recorded in PGN-format. You might not be interested in his particular task, but the DuckDB features that allow us to do this are applicable to a wide range of text-processing problems:

PGN Syntax


Portable Game Notation (PGN) is a plain-text format for recording and exchanging chess games. PGN is easily readable and writeable - both for humans and machines. It's widely used by chess software programs as well as by (online) chess magazines, tutorials etc.

Here's a simple PGN example illustrating the shortest game in history:
[Event "a chess café in Paris "]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "Gibaud"]
[Black "Lazard"]
[Result "0-1"]

1. d4 Nf6 2. Nd2 e5 3. dxe5 Ng4 4. h3 Ne3 0-1
This nicely illustrates the PGN-format:
  • A game consists of a set of header-lines, and one or more lines of movetext, separated by a blank line. In our example PGN, the first 7 lines are header lines, and the last line is movetext.
  • A header line contains a single tagpair, which is enclosed in square brackets ([ and ]):
    • Inside the square brackets there are two pieces of text: the tag and the value, which are separated by whitespace.
    • The tag should be a single token, immediately following the opening bracket ([). A tag cannot contain any whitespace.
    • The value is enclosed in double quotes ("). The value appears directly before the closing square bracket (]) that demarcates the end of the header line.
    • Almost any content can appear between the value's double quotes, except for linebreaks. Obviously, the value cannot contain a naked double-quote; see the next point.
    • The backslash (\) may be used as escape character for denoting a literal double quote inside the value, like so: \".
    • Because the backslash is used as escape character, a literal backslash also needs to be escaped. That is denoted by writing two backslashes: \\.
  • The movetext is denoted in Standard Algebraic Notation (SAN).
    • SAN denotes a numbered sequence of moves.
    • A move is the change of position of chess pieces on the chessboard.
    • Each move starts with the move number, which is denoted as a decimal integer. Move numbers start at 1.
    • The move number is immediately followed by a full stop character (.).
    • Multiple moves may appear on a single line, and a single game's movetext may span multiple lines.
    There's a lot more that could be said about the syntax of individual moves. However, this would require a lot more elaboration. For now, we'll proceed to treat the movetext as a whole, without worrying about the syntax that defines its structure.

From PGN to Tabular


While the PGN-format may be good for recording and exchanging games, it is not well suited for searching and analyzing large collections of chess games. A tabular format would fare a lot better there, and this is where DuckDB comes in. I will show you how a single, well formatted and indented DuckDB SELECT statement of barely more than 30 lines can transform the PGN above into the tabular dataset below:
┌─────────┬─────────┬────────────┬─────────┬─────────┬─────────┬─────────┬───────────────────────────────────────────────┐
│ game_id │  Black  │    Date    │ Result  │  Round  │  Site   │  White  │                     moves                     │
│  int64  │ varchar │  varchar   │ varchar │ varchar │ varchar │ varchar │                    varchar                    │
├─────────┼─────────┼────────────┼─────────┼─────────┼─────────┼─────────┼───────────────────────────────────────────────┤
│       1 │ Lazard  │ ????.??.?? │ 0-1     │ ?       │ ?       │ Gibaud  │ 1. d4 Nf6 2. Nd2 e5 3. dxe5 Ng4 4. h3 Ne3 0-1 │
└─────────┴─────────┴────────────┴─────────┴─────────┴─────────┴─────────┴───────────────────────────────────────────────┘
As you can see,
  • The game has been assigned a unique number which appears in the game_id column.
  • Multiple lines that made up a game in PGN notation are grouped together into a single row.
  • Tags occurring in the header lines now appear as column names, and the corresponding values appear as their column values.
  • The movetext too gets its own column.
Please note: I'm not saying this is the perfect relational format for representing chess games. However, on a spectrum with PGN blobs on the one end and a fully normalized relational schema on the other end, the format above is well underway. It is a complete representation of the original PGN source that already offers superior opportunities for search and analysis. In addition, it can quite easily be used for further transformation and processing, in particular with SQL.

Obtaining PGN-files


Many chess-related websites offer downloads of either games, puzzles or openings in PGN-format. To name a few:
  • PGN Mentor: collections of games by chess grandmasters. Offers .zip compressed files that group games by player, opening, event. Files are small and typically contain a couple of thousand games. Largest download I found is 13MB (40MB uncompressed)
  • Lichess open database: standard chess and chess-variant games played on the lichess community, grouped by month. Files typically contain millions of games, download sizes range from 17MB to 30GB
To follow along with this post, you'll need at least one PGN-file.

I suggest downloading the DutchClassical.zip openings from the PGN Mentor site. Please note that after download, you will need to unzip it first before you can read it with DuckDB. This is also the file I'll be using for most code samples, and for me the location of the unzipped .pgn file is:
C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn
So, if you want to run the code samples yourself, make sure to adjust that to whatever is the appropriate location on your system.

Alternatively you can get lichess_db_standard_rated_2013-01.pgn.zst, which is the smallest set of standard chess games from the the lichess.org database. DuckDB can already read .zst-compressed files, so you won't need to uncompress it. In fact, the DuckDB manual recommends against uncompressing .zst-files, so you probably shouldn't.

Next Installment: Ingestion of PGN Files


Next time, we'll start by using DuckDB's CSV Reader to ingest raw lines of text from PGN files.

No comments:

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...