This post is the first installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.
The installments are:
- Chess and Portable Game Notation (PGN)
- Ingesting raw text with the CSV Reader
- Distinguishing the Line Type
- Keeping game lines together: window functions
- Extracting Tagpairs with Regular Expressions
- Rolling up each game's lines into a single game row
Chess and Portable Game Notation (PGN)
In this series of posts I will demonstrate how you can use DuckDB to read chess games recorded in PGN-format. You might not be interested in his particular task, but the DuckDB features that allow us to do this are applicable to a wide range of text-processing problems:
- Ingesting raw text with the CSV Reader. The fundamental task of acquiring the raw data is covered in the 2nd installment
- Detecting line types. Some light preprocessing of the data to allow for more focused processing is covered in the 3rd installment
- Identifying groups of lines that belong together. Here we describe the key task of deciding how physcial lines should be grouped into one logical entity - the chess game.
- Parsing lines according to line type to extract attributes. Here we do detailed analysis at the line level and extract the data that will become attributes of our chess games.
- Rolling up multiple lines into single row objects. The final installment, we we use all of the previous steps to achieve the final desired tabular format of the data.
PGN Syntax
Portable Game Notation (PGN) is a plain-text format for recording and exchanging chess games. PGN is easily readable and writeable - both for humans and machines. It's widely used by chess software programs as well as by (online) chess magazines, tutorials etc.
Here's a simple PGN example illustrating the shortest game in history:
[Event "a chess café in Paris "] [Site "?"] [Date "????.??.??"] [Round "?"] [White "Gibaud"] [Black "Lazard"] [Result "0-1"] 1. d4 Nf6 2. Nd2 e5 3. dxe5 Ng4 4. h3 Ne3 0-1This nicely illustrates the PGN-format:
- A game consists of a set of header-lines, and one or more lines of movetext, separated by a blank line. In our example PGN, the first 7 lines are header lines, and the last line is movetext.
- A header line contains a single tagpair, which is enclosed in square brackets (
[
and]
):- Inside the square brackets there are two pieces of text: the tag and the value, which are separated by whitespace.
- The tag should be a single token, immediately following the opening bracket (
[
). A tag cannot contain any whitespace. - The value is enclosed in double quotes (
"
). The value appears directly before the closing square bracket (]
) that demarcates the end of the header line. - Almost any content can appear between the value's double quotes, except for linebreaks. Obviously, the value cannot contain a naked double-quote; see the next point.
- The backslash (
\
) may be used as escape character for denoting a literal double quote inside the value, like so:\"
. - Because the backslash is used as escape character, a literal backslash also needs to be escaped. That is denoted by writing two backslashes:
\\
.
- The movetext is denoted in Standard Algebraic Notation (SAN).
- SAN denotes a numbered sequence of moves.
- A move is the change of position of chess pieces on the chessboard.
- Each move starts with the move number, which is denoted as a decimal integer. Move numbers start at
1
. - The move number is immediately followed by a full stop character (
.
). - Multiple moves may appear on a single line, and a single game's movetext may span multiple lines.
From PGN to Tabular
While the PGN-format may be good for recording and exchanging games, it is not well suited for searching and analyzing large collections of chess games. A tabular format would fare a lot better there, and this is where DuckDB comes in. I will show you how a single, well formatted and indented DuckDB
SELECT
statement of barely more than 30 lines can transform the PGN above into the tabular dataset below:
┌─────────┬─────────┬────────────┬─────────┬─────────┬─────────┬─────────┬───────────────────────────────────────────────┐ │ game_id │ Black │ Date │ Result │ Round │ Site │ White │ moves │ │ int64 │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ ├─────────┼─────────┼────────────┼─────────┼─────────┼─────────┼─────────┼───────────────────────────────────────────────┤ │ 1 │ Lazard │ ????.??.?? │ 0-1 │ ? │ ? │ Gibaud │ 1. d4 Nf6 2. Nd2 e5 3. dxe5 Ng4 4. h3 Ne3 0-1 │ └─────────┴─────────┴────────────┴─────────┴─────────┴─────────┴─────────┴───────────────────────────────────────────────┘As you can see,
- The game has been assigned a unique number which appears in the
game_id
column. - Multiple lines that made up a game in PGN notation are grouped together into a single row.
- Tags occurring in the header lines now appear as column names, and the corresponding values appear as their column values.
- The movetext too gets its own column.
Obtaining PGN-files
Many chess-related websites offer downloads of either games, puzzles or openings in PGN-format. To name a few:
- PGN Mentor: collections of games by chess grandmasters. Offers .zip compressed files that group games by player, opening, event. Files are small and typically contain a couple of thousand games. Largest download I found is 13MB (40MB uncompressed)
- Lichess open database: standard chess and chess-variant games played on the lichess community, grouped by month. Files typically contain millions of games, download sizes range from 17MB to 30GB
I suggest downloading the DutchClassical.zip openings from the PGN Mentor site. Please note that after download, you will need to unzip it first before you can read it with DuckDB. This is also the file I'll be using for most code samples, and for me the location of the unzipped .pgn file is:
C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgnSo, if you want to run the code samples yourself, make sure to adjust that to whatever is the appropriate location on your system.
Alternatively you can get lichess_db_standard_rated_2013-01.pgn.zst, which is the smallest set of standard chess games from the the lichess.org database. DuckDB can already read .zst-compressed files, so you won't need to uncompress it. In fact, the DuckDB manual recommends against uncompressing .zst-files, so you probably shouldn't.
Next Installment: Ingestion of PGN Files
Next time, we'll start by using DuckDB's CSV Reader to ingest raw lines of text from PGN files.
No comments:
Post a Comment