Friday, July 12, 2024

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Keeping the lines of a game together (4/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the fourth installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:
  1. Chess and Portable Game Notation (PGN)
  2. Ingesting raw text with the CSV Reader
  3. Distinguishing the Line Type
  4. Keeping game lines together: window functions
  5. Extracting Tagpairs with Regular Expressions
  6. Rolling up each game's lines into a single game row

Keeping game lines together: window functions


The transformation to the tabular format that we want to achieve implies a big structural change compared to the raw lines from the PGN file, as we need to transform multiple subsequent lines into a single row that represents a single game. And, we want to do that for all games present in a PGN file. So we want to treat the lines that belong to one game together as a unit, in other words: we need to group them.

You might know that in SQL, the GROUP BY-clause would achieve that. But, if we want to use that, we'll need to have an expression (or combination of expressions) that identifies which lines should be grouped together. We currently don't have any such thing, so we're going to have to create it.

The fact that we're trying to work "across" rows should trigger an instinct to consider window functions. If you're unfamiliar with window functions, some time ago I wrote about year-to-date calculation using window functions - perhaps that may be of interest to you. To recap, a window function calculates a metric by performing some kind of aggregate function over a subset of the rows taken from the current resultset. That subset of rows is called the window.

Window-functions as 'running totals'


A simple example of a metric that would be calculated by a window function is the 'running total' (cumulative sum): the running total for the current row is the value of the current rows' amount, added to the sum of amounts calculated over the rows appearing before the current row. So for the running total, the window is defined as the current row and all its previous rows, and the aggregation that is performed over the window is SUM().

Row counters as running totals


Let's simplify this. Forget about the previous running total example on amount and use COUNT(*) instead of SUM(). What would the metric be?

Well, COUNT(*) counts the number of rows. For the first row, it would count only the current row because there are no preceding rows, and the result would be 1. For the second row, it would count the current row and the previous row, and the result would be 2. Of course, for the 3rd row the result would be 3, for the 4th row it would be 4, and so on, calculating an integer value for each row that correspons to the row's position within the resultset as a whole.

The metric from the simplified running total example would be aptly called 'row count' or 'row number'. To calculate it, we would write:
COUNT(*) OVER (
  ROWS BETWEEN UNBOUNDED PRECEDING
       AND     CURRENT ROW
)
Note how the OVER()-clause defines the window: the ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW-bit matches our explanation of the running total in a remarkably straight-forward way.

Counting only specifc rows


To be sure, if we'd only wanted to calculate the row number, we would have used the built-in ROW_NUMBER() window function, and this would allow you to write the previous expression simply as ROW_NUMBER() OVER().

But we're not interested in counting lines: we want to count the games, while each game spans multiple lines. To count games, and not just any line, we have to find something that occurs in only one line for each game, and apply the COUNT() window function to that instead.

Using the first movetext line to identify games


By analyzing the PGN-syntax, you might find that the movetext is the key that solves this problem:
  • Each game has at least one line of movetext, and that first movetext line starts with the first move.
  • We also know that moves are numbered, starting with move 1, and that the move numbers are followed by a full-stop.

So, instead of COUNT(*), we should count the cases where the line starts with '1.'! Our expression thus becomes:
COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
  ROWS BETWEEN UNBOUNDED PRECEDING
       AND     CURRENT ROW
) AS first_movetext_linecount
If the line is the first movetext line, LIKE '1.%' will be TRUE, and the CASE-expression evaluates to 1, which will be counted. Any non-NULL-value would have done the job, but 1 is a convenient choice: it reminds us that we're counting the first movetext line and at the same time, symbolizes that we're counting that as one game. In all other cases, the CASE-expression evaluates to NULL, and those lines will not be counted.

If we'd add this to our query, this would be the result:
┌────────────────────────────────────────────────────────────────────────────────────┬───────────┬──────────────────────────┐
│                                        line                                        │ is_header │ first_movetext_linecount │
│                                      varchar                                       │  boolean  │          int64           │
├────────────────────────────────────────────────────────────────────────────────────┼───────────┼──────────────────────────┤
│ [Event "Hastings"]                                                                 │ true      │                        0 │
│ [Site "Hastings"]                                                                  │ true      │                        0 │
│ [Date "1895.??.??"]                                                                │ true      │                        0 │
│ [Round "?"]                                                                        │ true      │                        0 │
│ [White "Tinsley, Samuel"]                                                          │ true      │                        0 │
│ [Black "Pollock, William Henry Kraus"]                                             │ true      │                        0 │
│ [Result "1-0"]                                                                     │ true      │                        0 │
│ [WhiteElo ""]                                                                      │ true      │                        0 │
│ [BlackElo ""]                                                                      │ true      │                        0 │
│ [ECO "A90"]                                                                        │ true      │                        0 │
│ 1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O 6.Qb3 c5 7.e3 Nc6 8.a3 cxd4false1 │
│ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc1+ 13.Rhxc1 b6 14.Nd4 Ne4 15.Rd1 Qg5 │ false     │                        1 │
│ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20.Rxa7 Rxa7 21.Nxa7 Kh8 22.b5 Qf6     │ false     │                        1 │
│ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 27.Kf1 Ra1+ 28.Kg2 b5 29.Qb4 Kh7     │ false     │                        1 │
│ 30.Be2 Nd6 31.Bf3 Ra4 32.Qb1 Rc4 33.Rd2 Ne4 34.Rd3 Rc5 35.h5 Kg8 36.Qa2 Kh8        │ false     │                        1 │
│ 37.Qa5 Kh7 38.Qd8 Nxc3 39.Qf8 Qc8 40.Qf7 b4 41.g4 fxg4 42.Qg6+ Kh8 43.Bxg4 Qd8     │ false     │                        1 │
│ 44.f4 Qf6 45.Qe8+ Kh7 46.Be6 Qxe6 47.Qxe6 b3 48.Qb6 Ne4 49.Qg6+ Kg8 50.Rxd5 Rc2+   │ false     │                        1 │
│ 51.Kf3 Rd2 52.Qxe4  1-0                                                            │ false     │                        1 │
│ [Event "Scheveningen"]                                                             │ true      │                        1 │
│ [Site "Scheveningen"]                                                              │ true      │                        1 │
│ [Date "1913.??.??"]                                                                │ true      │                        1 │
│ [Round "?"]                                                                        │ true      │                        1 │
│ [White "Loman, Rudolf"]                                                            │ true      │                        1 │
│ [Black "Lasker, Edward"]                                                           │ true      │                        1 │
│ [Result "1/2-1/2"]                                                                 │ true      │                        1 │
│ [WhiteElo ""]                                                                      │ true      │                        1 │
│ [BlackElo ""]                                                                      │ true      │                        1 │
│ [ECO "A90"]                                                                        │ true      │                        1 │
│ 1.d4 e6 2.c4 f5 3.g3 Nf6 4.Bg2 d5 5.cxd5 exd5 6.Bg5 c6 7.Nc3 Bd6 8.Qc2 O-Ofalse2 │
│ 9.O-O-O Qa5 10.Bxf6 gxf6 11.Bh3 Bb4 12.Kb1 Bxc3 13.Qxc3 Qxc3 14.bxc3 Na6           │ false     │                        2 │
├────────────────────────────────────────────────────────────────────────────────────┴───────────┴──────────────────────────┤
│ 30 rows                                                                                                         3 columns │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Compensating Headers to get the game id


There's one problem left, and that's that the game's header lines occur before that game's first movetext line. So, the game's header lines get a number corresponding to the previous game's movetext line. But, this is easily solved: because it applies to all header lines of each game, we can simply compensate by adding 1 in case the line is a header line.

In the previous section, we already solved the problem of identifying header lines. Thanks to a DuckDB feature called reusable column alias (a concept similar to lateral column alias in DataBricks SQL), fixing first_movetext_counter so it becomes a game_id is trivial! We can simply check whether is_header is TRUE, and if so, add 1:
COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
  ROWS BETWEEN UNBOUNDED PRECEDING
       AND     CURRENT ROW
) + 
CASE
  WHEN is_header THEN 1
  ELSE 0
END AS game_id
With this compensation in place, the query now looks like this:
SELECT  line
,       line LIKE '[%'                               AS is_header
,       COUNT(CASE line LIKE '1.%' THEN 1 END) OVER (
         ROWS BETWEEN UNBOUNDED PRECEDING
              AND     CURRENT ROW
        ) + 
        CASE
          WHEN is_header THEN 1
          ELSE 0
        END                                          AS game_id
FROM    read_csv(
          'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
        , columns = {'line': 'VARCHAR'}
        )
WHERE   line IS NOT NULL
The result now becomes:
┌────────────────────────────────────────────────────────────────────────────────────┬───────────┬─────────┐
│                                        line                                        │ is_header │ game_id │
│                                      varchar                                       │  boolean  │  int64  │
├────────────────────────────────────────────────────────────────────────────────────┼───────────┼─────────┤
│ [Event "Hastings"]                                                                 │ true      │       1 │
│ [Site "Hastings"]                                                                  │ true      │       1 │
│ [Date "1895.??.??"]                                                                │ true      │       1 │
│ [Round "?"]                                                                        │ true      │       1 │
│ [White "Tinsley, Samuel"]                                                          │ true      │       1 │
│ [Black "Pollock, William Henry Kraus"]                                             │ true      │       1 │
│ [Result "1-0"]                                                                     │ true      │       1 │
│ [WhiteElo ""]                                                                      │ true      │       1 │
│ [BlackElo ""]                                                                      │ true      │       1 │
│ [ECO "A90"]                                                                        │ true      │       1 │
│ 1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O 6.Qb3 c5 7.e3 Nc6 8.a3 cxd4          │ false     │       1 │
│ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc1+ 13.Rhxc1 b6 14.Nd4 Ne4 15.Rd1 Qg5 │ false     │       1 │
│ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20.Rxa7 Rxa7 21.Nxa7 Kh8 22.b5 Qf6     │ false     │       1 │
│ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 27.Kf1 Ra1+ 28.Kg2 b5 29.Qb4 Kh7     │ false     │       1 │
│ 30.Be2 Nd6 31.Bf3 Ra4 32.Qb1 Rc4 33.Rd2 Ne4 34.Rd3 Rc5 35.h5 Kg8 36.Qa2 Kh8        │ false     │       1 │
│ 37.Qa5 Kh7 38.Qd8 Nxc3 39.Qf8 Qc8 40.Qf7 b4 41.g4 fxg4 42.Qg6+ Kh8 43.Bxg4 Qd8     │ false     │       1 │
│ 44.f4 Qf6 45.Qe8+ Kh7 46.Be6 Qxe6 47.Qxe6 b3 48.Qb6 Ne4 49.Qg6+ Kg8 50.Rxd5 Rc2+   │ false     │       1 │
│ 51.Kf3 Rd2 52.Qxe4  1-0                                                            │ false     │       1 │
│ [Event "Scheveningen"]                                                             │ true      │       2 │
│ [Site "Scheveningen"]                                                              │ true      │       2 │
│ [Date "1913.??.??"]                                                                │ true      │       2 │
│ [Round "?"]                                                                        │ true      │       2 │
│ [White "Loman, Rudolf"]                                                            │ true      │       2 │
│ [Black "Lasker, Edward"]                                                           │ true      │       2 │
│ [Result "1/2-1/2"]                                                                 │ true      │       2 │
│ [WhiteElo ""]                                                                      │ true      │       2 │
│ [BlackElo ""]                                                                      │ true      │       2 │
│ [ECO "A90"]                                                                        │ true      │       2 │
│ 1.d4 e6 2.c4 f5 3.g3 Nf6 4.Bg2 d5 5.cxd5 exd5 6.Bg5 c6 7.Nc3 Bd6 8.Qc2 O-O         │ false     │       2 │
│ 9.O-O-O Qa5 10.Bxf6 gxf6 11.Bh3 Bb4 12.Kb1 Bxc3 13.Qxc3 Qxc3 14.bxc3 Na6           │ false     │       2 │
├────────────────────────────────────────────────────────────────────────────────────┴───────────┴─────────┤
│ 30 rows                                                                                        3 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Perfect! We now have a game_id expression that we can use to identify all the lines belonging to one game. Remember, we need that to be able to roll up all the game's lines into a single row on our desired tabular result. We will return to that topic in the last installment.

Next Installment: Parsing PGN Tagpairs


Next time, we'll work with regular expressions to extract information from PGN tagpairs.

No comments:

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...