This post is the fourth installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:

- Chess and Portable Game Notation (PGN)
- Ingesting raw text with the CSV Reader
- Distinguishing the Line Type
- Keeping game lines together: window functions
- Extracting Tagpairs with Regular Expressions
- Rolling up each game's lines into a single game row

### Keeping game lines together: window functions

The transformation to the tabular format that we want to achieve implies a big structural change compared to the raw lines from the PGN file, as we need to transform multiple subsequent lines into a single row that represents a single game. And, we want to do that for all games present in a PGN file. So we want to treat the lines that belong to one game together as a unit, in other words: we need to group them.

You might know that in SQL, the

`GROUP BY`

-clause would achieve that.
But, if we want to use that, we'll need to have an expression (or combination of expressions) that identifies which lines should be grouped together.
We currently don't have any such thing, so we're going to have to create it.
The fact that we're trying to work "across" rows should trigger an instinct to consider window functions. If you're unfamiliar with window functions, some time ago I wrote about year-to-date calculation using window functions - perhaps that may be of interest to you. To recap, a window function calculates a metric by performing some kind of aggregate function over a subset of the rows taken from the current resultset. That subset of rows is called the

*window*.

#### Window-functions as 'running totals'

A simple example of a metric that would be calculated by a window function is the 'running total' (cumulative sum): the running total for the current row is the value of the current rows' amount, added to the sum of amounts calculated over the rows appearing before the current row. So for the running total, the window is defined as the current row and all its previous rows, and the aggregation that is performed over the window is

`SUM()`

.
#### Row counters as running totals

Let's simplify this. Forget about the previous running total example on amount and use

`COUNT(*)`

instead of `SUM()`

.
What would the metric be?
Well,

`COUNT(*)`

counts the number of rows.
For the first row, it would count only the current row because there are no preceding rows, and the result would be `1`

.
For the second row, it would count the current row and the previous row, and the result would be `2`

.
Of course, for the 3rd row the result would be `3`

, for the 4th row it would be `4`

, and so on, calculating an integer value for each row that correspons to the row's position within the resultset as a whole.
The metric from the simplified running total example would be aptly called 'row count' or 'row number'. To calculate it, we would write:

Note how theCOUNT(*)OVER(ROWS BETWEEN UNBOUNDED PRECEDINGAND CURRENT ROW)

`OVER()`

-clause defines the window: the **ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW**-bit matches our explanation of the running total in a remarkably straight-forward way.

#### Counting only specifc rows

To be sure, if we'd only wanted to calculate the row number, we would have used the built-in

`ROW_NUMBER()`

window function, and this would allow you to write the previous expression simply as **ROW_NUMBER()** **OVER**()

.
But we're not interested in counting lines: we want to count the games, while each game spans multiple lines. To count games, and not just any line, we have to find something that occurs in only one line for each game, and apply the

`COUNT()`

window function to that instead.
#### Using the first movetext line to identify games

By analyzing the PGN-syntax, you might find that the movetext is the key that solves this problem:

- Each game has at least one line of movetext, and that first movetext line starts with the first move.
- We also know that moves are numbered, starting with move 1, and that the move numbers are followed by a full-stop.

So, instead of

`COUNT(*)`

, we should count the cases where the line starts with `'1.'`

!
Our expression thus becomes:If the line is the first movetext line,COUNT(CASElineLIKE'1.%'THEN1END)OVER(ROWS BETWEEN UNBOUNDED PRECEDINGAND CURRENT ROW)ASfirst_movetext_linecount

**LIKE** '1.%'

will be `TRUE`

, and the `CASE`

-expression evaluates to `1`

, which will be counted.
Any non-`NULL`

-value would have done the job, but `1`

is a convenient choice: it reminds us that we're counting the first movetext line and at the same time, symbolizes that we're counting that as one game.
In all other cases, the `CASE`

-expression evaluates to `NULL`

, and those lines will not be counted.
If we'd add this to our query, this would be the result:

┌────────────────────────────────────────────────────────────────────────────────────┬───────────┬──────────────────────────┐ │ line │ is_header │ first_movetext_linecount │ │ varchar │ boolean │ int64 │ ├────────────────────────────────────────────────────────────────────────────────────┼───────────┼──────────────────────────┤ │ [Event "Hastings"] │ true │ 0 │ │ [Site "Hastings"] │ true │ 0 │ │ [Date "1895.??.??"] │ true │ 0 │ │ [Round "?"] │ true │ 0 │ │ [White "Tinsley, Samuel"] │ true │ 0 │ │ [Black "Pollock, William Henry Kraus"] │ true │ 0 │ │ [Result "1-0"] │ true │ 0 │ │ [WhiteElo ""] │ true │ 0 │ │ [BlackElo ""] │ true │ 0 │ │ [ECO "A90"] │ true │ 0 │ │1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O 6.Qb3 c5 7.e3 Nc6 8.a3 cxd4│false│1│ │ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc1+ 13.Rhxc1 b6 14.Nd4 Ne4 15.Rd1 Qg5 │ false │ 1 │ │ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20.Rxa7 Rxa7 21.Nxa7 Kh8 22.b5 Qf6 │ false │ 1 │ │ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 27.Kf1 Ra1+ 28.Kg2 b5 29.Qb4 Kh7 │ false │ 1 │ │ 30.Be2 Nd6 31.Bf3 Ra4 32.Qb1 Rc4 33.Rd2 Ne4 34.Rd3 Rc5 35.h5 Kg8 36.Qa2 Kh8 │ false │ 1 │ │ 37.Qa5 Kh7 38.Qd8 Nxc3 39.Qf8 Qc8 40.Qf7 b4 41.g4 fxg4 42.Qg6+ Kh8 43.Bxg4 Qd8 │ false │ 1 │ │ 44.f4 Qf6 45.Qe8+ Kh7 46.Be6 Qxe6 47.Qxe6 b3 48.Qb6 Ne4 49.Qg6+ Kg8 50.Rxd5 Rc2+ │ false │ 1 │ │ 51.Kf3 Rd2 52.Qxe4 1-0 │ false │ 1 │ │ [Event "Scheveningen"] │ true │ 1 │ │ [Site "Scheveningen"] │ true │ 1 │ │ [Date "1913.??.??"] │ true │ 1 │ │ [Round "?"] │ true │ 1 │ │ [White "Loman, Rudolf"] │ true │ 1 │ │ [Black "Lasker, Edward"] │ true │ 1 │ │ [Result "1/2-1/2"] │ true │ 1 │ │ [WhiteElo ""] │ true │ 1 │ │ [BlackElo ""] │ true │ 1 │ │ [ECO "A90"] │ true │ 1 │ │1.d4 e6 2.c4 f5 3.g3 Nf6 4.Bg2 d5 5.cxd5 exd5 6.Bg5 c6 7.Nc3 Bd6 8.Qc2 O-O│false│2│ │ 9.O-O-O Qa5 10.Bxf6 gxf6 11.Bh3 Bb4 12.Kb1 Bxc3 13.Qxc3 Qxc3 14.bxc3 Na6 │ false │ 2 │ ├────────────────────────────────────────────────────────────────────────────────────┴───────────┴──────────────────────────┤ │ 30 rows 3 columns │ └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

#### Compensating Headers to get the game id

There's one problem left, and that's that the game's header lines occur

*before*that game's first movetext line. So, the game's header lines get a number corresponding to the

*previous*game's movetext line. But, this is easily solved: because it applies to all header lines of each game, we can simply compensate by adding

`1`

in case the line is a header line.
In the previous section, we already solved the problem of identifying header lines. Thanks to a DuckDB feature called reusable column alias (a concept similar to lateral column alias in DataBricks SQL), fixing

`first_movetext_counter`

so it becomes a `game_id`

is trivial!
We can simply check whether `is_header`

is `TRUE`

, and if so, add `1`

: With this compensation in place, the query now looks like this:COUNT(CASElineLIKE'1.%'THEN1END)OVER(ROWS BETWEEN UNBOUNDED PRECEDINGAND CURRENT ROW) +CASEWHENis_headerTHEN1ELSE0ENDASgame_id

The result now becomes:SELECTline , lineLIKE'[%'ASis_header ,COUNT(CASElineLIKE'1.%'THEN1END)OVER(ROWS BETWEEN UNBOUNDED PRECEDINGAND CURRENT ROW) +CASEWHENis_headerTHEN1ELSE0ENDASgame_idFROMread_csv( 'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn' , columns = {'line': 'VARCHAR'} )WHERElineIS NOT NULL

┌────────────────────────────────────────────────────────────────────────────────────┬───────────┬─────────┐ │ line │ is_header │ game_id │ │ varchar │ boolean │ int64 │ ├────────────────────────────────────────────────────────────────────────────────────┼───────────┼─────────┤ │ [Event "Hastings"] │ true │ 1 │ │ [Site "Hastings"] │ true │ 1 │ │ [Date "1895.??.??"] │ true │ 1 │ │ [Round "?"] │ true │ 1 │ │ [White "Tinsley, Samuel"] │ true │ 1 │ │ [Black "Pollock, William Henry Kraus"] │ true │ 1 │ │ [Result "1-0"] │ true │ 1 │ │ [WhiteElo ""] │ true │ 1 │ │ [BlackElo ""] │ true │ 1 │ │ [ECO "A90"] │ true │ 1 │ │ 1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O 6.Qb3 c5 7.e3 Nc6 8.a3 cxd4 │ false │ 1 │ │ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc1+ 13.Rhxc1 b6 14.Nd4 Ne4 15.Rd1 Qg5 │ false │ 1 │ │ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20.Rxa7 Rxa7 21.Nxa7 Kh8 22.b5 Qf6 │ false │ 1 │ │ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 27.Kf1 Ra1+ 28.Kg2 b5 29.Qb4 Kh7 │ false │ 1 │ │ 30.Be2 Nd6 31.Bf3 Ra4 32.Qb1 Rc4 33.Rd2 Ne4 34.Rd3 Rc5 35.h5 Kg8 36.Qa2 Kh8 │ false │ 1 │ │ 37.Qa5 Kh7 38.Qd8 Nxc3 39.Qf8 Qc8 40.Qf7 b4 41.g4 fxg4 42.Qg6+ Kh8 43.Bxg4 Qd8 │ false │ 1 │ │ 44.f4 Qf6 45.Qe8+ Kh7 46.Be6 Qxe6 47.Qxe6 b3 48.Qb6 Ne4 49.Qg6+ Kg8 50.Rxd5 Rc2+ │ false │ 1 │ │ 51.Kf3 Rd2 52.Qxe4 1-0 │ false │ 1 │ │ [Event "Scheveningen"] │ true │ 2 │ │ [Site "Scheveningen"] │ true │ 2 │ │ [Date "1913.??.??"] │ true │ 2 │ │ [Round "?"] │ true │ 2 │ │ [White "Loman, Rudolf"] │ true │ 2 │ │ [Black "Lasker, Edward"] │ true │ 2 │ │ [Result "1/2-1/2"] │ true │ 2 │ │ [WhiteElo ""] │ true │ 2 │ │ [BlackElo ""] │ true │ 2 │ │ [ECO "A90"] │ true │ 2 │ │ 1.d4 e6 2.c4 f5 3.g3 Nf6 4.Bg2 d5 5.cxd5 exd5 6.Bg5 c6 7.Nc3 Bd6 8.Qc2 O-O │ false │ 2 │ │ 9.O-O-O Qa5 10.Bxf6 gxf6 11.Bh3 Bb4 12.Kb1 Bxc3 13.Qxc3 Qxc3 14.bxc3 Na6 │ false │ 2 │ ├────────────────────────────────────────────────────────────────────────────────────┴───────────┴─────────┤ │ 30 rows 3 columns │ └──────────────────────────────────────────────────────────────────────────────────────────────────────────┘Perfect! We now have a

`game_id`

expression that we can use to identify all the lines belonging to one game.
Remember, we need that to be able to roll up all the game's lines into a single row on our desired tabular result.
We will return to that topic in the last installment.
### Next Installment: Parsing PGN Tagpairs

Next time, we'll work with regular expressions to extract information from PGN tagpairs.

## No comments:

Post a Comment