Friday, July 12, 2024

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Distinguishing the Line Type (3/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is the third installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.

The installments are:
  1. Chess and Portable Game Notation (PGN)
  2. Ingesting raw text with the CSV Reader
  3. Distinguishing the Line Type
  4. Keeping game lines together: window functions
  5. Extracting Tagpairs with Regular Expressions
  6. Rolling up each game's lines into a single game row

Distinguishing PGN Headers, Movetext and separators


In the previous installment, we learned how to access the raw lines of text from a PGN file. Before we can move on to greater things, we need to do a little bit of data enrichment. It's not difficult, and it doesn't show off any particular features specific to DuckDB, but it has to be done and this installment covers how. Think of it as a warm-up!

When you inspect the query result presented in the previous installment, you notice quickly there are three types of lines:
  • header lines (marked up in green)
  • movetext lines (marked up in brown)
  • NULL-values (marked up in red), which correspond to empy lines in the file.
We can safely ignore the NULL-values, as they don't carry any information. We do that by simply adding a WHERE-clause to our query:
WHERE line IS NOT NULL
Now let's add an expression that tells us whether a line is a header line (or not). We know header lines have a tagpair enclosed in square brackets. So in principle, checking whether the line starts with an opening square bracket is enough to know we're dealing with a header line. We can do that by using a LIKE-comparison:
line LIKE '[%'
Adding this to our query, we now get:
SELECT line
,      line LIKE '[%' AS is_header
FROM   read_csv(
         'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn'
       , columns = {'line': 'VARCHAR'}
       )
WHERE  line IS NOT NULL
Our result now looks like this:
┌──────────────────────────────────────────────────────────────────────────────────────┬───────────┐
│                                         line                                         │ is_header │
│                                       varchar                                        │  boolean  │
├──────────────────────────────────────────────────────────────────────────────────────┼───────────┤
│ [Event "Hastings"]                                                                   │ true      │
│ [Site "Hastings"]                                                                    │ true      │
│ [Date "1895.??.??"]                                                                  │ true      │
│ [Round "?"]                                                                          │ true      │
│ [White "Tinsley, Samuel"]                                                            │ true      │
│ [Black "Pollock, William Henry Kraus"]                                               │ true      │
│ [Result "1-0"]                                                                       │ true      │
│ [WhiteElo ""]                                                                        │ true      │
│ [BlackElo ""]                                                                        │ true      │
│ [ECO "A90"]                                                                          │ true      │
│ 1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O 6.Qb3 c5 7.e3 Nc6 8.a3 cxd4            │ false     │
│ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc1+ 13.Rhxc1 b6 14.Nd4 Ne4 15.Rd1 Qg5   │ false     │
│ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20.Rxa7 Rxa7 21.Nxa7 Kh8 22.b5 Qf6       │ false     │
│ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 27.Kf1 Ra1+ 28.Kg2 b5 29.Qb4 Kh7       │ false     │
│ 30.Be2 Nd6 31.Bf3 Ra4 32.Qb1 Rc4 33.Rd2 Ne4 34.Rd3 Rc5 35.h5 Kg8 36.Qa2 Kh8          │ false     │
│               ·                                                                      │  ·        │
│               ·                                                                      │  ·        │
│               ·                                                                      │  ·        │
│ [Black "Kurbonboeva,Sarvinoz"]                                                       │ true      │
│ [Result "1-0"]                                                                       │ true      │
│ [WhiteElo "2423"]                                                                    │ true      │
│ [BlackElo "2154"]                                                                    │ true      │
│ [ECO "A90"]                                                                          │ true      │
│ 1.d4 f5 2.g3 Nf6 3.Bg2 e6 4.c4 d5 5.b3 c6 6.Nh3 Bd6 7.O-O Qe7 8.Bf4 e5 9.dxe5 Bxe5   │ false     │
│ 10.Bxe5 Qxe5 11.Nd2 O-O 12.cxd5 cxd5 13.Nf3 Qe7 14.Nf4 Rd8 15.Nd4 Nc6 16.Rc1 Nxd4    │ false     │
│ 17.Qxd4 Ne4 18.Nxd5 Qf7 19.Rc7 Rxd5 20.Rxf7 Rxd4 21.Re7 Kf8 22.Rc7 Be6 23.Rxb7 Nc5   │ false     │
│ 24.Rc7 Na6 25.Rc6 Bc8 26.Rxa6 Bxa6 27.Bxa8 Bxe2 28.Re1 Bb5 29.a4 Bd7 30.Bf3 Rd3      │ false     │
│ 31.Bd1 Kf7 32.Bc2 Rc3 33.Rd1 Be6 34.Rd2 Bxb3 35.Bxb3+ Rxb3 36.Rd7+ Kf6 37.Rxa7 Rb1+  │ false     │
│ 38.Kg2 Ra1 39.h4 h6 40.h5 Ra3 41.a5 Ra2 42.Kf3 Ra3+ 43.Ke2 Ra2+ 44.Ke3 Ra3+          │ false     │
│ 45.Kd4 Ra2 46.a6 Rxf2 47.Rc7 Ra2 48.a7 g5 49.hxg6 Kxg6 50.Kc5 h5 51.Kb6 h4           │ false     │
│ 52.gxh4 Kh5 53.Rh7+ Kg4 54.Kb7 Rb2+ 55.Kc8 Rc2+ 56.Kb8 Rb2+ 57.Rb7 Ra2 58.a8=Q Rxa8+ │ false     │
│ 59.Kxa8 Kxh4 60.Rf7 Kg4 61.Kb7 f4 62.Kc6 f3 63.Kd5 Kg3 64.Ke4 f2 65.Ke3 Kg2          │ false     │
│ 66.Ke4  1-0                                                                          │ false     │
├──────────────────────────────────────────────────────────────────────────────────────┴───────────┤
│ 117000 rows (30 shown)                                                                 2 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
We can be humble, and still say: that's not a bad start!

Next Installment: Keeping the lines of a game together


Next time, we'll add a game_id to keep all lines that belong to one game together.

No comments:

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...