This post is the second installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.
The installments are:
- Chess and Portable Game Notation (PGN)
- Ingesting raw text with the CSV Reader
- Distinguishing the Line Type
- Keeping game lines together: window functions
- Extracting Tagpairs with Regular Expressions
- Rolling up each game's lines into a single game row
Ingesting lines of raw text from files using DuckDB's read_csv()
For this installment I'm going to use DuckDB's capabilities to read and query CSV-files. There is already a lot of content available that explains the ins and outs of using the DuckDB CSV-reader for typical, tabular datatasets, so I'm not going to cover that. Instead, for this particular example, I'm going to (ab)use DuckDB's CSV-reader just to read raw lines of text. So rather than letting the CSV reader discover columns, I want the file to be returned as a list of lines. This query does the trick:
SELECT line FROM read_csv( 'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn' , columns = {'line': 'VARCHAR'} )We use the
columns
argument to specify that our column should be called line
.
We also could've omitted it, but then the column would've been assigned the name column0
, which I think is less clear.
Running the query gives us a result like this:┌──────────────────────────────────────────────────────────────────────────────────────┐ │ line │ │ varchar │ ├──────────────────────────────────────────────────────────────────────────────────────┤ │ [Event "Hastings"] │ │ [Site "Hastings"] │ │ [Date "1895.??.??"] │ │ [Round "?"] │ │ [White "Tinsley, Samuel"] │ │ [Black "Pollock, William Henry Kraus"] │ │ [Result "1-0"] │ │ [WhiteElo ""] │ │ [BlackElo ""] │ │ [ECO "A90"] │ │ <NULL> │ │ 1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O 6.Qb3 c5 7.e3 Nc6 8.a3 cxd4 │ │ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc1+ 13.Rhxc1 b6 14.Nd4 Ne4 15.Rd1 Qg5 │ │ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20.Rxa7 Rxa7 21.Nxa7 Kh8 22.b5 Qf6 │ │ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 27.Kf1 Ra1+ 28.Kg2 b5 29.Qb4 Kh7 │ │ · │ │ · │ │ · │ │ [WhiteElo "2423"] │ │ [BlackElo "2154"] │ │ [ECO "A90"] │ │ <NULL> │ │ 1.d4 f5 2.g3 Nf6 3.Bg2 e6 4.c4 d5 5.b3 c6 6.Nh3 Bd6 7.O-O Qe7 8.Bf4 e5 9.dxe5 Bxe5 │ │ 10.Bxe5 Qxe5 11.Nd2 O-O 12.cxd5 cxd5 13.Nf3 Qe7 14.Nf4 Rd8 15.Nd4 Nc6 16.Rc1 Nxd4 │ │ 17.Qxd4 Ne4 18.Nxd5 Qf7 19.Rc7 Rxd5 20.Rxf7 Rxd4 21.Re7 Kf8 22.Rc7 Be6 23.Rxb7 Nc5 │ │ 24.Rc7 Na6 25.Rc6 Bc8 26.Rxa6 Bxa6 27.Bxa8 Bxe2 28.Re1 Bb5 29.a4 Bd7 30.Bf3 Rd3 │ │ 31.Bd1 Kf7 32.Bc2 Rc3 33.Rd1 Be6 34.Rd2 Bxb3 35.Bxb3+ Rxb3 36.Rd7+ Kf6 37.Rxa7 Rb1+ │ │ 38.Kg2 Ra1 39.h4 h6 40.h5 Ra3 41.a5 Ra2 42.Kf3 Ra3+ 43.Ke2 Ra2+ 44.Ke3 Ra3+ │ │ 45.Kd4 Ra2 46.a6 Rxf2 47.Rc7 Ra2 48.a7 g5 49.hxg6 Kxg6 50.Kc5 h5 51.Kb6 h4 │ │ 52.gxh4 Kh5 53.Rh7+ Kg4 54.Kb7 Rb2+ 55.Kc8 Rc2+ 56.Kb8 Rb2+ 57.Rb7 Ra2 58.a8=Q Rxa8+ │ │ 59.Kxa8 Kxh4 60.Rf7 Kg4 61.Kb7 f4 62.Kc6 f3 63.Kd5 Kg3 64.Ke4 f2 65.Ke3 Kg2 │ │ 66.Ke4 1-0 │ │ <NULL> │ ├──────────────────────────────────────────────────────────────────────────────────────┤ │ 131486 rows (30 shown) │ └──────────────────────────────────────────────────────────────────────────────────────┘
Text lines versus CSV reader rows: different, but related
At this point it's good to note that there's a direct correspondence between the lines in the file, and the rows returned by
csv_read()
:
Not only does the content match, the order of the lines as they appear in the file is also preserved.
The latter sounds obvious but really isn't. The CSV-reader has a
parallel
option that controls whether multiple threads are used to read the file is read in parallel.
Such a feature could easily lead to an implementation that does not guarantee that the order of the rows in the returned resultset matches the order of the lines read from the file(s).
The SQL language also doesn't guarantee row order unless enforced by an explicit ORDER BY-clause, so there wouldn't be a clear requirement from that side either to preserve the line order.
Fortunately, the DuckDB CSV-reader does guarantee that the order is preserved. This is controlled by the
preserve_insertion_order
-configuration option, which is TRUE by default.
Next Installment: Detecting line types
Next time, we'll add an expression that can detect whether the line is a header line, or a movetext line.
No comments:
Post a Comment