The DataZen winter meetup 2025 is nigh!
Join us 18 - 20 February 2025 for 3 days of expert-led sessions on AI, LLM, ChatGPT, Big Data, MLOps, and more. This FREE online event is open to data enthusiasts of all levels!
Checkout the program here: https://wearecommunity.io/events/winter-data-meetup2025/talks/84304
I'm doing a talk on DuckDB and Huey - an open source browser app for pivoting hundreds of millions of rows directly in your browser:
My talk is on february 18 2025, 10:00- 11:00 CET Huey: Blazing-Fast Browser Pivot Tables with DuckDB/WASM.
I'm looking forward to seeing you there!
Roland Bouman's blog
Programming - Databases - Analytics
Wednesday, January 22, 2025
Thursday, December 19, 2024
DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance
DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.
This post is about a particular challenge posted on the DuckDB Discord server which happened to interest me. So, here goes!!
Yesterday on the DuckDB discord server user @col_montium asked:
Surprisingly, the source data set is less than 2MB (2,039,748 bytes)! The query also doesn't appear to be terribly complicated, and if it finishes successfully, it yields only
As it is so modest in size, we can easily inspect it with a text editor:
Now that we analyzed the structure, it should be clear what the query attempts to achieve. It wants to make a row for each property of the outermost
Using
In an attempt to get the query working without running out of memory, I decided to store the dataset in a DuckDB table like so:
You might also have noticed this alternative query uses slightly different syntax to the extract
With these results in mind, we have to conclude that the actual operations on the JSON data cannot really explain the difference. So, it must have something todo with the JSON-reader that was implicitly invoked by the initial query. Let's zoom in a bit on the difference between data extracted from a
We know that in the second example, the data is of the
What data type does the JSON reader think the data has? Let's find out using this
In the example above it's truncated, but if we run it again using
So, this really is quite different as compared to the
Whether this is related to the difference in performance and to the out of memory errors, is yet to be determined, but it does not seem unlikely. For starters, the data type closely resembles the data. The textual description of the data type is about half the size of the data itself, which does not seem like a good thing!
We will investigate the JSON-readers' capabilities for data type detection and the possibilities to control that in the next couple of sections. But before we examine that in more detail, let's take a step back and consider what we just learned.
So far, we've learned at least one important lesson: the JSON reader does in fact do exactly what its name implies - read JSON-encoded data. But while doing so, it tends to output data using DuckDB native data types, and typically not the
In hindsight this certainly sounds perfectly sensible. Yet, both me and @col_montium apparently did not realize that fully, as the original query takes the data coming out of the JSON reader, and processes it using functions like
You may wonder: how it is possible that these functions and operators work at all on the
Actually - it is a little bit different still, as the JSON-reader deals with reading JSON-encoded text whereas the
Avoiding the implicit
Now that we have a better understanding of what we've done, could we rewrite the original query so we can avoid the implicit
First of all, we cannot simply use whatever are the
For example, it is currently not possible to extract the keys from a
Even if we somehow managed to extract the keys from the
A Solution based on
Eventually I came up with an approach that at least allows me to envision what the solution based on
Since we don't need the metadata-like objects
The
Now, the values extracted from the
However, we're only interested in a particular set of properties. They are
Again, the
The final touch is to unwrap the
When we run it, we get:
If we comment out the extraction for the the
So even if we would somehow be able to overcome the issues with extracting the properties and creating the
We mentioned earlier that
Now that we learned about the
Let's attempt to rewrite the original query using the
In addition, the main difference with the original query is that the JSON-reader now hands the variable objects to us as
If we download the JSON file and store it on the local disc, and modify our queries to use that, the results are quite remarkable: the one using
This post is about a particular challenge posted on the DuckDB Discord server which happened to interest me. So, here goes!!
Yesterday on the DuckDB discord server user @col_montium asked:
I've got a bit of a head scratcher. My json from https://api.census.gov/data/2000/dec/sf1/variables.json is a single struct with dynamic keys. I want to extract a table with columnsfield_code
,label
,concept
, andgroup
. I completed this with a small sample of the data but with the entire dataset the query uses up 24 gigabytes of RAM and then crashes. Here's my query:
WITH source AS ( SELECT * FROM 'https://api.census.gov/data/2000/dec/sf1/variables.json' ), all_keys AS ( SELECT unnest( json_keys( variables ) ) AS "key", variables FROM source ), extracted_fields AS ( SELECT "key" AS field_code , variables->key->>'$.label' AS label , variables->key->>'$.concept' AS concept , variables->key->>'$.predicateType' AS predicate_type , variables->key->>'$.group' AS "group" FROM all_keys WHERE "key" NOT IN ('for', 'in', 'ucgid') ) SELECT * FROM extracted_fields ORDER BY field_codeI tried to run this myself on my laptop using the DuckDB 1.1.3 (GA) command line interface, and it failed with a similar error as reported by @col_montium:
Run Time (s): real 252.335 user 197.031250 sys 9.500000 Out of Memory Error: failed to allocate data of size 16.0 MiB (24.9 GiB/25.0 GiB used)I was able to run it successfully using a 1.1.4 nightly build, but it took about 3 minutes to complete:
Run Time (s): real 188.681 user 148.500000 sys 8.078125
Dataset: Size and Structure
Surprisingly, the source data set is less than 2MB (2,039,748 bytes)! The query also doesn't appear to be terribly complicated, and if it finishes successfully, it yields only
8141
rows.
So, clearly something interesting must be the matter with the structure of the dataset.
As it is so modest in size, we can easily inspect it with a text editor:
{ "variables": { "for": { "label": "Census API FIPS 'for' clause", "concept": "Census API Geography Specification", "predicateType": "fips-for", "group": "N/A", "limit": 0, "predicateOnly": true }, "in" { "label": "Census API FIPS 'in' clause", "concept": "Census API Geography Specification", "predicateType": "fips-in", "group": "N/A", "limit": 0, "predicateOnly": true }, "ucgid" { "label": "Uniform Census Geography Identifier clause", "concept": "Census API Geography Specification", "predicateType": "ucgid", "group": "N/A", "limit": 0, "predicateOnly": true, "hasGeoCollectionSupport": true }, "P029009" { "label": "Total!!In households!!Related child!!Own child!!6 to 11 years", "concept": "RELATIONSHIP BY AGE FOR THE POPULATION UNDER 18 YEARS [46]", "predicateType": "int", "group": "P029", "limit": 0 }, ...many more variables... "PCT012H185" { "label": "Total!!Female!!78 years", "concept": "SEX BY AGE (HISPANIC OR LATINO) [209]", "predicateType": "int", "group": "PCT012H", "limit": 0 } } }The structure of the dataset is quite simple: a single object with a single
variables
property, which has an object value with many object-typed properties.
The object-type values of these properties contain a handful of recurring scalar properties like label
, concept
, predicateType
, group
and limit
.
Now that we analyzed the structure, it should be clear what the query attempts to achieve. It wants to make a row for each property of the outermost
variables
-object.
The property name becomes the field_code
column, and forms its natural primary key.
The recurring properties of the innermost objects become its non-key columns label
, concept
, predicate_type
and group
.
Using read_text()
and a DuckDB table column with the JSON
-datatype
In an attempt to get the query working without running out of memory, I decided to store the dataset in a DuckDB table like so:
CREATE TABLE t_json AS SELECT filename , content::JSON AS data FROM read_text('https://api.census.gov/data/2000/dec/sf1/variables.json');With this in place, a modified version of the initial query might be:
WITH variables AS ( SELECT data->variables AS variables , unnest( json_keys( variables ) ) AS field_code FROM t_json ) SELECT field_code , variables->field_code->>'label' AS label , variables->field_code->>'concept' AS concept , variables->field_code->>'predicateType' AS predicate_type , variables->field_code->>'group' AS group FROM variables WHERE field_code NOT IN ('for', 'in', 'ucgid') ORDER BY field_codeThe results look like this:
┌────────────┬──────────────────────────────────┬─────────┬────────────────┬─────────┐ │ field_code │ label │ concept │ predicate_type │ group │ │ varchar │ varchar │ varchar │ varchar │ varchar │ ├────────────┼──────────────────────────────────┼─────────┼────────────────┼─────────┤ │ AIANHH │ American Indian Area/Alaska Na… │ NULL │ NULL │ N/A │ │ AIHHTLI │ American Indian Area (Off-Rese… │ NULL │ NULL │ N/A │ │ AITSCE │ American Indian Tribal Subdivi… │ NULL │ NULL │ N/A │ │ ANRC │ Alaska Native Regional Corpora… │ NULL │ NULL │ N/A │ │ BLKGRP │ Census Block Group │ NULL │ NULL │ N/A │ │ BLOCK │ Census Block │ NULL │ NULL │ N/A │ │ CD106 │ Congressional District (106th) │ NULL │ string │ N/A │ │ CONCIT │ Consolidated City │ NULL │ NULL │ N/A │ │ · │ · │ · │ · │ · │ │ · │ · │ · │ · │ · │ │ · │ · │ · │ · │ · │ │ SUBMCD │ Sub-Minor Civil Division (FIPS) │ NULL │ NULL │ N/A │ │ SUMLEVEL │ Summary Level code │ NULL │ string │ N/A │ │ TRACT │ Census Tract │ NULL │ NULL │ N/A │ │ UA │ Urban Area │ NULL │ NULL │ N/A │ │ US │ United States │ NULL │ NULL │ N/A │ │ ZCTA3 │ ZIP Code Tabulation Area (Thre… │ NULL │ NULL │ N/A │ │ ZCTA5 │ Zip Code Tabulation Area (Five… │ NULL │ NULL │ N/A │ ├────────────┴──────────────────────────────────┴─────────┴────────────────┴─────────┤ │ 8141 rows (15 shown) 5 columns │ └────────────────────────────────────────────────────────────────────────────────────┘ Run Time (s): real 11.682 user 8.015625 sys 2.406250Less than 12 seconds. Not super-fast, but still: a pretty substantial improvement as compared to the initial query. What could explain this difference? Two things stand out as compared to the initial query:
-
Using the
read_text()
-function to ingest the data. In the original query, the data was ingested by the first common table expression calledsource
, which did aSELECT
directlyFROM
the dataset's URL'https://api.census.gov/data/2000/dec/sf1/variables.json'
. Clearly, with this syntax, DuckDB will do some magic to perform the appropriate action, which in this case will be to invokeread_json_auto('https://api.census.gov/data/2000/dec/sf1/variables.json')
. -
Explicitly casting the content of the data file to DuckDB's
JSON
-datatype. I expected the issue with the original query had something todo with the transformation of JSON data to a tabular result. Having the data in aJSON
-datatype seemed like a good start to play with different methods to extract data. Theread_text()
-function returns the contents of the file asVARCHAR
. It cannot and does not attempt to parse or process the data. That's why we need to cast its output explictly to theJSON
-type ourselves.
read_text()
directly:
WITH variables AS ( SELECT json( content )->variables AS variables , unnest( json_keys( variables ) ) AS field_code FROM read_text( 'https://api.census.gov/data/2000/dec/sf1/variables.json' ) ) SELECT field_code , variables->field_code->>'label' AS label , variables->field_code->>'concept' AS concept , variables->field_code->>'predicateType' AS predicate_type , variables->field_code->>'group' AS group FROM varibles WHERE field_code NOT IN ('for', 'in', 'ucgid') ORDER BY field_codeThis query takes about 20 seconds, in other words: about the same time as it takes to run the previous query plus the time to fetch the dataset and load it in the table. For a serious comparison of query performance we should probably eliminate the download and use only local files, but we'll get to that later. For now it's enough to notice that this setup is a significant improvement, both in terms of query execution time as well as in stability/memory consumption.
You might also have noticed this alternative query uses slightly different syntax to the extract
label
, concept
, predicateType
and group
.
Also, the final data presentation query is combined with the field extraction logic.
However, these changes are really a matter of style.
They do not have any bearing on query performance - not for this data set anyway.
Understanding the difference: the JSON-reader
With these results in mind, we have to conclude that the actual operations on the JSON data cannot really explain the difference. So, it must have something todo with the JSON-reader that was implicitly invoked by the initial query. Let's zoom in a bit on the difference between data extracted from a
JSON
-typed value and data extracted by the JSON-reader.
We know that in the second example, the data is of the
JSON
-type, because we explicitly cast it to that type.
However, that only tells us that the data is essentially text, conforming to the JSON-syntax.
It is of course very useful to know the encoding scheme used to represent values and structures, that still does not reveal anything about the content and structure of the data itself.
What data type does the JSON reader think the data has? Let's find out using this
DESCRIBE
-statement:
DESCRIBE
SELECT *
FROM 'https://api.census.gov/data/2000/dec/sf1/variables.json'
Its result is:┌─────────────┬─────────────────────────────┬─────────┬─────────┬─────────┬─────────┐ │ column_name │ column_type │ null │ key │ default │ extra │ │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ ├─────────────┼─────────────────────────────┼─────────┼─────────┼─────────┼─────────┤ │ variables │ STRUCT("for" STRUCT("labe… │ YES │ NULL │ NULL │ NULL │ └─────────────┴─────────────────────────────┴─────────┴─────────┴─────────┴─────────┘ Run Time (s): real 9.541 user 1.968750 sys 0.031250As you can see, the
variables
property of the main object that makes up our dataset was detected and projected as a column.
DuckDB has also inferred a pretty detailed data type for the column, expressed using DuckDB native SQL data types.
In the example above it's truncated, but if we run it again using
.mode line
, we can see the entire type descriptor.
It goes like this:
STRUCT( "for" STRUCT( "label" VARCHAR , concept VARCHAR , predicateType VARCHAR , "group" VARCHAR , "limit" BIGINT , predicateOnly BOOLEAN ) , "in" STRUCT( "label" VARCHAR , concept VARCHAR , predicateType VARCHAR , "group" VARCHAR , "limit" BIGINT , predicateOnly BOOLEAN ) , ucgid STRUCT( "label" VARCHAR , concept VARCHAR , predicateType VARCHAR , "group" VARCHAR , "limit" BIGINT , predicateOnly BOOLEAN , hasGeoCollectionSupport BOOLEAN ) , P029009 STRUCT( "label" VARCHAR , concept VARCHAR , predicateType VARCHAR , "group" VARCHAR , "limit" BIGINT ) ...many more ... , PCT012H185 STRUCT( "label" VARCHAR , concept VARCHAR , predicateType VARCHAR , "group" VARCHAR , "limit" BIGINT ) )Looks familiar? It should, as it corresponds closely to the structure we witnessed when analyzing the raw JSON data using a text editor.
So, this really is quite different as compared to the
JSON
-type column, as the JSON-reader must have done some work to explore the structure of the data.
Not only did it detect the data type, it also uses the type to represent the data, which in turn determines what operators and functions will be available by the remainder of the query to work with the data.
Whether this is related to the difference in performance and to the out of memory errors, is yet to be determined, but it does not seem unlikely. For starters, the data type closely resembles the data. The textual description of the data type is about half the size of the data itself, which does not seem like a good thing!
We will investigate the JSON-readers' capabilities for data type detection and the possibilities to control that in the next couple of sections. But before we examine that in more detail, let's take a step back and consider what we just learned.
One lesson learned
So far, we've learned at least one important lesson: the JSON reader does in fact do exactly what its name implies - read JSON-encoded data. But while doing so, it tends to output data using DuckDB native data types, and typically not the
JSON
data type.
In hindsight this certainly sounds perfectly sensible. Yet, both me and @col_montium apparently did not realize that fully, as the original query takes the data coming out of the JSON reader, and processes it using functions like
json_keys()
, and the extraction operators ->
and ->>
.
These are designed for working with values in the JSON
data type, and not for the STRUCT
values that the JSON reader hands back to us.
You may wonder: how it is possible that these functions and operators work at all on the
STRUCT
value returned by the JSON reader?
The answer is quite simple: when we apply these functions and operators, they first implicitly cast their arguments and operands to the JSON
-type.
So, just as the JSON reader spent all its effort to read raw JSON-encoded data and convert it into neat native DuckDB STRUCT
s, we immediately undid that effort only to convert them back to a JSON
-type.
And initially we - or at least I - didn't even realize it.
Actually - it is a little bit different still, as the JSON-reader deals with reading JSON-encoded text whereas the
JSON
-data type is still a database type.
But the point here is that the JSON-reader did work to provide a precise description of the data as STRUCT
s, and we didn't bother treating the data as such.
Instead, we implicitly cast it to the JSON
-type, which is much looser.
So it feels there might be an unused opportunity to benefit from the detected data type.
We will examine this possibility in the next section.
Avoiding the implicit JSON
-type cast?
Now that we have a better understanding of what we've done, could we rewrite the original query so we can avoid the implicit
JSON
-type cast, and benefit from the DuckDB native types returned by the JSON reader?
Somewhat surprisingly, this is not quite as simple as one might think it would (or should) be.
First of all, we cannot simply use whatever are the
STRUCT
equivalents for the JSON
-type functions and operators.
For example, it is currently not possible to extract the keys from a
STRUCT
type.
So there's no straightforward equivalent to how the original query uses json_keys()
and then unnest()
to spread out the keys from the original variables
object into separate rows.
(In the DuckDB github repository, @pkoppstein opened a discussion about creating a feature to extract the keys from STRUCT
s in case you want to chime in.)
Even if we somehow managed to extract the keys from the
STRUCT
, there is currently no way to use it to extract the corresponding value.
That is, STRUCT
s of course support extraction methods, but all of them require the key to be a constant value. This prohibits using a column value, such as what we would get if we'd use an unnested set of keys:
WITH field_codes AS ( SELECT unnest( json_keys( variables ) ) AS field_code , variables FROM 'https://api.census.gov/data/2000/dec/sf1/variables.json' ) SELECT field_code , variables[field_code] AS "variable" FROM field_codes WHERE field_codes NOT IN ('for', 'in', 'ucgid')In the example above,
variables
is the STRUCT
-typed value we receive from the JSON-reader.
The expression variables[field_code]
attempts to extract a value from it using the column value field_code
.
The statement fails because field_code
is not a constant:Run Time (s): real 8.345 user 1.531250 sys 0.031250 Binder Error: Key name for struct_extract needs to be a constant stringAs the error message suggests, the square bracket extraction syntax is just syntactic sugar for the
struct_extract()
function.
Using a function call instead yields the exact same error.
A Solution based on STRUCT
s
Eventually I came up with an approach that at least allows me to envision what the solution based on
STRUCT
s, and without relying on the JSON
-type might look like.
Here's that attempt:
WITH variables AS ( UNPIVOT ( WITH variables AS ( SELECT COLUMNS( variables.* EXCLUDE( "for", "in", "ucgid" ) ) FROM read_json( 'https://api.census.gov/data/2000/dec/sf1/variables.json' ) ) SELECT struct_pack( label := struct_extract( COLUMNS( * ) , 'label' ) , concept := struct_extract( COLUMNS( * ) , 'concept' ) , predicateType := struct_extract( COLUMNS( * ) , 'predicateType' ) , "group" := struct_extract( COLUMNS( * ) , 'group' ) ) FROM variables ) ON( * ) INTO NAME field_code VALUE "variable" ) SELECT field_code, "variable".* FROM variables ORDER BY field_codeThe main idea here is to use
variables.*
to turn the keys of the STRUCT
given to us by the JSON-reader into separate columns.
The UNPIVOT
-statement can then turn those columns into rows.
In the original query this was achieved by combining json_keys()
and unnest()
.
Since we don't need the metadata-like objects
ucgid
, for
and in
, we also don't need to unpack them from the variables
object.
So, we wrapped variables.*
into a COLUMNS
-"star expression"
so we could add an EXCLUDE
-clause to remove them.
In the original query, those where removed in the WHERE
-clause.
The
UNPIVOT
-statement not only creates rows from columns, but also lets us transform the column names into values.
We use the NAME
-clause to collect them into a new field_code
column.
The VALUE
-clause can be used to name the column that receives the corresponding column values.
In the example above, that value column is assigned the name variable
.
Now, the values extracted from the
variables
-object are themselves STRUCT
s, and they do not all have the same set of keys.
From the DuckDB persective, they therefore have distinct types, and thus cannot all simply be lumped together directly into a single column.
However, we're only interested in a particular set of properties. They are
'label'
, 'concept'
, 'predicateType'
, and 'group'
.
If the type of their value is the same across all objects, then we can extract them and create new STRUCT
s having only those keys.
This is achieved using struct_extract()
and struct_pack()
.
Again, the
COLUMNS(*)
-"star expression" proves to be a very useful tool!
We use it here as argument for struct_extract()
.
This way, we need to write the STRUCT
extraction-and-assembly business only once.
The star expression then applies it to all columns without having to name them explicitly.
The STRUCT
-value assembled by struct_pack()
gets the name of the original column.
The final touch is to unwrap the
STRUCT
s we assembled into separate columns.
This is done in the terminal SELECT
, again using variable.*
syntax.
Nice! Pity it works only in theory
When we run it, we get:
Run Time (s): real 14.183 user 12.484375 sys 0.062500 Binder Error: Could not find key "concept" in structApparently, not only do the objects in our data set sometimes have more keys than we're interested in, some also don't have all the keys we're attempting to extract. There does not seem to be any way to detect whether some arbitrary
STRUCT
-value has a specific key, so we can't simply work around it.
If we comment out the extraction for the the
concept
key, we get the same error but now for the predicateType
key.
If you comment out that extraction too, the query executes succesfully.
On my laptop, it takes about as long as the original query, that is to say, much worse than the alternative using the JSON
-type.
So even if we would somehow be able to overcome the issues with extracting the properties and creating the
STRUCT
, we still can't seem to really benefit from the more strict typing.
I guess the main take-away here is, we can STRUCTggle all we want, but our data simply appears not to be given to us in a way that allows it to work for us.
JSON reader Parameters
We mentioned earlier that
SELECT
-ing directly from the url, as the original query did, causes DuckDB to invoke read_json()
or read_json_auto()
, which are synonyms of each other.
If you like, you can convince yourself by running an equivalent DESCRIBE
-statement that explicitly invokes the reader:
DESCRIBE SELECT * FROM read_json( 'https://api.census.gov/data/2000/dec/sf1/variables.json' )If you execute it, you'll notice that the inferred column name and data type are identical to that returned by the prior
DESCRIBE
-statement.
But invoking the reader explicitly has a benefit in that it offers us the possibility to pass parameters to control its behavior.
Some relevant to the topic at hand are:
BOOLEAN auto_detect
- Whether to auto detect the schema at all. Contrary to what the current documentation states, the default value is TRUE rather than FALSE.
STRUCT(name VARCHAR, type VARCHAR) columns
- If you choose
auto_detect
to be FALSE, you are required to explicitly specify the columns and column types. INTEGER maximum_depth
- The number of nesting levels that are considered when detecting the datatype. The default is
-1
, which means there is no restriction on the depth of type detection. When set to a positive integer, and there are object-type values at the maximum level of nesting, then those will get 'detected' as being of theJSON
-type.
Controlling the depth of the JSON-reader's type detection
Now that we learned about the
maximum_depth
parameter, let's apply it and experience its effect:
DESCRIBE SELECT * FROM read_json( 'https://api.census.gov/data/2000/dec/sf1/variables.json' , maximum_depth = 0 )At
maximum_depth = 0
, there is no detection at all, and the entire data set is just a JSON
-typed value:
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐ │ column_name │ column_type │ null │ key │ default │ extra │ │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ ├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤ │ json │ JSON │ YES │ NULL │ NULL │ NULL │ └─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘ Run Time (s): real 7.583 user 0.375000 sys 0.015625This is on par from what we achieved with the
read_text()
and explicitly casting its content
column to the JSON
-type.
Let's allow for one level:
DESCRIBE SELECT * FROM read_json( 'https://api.census.gov/data/2000/dec/sf1/variables.json' , maximum_depth = 1 )At
maximum_depth = 1
we get at least extraction into separate columns.
However, the column data type will still be the generic JSON
-type:
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐ │ column_name │ column_type │ null │ key │ default │ extra │ │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ ├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤ │ variables │ JSON │ YES │ NULL │ NULL │ NULL │ └─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘ Run Time (s): real 8.521 user 0.343750 sys 0.000000We could allow another level:
DESCRIBE SELECT * FROM read_json( 'https://api.census.gov/data/2000/dec/sf1/variables.json' , maximum_depth = 2 )At
maximum_depth = 2
, the type detected for the variables
column is a bag of properties of the JSON
-type.
In the DuckDB type system, this is represented as MAP( VARCHAR, JSON )
:
┌─────────────┬────────────────────┬─────────┬─────────┬─────────┬─────────┐ │ column_name │ column_type │ null │ key │ default │ extra │ │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ ├─────────────┼────────────────────┼─────────┼─────────┼─────────┼─────────┤ │ variables │ MAP(VARCHAR, JSON) │ YES │ NULL │ NULL │ NULL │ └─────────────┴────────────────────┴─────────┴─────────┴─────────┴─────────┘ Run Time (s): real 8.229 user 0.343750 sys 0.000000This is interesting! So far we've seen
STRUCT
s, but not MAP
s before.
JSON
-objects as DuckDB MAP
s
Let's attempt to rewrite the original query using the MAP( VARCHAR, JSON )
-typed value we get from the JSON-reader when we pass maximum_depth = 2
:
WITH variables AS ( SELECT unnest( map_keys( variables ) ) AS field_code , unnest( map_values( variables ) ) AS "variable" FROM read_json( 'https://api.census.gov/data/2000/dec/sf1/variables.json' , maximum_depth = 2 ) ) SELECT field_code , "variable"->>'label' AS label , "variable"->>'concept' AS concept , "variable"->>'predicateType' AS predicate_type , "variable"->>'group' AS group FROM variables WHERE field_code NOT IN ('for', 'in', 'ucgid') ORDER BY field_codeThe main difference as compared to using only the
JSON
-type, is that the MAP
-type lets us extract both the keys as well as the corresponding values using map_keys()
and map_values()
respectively.
This means the final SELECT
does not have to extract the variable object explicitly using the key: instead, we can immediately extract properties of interest from the variable objects.
In addition, the main difference with the original query is that the JSON-reader now hands the variable objects to us as
JSON
-typed values.
The original query relied on implicit typecasting, which does not happen here.
If we download the JSON file and store it on the local disc, and modify our queries to use that, the results are quite remarkable: the one using
read_text()
and the explicit cast to the JSON
-type takes about 12 seconds.
However, that result is smashed completely by the last query, which now takes just half a second!
A really nice improvement!
Conclusion
-
DuckDB's feature to
SELECT
directly from a file or URL is convenient, but it pays off to examine the underlying reader invocation and its parameters. -
When using the JSON-reader, it's a good idea to examine whether the detected datatype is suitable for the extraction you're attempting to perform.
By default, the JSON-reader will exhaustively detect the type.
If you find the detected type is large as compared to your dataset, and/or more precise or detailed than required for your purpose, try using the
maximum_depth
parameter to curtail the type detection. -
If you're using the JSON-reader and find that your extraction logic relies on
JSON
-type functions and operators, then beware of implicit casting of your extracted data to theJSON
-type. Implicit casting to theJSON
-type may point to an opportunity to limit the level of data type detection. -
When reading JSON-data, you can try extracting from a
JSON
-typed value - either by loading your data to a table with aJSON
-typed column, or using a combination ofread_text()
and an explicit cast to theJSON
-type. Even if your goal is use the JSON-reader, theJSON
-typed value is a useful baseline: if you find that your JSON-reader based query is slower than the one on theJSON
-typed value, it means you have an opportunity to improve. -
If you're using
json_keys()
with the purpose of extracting objects or values from the JSON-data, then consider using the JSON-reader and configuring it so that it returns aMAP
-type. TheMAP
functionsmap_keys()
andmap_values()
are really fast, and may help you avoid an an extra step to use the key to extract an object. -
It would be nice if
STRUCT
s would be a little bit more flexible. For example, I'd like to be able to extract field values non-constant expressions. It would also be useful to have a function to obtain their keys, and it would help to check if it has a particular key. Another feature that would also have helped is if it would be possible to defineSTRUCT
-fields to be optional, or to have a default value.
Friday, July 12, 2024
DuckDB bag of tricks: Processing PGN chess games with DuckDB - Rolling up each game's lines into a single game row (6/6)
DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.
This post is the sixth installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.
The installments are:
All essential elements are in place for transforming the PGN lines into a tabular result. The actual transformation entails two things:
Typical
The
In addition,
At the intersection of the aggregate rows and aggregate columns of the crosstab are the cell values. These are specified by the
In typical OLAP use cases, the cell values are typically
The description of the
Even though we expect only a single text value for the
When we put it all together, this is what our inital
The only thing stil missing is the movetext, but at this point, it's almost trivial to add that as well. We can simply amend the
Also, because one game could have multiple lines of movetext, we must also change the aggregate function in the
From this point on there's many things that could be done to improve the solution, for example:
The goal of these posts was to show how DuckDB's features and SQL-dialect allow these kind of raw-text processing tasks to be solved quickly and elegantly. I hope I have succeeded in demonstrating that - it sure was a lot of fun to try!
This post is the sixth installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.
The installments are:
- Chess and Portable Game Notation (PGN)
- Ingesting raw text with the CSV Reader
- Distinguishing the Line Type
- Keeping game lines together: window functions
- Extracting Tagpairs with Regular Expressions
- Rolling up each game's lines into a single game row
Rolling up each game's lines into a single game row
All essential elements are in place for transforming the PGN lines into a tabular result. The actual transformation entails two things:
- Grouping all lines with the same
game_id
into one single row. - Create columns for each unique value of
tag_name
fromtag_pair
, and place the correspondingtag_value
into those columns.
PIVOT
-statement.
Typical PIVOT
-statements
The
PIVOT
-statement is typically used for OLAP use cases to create analytical crosstabs.
In this context, we can think of PIVOT
as an extension of a standard SELECT
-statement with a GROUP BY
-clause: each row is actually an aggregate row that represents the group of rows from the underlying dataset that have a unique combination of values for all expressions appearing in the GROUP BY
-clause.
In addition,
PIVOT
also has an ON
-clause, and each unique combination of values coming from the ON
-clause expressions generates an aggregate column.
At the intersection of the aggregate rows and aggregate columns of the crosstab are the cell values. These are specified by the
USING
-clause, which specifies aggregate function expressions that are to be applied on those rows from the underlying resultset that belong to both the aggregate row and the aggregate column.
In typical OLAP use cases, the cell values are typically
SUM()
s or AVG()
s over monetary amounts or quantities, sometimes COUNT()
s.
PIVOT
as a row-to-column transposer
The description of the
PIVOT
-statement above provides some hints on how we can use it to transform the game lines to a single game row, and how to turn the tags into separate columns:
- We want to roll up the game lines of one game into a single game row, so the
game_id
expression should be placed in theGROUP BY
-clause. - The tag names should be used to generate columns, so the
tag_name
-member of thetag_pair
STRUCT
-value returned byregexp_extract()
should be placed in theON
-clause. - The
tag_value
s should appear as cell values, so thetag_value
-member oftag_pair
should be placed in theUSING
-clause.
-
The
tag_value
s that are to appear as cell values are primarily of a text type: player names, event locations, and sometimes date- or time-like values. There are some numerical values too, like Elo scores, but these are non-additive, and quite unlike the amounts and quantities we find in the typical OLAP-case. - As the tagspairs are just attributes of the game, we expect set of tag_name-values for in one game to be unique. That means that for each game, we will find at most one tag_value in each generated column.
PIVOT
-statement as a device to transpose rows to column, rather than an analytical crosstab.
Even though we expect only a single text value for the
tag_value
, the PIVOT
-statement's USING
-clause still requires some kind of aggregate function.
To aggregate tag_value
we can settle for anything that preserves the text: MIN()
, MAX()
, or ANY_VALUE()
, as well as the text aggregate STRING_AGG()
would all do.
When we put it all together, this is what our inital
PIVOT
-statement looks like:
PIVOT( SELECT line , line LIKE '[%' AS is_header , COUNT(CASE line LIKE '1.%' THEN 1 END) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) + CASE WHEN is_header THEN 1 ELSE 0 END AS game_id , CASE WHEN is_header THEN regexp_extract( line , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]*$' , [ 'tag_name', 'tag_value' ] ) END AS tag_pair FROM read_csv( 'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn' , columns = {'line': 'VARCHAR'} ) WHERE line IS NOT NULL ) ON tag_pair[This is what its result looke like:'tag_name'
] USINGANY_VALUE
( tag_pair['tag_value'
] ) GROUP BY game_id
┌─────────┬──────────────────────┬──────────┬────────────┬───┬─────────┬─────────────────────┬──────────────────────┬──────────┐ │ game_id │ Black │ BlackElo │ Date │ . │ Round │ Site │ White │ WhiteElo │ │ int64 │ varchar │ varchar │ varchar │ │ varchar │ varchar │ varchar │ varchar │ ├─────────┼──────────────────────┼──────────┼────────────┼───┼─────────┼─────────────────────┼──────────────────────┼──────────┤ │ 1 │ Pollock, William H. │ │ 1895.??.?? │ . │ ? │ Hastings │ Tinsley, Samuel │ │ │ 2 │ Lasker, Edward │ │ 1913.??.?? │ . │ ? │ Scheveningen │ Loman, Rudolf │ │ │ 3 │ Tartakower, Saviely │ │ 1921.??.?? │ . │ 5 │ The Hague │ Alekhine, Alexander │ │ │ 4 │ Wegemund, Otto │ │ 1922.??.?? │ . │ 7 │ Bad Oeynhausen │ Antze, O. │ │ │ 5 │ Tarrasch, Siegbert │ │ 1922.??.?? │ . │ 19 │ Bad Pistyan │ Johner, Paul F │ │ │ 6 │ Alekhine, Alexander │ │ 1922.??.?? │ . │ ? │ Hastings │ Bogoljubow, Efim │ │ │ 7 │ Kmoch, Hans │ │ 1922.??.?? │ . │ ? │ Vienna │ Rubinstein, Akiba │ │ │ 8 │ Mieses, Jacques │ │ 1923.??.?? │ . │ 9 │ Hastings │ Norman, George Mar. │ │ │ 9 │ Orlando, Placido │ │ 1923.??.?? │ . │ ? │ Trieste │ Szabados, Eugenio │ │ │ 10 │ Tarrasch, Siegbert │ │ 1923.??.?? │ . │ 1 │ Trieste │ Seitz, Jakob Adolf │ │ │ 11 │ Wolf, Siegfried Re. │ │ 1923.??.?? │ . │ 5 │ Vienna │ Von Patay, J. │ │ │ 12 │ Tartakower, Saviely │ │ 1924.??.?? │ . │ ? │ New York │ Bogoljubow, Efim │ │ │ 13 │ Pokorny, Amos │ │ 1926.??.?? │ . │ 3 │ Trencianske Teplice │ Kostic, Boris │ │ │ 14 │ Tartakower, Saviely │ │ 1927.??.?? │ . │ 2 │ Kecskemet │ Vukovic, Vladimir │ │ │ 15 │ Botvinnik, Mikhail │ │ 1927.??.?? │ . │ 2 │ Moscow │ Rabinovich, Ilya L. │ │ │ · │ · │ · │ · │ · │ · │ · │ · │ · │ │ · │ · │ · │ · │ · │ · │ · │ · │ · │ │ · │ · │ · │ · │ · │ · │ · │ · │ · │ │ 7229 │ Kovacevic,Bl │ 2400 │ 2023.12.09 │ . │ 7.3 │ Zagreb CRO │ Kozul,Z │ 2532 │ │ 7230 │ Iskos,A │ 2153 │ 2023.12.10 │ . │ 6.46 │ Skopje MKD │ Zhezhovska,Monika │ 1826 │ │ 7231 │ Spichkin,A │ 2035 │ 2023.12.12 │ . │ 2 │ chess.com INT │ Rodriguez Santiago,J │ 2043 │ │ 7232 │ Rogov,Matfey │ 2213 │ 2023.12.12 │ . │ 3 │ chess.com INT │ Clarke,Matthew │ 2127 │ │ 7233 │ Osmonbekov,T │ 2137 │ 2023.12.12 │ . │ 3 │ chess.com INT │ Sroczynski,M │ 2266 │ │ 7234 │ Novikova,Galina │ 2073 │ 2023.12.12 │ . │ 8 │ chess.com INT │ Marcziter,D │ 2192 │ │ 7235 │ Tomazini,A │ 2336 │ 2023.12.14 │ . │ 5.24 │ Zagreb CRO │ Pultinevicius,Paul. │ 2584 │ │ 7236 │ Spichkin,A │ 2035 │ 2023.12.19 │ . │ 2 │ chess.com INT │ Levine,D │ 2040 │ │ 7237 │ Kanyamarala,Tarun │ 2305 │ 2023.12.19 │ . │ 4 │ chess.com INT │ Nechitaylo,Nikita │ 2203 │ │ 7238 │ Ronka,E │ 2291 │ 2023.12.19 │ . │ 4 │ chess.com INT │ Gruzman,Ilya │ 2151 │ │ 7239 │ Kurbonboeva,Sarvinoz │ 2154 │ 2023.12.26 │ . │ 1.37 │ Samarkand UZB │ Mammadzada,G │ 2449 │ │ 7240 │ Koneru,H │ 2554 │ 2023.12.26 │ . │ 3.24 │ Samarkand UZB │ Peycheva,Gergana │ 2271 │ │ 7241 │ Carlsson,Andreas │ 1902 │ 2023.12.28 │ . │ 4.9 │ Karlstad SWE │ Kreken,Eivind Grunt │ 2271 │ │ 7242 │ Kazarjan,Gachatur │ 2078 │ 2023.12.30 │ . │ 9.31 │ Groningen NED │ Schuricht,Emil Fre. │ 2095 │ │ 7243 │ Kurbonboeva,Sarvinoz │ 2154 │ 2023.12.30 │ . │ 14.47 │ Samarkand UZB │ Zhu Chen │ 2423 │ ├─────────┴──────────────────────┴──────────┴────────────┴───┴─────────┴─────────────────────┴──────────────────────┴──────────┤ │ 7243 rows (30 shown) 11 columns (8 shown) │ └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘This certainly is starting to look a lot like the result we were after.
Folding in the movetext
The only thing stil missing is the movetext, but at this point, it's almost trivial to add that as well. We can simply amend the
tag_pair
-expression and let it return a new STRUCT
-value with the literal text 'moves' as name member and the line
itself as value in case the is_header
expression is not TRUE
:CASE WHEN is_header THEN regexp_extract( line , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]*$' , [ 'column_name', 'column_value' ] ) ELSE { 'column_name': 'moves' , 'column_value': line } END AS column_name_valueFor consistency, we also changed the
tag_pair
alias to column_name_value
and its member names from column_name
and column_value
to column_name
and column_value
respectively.
Therefore we must also update the references elsewhere in the PIVOT
statement accordingly.
Also, because one game could have multiple lines of movetext, we must also change the aggregate function in the
USING
-clause from ANY_VALUE()
to STRING_AGG()
.
After these changes we get the final statement:PIVOT( SELECT line , line LIKE '[%' AS is_header , COUNT(CASE line LIKE '1.%' THEN 1 END) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) + CASE WHEN is_header THEN 1 ELSE 0 END AS game_id , CASE WHEN is_header THEN regexp_extract( line , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]*$' , [ 'column_name', 'column_value' ] ) ELSE { 'column_name': 'moves' , 'column_value': line } END AS column_name_value FROM read_csv( 'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn' , columns = {'line': 'VARCHAR'} ) WHERE line IS NOT NULL ) ON column_name_value[And its result:'column_name'
] USINGSTRING_AGG
( column_name_value['column_value'
], ' ' ) GROUP BY game_id
┌─────────┬──────────────────────┬──────────┬────────────┬───┬──────────────────────┬──────────┬──────────────────────┐ │ game_id │ Black │ BlackElo │ Date │ . │ White │ WhiteElo │ moves │ │ int64 │ varchar │ varchar │ varchar │ │ varchar │ varchar │ varchar │ ├─────────┼──────────────────────┼──────────┼────────────┼───┼──────────────────────┼──────────┼──────────────────────┤ │ 1 │ Pollock, William H. │ │ 1895.??.?? │ . │ Tinsley, Samuel │ │ 1.d4 f5 2.c4 e6 3.. │ │ 2 │ Lasker, Edward │ │ 1913.??.?? │ . │ Loman, Rudolf │ │ 1.d4 e6 2.c4 f5 3.. │ │ 3 │ Tartakower, Saviely │ │ 1921.??.?? │ . │ Alekhine, Alexander │ │ 1.d4 f5 2.c4 e6 3.. │ │ 4 │ Wegemund, Otto │ │ 1922.??.?? │ . │ Antze, O. │ │ 1.c4 f5 2.d4 Nf6 3. │ │ 5 │ Tarrasch, Siegbert │ │ 1922.??.?? │ . │ Johner, Paul F │ │ 1.d4 e6 2.c4 f5 3.. │ │ 6 │ Alekhine, Alexander │ │ 1922.??.?? │ . │ Bogoljubow, Efim │ │ 1.d4 f5 2.c4 Nf6 3. │ │ 7 │ Kmoch, Hans │ │ 1922.??.?? │ . │ Rubinstein, Akiba │ │ 1.d4 e6 2.c4 f5 3.. │ │ 8 │ Mieses, Jacques │ │ 1923.??.?? │ . │ Norman, George Mar. │ │ 1.d4 f5 2.g3 Nf6 3. │ │ 9 │ Orlando, Placido │ │ 1923.??.?? │ . │ Szabados, Eugenio │ │ 1.d4 e6 2.c4 f5 3.. │ │ 10 │ Tarrasch, Siegbert │ │ 1923.??.?? │ . │ Seitz, Jakob Adolf │ │ 1.c4 e6 2.d4 f5 3.. │ │ 11 │ Wolf, Siegfried Re. │ │ 1923.??.?? │ . │ Von Patay, J. │ │ 1.d4 e6 2.c4 f5 3.. │ │ 12 │ Tartakower, Saviely │ │ 1924.??.?? │ . │ Bogoljubow, Efim │ │ 1.d4 f5 2.g3 e6 3.. │ │ 13 │ Pokorny, Amos │ │ 1926.??.?? │ . │ Kostic, Boris │ │ 1.c4 f5 2.d4 Nf6 3. │ │ 14 │ Tartakower, Saviely │ │ 1927.??.?? │ . │ Vukovic, Vladimir │ │ 1.d4 f5 2.c4 e6 3.. │ │ 15 │ Botvinnik, Mikhail │ │ 1927.??.?? │ . │ Rabinovich, Ilya L. │ │ 1.d4 e6 2.c4 f5 3.. │ │ · │ · │ · │ · │ · │ · │ · │ · │ │ · │ · │ · │ · │ · │ · │ · │ · │ │ · │ · │ · │ · │ · │ · │ · │ · │ │ 7229 │ Kovacevic,Bl │ 2400 │ 2023.12.09 │ . │ Kozul,Z │ 2532 │ 1.d4 e6 2.c4 f5 3.. │ │ 7230 │ Iskos,A │ 2153 │ 2023.12.10 │ . │ Zhezhovska,Monika │ 1826 │ 1.d4 e6 2.c4 f5 3.. │ │ 7231 │ Spichkin,A │ 2035 │ 2023.12.12 │ . │ Rodriguez Santiago,J │ 2043 │ 1.d4 e6 2.c4 f5 3.. │ │ 7232 │ Rogov,Matfey │ 2213 │ 2023.12.12 │ . │ Clarke,Matthew │ 2127 │ 1.d4 e6 2.c4 f5 3.. │ │ 7233 │ Osmonbekov,T │ 2137 │ 2023.12.12 │ . │ Sroczynski,M │ 2266 │ 1.d4 e6 2.c4 f5 3.. │ │ 7234 │ Novikova,Galina │ 2073 │ 2023.12.12 │ . │ Marcziter,D │ 2192 │ 1.d4 e6 2.c4 f5 3.. │ │ 7235 │ Tomazini,A │ 2336 │ 2023.12.14 │ . │ Pultinevicius,Paul. │ 2584 │ 1.d4 e6 2.c4 f5 3.. │ │ 7236 │ Spichkin,A │ 2035 │ 2023.12.19 │ . │ Levine,D │ 2040 │ 1.d4 e6 2.c4 f5 3.. │ │ 7237 │ Kanyamarala,Tarun │ 2305 │ 2023.12.19 │ . │ Nechitaylo,Nikita │ 2203 │ 1.d4 e6 2.c4 f5 3.. │ │ 7238 │ Ronka,E │ 2291 │ 2023.12.19 │ . │ Gruzman,Ilya │ 2151 │ 1.d4 f5 2.g3 e6 3.. │ │ 7239 │ Kurbonboeva,Sarvinoz │ 2154 │ 2023.12.26 │ . │ Mammadzada,G │ 2449 │ 1.d4 f5 2.c4 Nf6 3. │ │ 7240 │ Koneru,H │ 2554 │ 2023.12.26 │ . │ Peycheva,Gergana │ 2271 │ 1.d4 e6 2.c4 f5 3.. │ │ 7241 │ Carlsson,Andreas │ 1902 │ 2023.12.28 │ . │ Kreken,Eivind Grunt │ 2271 │ 1.d4 e6 2.c4 f5 3.. │ │ 7242 │ Kazarjan,Gachatur │ 2078 │ 2023.12.30 │ . │ Schuricht,Emil Fre. │ 2095 │ 1.d4 f5 2.g3 e6 3.. │ │ 7243 │ Kurbonboeva,Sarvinoz │ 2154 │ 2023.12.30 │ . │ Zhu Chen │ 2423 │ 1.d4 f5 2.g3 Nf6 3. │ ├─────────┴──────────────────────┴──────────┴────────────┴───┴──────────────────────┴──────────┴──────────────────────┤ │ 7243 rows (30 shown) 12 columns (7 shown) │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Next steps
From this point on there's many things that could be done to improve the solution, for example:
- Column values that originate from headers tags should un-escape escaped characters.
- The PGN-syntax itself does not dictate this but there are established conventions for what tag names to use and what kind of values are appropriate. For example, see the seven tag roster and optional tag pairs in the wikipedia entry for Portable Game Notation. It would make sense to further cleanse and conform the corresponding columns and give them a more suitable data type, or to actively validate their value.
- Further data normalization could be attempted by creating separate tables for player, event, opening etc.
- The moves could be further processed and analyzed to derive a table of board positions, something which would greatly increase the opportunities to analyze games.
-
Better detection of the first game line.
Our assumption has been that the first movetext always starts with
'1.'
. But what if a game does not have any moves at all? This may sound like that shouldn't be possible, but especially on an online chess site, a player's connection might break after a game was started, but before a move was made.
Whatever the reason may be, and whether we're interested in such games or not, our current solution is not capable to detect these cases. Instead, games simply aren't identified as intended, and our final game would likely be a mixture of 2 or maybe even more games. Bad bad bad!
(If you're interested in looking into such a scenario, the Lichess chess database for October 2013 contains 411,039 games, but 140 do not have any moves.) - A more robust regular expression to deal with games that may not adhere fully to the PGN syntax (for example, more foregiving handling of whitespace)
- Better handling of errors when reading CSV.
The goal of these posts was to show how DuckDB's features and SQL-dialect allow these kind of raw-text processing tasks to be solved quickly and elegantly. I hope I have succeeded in demonstrating that - it sure was a lot of fun to try!
DuckDB bag of tricks: Processing PGN chess games with DuckDB - Extracting Tagpairs with Regular Expressions (5/6)
DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.
This post is the fifth installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.
The installments are:
Let's apply our attention now to the header lines to extract information embedded in the tagpairs to expose it in a more useful form. We'll be using this regular expression to do that:
From our description of the tagpair-syntax, we know that the tag and value must be inside the square brackets, and that tag and its value are separated by whitespace. We also know that value is enclosed in double quotes. We may stylize this syntax as follows:
We will now turn this model into a regular expression by replacing the items from the model into patterns. Please be advised that this section is not intended to be a regular expressions tutorial - there are plenty of online resources that do that. It's just here to help you explain how I arrived at the final regular expression: this will make it easier for you to understand its structure, and put you in a better position to undertand its limitations, and if necessary, to adapt it to your specific needs.
First, lets replace line-start and line-end with ^ and $ anchors:
So, we can craft the pattern matching the tag as any character that is not whitespace (
The value-character stands for any character that may appear between the double-quotes demarcating the value. There may be any number of value-characters inside value:
This is completely analogous to the prior example we saw with tag, which is followed by white-space and therefore cannot itself contain whitespace. So just like we wrote "not-whitespace" as
Whenever a syntax defines an escape character device, it automatically introduces the problem of how to denote a literal occurrence of the escape character. That issue is usually solved simply by escaping the escape character, and that's also how the PGN syntax solves it. So in order to represent a literal backslash
Now that we realize a value-character can either be escaped or not escaped, it is a good idea to replace its syntax model with something more refined:
We can immediately replace unescaped-value-character with the "not-double-quote" pattern we identified earlier:
Using
In DuckDB, we can extract the matches made by a regular expression using
The first capturing group corresponds to the tag, so to extract it we could write:
More advanced products also offer a feature to split a string on the regular expression matches to an array or a resultset.
So in our example the 3rd argument is this list
For our example, the return value is:
It would not be unreasonable to assume that a single regular expression invocation returning all the matches all at once might also be faster than calling out multiple times to return only one match at a time. However, for this particular case, I found the performance to be on par.
Let's incorporate the tag extraction logic into our query. Because tag extraction is only applicable for header lines, it feels appropriate to use the
Next time, we will see how to roll up the multiple lines from a game into a single game row.
This post is the fifth installment of a series in which I share tips and tricks on how to use DuckDB for raw text processing. As a model, we will use Portable Game Notation (PGN), a popular format for digitized recording Chess games.
The installments are:
- Chess and Portable Game Notation (PGN)
- Ingesting raw text with the CSV Reader
- Distinguishing the Line Type
- Keeping game lines together: window functions
- Extracting Tagpairs with Regular Expressions
- Rolling up each game's lines into a single game row
Parsing PGN tagpairs: Regular Expressions to the rescue!
Let's apply our attention now to the header lines to extract information embedded in the tagpairs to expose it in a more useful form. We'll be using this regular expression to do that:
^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$If you're not familiar with regular expressions, or if you're not convinced this is correct, that's fine! The first part of this installment describes in detail how I arrived at this expression. In the second part, we will see how we can use regular expression in DuckDB to extract information from PGN headers.
Crafting a regular expression for PGN tagpairs
From our description of the tagpair-syntax, we know that the tag and value must be inside the square brackets, and that tag and its value are separated by whitespace. We also know that value is enclosed in double quotes. We may stylize this syntax as follows:
line-start [ tag whitespace "value" ] line-endIn the syntax model above, literal characters are in
red
; items in italics are syntactical concepts that we can't quite describe as a single, fixed sequence of characters.
But we can describe them using a pattern that specifies exactly what sequence of characters we would expect there, given the assumption that we're reading valid PGN files.
We will now turn this model into a regular expression by replacing the items from the model into patterns. Please be advised that this section is not intended to be a regular expressions tutorial - there are plenty of online resources that do that. It's just here to help you explain how I arrived at the final regular expression: this will make it easier for you to understand its structure, and put you in a better position to undertand its limitations, and if necessary, to adapt it to your specific needs.
First, lets replace line-start and line-end with ^ and $ anchors:
^ [ tag whitespace "value" ] $Next, the square brackets that enclose our tagpair are regular expression metacharacters, so to match them literally, we need to escape them using a backslash:
^ \[ tag whitespace "value" \] $
Extracting tags: whitespace and not-whitespace
We can now replace the whitespace-placeholder with a pattern that matches actual whitespace. We do this by specifying one whitespace character,\s
, follewed by +
, indicating that it must occur at least once and could repeat many times:^ \[ tag \s+ "value" \] $We know the tag is followed by whitespace. From that fact, it follows that the tag cannot itself contain any whitespace: if that would be possible, we wouldn't be able to distinguish between the whitespace that demarcates the end of the tag as opposed to whitespace appearing inside the tag.
So, we can craft the pattern matching the tag as any character that is not whitespace (
[^\s]
), one or more times (denoted by +
).
The tag is information we want to extract; therefore we enclose it in parentheses so we can capture it:^ \[ ([^\s]+) \s+ "value" \] $
Refine the syntax model to extract the value
To replace the value placeholder with a pattern it will be helpful to refine our syntax model a little bit. Let's zoom in on just the value from our original syntax model, and introduce a value-character.The value-character stands for any character that may appear between the double-quotes demarcating the value. There may be any number of value-characters inside value:
" value-character* "As for finding a pattern for value-character, we know for sure is that such a character cannot be a double quote, because we would not be able to distinguish it from the double-quote that indicates the end of the value.
This is completely analogous to the prior example we saw with tag, which is followed by white-space and therefore cannot itself contain whitespace. So just like we wrote "not-whitespace" as
[^\s]
for the tag, we can write "not-double-quote" as: [^"]
.
Modeling Character Escape Sequences
The PGN syntax specifies that a literal double-quote inside the value can be denoted by immediately preceding it using a backslash:\"
.
The backslash preceding the double-quote is a so-called "escape" character, relieving the double-quote from its normal role of demarcating the end of the value.
Whenever a syntax defines an escape character device, it automatically introduces the problem of how to denote a literal occurrence of the escape character. That issue is usually solved simply by escaping the escape character, and that's also how the PGN syntax solves it. So in order to represent a literal backslash
\
in the PGN-value, it must be preceded immediately with the escape character, which is also the backslash.
Thus, it becomes: \\
.
Now that we realize a value-character can either be escaped or not escaped, it is a good idea to replace its syntax model with something more refined:
" ( \ escaped-value-character | unescaped-value-character )* "Note: the pipe-character
|
is the standard regular expression syntax to denote alternative options ("either/or" in English).
We can immediately replace unescaped-value-character with the "not-double-quote" pattern we identified earlier:
" ( \ escaped-value-character | [^"] )* "Now let's replace the PGN escape character
\
with its regular expression pattern.
To do that we need to realize the backslash is also a metacharacter in the regular expression context, where it's also used as escape character!
So, in order to represent the PGN escape character in a regular expression, we need to escape it in the the regular expression context.
That sounds harder than it is - it simply becomes \\
.
Here, the first backslash is the regular expression escape character, and the second backslash a literal backslash, which is the escape character in the PGN context:" ( \\ escaped-value-character | [^"] )* "We know that in the PGN-context both the double quote and the backslash are ecaped characters. We also know that in the reqular expression context, the backslash needs to be escaped. So, we can now replace escaped-value-character with
["\\]
:" ( \\ ["\\] | [^"] )* "As final touch to the value pattern, we enclose the value in parenthesis to establish a capturing group. Just like we did earlier for tag, this will allow us to extract the value:
" ( (\\ ["\\] | [^"] )* ) "Now that we found a pattern for value, we can complete the entire pattern. The double quotes around value can be matched directly, and need no special treatment to be used in the regular expression:
^ \[ ([^\s]+) \s+ "( ( \\ ["|\\] | [^"] )* )" \] $Are you still reading? Thanks for hanging in there! Let's move on to pluck the fruits from our labour.
Using regexp_extract()
to extract tags and values
In DuckDB, we can extract the matches made by a regular expression using
regexp_extract()
.
This is an overloaded function.
For all versions of regexp_extract()
, the first two arguments are the same:- The string to which the regular expression is to be applied
- The regular expression
Using the capturing groups
We want to extract the tag and tagvalue from the header lines, and for this purpose we created corresponding capturing groups. Fortunately, we can pass the index of the capturing group as the third argument toregexp_extract()
.
In this usage, the function should return whatever was matched by the capturing group indicated by the 3rd argument.
The first capturing group corresponds to the tag, so to extract it we could write:
regexp_extract( '[Event "Hastings"]' -- capture1 -- ___|__ -- | | , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$' , 1 )And this would give us
Event
as result.
Likewise, to extract the value we could write:regexp_extract( '[Event "Hastings"]' -- capture2 -- _______|_______ -- | | , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$' , 2 )And this would yield the result
"Hastings"
Regular Expressions in SQL
Now, many database products offer support for regular expressions. Products with basic support offer at least matching against a regular expression, and typically also extraction and replacement of a single matched pattern or capturing group, which is the example I just gave above.More advanced products also offer a feature to split a string on the regular expression matches to an array or a resultset.
Regular Expressions in DuckDB
DuckDB supports all prior mentioned regular expression usages. In addition, DuckDB allows you to extract the matches of multiple capturing groups, all in one go! To do that, we can pass a list of names toregexp_extract()
as 3rd argument:regexp_extract( '[Event "Hastings"]' , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$' , [ 'tag_name', 'tag_value' ] )With this usage, the function returns a
STRUCT
-value with the names passed in 3rd argument as keys.
The value assigned to the keys is the text matched by the capturing group appearing at the same position as the index of the name in the list.
So in our example the 3rd argument is this list
['tag_name', 'tag_value']
.
The name 'tag_name' appears at the first position in the list, and thus, the value assigned to the 'tag_name'-key in the return value will be assigned the text that matches the first capturing group in the regular expression.
Likewise, 'tag_value' appears in the list at the second position and thus the match made by the second capturing group will be assigned to that key in the return value.
For our example, the return value is:
┌───────────────────────────────────────────────┐ │ value │ │ struct(tag_name varchar, "tag_value" varchar) │ ├───────────────────────────────────────────────┤ │ {'tag_name': Event, 'tag_value': Hastings} │ └───────────────────────────────────────────────┘I think this is pretty awesome! We only have to write and apply a particular regular expression once, instead of having to repeat it for each group we want to capture. This avoids code duplication and your SQL can be shorter, easier to read, and therefore, more maintainable.
It would not be unreasonable to assume that a single regular expression invocation returning all the matches all at once might also be faster than calling out multiple times to return only one match at a time. However, for this particular case, I found the performance to be on par.
Adding the Regular Expression to our Query
Let's incorporate the tag extraction logic into our query. Because tag extraction is only applicable for header lines, it feels appropriate to use the
is_header
flag to check whether the regular expression should be applied.
Like we already saw, we can benefit from the DuckDB reusable alias feature to do that:CASE WHEN is_header THEN regexp_extract( line , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$' , [ 'tag_name', 'tag_value' ] ) ENDWith this addition our query becomes:
SELECT line , line LIKE '[%' AS is_header , COUNT(CASE line LIKE '1.%' THEN 1 END) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) + CASE WHEN is_header THEN 1 ELSE 0 END AS game_id , CASE WHEN is_header THEN regexp_extract( line , '^\[([^\s]+)\s+"((\\["\\]|[^"])*)"\]$' , [ 'tag_name', 'tag_value' ] ) END AS tag_pair FROM read_csv( 'C:\Users\Roland_Bouman\Downloads\DutchClassical\DutchClassical.pgn' , columns = {'line': 'VARCHAR'} ) WHERE line IS NOT NULLOur results are now:
┌──────────────────────────────────────────────────┬───────────┬─────────┬───────────────────────────────────────────────────────┐ │ line │ is_header │ game_id │ tag_pair │ │ varchar │ boolean │ int64 │ struct(tag_name varchar, "tag_value" varchar) │ ├──────────────────────────────────────────────────┼───────────┼─────────┼───────────────────────────────────────────────────────┤ │ [Event "Hastings"] │ true │ 1 │ {'tag': Event, 'value': Hastings} │ │ [Site "Hastings"] │ true │ 1 │ {'tag': Site, 'value': Hastings} │ │ [Date "1895.??.??"] │ true │ 1 │ {'tag': Date, 'value': 1895.??.??} │ │ [Round "?"] │ true │ 1 │ {'tag': Round, 'value': ?} │ │ [White "Tinsley, Samuel"] │ true │ 1 │ {'tag': White, 'value': Tinsley, Samuel} │ │ [Black "Pollock, William Henry Kraus"] │ true │ 1 │ {'tag': Black, 'value': Pollock, William Henry Kraus} │ │ [Result "1-0"] │ true │ 1 │ {'tag': Result, 'value': 1-0} │ │ [WhiteElo ""] │ true │ 1 │ {'tag': WhiteElo, 'value': } │ │ [BlackElo ""] │ true │ 1 │ {'tag': BlackElo, 'value': } │ │ [ECO "A90"] │ true │ 1 │ {'tag': ECO, 'value': A90} │ │ 1.d4 f5 2.c4 e6 3.g3 Nf6 4.Bg2 Bb4+ 5.Nc3 O-O . │ false │ 1 │ │ │ 9.axb4 dxc3 10.bxc3 Ne5 11.Nf3 Nd3+ 12.Ke2 Nxc. │ false │ 1 │ │ │ 16.h4 Qg6 17.Bf3 Bb7 18.Nb5 d5 19.cxd5 exd5 20. │ false │ 1 │ │ │ 23.Rc1 Ra8 24.Nc6 Bxc6 25.bxc6 Qxc6 26.Rc2 h6 . │ false │ 1 │ │ │ 30.Be2 Nd6 31.Bf3 Ra4 32.Qb1 Rc4 33.Rd2 Ne4 34. │ false │ 1 │ │ │ · │ · │ · │ · │ │ · │ · │ · │ · │ │ · │ · │ · │ · │ │ [Black "Kurbonboeva,Sarvinoz"] │ true │ 7243 │ {'tag': Black, 'value': Kurbonboeva,Sarvinoz} │ │ [Result "1-0"] │ true │ 7243 │ {'tag': Result, 'value': 1-0} │ │ [WhiteElo "2423"] │ true │ 7243 │ {'tag': WhiteElo, 'value': 2423} │ │ [BlackElo "2154"] │ true │ 7243 │ {'tag': BlackElo, 'value': 2154} │ │ [ECO "A90"] │ true │ 7243 │ {'tag': ECO, 'value': A90} │ │ 1.d4 f5 2.g3 Nf6 3.Bg2 e6 4.c4 d5 5.b3 c6 6.Nh. │ false │ 7243 │ │ │ 10.Bxe5 Qxe5 11.Nd2 O-O 12.cxd5 cxd5 13.Nf3 Qe. │ false │ 7243 │ │ │ 17.Qxd4 Ne4 18.Nxd5 Qf7 19.Rc7 Rxd5 20.Rxf7 Rx. │ false │ 7243 │ │ │ 24.Rc7 Na6 25.Rc6 Bc8 26.Rxa6 Bxa6 27.Bxa8 Bxe. │ false │ 7243 │ │ │ 31.Bd1 Kf7 32.Bc2 Rc3 33.Rd1 Be6 34.Rd2 Bxb3 3. │ false │ 7243 │ │ │ 38.Kg2 Ra1 39.h4 h6 40.h5 Ra3 41.a5 Ra2 42.Kf3. │ false │ 7243 │ │ │ 45.Kd4 Ra2 46.a6 Rxf2 47.Rc7 Ra2 48.a7 g5 49.h. │ false │ 7243 │ │ │ 52.gxh4 Kh5 53.Rh7+ Kg4 54.Kb7 Rb2+ 55.Kc8 Rc2. │ false │ 7243 │ │ │ 59.Kxa8 Kxh4 60.Rf7 Kg4 61.Kb7 f4 62.Kc6 f3 63. │ false │ 7243 │ │ │ 66.Ke4 1-0 │ false │ 7243 │ │ ├──────────────────────────────────────────────────┴───────────┴─────────┴───────────────────────────────────────────────────────┤ │ 117000 rows (30 shown) 4 columns │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Next Installment: Rolling up PGN lines to one game row
Next time, we will see how to roll up the multiple lines from a game into a single game row.
Subscribe to:
Posts (Atom)
DataZen winter meetup 2025
The DataZen winter meetup 2025 is nigh! Join us 18 - 20 February 2025 for 3 days of expert-led sessions on AI, LLM, ChatGPT, Big Data, M...
-
Like all procedural database languages I know, the MySQL stored procedure language supports explicit cursors . I just wrote "explicit c...
-
Every now and then, people are puzzled by the precise status and extent of MySQL support for dynamic SQL. Statement Handling MySQL support ...
-
Handling cursor loops in MySQL keeps puzzling people . Single Cursor Loops The common case is to have a simple cursor loop. Each record is ...