Thursday, December 19, 2024

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB.

This post is about a particular challenge posted on the DuckDB Discord server which happened to interest me. So, here goes!!

Yesterday on the DuckDB discord server user @col_montium asked:
I've got a bit of a head scratcher. My json from https://api.census.gov/data/2000/dec/sf1/variables.json is a single struct with dynamic keys. I want to extract a table with columns field_code, label, concept, and group. I completed this with a small sample of the data but with the entire dataset the query uses up 24 gigabytes of RAM and then crashes. Here's my query:
WITH source AS (
  SELECT *
  FROM 'https://api.census.gov/data/2000/dec/sf1/variables.json'
),
all_keys AS (
  SELECT unnest( json_keys( variables ) ) AS "key",
         variables
  FROM   source
),
extracted_fields AS (
  SELECT "key" AS field_code
  ,      variables->key->>'$.label' AS label
  ,      variables->key->>'$.concept' AS concept
  ,      variables->key->>'$.predicateType' AS predicate_type
  ,      variables->key->>'$.group' AS "group"
  FROM all_keys
  WHERE "key" NOT IN ('for', 'in', 'ucgid')
)
SELECT *
FROM extracted_fields
ORDER BY field_code
I tried to run this myself on my laptop using the DuckDB 1.1.3 (GA) command line interface, and it failed with a similar error as reported by @col_montium:
Run Time (s): real 252.335 user 197.031250 sys 9.500000
Out of Memory Error: failed to allocate data of size 16.0 MiB (24.9 GiB/25.0 GiB used)
I was able to run it successfully using a 1.1.4 nightly build, but it took about 3 minutes to complete:
Run Time (s): real 188.681 user 148.500000 sys 8.078125

Dataset: Size and Structure


Surprisingly, the source data set is less than 2MB (2,039,748 bytes)! The query also doesn't appear to be terribly complicated, and if it finishes successfully, it yields only 8141 rows. So, clearly something interesting must be the matter with the structure of the dataset.

As it is so modest in size, we can easily inspect it with a text editor:
{
  "variables": {
    "for": {
      "label": "Census API FIPS 'for' clause",
      "concept": "Census API Geography Specification",
      "predicateType": "fips-for",
      "group": "N/A",
      "limit": 0,
      "predicateOnly": true
    },
    "in" {
      "label": "Census API FIPS 'in' clause",
      "concept": "Census API Geography Specification",
      "predicateType": "fips-in",
      "group": "N/A",
      "limit": 0,
      "predicateOnly": true
    },
    "ucgid" {
      "label": "Uniform Census Geography Identifier clause",
      "concept": "Census API Geography Specification",
      "predicateType": "ucgid",
      "group": "N/A",
      "limit": 0,
      "predicateOnly": true,
      "hasGeoCollectionSupport": true
    },
    "P029009" {
      "label": "Total!!In households!!Related child!!Own child!!6 to 11 years",
      "concept": "RELATIONSHIP BY AGE FOR THE POPULATION UNDER 18 YEARS [46]",
      "predicateType": "int",
      "group": "P029",
      "limit": 0
    },
    
    ...many more variables...
    
    "PCT012H185" {
      "label": "Total!!Female!!78 years",
      "concept": "SEX BY AGE (HISPANIC OR LATINO) [209]",
      "predicateType": "int",
      "group": "PCT012H",
      "limit": 0
    }
  }
}    
The structure of the dataset is quite simple: a single object with a single variables property, which has an object value with many object-typed properties. The object-type values of these properties contain a handful of recurring scalar properties like label, concept, predicateType, group and limit.

Now that we analyzed the structure, it should be clear what the query attempts to achieve. It wants to make a row for each property of the outermost variables-object. The property name becomes the field_code column, and forms its natural primary key. The recurring properties of the innermost objects become its non-key columns label, concept, predicate_type and group.

Using read_text() and a DuckDB table column with the JSON-datatype


In an attempt to get the query working without running out of memory, I decided to store the dataset in a DuckDB table like so:
CREATE TABLE t_json
AS
SELECT  filename
,       content::JSON AS data
FROM    read_text('https://api.census.gov/data/2000/dec/sf1/variables.json');
With this in place, a modified version of the initial query might be:
WITH variables AS (
  SELECT  data->variables                         AS variables
  ,       unnest( json_keys( variables ) )        AS field_code
  FROM    t_json
)
SELECT    field_code
,         variables->field_code->>'label'         AS label
,         variables->field_code->>'concept'       AS concept
,         variables->field_code->>'predicateType' AS predicate_type
,         variables->field_code->>'group'         AS group
FROM      variables
WHERE     field_code NOT IN ('for', 'in', 'ucgid')
ORDER BY  field_code
The results look like this:
┌────────────┬──────────────────────────────────┬─────────┬────────────────┬─────────┐
│ field_code │              label               │ concept │ predicate_type │  group  │
│  varchar   │             varchar              │ varchar │    varchar     │ varchar │
├────────────┼──────────────────────────────────┼─────────┼────────────────┼─────────┤
│ AIANHH     │ American Indian Area/Alaska Na…  │ NULL    │ NULL           │ N/A     │
│ AIHHTLI    │ American Indian Area (Off-Rese…  │ NULL    │ NULL           │ N/A     │
│ AITSCE     │ American Indian Tribal Subdivi…  │ NULL    │ NULL           │ N/A     │
│ ANRC       │ Alaska Native Regional Corpora…  │ NULL    │ NULL           │ N/A     │
│ BLKGRP     │ Census Block Group               │ NULL    │ NULL           │ N/A     │
│ BLOCK      │ Census Block                     │ NULL    │ NULL           │ N/A     │
│ CD106      │ Congressional District (106th)   │ NULL    │ string         │ N/A     │
│ CONCIT     │ Consolidated City                │ NULL    │ NULL           │ N/A     │
│   ·        │         ·                        │  ·      │  ·             │  ·      │
│   ·        │         ·                        │  ·      │  ·             │  ·      │
│   ·        │         ·                        │  ·      │  ·             │  ·      │
│ SUBMCD     │ Sub-Minor Civil Division (FIPS)  │ NULL    │ NULL           │ N/A     │
│ SUMLEVEL   │ Summary Level code               │ NULL    │ string         │ N/A     │
│ TRACT      │ Census Tract                     │ NULL    │ NULL           │ N/A     │
│ UA         │ Urban Area                       │ NULL    │ NULL           │ N/A     │
│ US         │ United States                    │ NULL    │ NULL           │ N/A     │
│ ZCTA3      │ ZIP Code Tabulation Area (Thre…  │ NULL    │ NULL           │ N/A     │
│ ZCTA5      │ Zip Code Tabulation Area (Five…  │ NULL    │ NULL           │ N/A     │
├────────────┴──────────────────────────────────┴─────────┴────────────────┴─────────┤
│ 8141 rows (15 shown)                                                     5 columns │
└────────────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 11.682 user 8.015625 sys 2.406250
Less than 12 seconds. Not super-fast, but still: a pretty substantial improvement as compared to the initial query. What could explain this difference? Two things stand out as compared to the initial query:
  • Using the read_text()-function to ingest the data. In the original query, the data was ingested by the first common table expression called source, which did a SELECT directly FROM the dataset's URL 'https://api.census.gov/data/2000/dec/sf1/variables.json'. Clearly, with this syntax, DuckDB will do some magic to perform the appropriate action, which in this case will be to invoke read_json_auto('https://api.census.gov/data/2000/dec/sf1/variables.json').
  • Explicitly casting the content of the data file to DuckDB's JSON-datatype. I expected the issue with the original query had something todo with the transformation of JSON data to a tabular result. Having the data in a JSON-datatype seemed like a good start to play with different methods to extract data. The read_text()-function returns the contents of the file as VARCHAR. It cannot and does not attempt to parse or process the data. That's why we need to cast its output explictly to the JSON-type ourselves.
You might object to a comparison as this query does not include the time to fetch the dataset from the internet, nor the time to store it in the DuckDB table. For me that takes about 6 - 10 seconds. However, we can easily rewrite the query to include read_text() directly:
WITH variables AS (
  SELECT  json( content )->variables              AS variables
  ,       unnest( json_keys( variables ) )        AS field_code
  FROM    read_text( 'https://api.census.gov/data/2000/dec/sf1/variables.json' )
)
SELECT    field_code
,         variables->field_code->>'label'         AS label
,         variables->field_code->>'concept'       AS concept
,         variables->field_code->>'predicateType' AS predicate_type
,         variables->field_code->>'group'         AS group
FROM      varibles
WHERE     field_code NOT IN ('for', 'in', 'ucgid')
ORDER BY  field_code
This query takes about 20 seconds, in other words: about the same time as it takes to run the previous query plus the time to fetch the dataset and load it in the table. For a serious comparison of query performance we should probably eliminate the download and use only local files, but we'll get to that later. For now it's enough to notice that this setup is a significant improvement, both in terms of query execution time as well as in stability/memory consumption.

You might also have noticed this alternative query uses slightly different syntax to the extract label, concept, predicateType and group. Also, the final data presentation query is combined with the field extraction logic. However, these changes are really a matter of style. They do not have any bearing on query performance - not for this data set anyway.

Understanding the difference: the JSON-reader


With these results in mind, we have to conclude that the actual operations on the JSON data cannot really explain the difference. So, it must have something todo with the JSON-reader that was implicitly invoked by the initial query. Let's zoom in a bit on the difference between data extracted from a JSON-typed value and data extracted by the JSON-reader.

We know that in the second example, the data is of the JSON-type, because we explicitly cast it to that type. However, that only tells us that the data is essentially text, conforming to the JSON-syntax. It is of course very useful to know the encoding scheme used to represent values and structures, that still does not reveal anything about the content and structure of the data itself.

What data type does the JSON reader think the data has? Let's find out using this DESCRIBE-statement:
DESCRIBE
SELECT *
FROM 'https://api.census.gov/data/2000/dec/sf1/variables.json'
Its result is:
┌─────────────┬─────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │         column_type         │  null   │   key   │ default │  extra  │
│   varchar   │           varchar           │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ variables   │ STRUCT("for" STRUCT("labe…  │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────────────────────┴─────────┴─────────┴─────────┴─────────┘
Run Time (s): real 9.541 user 1.968750 sys 0.031250
As you can see, the variables property of the main object that makes up our dataset was detected and projected as a column. DuckDB has also inferred a pretty detailed data type for the column, expressed using DuckDB native SQL data types.

In the example above it's truncated, but if we run it again using .mode line, we can see the entire type descriptor. It goes like this:
STRUCT(
  "for" STRUCT(
    "label" VARCHAR
  , concept VARCHAR
  , predicateType VARCHAR
  , "group" VARCHAR
  , "limit" BIGINT
  , predicateOnly BOOLEAN
  )
, "in" STRUCT(
    "label" VARCHAR
  , concept VARCHAR
  , predicateType VARCHAR
  , "group" VARCHAR
  , "limit" BIGINT
  , predicateOnly BOOLEAN
  )
, ucgid STRUCT(
    "label" VARCHAR
  , concept VARCHAR
  , predicateType VARCHAR
  , "group" VARCHAR
  , "limit" BIGINT
  , predicateOnly BOOLEAN
  , hasGeoCollectionSupport BOOLEAN
  )
, P029009 STRUCT(
    "label" VARCHAR
  , concept VARCHAR
  , predicateType VARCHAR
  , "group" VARCHAR
  , "limit" BIGINT
  )
  
  ...many more ...
  
, PCT012H185 STRUCT(
    "label" VARCHAR
  , concept VARCHAR
  , predicateType VARCHAR
  , "group" VARCHAR
  , "limit" BIGINT
  )
)
Looks familiar? It should, as it corresponds closely to the structure we witnessed when analyzing the raw JSON data using a text editor.

So, this really is quite different as compared to the JSON-type column, as the JSON-reader must have done some work to explore the structure of the data. Not only did it detect the data type, it also uses the type to represent the data, which in turn determines what operators and functions will be available by the remainder of the query to work with the data.

Whether this is related to the difference in performance and to the out of memory errors, is yet to be determined, but it does not seem unlikely. For starters, the data type closely resembles the data. The textual description of the data type is about half the size of the data itself, which does not seem like a good thing!

We will investigate the JSON-readers' capabilities for data type detection and the possibilities to control that in the next couple of sections. But before we examine that in more detail, let's take a step back and consider what we just learned.

One lesson learned


So far, we've learned at least one important lesson: the JSON reader does in fact do exactly what its name implies - read JSON-encoded data. But while doing so, it tends to output data using DuckDB native data types, and typically not the JSON data type.

In hindsight this certainly sounds perfectly sensible. Yet, both me and @col_montium apparently did not realize that fully, as the original query takes the data coming out of the JSON reader, and processes it using functions like json_keys(), and the extraction operators -> and ->>. These are designed for working with values in the JSON data type, and not for the STRUCT values that the JSON reader hands back to us.

You may wonder: how it is possible that these functions and operators work at all on the STRUCT value returned by the JSON reader? The answer is quite simple: when we apply these functions and operators, they first implicitly cast their arguments and operands to the JSON-type. So, just as the JSON reader spent all its effort to read raw JSON-encoded data and convert it into neat native DuckDB STRUCTs, we immediately undid that effort only to convert them back to a JSON-type. And initially we - or at least I - didn't even realize it.

Actually - it is a little bit different still, as the JSON-reader deals with reading JSON-encoded text whereas the JSON-data type is still a database type. But the point here is that the JSON-reader did work to provide a precise description of the data as STRUCTs, and we didn't bother treating the data as such. Instead, we implicitly cast it to the JSON-type, which is much looser. So it feels there might be an unused opportunity to benefit from the detected data type. We will examine this possibility in the next section.

Avoiding the implicit JSON-type cast?


Now that we have a better understanding of what we've done, could we rewrite the original query so we can avoid the implicit JSON-type cast, and benefit from the DuckDB native types returned by the JSON reader? Somewhat surprisingly, this is not quite as simple as one might think it would (or should) be.

First of all, we cannot simply use whatever are the STRUCT equivalents for the JSON-type functions and operators.

For example, it is currently not possible to extract the keys from a STRUCT type. So there's no straightforward equivalent to how the original query uses json_keys() and then unnest() to spread out the keys from the original variables object into separate rows. (In the DuckDB github repository, @pkoppstein opened a discussion about creating a feature to extract the keys from STRUCTs in case you want to chime in.)

Even if we somehow managed to extract the keys from the STRUCT, there is currently no way to use it to extract the corresponding value. That is, STRUCTs of course support extraction methods, but all of them require the key to be a constant value. This prohibits using a column value, such as what we would get if we'd use an unnested set of keys:
WITH field_codes AS (
  SELECT  unnest( json_keys( variables ) ) AS field_code
  ,       variables
  FROM    'https://api.census.gov/data/2000/dec/sf1/variables.json'
)
SELECT  field_code
,       variables[field_code] AS "variable"
FROM    field_codes
WHERE   field_codes NOT IN ('for', 'in', 'ucgid')
In the example above, variables is the STRUCT-typed value we receive from the JSON-reader. The expression variables[field_code] attempts to extract a value from it using the column value field_code. The statement fails because field_code is not a constant:
Run Time (s): real 8.345 user 1.531250 sys 0.031250
Binder Error: Key name for struct_extract needs to be a constant string
As the error message suggests, the square bracket extraction syntax is just syntactic sugar for the struct_extract() function. Using a function call instead yields the exact same error.

A Solution based on STRUCTs


Eventually I came up with an approach that at least allows me to envision what the solution based on STRUCTs, and without relying on the JSON-type might look like. Here's that attempt:
WITH variables AS (
  UNPIVOT (
    WITH variables AS (
      SELECT  COLUMNS( variables.* EXCLUDE( "for", "in", "ucgid" ) )
      FROM    read_json( 'https://api.census.gov/data/2000/dec/sf1/variables.json' )
    )
    SELECT  struct_pack(
              label := struct_extract( 
                COLUMNS( * )
              , 'label' 
              )
            , concept := struct_extract( 
                COLUMNS( * )
              , 'concept' 
              )
            , predicateType := struct_extract( 
                COLUMNS( * )
              , 'predicateType' 
              )
            , "group" := struct_extract( 
                COLUMNS( * )
              , 'group' 
              )
            )
    FROM    variables
  )
  ON( * )
  INTO NAME field_code VALUE "variable"
)
SELECT field_code, "variable".*
FROM variables
ORDER BY field_code
The main idea here is to use variables.* to turn the keys of the STRUCT given to us by the JSON-reader into separate columns. The UNPIVOT-statement can then turn those columns into rows. In the original query this was achieved by combining json_keys() and unnest().

Since we don't need the metadata-like objects ucgid, for and in, we also don't need to unpack them from the variables object. So, we wrapped variables.* into a COLUMNS-"star expression" so we could add an EXCLUDE-clause to remove them. In the original query, those where removed in the WHERE-clause.

The UNPIVOT-statement not only creates rows from columns, but also lets us transform the column names into values. We use the NAME-clause to collect them into a new field_code column. The VALUE-clause can be used to name the column that receives the corresponding column values. In the example above, that value column is assigned the name variable.

Now, the values extracted from the variables-object are themselves STRUCTs, and they do not all have the same set of keys. From the DuckDB persective, they therefore have distinct types, and thus cannot all simply be lumped together directly into a single column.

However, we're only interested in a particular set of properties. They are 'label', 'concept', 'predicateType', and 'group'. If the type of their value is the same across all objects, then we can extract them and create new STRUCTs having only those keys. This is achieved using struct_extract() and struct_pack().

Again, the COLUMNS(*)-"star expression" proves to be a very useful tool! We use it here as argument for struct_extract(). This way, we need to write the STRUCT extraction-and-assembly business only once. The star expression then applies it to all columns without having to name them explicitly. The STRUCT-value assembled by struct_pack() gets the name of the original column.

The final touch is to unwrap the STRUCTs we assembled into separate columns. This is done in the terminal SELECT, again using variable.* syntax.

Nice! Pity it works only in theory


When we run it, we get:
Run Time (s): real 14.183 user 12.484375 sys 0.062500
Binder Error: Could not find key "concept" in struct
Apparently, not only do the objects in our data set sometimes have more keys than we're interested in, some also don't have all the keys we're attempting to extract. There does not seem to be any way to detect whether some arbitrary STRUCT-value has a specific key, so we can't simply work around it.

If we comment out the extraction for the the concept key, we get the same error but now for the predicateType key. If you comment out that extraction too, the query executes succesfully. On my laptop, it takes about as long as the original query, that is to say, much worse than the alternative using the JSON-type.

So even if we would somehow be able to overcome the issues with extracting the properties and creating the STRUCT, we still can't seem to really benefit from the more strict typing. I guess the main take-away here is, we can STRUCTggle all we want, but our data simply appears not to be given to us in a way that allows it to work for us.

JSON reader Parameters


We mentioned earlier that SELECT-ing directly from the url, as the original query did, causes DuckDB to invoke read_json() or read_json_auto(), which are synonyms of each other. If you like, you can convince yourself by running an equivalent DESCRIBE-statement that explicitly invokes the reader:
DESCRIBE
SELECT *
FROM read_json( 'https://api.census.gov/data/2000/dec/sf1/variables.json' )
If you execute it, you'll notice that the inferred column name and data type are identical to that returned by the prior DESCRIBE-statement. But invoking the reader explicitly has a benefit in that it offers us the possibility to pass parameters to control its behavior. Some relevant to the topic at hand are:
BOOLEAN auto_detect
Whether to auto detect the schema at all. Contrary to what the current documentation states, the default value is TRUE rather than FALSE.
STRUCT(name VARCHAR, type VARCHAR) columns
If you choose auto_detect to be FALSE, you are required to explicitly specify the columns and column types.
INTEGER maximum_depth
The number of nesting levels that are considered when detecting the datatype. The default is -1, which means there is no restriction on the depth of type detection. When set to a positive integer, and there are object-type values at the maximum level of nesting, then those will get 'detected' as being of the JSON-type.

Controlling the depth of the JSON-reader's type detection


Now that we learned about the maximum_depth parameter, let's apply it and experience its effect:
DESCRIBE
SELECT  *
FROM    read_json( 
          'https://api.census.gov/data/2000/dec/sf1/variables.json' 
        , maximum_depth = 0
        )
At maximum_depth = 0, there is no detection at all, and the entire data set is just a JSON-typed value:
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ json        │ JSON        │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
Run Time (s): real 7.583 user 0.375000 sys 0.015625
This is on par from what we achieved with the read_text() and explicitly casting its content column to the JSON-type. Let's allow for one level:
DESCRIBE
SELECT  *
FROM    read_json( 
          'https://api.census.gov/data/2000/dec/sf1/variables.json' 
        , maximum_depth = 1
        )
At maximum_depth = 1 we get at least extraction into separate columns. However, the column data type will still be the generic JSON-type:
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ variables   │ JSON        │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
Run Time (s): real 8.521 user 0.343750 sys 0.000000
We could allow another level:
DESCRIBE
SELECT  *
FROM    read_json( 
          'https://api.census.gov/data/2000/dec/sf1/variables.json' 
        , maximum_depth = 2
        )
At maximum_depth = 2, the type detected for the variables column is a bag of properties of the JSON-type. In the DuckDB type system, this is represented as MAP( VARCHAR, JSON ):
┌─────────────┬────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │    column_type     │  null   │   key   │ default │  extra  │
│   varchar   │      varchar       │ varchar │ varchar │ varchar │ varchar │
├─────────────┼────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ variables   │ MAP(VARCHAR, JSON) │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴────────────────────┴─────────┴─────────┴─────────┴─────────┘
Run Time (s): real 8.229 user 0.343750 sys 0.000000
This is interesting! So far we've seen STRUCTs, but not MAPs before.

JSON-objects as DuckDB MAPs

Let's attempt to rewrite the original query using the MAP( VARCHAR, JSON )-typed value we get from the JSON-reader when we pass maximum_depth = 2:
WITH variables AS (
  SELECT  unnest( map_keys( variables ) )   AS field_code
  ,       unnest( map_values( variables ) ) AS "variable"
  FROM    read_json( 
            'https://api.census.gov/data/2000/dec/sf1/variables.json' 
          , maximum_depth = 2
          )
)
SELECT    field_code
,         "variable"->>'label'              AS label
,         "variable"->>'concept'            AS concept
,         "variable"->>'predicateType'      AS predicate_type
,         "variable"->>'group'              AS group
FROM      variables
WHERE     field_code NOT IN ('for', 'in', 'ucgid')
ORDER BY  field_code
The main difference as compared to using only the JSON-type, is that the MAP-type lets us extract both the keys as well as the corresponding values using map_keys() and map_values() respectively. This means the final SELECT does not have to extract the variable object explicitly using the key: instead, we can immediately extract properties of interest from the variable objects.

In addition, the main difference with the original query is that the JSON-reader now hands the variable objects to us as JSON-typed values. The original query relied on implicit typecasting, which does not happen here.

If we download the JSON file and store it on the local disc, and modify our queries to use that, the results are quite remarkable: the one using read_text() and the explicit cast to the JSON-type takes about 12 seconds. However, that result is smashed completely by the last query, which now takes just half a second! A really nice improvement!

Conclusion


  • DuckDB's feature to SELECT directly from a file or URL is convenient, but it pays off to examine the underlying reader invocation and its parameters.
  • When using the JSON-reader, it's a good idea to examine whether the detected datatype is suitable for the extraction you're attempting to perform. By default, the JSON-reader will exhaustively detect the type. If you find the detected type is large as compared to your dataset, and/or more precise or detailed than required for your purpose, try using the maximum_depth parameter to curtail the type detection.
  • If you're using the JSON-reader and find that your extraction logic relies on JSON-type functions and operators, then beware of implicit casting of your extracted data to the JSON-type. Implicit casting to the JSON-type may point to an opportunity to limit the level of data type detection.
  • When reading JSON-data, you can try extracting from a JSON-typed value - either by loading your data to a table with a JSON-typed column, or using a combination of read_text() and an explicit cast to the JSON-type. Even if your goal is use the JSON-reader, the JSON-typed value is a useful baseline: if you find that your JSON-reader based query is slower than the one on the JSON-typed value, it means you have an opportunity to improve.
  • If you're using json_keys() with the purpose of extracting objects or values from the JSON-data, then consider using the JSON-reader and configuring it so that it returns a MAP-type. The MAP functions map_keys() and map_values() are really fast, and may help you avoid an an extra step to use the key to extract an object.
  • It would be nice if STRUCTs would be a little bit more flexible. For example, I'd like to be able to extract field values non-constant expressions. It would also be useful to have a function to obtain their keys, and it would help to check if it has a particular key. Another feature that would also have helped is if it would be possible to define STRUCT-fields to be optional, or to have a default value.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...