Tuesday, March 26, 2024

SAP HANA Trick: DISTINCT STRING_AGG

Nowadays, many SQL implementations offer some form of aggregate string concatenation function. Being an aggregate function, it has the effect concatenating values of a group of rows into a single string.

In most implementations, including MS SQL, SAP HANA, and PostgreSQL, this function goes by the name STRING_AGG(). In Snowflake it's called LISTAGG() and In MySQL, which is the first product I ever used that had such a function, it's called GROUP_CONCAT().

Basic Example

To illustrate how it behaves, lets assume we have a table of recipe ingredients like so:

CREATE TABLE RECIPE_INGREDIENT (
  RECIPE_NAME     VARCHAR(32) NOT NULL
, INGREDIENT_NAME VARCHAR(32) NOT NULL
, PRIMARY KEY(
    RECIPE_NAME
  , INGREDIENT_NAME
  )
);
With this data:
INSERT INTO 
RECIPE_INGREDIENT(
  RECIPE_NAME
, INGREDIENT_NAME
)
VALUES
  -- bread
  ('bread', 'flour'), ('bread', 'water'), ('bread', 'yeast'), ('bread', 'salt')
  -- cake 
, ('cake', 'flour'), ('cake', 'eggs'), ('cake', 'sugar'), ('cake', 'butter')
, ('cake', 'baking powder'), ('cake', 'salt'), ('cake', 'vanilla extract')
;
A typical query could then be:
SELECT   RECIPE_NAME
,        STRING_AGG(INGREDIENT_NAME||'; ') AS INGREDIENTS
FROM     RECIPE_INGREDIENT
GROUP BY RECIPE_NAME;
And the result would look something like this:
+-------------+-------------------------------------------------------------------+
| RECIPE_NAME | INGREDIENTS                                                       |
+-------------+-------------------------------------------------------------------+
| bread       | flour; water; yeast; salt;                                        |
| cake        | flour; eggs; sugar; butter; baking powder; salt; vanilla extract; |
+-------------+-------------------------------------------------------------------+

Arguments and Options: separator and ordering

All products that implement it offer the option to specify a separator string that is to be placed between pairs of values. All implementations also have a way to control the order in which the values are concatenated.

Options: DISTINCT

Only some products support the DISTINCT keyword in their string concatenation aggregate function. If supported the DISTINCT keyword appears right before the expression that is to be concatenated. It has the effect of adding each unique value only once to the concatenated string. That is, each distinct value is folded into the concatenated result only once.

MySQL and Snowflake support the DISTINCT keyword in their string concatenation aggregate function. SAP HANA, MS SQL and PostgreSQL, do not.

For our current example data, DISTINCT wouldn't make any difference, as INGREDIENT_NAME is (necessarily) unique for each RECIPE_NAME due to the PRIMARY KEY. But we can make a small change to the RECIPE_INGREDIENT table and add a INGREDIENT_TYPE column to help illustrate how DISTINCT can be useful:

CREATE TABLE RECIPE_INGREDIENT (
  RECIPE_NAME     VARCHAR(32) NOT NULL
, INGREDIENT_NAME VARCHAR(32) NOT NULL
, INGREDIENT_TYPE VARCHAR(32) NOT NULL
, PRIMARY KEY(
    RECIPE_NAME
  , INGREDIENT_NAME
  )
);
With this data:
INSERT INTO 
RECIPE_INGREDIENT(
  RECIPE_NAME
, INGREDIENT_NAME
, INGREDIENT_TYPE
)
VALUES
  -- bread
  ('bread', 'flour', 'carbs'), ('bread', 'water', NULL)
, ('bread', 'yeast', 'leavener'), ('bread', 'salt', 'flavor')
  -- cake 
, ('cake', 'flour', 'carbs'), ('cake', 'eggs', 'protein')
, ('cake', 'sugar', 'carbs'), ('cake', 'butter', 'fat')
, ('cake', 'baking powder', 'leavener'), ('cake', 'salt', 'flavor')
, ('cake', 'vanilla extract', 'flavor')
;
Suppose we write another query, grouping recipes by their name, and using STRING_AGG() on both INGREDIENT_NAME and INGREDIENT_TYPE. This time, we'll also use the separator option by passing it as second argument:
SELECT   RECIPE_NAME
,        STRING_AGG(INGREDIENT_NAME, ', ') AS INGREDIENTS
,        STRING_AGG(INGREDIENT_TYPE, ', ') AS INGREDIENT_TYPES
FROM     RECIPE_INGREDIENT
GROUP BY RECIPE_NAME;
The result would look something like this:
+-------------+----------------+-----------------------------------------------------+
| RECIPE_NAME | INGREDIENTS    | INGREDIENT_TYPES                                    |
+-------------+----------------+-----------------------------------------------------+
| bread       | flower, wat... | carbs, leavener, flavor                             |
| cake        | flower, egg... | carbs, protein, fat, leavener carbs, flavor, flavor |
+-------------+----------------+-----------------------------------------------------+
The list of INGREDIENT_NAMEs is just like what we saw before but edited for brevity.

Notice how the list of INGREDIENT_TYPE has 3 values for bread even though it has 4 ingredients. That's because the ingredient water has a NULL value for INGREDIENT_TYPE, and STRING_AGG() ignores NULL values (just like most other aggregate functions do).

Notice how the list of INGREDIENT_TYPE for cake has just as many entries as there are ingredients. Because the cake ingredients flour and sugar are both of type carbs, and ingredients vanilla extract and salt are both of type flavor, the INGREDIENT_TYPE list contains duplicates.

In those SQL implementations that support DISTINCT, we can get rid of the duplicates in the INGREDIENT_TYPE list simply by writing:

SELECT   RECIPE_NAME
,        STRING_AGG(INGREDIENT_NAME, ', ') AS INGREDIENTS
,        STRING_AGG( DISTINCT INGREDIENT_TYPE, ', ') AS INGREDIENT_TYPES
FROM     RECIPE_INGREDIENT
GROUP BY RECIPE_NAME;

Alternatives to DISTINCT

What can we do if the SQL implementation you're working with doesn't support it? A little googling finds many suggestions to solve it by applying a two-step process:

  1. The first step serves to obtain unique column values. This can be done either by applying DISTINCT in the SELECT-list, or by adding a GROUP BY-clause.
  2. Then, in a second step, STRING_AGG() is applied on the pre-deduplicated column values to obtain a list containing only unique entries.

The process is simple enough if you only need a single column:

WITH deduplicated_ingredient_types AS (
  SELECT   DISTINCT RECIPE_NAME, INGREDIENT_TYPE
  FROM     RECIPE_INGREDIENT
)
SELECT   RECIPE_NAME
,        STRING_AGG( INGREDIENT_TYPE, ', ') AS INGREDIENT_TYPES
FROM     deduplicated_ingredient_types
GROUP BY RECIPE_NAME;

This approach soon becomes unwieldy when we also want to have lists on other columns. For example, to get a result with a list of both INGREDIENT_NAME and INGREDIENT_TYPE, we'd have to write something like:

WITH deduplicated_ingredient_types AS (
  SELECT   DISTINCT RECIPE_NAME, INGREDIENT_TYPE
  FROM     RECIPE_INGREDIENT
), ingredient_types AS (
  SELECT   RECIPE_NAME
  ,        STRING_AGG( INGREDIENT_TYPE, ', ') AS INGREDIENT_TYPES
  FROM     deduplicated_ingredient_types
  GROUP BY RECIPE_NAME
)
SELECT   RECIPE_INGREDIENT.RECIPE_NAME
,        STRING_AGG( RECIPE_INGREDIENT.INGREDIENT_NAME, ', ') AS INGREDIENTS
,        ingredient_types.INGREDIENT_TYPES
FROM     RECIPE_INGREDIENT
INNER JOIN ingredient_types
ON RECIPE_INGREDIENT.RECIPE_NAME = ingredient_types.RECIPE_NAME
GROUP BY RECIPE_INGREDIENT.RECIPE_NAME;

The pattern of having a CTE to obtain unique values and JOIN would have to be repeated for each column from wich we want to derive a list.

A better way: the ROW_NUMBER() window function

We recently ran into a case for an actual SAP HANA customer where we need to produce such lists, and for many different columns. We found out that although we still need a two-step process, we can solve the problem more elegantly using the ROW_NUMBER() window function. Consider the following query:

SELECT  RECIPE_NAME
,       INGREDIENT_NAME
,       INGREDIENT_TYPE
,       ROW_NUMBER() OVER (
            PARTITION BY RECIPE_NAME, INGREDIENT_TYPE
        ) AS INGREDIENT_TYPE_OCCURRENCE 
FROM    RECIPE_INGREDIENT;
And the results are:
+-------------+-----------------+-----------------+----------------------------+
| RECIPE_NAME | INGREDIENT_NAME | INGREDIENT_TYPE | INGREDIENT_TYPE_OCCURRENCE |
+-------------+-----------------+-----------------+----------------------------+
| bread       | flour           | carbs           |                          1 |
| bread       | water           | NULL            |                          1 |
| bread       | yeast           | leavener        |                          1 |
| bread       | salt            | flavor          |                          1 |
| cake        | flour           | carbs           |                          1 |
| cake        | eggs            | protein         |                          1 |
| cake        | sugar           | carbs           |                          2 |
| cake        | butter          | fat             |                          1 |
| cake        | baking powder   | leavener        |                          1 |
| cake        | salt            | flavor          |                          1 |
| cake        | vanilla extract | flavor          |                          2 |
+-------------+-----------------------------------+----------------------------+

You might notice that every INGREDIENT_TYPE_OCCURRENCE is 1, except for the duplicate values for carbs and flavor in the cake recipe. And, this is of course the key to solving the problem!

Using CASE to let STRING_AGG() ignore duplicates

Now that we know (by looking at the INGREDIENT_TYPE_OCCURRENCE column) where the duplicate INGREDIENT_TYPE values are at, we can direct the STRING_AGG() function to ignore them using some CASE logic:

WITH occurrences AS (
  SELECT  RECIPE_NAME
  ,       INGREDIENT_NAME
  ,       INGREDIENT_TYPE
  ,       ROW_NUMBER() OVER (
              PARTITION BY RECIPE_NAME, INGREDIENT_TYPE
          ) AS INGREDIENT_TYPE_OCCURRENCE 
  FROM     RECIPE_INGREDIENT
)
SELECT  RECIPE_NAME
,       STRING_AGG(INGREDIENT_NAME, ', ') AS INGREDIENTS
,       STRING_AGG(
          CASE INGREDIENT_TYPE_OCCURRENCE
            WHEN 1 THEN INGREDIENT_TYPE  
            ELSE NULL  
          END, ', '
        ) AS INGREDIENT_TYPES
FROM     occurrences;
GROUP BY RECIPE_NAME;

And that's that! Of course, if we would like to add more columns for which we need unique lists of values, we would still need to repeat some steps. But rather than creating a new CTE to isolate the unique values, and a JOIN to add each of those to our main result, we now only need to add 1 ROW_NUMBER() expression, and one STRING_AGG() with nested CASE expression, which I think is a pretty good improvement.

Finally...

I hope you enjoyed this post! Drop a line in the comments if you have another suggestion or some remarks.

No comments:

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...