Nowadays, many SQL implementations offer some form of aggregate string concatenation function. Being an aggregate function, it has the effect concatenating values of a group of rows into a single string.
In most implementations, including MS SQL, SAP HANA, and PostgreSQL, this function goes by the name STRING_AGG()
. In Snowflake it's called LISTAGG()
and In MySQL, which is the first product I ever used that had such a function, it's called GROUP_CONCAT()
.
Basic Example
To illustrate how it behaves, lets assume we have a table of recipe ingredients like so:
CREATE TABLE RECIPE_INGREDIENT ( RECIPE_NAME VARCHAR(32) NOT NULL , INGREDIENT_NAME VARCHAR(32) NOT NULL , PRIMARY KEY( RECIPE_NAME , INGREDIENT_NAME ) );With this data:
INSERT INTO RECIPE_INGREDIENT( RECIPE_NAME , INGREDIENT_NAME ) VALUES -- bread ('bread', 'flour'), ('bread', 'water'), ('bread', 'yeast'), ('bread', 'salt') -- cake , ('cake', 'flour'), ('cake', 'eggs'), ('cake', 'sugar'), ('cake', 'butter') , ('cake', 'baking powder'), ('cake', 'salt'), ('cake', 'vanilla extract') ;A typical query could then be:
SELECT RECIPE_NAME
, STRING_AGG(INGREDIENT_NAME||'; ') AS INGREDIENTS
FROM RECIPE_INGREDIENT
GROUP BY RECIPE_NAME;
And the result would look something like this:
+-------------+-------------------------------------------------------------------+ | RECIPE_NAME | INGREDIENTS | +-------------+-------------------------------------------------------------------+ | bread | flour; water; yeast; salt; | | cake | flour; eggs; sugar; butter; baking powder; salt; vanilla extract; | +-------------+-------------------------------------------------------------------+
Arguments and Options: separator and ordering
All products that implement it offer the option to specify a separator string that is to be placed between pairs of values. All implementations also have a way to control the order in which the values are concatenated.
Options: DISTINCT
Only some products support the DISTINCT
keyword in their string concatenation aggregate function. If supported the DISTINCT
keyword appears right before the expression that is to be concatenated. It has the effect of adding each unique value only once to the concatenated string. That is, each distinct value is folded into the concatenated result only once.
MySQL and Snowflake support the DISTINCT
keyword in their string concatenation aggregate function.
SAP HANA, MS SQL and PostgreSQL, do not.
For our current example data, DISTINCT
wouldn't make any difference, as INGREDIENT_NAME
is (necessarily) unique for each RECIPE_NAME
due to the PRIMARY KEY
. But we can make a small change to the RECIPE_INGREDIENT
table and add a INGREDIENT_TYPE
column to help illustrate how DISTINCT
can be useful:
CREATE TABLE RECIPE_INGREDIENT ( RECIPE_NAME VARCHAR(32) NOT NULL , INGREDIENT_NAME VARCHAR(32) NOT NULL , INGREDIENT_TYPE VARCHAR(32) NOT NULL , PRIMARY KEY( RECIPE_NAME , INGREDIENT_NAME ) );With this data:
INSERT INTO RECIPE_INGREDIENT( RECIPE_NAME , INGREDIENT_NAME , INGREDIENT_TYPE ) VALUES -- bread ('bread', 'flour', 'carbs'), ('bread', 'water', NULL) , ('bread', 'yeast', 'leavener'), ('bread', 'salt', 'flavor') -- cake , ('cake', 'flour', 'carbs'), ('cake', 'eggs', 'protein') , ('cake', 'sugar', 'carbs'), ('cake', 'butter', 'fat') , ('cake', 'baking powder', 'leavener'), ('cake', 'salt', 'flavor') , ('cake', 'vanilla extract', 'flavor') ;Suppose we write another query, grouping recipes by their name, and using
STRING_AGG()
on both INGREDIENT_NAME
and INGREDIENT_TYPE
. This time, we'll also use the separator option by passing it as second argument:
SELECT RECIPE_NAME , STRING_AGG(INGREDIENT_NAME, ', ') AS INGREDIENTS , STRING_AGG(INGREDIENT_TYPE, ', ') AS INGREDIENT_TYPES FROM RECIPE_INGREDIENT GROUP BY RECIPE_NAME;The result would look something like this:
+-------------+----------------+-----------------------------------------------------+ | RECIPE_NAME | INGREDIENTS | INGREDIENT_TYPES | +-------------+----------------+-----------------------------------------------------+ | bread | flower, wat... | carbs, leavener, flavor | | cake | flower, egg... | carbs, protein, fat, leavener carbs, flavor, flavor | +-------------+----------------+-----------------------------------------------------+The list of
INGREDIENT_NAME
s is just like what we saw before but edited for brevity.
Notice how the list of INGREDIENT_TYPE
has 3 values for bread
even though it has 4 ingredients.
That's because the ingredient water
has a NULL
value for INGREDIENT_TYPE
, and STRING_AGG()
ignores NULL
values (just like most other aggregate functions do).
Notice how the list of INGREDIENT_TYPE
for cake
has just as many entries as there are ingredients.
Because the cake
ingredients flour
and sugar
are both of type carbs
, and ingredients vanilla extract
and salt
are both of type flavor
, the INGREDIENT_TYPE
list contains duplicates.
In those SQL implementations that support DISTINCT
, we can get rid of the duplicates in the INGREDIENT_TYPE
list simply by writing:
SELECT RECIPE_NAME , STRING_AGG(INGREDIENT_NAME, ', ') AS INGREDIENTS , STRING_AGG( DISTINCT INGREDIENT_TYPE, ', ') AS INGREDIENT_TYPES FROM RECIPE_INGREDIENT GROUP BY RECIPE_NAME;
Alternatives to DISTINCT
What can we do if the SQL implementation you're working with doesn't support it? A little googling finds many suggestions to solve it by applying a two-step process:
- The first step serves to obtain unique column values. This can be done either by applying
DISTINCT
in theSELECT
-list, or by adding aGROUP BY
-clause. - Then, in a second step,
STRING_AGG()
is applied on the pre-deduplicated column values to obtain a list containing only unique entries.
The process is simple enough if you only need a single column:
WITH deduplicated_ingredient_types AS (
SELECT DISTINCT RECIPE_NAME, INGREDIENT_TYPE
FROM RECIPE_INGREDIENT
)
SELECT RECIPE_NAME
, STRING_AGG( INGREDIENT_TYPE, ', ') AS INGREDIENT_TYPES
FROM deduplicated_ingredient_types
GROUP BY RECIPE_NAME;
This approach soon becomes unwieldy when we also want to have lists on other columns. For example, to get a result with a list of both INGREDIENT_NAME
and INGREDIENT_TYPE
, we'd have to write something like:
WITH deduplicated_ingredient_types AS ( SELECT DISTINCT RECIPE_NAME, INGREDIENT_TYPE FROM RECIPE_INGREDIENT ), ingredient_types AS ( SELECT RECIPE_NAME , STRING_AGG( INGREDIENT_TYPE, ', ') AS INGREDIENT_TYPES FROM deduplicated_ingredient_types GROUP BY RECIPE_NAME ) SELECT RECIPE_INGREDIENT.RECIPE_NAME , STRING_AGG( RECIPE_INGREDIENT.INGREDIENT_NAME, ', ') AS INGREDIENTS , ingredient_types.INGREDIENT_TYPES FROM RECIPE_INGREDIENT INNER JOIN ingredient_types ON RECIPE_INGREDIENT.RECIPE_NAME = ingredient_types.RECIPE_NAME GROUP BY RECIPE_INGREDIENT.RECIPE_NAME;
The pattern of having a CTE to obtain unique values and JOIN would have to be repeated for each column from wich we want to derive a list.
A better way: the ROW_NUMBER()
window function
We recently ran into a case for an actual SAP HANA customer where we need to produce such lists, and for many different columns.
We found out that although we still need a two-step process, we can solve the problem more elegantly using the ROW_NUMBER()
window function.
Consider the following query:
SELECT RECIPE_NAME
, INGREDIENT_NAME
, INGREDIENT_TYPE
, ROW_NUMBER() OVER (
PARTITION BY RECIPE_NAME, INGREDIENT_TYPE
) AS INGREDIENT_TYPE_OCCURRENCE
FROM RECIPE_INGREDIENT;
And the results are:
+-------------+-----------------+-----------------+----------------------------+ | RECIPE_NAME | INGREDIENT_NAME | INGREDIENT_TYPE | INGREDIENT_TYPE_OCCURRENCE | +-------------+-----------------+-----------------+----------------------------+ | bread | flour | carbs | 1 | | bread | water | NULL | 1 | | bread | yeast | leavener | 1 | | bread | salt | flavor | 1 | | cake | flour | carbs | 1 | | cake | eggs | protein | 1 | | cake | sugar | carbs | 2 | | cake | butter | fat | 1 | | cake | baking powder | leavener | 1 | | cake | salt | flavor | 1 | | cake | vanilla extract | flavor | 2 | +-------------+-----------------------------------+----------------------------+
You might notice that every INGREDIENT_TYPE_OCCURRENCE
is 1
, except for the duplicate values for carbs
and flavor
in the cake
recipe.
And, this is of course the key to solving the problem!
Using CASE
to let STRING_AGG()
ignore duplicates
Now that we know (by looking at the INGREDIENT_TYPE_OCCURRENCE
column) where the duplicate INGREDIENT_TYPE
values are at, we can direct the STRING_AGG()
function to ignore them using some CASE
logic:
WITH occurrences AS ( SELECT RECIPE_NAME , INGREDIENT_NAME , INGREDIENT_TYPE , ROW_NUMBER() OVER ( PARTITION BY RECIPE_NAME, INGREDIENT_TYPE ) AS INGREDIENT_TYPE_OCCURRENCE FROM RECIPE_INGREDIENT ) SELECT RECIPE_NAME , STRING_AGG(INGREDIENT_NAME, ', ') AS INGREDIENTS , STRING_AGG( CASE INGREDIENT_TYPE_OCCURRENCE WHEN 1 THEN INGREDIENT_TYPE ELSE NULL END, ', ' ) AS INGREDIENT_TYPES FROM occurrences; GROUP BY RECIPE_NAME;
And that's that! Of course, if we would like to add more columns for which we need unique lists of values, we would still need to repeat some steps.
But rather than creating a new CTE to isolate the unique values, and a JOIN
to add each of those to our main result, we now only need to add 1 ROW_NUMBER()
expression, and one STRING_AGG()
with nested CASE
expression, which I think is a pretty good improvement.
Finally...
I hope you enjoyed this post! Drop a line in the comments if you have another suggestion or some remarks.
No comments:
Post a Comment