Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Sunday, April 16, 2006

Nested repeating groups in MySQL query output

There are a lot of articles around that explain that relations (tables) should be normalized (an article by Mike Hillyer will get you started) before you can store them in a relational database. The complete normalization process consists of two steps:

  1. elimination of multi-valued attributes ('at most one value per column')

  2. elimination of redundancy ('no duplication of data')


Although these are two distinct activities that solve different problems, the methods used to perform them have a great deal in common. In both cases, some columns from the original table are moved to a new table, and a foreign key relationship is constructed to be able to relate the two tables.

The goal of the entire normalization process is to make the data as maintainable as possible to ensure integrity and validity of the data. Essentially, that goal is achieved by putting each separate 'thing' in it's own table reducing the amount of effort that is needed to add, remove or modify it.

This ususally means that retrieving and presenting data requires more effort than it would've been if the data would've been stored in an unnormalized structure. Because data is spread out across multiple tables, answering even simple questions usually involves accessing multiple tables.

More often than not, data presentation involves recreating (but not storing) a redundant structure with repeating groups. This is especially true for human-readable presentations of data (reports etc), but it may also occur when generating flat files from the database for the purpose of data-exchange (for example, XML-documents).

Although it may require considerable effort to query and retrieve a normalized structure to present data in a redundant format with repeating groups, it is doable. It's certainly a lot easier than storing the data in a redundant format and maintaining integrity and validity in case data is added, modified or removed.

GROUP_CONCAT(): repeating groups through aggregation


Consider the following structure from the Sakila sample database:

(Some of you might recall I used this structure in my previous blog entry to support my plea for a new JOIN USING FOREIGN KEY syntax.)
This is a normalized structure. There are no multi-valued attributes (no column can have more than one value for a particular row). There is no redundancy either: each column is dependant only upon the primary key of it's table.

Suppose you'd want an overview of all the countries and their cities. As described in my previous blog entry, this is easily achieved by JOIN-ing the city table to the country table:

SELECT country.country
, city.city
FROM country
JOIN city
ON country.country_id = city.country_id
ORDER BY country.country
, city.city
LIMIT 5
;

(The ORDER BY and LIMIT clauses are there only to allow me to present a small, managable list of data - it has no other bearing).
This query yields the following result:

+----------------+--------+
| country | city |
+----------------+--------+
| Afghanistan | Kabul |
| Algeria | Batna |
| Algeria | Bchar |
| Algeria | Skikda |
| American Samoa | Tafuna |
+----------------+--------+
5 rows in set (0.03 sec)


Now, let's rephrase the requirements just a little bit. What if we want an overview of all countries and a list of all it's cities. It resembles the previous overview in that each country is displayed with all of it's cities; it's just the presentation that is different. Here, the output is required to use a repeating group: the result itself will be a list of countries, and each country in the list will contain a list of cities itself. That's pretty much the definition of a repeating group:

+----------------+----------------------+
| country | cities |
+----------------+----------------------+
| Afghanistan | Kabul |
| Algeria | Batna, Bchar, Skikda |
| American Samoa | Tafuna |
+----------------+----------------------+
3 rows in set (0.03 sec)

As it turns out, it is impossible to solve this with a single SELECT expression in Oracle or MS SQL. How convenient that this is not a problem at all in MySQL:

SELECT country.country
, GROUP_CONCAT(
city.city
ORDER BY city
SEPARATOR ', '
) cities
FROM country country
JOIN city city
ON country.country_id = city.country_id
GROUP BY country.country ASC
LIMIT 3

This query has two extra elements compared to the previous one: the GROUP BY clause and the GROUP_CONCAT() function.

GROUP BY accepts a comma-separated list of expressions. GROUP BY modifies the returned resultset: all rows from the original resultset that have a distinct combination of values in the list of GROUP BY expressions, are bundled (grouped) and returned as one single row in the returned resultset.

Columns from the original resultset that do not appear in the GROUP BY list ('detail'-columns, as opposed to 'grouping'- or 'summary'-columns) should not be used directly in the SELECT list. That's because for one group in the final resultset, mutliple distinct values may occur in those columns. Because the expressions in the GROUP BY list are returned a single row, the resultset simply doesn't have room to return all of them.

(Most rdbms-es consider it an error to have a column that does not appear in the GROUP BY list in the SELECT list. By default MySQL will accept such a statement. MySQL even returns the correct value too if the column has a unique value within the group. Otherwise, the returned value is undefined. This particular behaviour can be overidden using the ONLY_FULL_GROUP_BY server SQL mode. In that particular mode, MySQL will treat it as an error to have such columns in the SELECT-list too. See the manual for more info on this subject.)

Although the detail values for each group cannot be accessed in the SELECT list directly, they can be processed by an aggregate function, like MIN, MAX, COUNT, SUM. All these functions operate on a group of objects only to return a single scalar value.

GROUP_CONCAT() is such an aggregate function. The function processes the expression passed to it by concatenating all occurrences within the group together. Apart from the expression value, an ORDER BY clause may be specified to control the order in which the detail expression values are concatenated to each other. Also, a SEPARATOR clause may be passed to specify the string that should be used to separate the multiple entries in the return value.

Adding a level


Ok, we've seen that in MySQL, generating the repeating group in the query result is really very easy.
Now, suppose the requirements for the query change again. This time, we would also like to see the list of customers living in each city. First, let's take a look at the relevant structures in the database schema:

If we would not require the repeating groups in the output, the solution is a quite straightforward extension of the first country/city query:

SELECT country.country
, city.city
, customer.last_name
, customer.first_name
FROM country
JOIN city
ON country.country_id = city.country_id
JOIN address
ON city.city_id = address.city_id
JOIN customer
ON address.address_id = customer.address_id
WHERE country.country = 'United Kingdom'
ORDER BY country.country
, city.city
, customer.last_name
, customer.first_name
;

(This time, I used a WHERE condition to limit the number of rows. It has no bearing otherwise.)
This is the result:

+----------------+-----------------+-----------+------------+
| country | city | last_name | first_name |
+----------------+-----------------+-----------+------------+
| United Kingdom | Bradford | POWELL | ANNE |
| United Kingdom | Dundee | BURNS | APRIL |
| United Kingdom | London | HOFFMAN | MATTIE |
| United Kingdom | London | VINES | CECIL |
| United Kingdom | Southampton | THORN | MARSHALL |
| United Kingdom | Southend-on-Sea | MARTIN | SANDRA |
| United Kingdom | Southport | GRUBER | ARMANDO |
| United Kingdom | Stockport | PAINE | DAN |
| United Kingdom | York | SLEDGE | GILBERT |
+----------------+-----------------+-----------+------------+
9 rows in set (0.00 sec)

However, we run into problems when we try to think of an appropriate GROUP BY clause and GROUP_CONCAT() expression to generate the desired repeating groups.

If we GROUP BY the country.country column like we did before, we would get one summary row for the country, exaclty as required. However, what remains is a set of city/customer combinations. Although we could process those with GROUP_CONCAT, that will never give us the desired result. The repeating group generated by GROUP_CONCAT will have an entry for each city/customer combination corresponding to the United Kingdom, and that is not what we want: what we want is a repeating group of cities per country and per city, a repeating group of customers.

So, we need two different repeating groups occurring for two different groupings or levels. This means that somehow, our query needs to have two corresponding GROUP_CONCAT()'s as well as two GROUP BY's. To achieve that, we need to have a '..query inside a query..' that we can equip with it's own GROUP BY clause so it can generate it's own repeating group.

A 'query inside a query' is called a subquery. Inside a SELECT expression, a subquery can appear in three different places: in the WHERE-clause, in the FROM-clause, or in SELECT-list. Let's stick as close as we can to our previous attempt, JOIN-ing all relevant tables, and write a subquery in the FROM clause.

Well, it's not really hard to see what parts we must place in the subquery. We had the repeating group for the city table in place, all we did since then is add the address and customer tables. These two tables really are all that is required to generate the repeating group of customers per city:


SELECT a.city_id
, GROUP_CONCAT(
CONCAT(
c.first_name
, ' '
, c.last_name
)
ORDER BY c.last_name
SEPARATOR ', '
) customers_per_city
FROM customer c
JOIN address a
ON c.address_id = a.address_id
WHERE a.city_id = 312
GROUP BY a.city_id

(312 is the city_id of London in the United Kingdom)
You can see that this query indeed generates the repeating group of customers for a particular value of city_id:

+---------+-----------------------------+
| city_id | customers_per_city |
+---------+-----------------------------+
| 312 | MATTIE HOFFMAN, CECIL VINES |
+---------+-----------------------------+

In fact, the repeating group corresponds to the 3rd and 4th entry in the result of the previous query: the customers living in London.

Now, we combine this with the previous query, substituting the directly joined address and customer tables and their JOIN condition with this query:


SELECT concat(
country.country
, ': '
, GROUP_CONCAT(
concat(
'\n\t', city.city, ':'
, '\n\t\t', customer_per_city.customer_per_city
)
separator ''
)
) customers_per_city
FROM country
JOIN city
ON country.country_id = city.country_id
JOIN (
SELECT a.city_id
, GROUP_CONCAT(
CONCAT(
c.first_name
, ' '
, c.last_name
)
ORDER BY c.last_name
SEPARATOR ', '
) customer_per_city
FROM customer c
JOIN address a
ON c.address_id = a.address_id
GROUP BY a.city_id
) customer_per_city
ON city.city_id = customer_per_city.city_id
WHERE country.country = 'United Kingdom'
GROUP BY country.country ASC

(In this query, some adjustments are made to the separators. Specificly, some newlines and indenting were added to make the result more readable. It's still a repeating group but this time, a new-line is output between cities instead of the semi colon used in previous examples.)
This is the result:

United Kingdom:
Bradford:
ANNE POWELL
Dundee:
APRIL BURNS
London:
MATTIE HOFFMAN, CECIL VINES
Southampton:
MARSHALL THORN
Southend-on-Sea:
SANDRA MARTIN
Southport:
ARMANDO GRUBER
Stockport:
DAN PAINE
York:
GILBERT SLEDGE


Summary


The MySQL aggregate function GROUP_CONCAT() generates a repeating group for the string expression passed as argument.

The items that appear in a repeating group can themselves be repeating groups. Nesting of repeated groups is achieved by writing distinct queries for each distinct repeating group. These distinct queries are then joined (as subqueries) to relate the repeating groups to each other.

2 comments:

Sheeri said...

Ideally, use INNER JOINS instead of implicit comma joins.

SELECT country.country
INNER JOIN city.city
FROM country
JOIN city
ON country.country_id = city.country_id
ORDER BY country.country
, city.city
LIMIT 5
;

or, better still, use "USING" instead of "ON" when the fields have the same name:

SELECT country.country
INNER JOIN city.city
FROM country
JOIN city
USING (country_id)
ORDER BY country.country
, city.city
LIMIT 5
;

------------
this:
SELECT country.country
, city.city
, customer.last_name
, customer.first_name
FROM country
JOIN city
ON country.country_id = city.country_id
JOIN address
ON city.city_id = address.city_id
JOIN customer
ON address.address_id = customer.address_id
WHERE country.country = 'United Kingdom'
ORDER BY country.country
, city.city
, customer.last_name
, customer.first_name
;


becomes

SELECT country.country
, city.city
, customer.last_name
, customer.first_name
FROM country
JOIN city
USING (country_id)
JOIN address
USING (city_id)
JOIN customer
USING (address_id)
WHERE country.country = 'United Kingdom'
ORDER BY country.country
, city.city
, customer.last_name
, customer.first_name
;


much easier to read!

And I learned about GROUP_CONCAT...thanx!

(A good article, btw)

Roland Bouman said...

Sheeri, thanks for the comment.

I think you may have misread the initial query. I'm using comma's alright, but in their proper sense, that is, as separators for the column expressions in the SELECT-list.

It's true I used the JOIN keyword instead of INNER JOIN. I usually write INNER JOIN in full. However, this time, I referred to the query exactly as it appeared in a previous blog entry .

In that entry, "intelligent join syntax"? (http://rpbouman.blogspot.com/2006/04/intelligent-sql-join-syntax_09.html) I jotted down some ideas on join concepts in general, and I chose to write JOIN instead of INNER JOIN to avoid an explanation on the differences between inner and outer joins.

In that entry, I explain why I prefer an explicit ON over a NATURAL or named columns join (the one with the USING syntax), and why I think it should be possible to refer to a foreign key constraint to specify the join condition.