- elimination of multi-valued attributes ('at most one value per column')
- elimination of redundancy ('no duplication of data')
Although these are two distinct activities that solve different problems, the methods used to perform them have a great deal in common. In both cases, some columns from the original table are moved to a new table, and a foreign key relationship is constructed to be able to relate the two tables.
The goal of the entire normalization process is to make the data as maintainable as possible to ensure integrity and validity of the data. Essentially, that goal is achieved by putting each separate 'thing' in it's own table reducing the amount of effort that is needed to add, remove or modify it.
This ususally means that retrieving and presenting data requires more effort than it would've been if the data would've been stored in an unnormalized structure. Because data is spread out across multiple tables, answering even simple questions usually involves accessing multiple tables.
More often than not, data presentation involves recreating (but not storing) a redundant structure with repeating groups. This is especially true for human-readable presentations of data (reports etc), but it may also occur when generating flat files from the database for the purpose of data-exchange (for example, XML-documents).
Although it may require considerable effort to query and retrieve a normalized structure to present data in a redundant format with repeating groups, it is doable. It's certainly a lot easier than storing the data in a redundant format and maintaining integrity and validity in case data is added, modified or removed.
GROUP_CONCAT(): repeating groups through aggregation
Consider the following structure from the Sakila sample database:
(Some of you might recall I used this structure in my previous blog entry to support my plea for a new
JOIN USING FOREIGN KEY
syntax.)This is a normalized structure. There are no multi-valued attributes (no column can have more than one value for a particular row). There is no redundancy either: each column is dependant only upon the primary key of it's table.
Suppose you'd want an overview of all the countries and their cities. As described in my previous blog entry, this is easily achieved by
JOIN
-ing the city
table to the country
table:
SELECT country.country
, city.city
FROM country
JOIN city
ON country.country_id = city.country_id
ORDER BY country.country
, city.city
LIMIT 5
;
(The
ORDER BY
and LIMIT
clauses are there only to allow me to present a small, managable list of data - it has no other bearing).This query yields the following result:
+----------------+--------+
| country | city |
+----------------+--------+
| Afghanistan | Kabul |
| Algeria | Batna |
| Algeria | Bchar |
| Algeria | Skikda |
| American Samoa | Tafuna |
+----------------+--------+
5 rows in set (0.03 sec)
Now, let's rephrase the requirements just a little bit. What if we want an overview of all countries and a list of all it's cities. It resembles the previous overview in that each country is displayed with all of it's cities; it's just the presentation that is different. Here, the output is required to use a repeating group: the result itself will be a list of countries, and each country in the list will contain a list of cities itself. That's pretty much the definition of a repeating group:
+----------------+----------------------+
| country | cities |
+----------------+----------------------+
| Afghanistan | Kabul |
| Algeria | Batna, Bchar, Skikda |
| American Samoa | Tafuna |
+----------------+----------------------+
3 rows in set (0.03 sec)
As it turns out, it is impossible to solve this with a single
SELECT
expression in Oracle or MS SQL. How convenient that this is not a problem at all in MySQL:
SELECT country.country
, GROUP_CONCAT(
city.city
ORDER BY city
SEPARATOR ', '
) cities
FROM country country
JOIN city city
ON country.country_id = city.country_id
GROUP BY country.country ASC
LIMIT 3
This query has two extra elements compared to the previous one: the
GROUP BY
clause and the GROUP_CONCAT()
function.GROUP BY
accepts a comma-separated list of expressions. GROUP BY
modifies the returned resultset: all rows from the original resultset that have a distinct combination of values in the list of GROUP BY
expressions, are bundled (grouped) and returned as one single row in the returned resultset. Columns from the original resultset that do not appear in the
GROUP BY
list ('detail'-columns, as opposed to 'grouping'- or 'summary'-columns) should not be used directly in the SELECT
list. That's because for one group in the final resultset, mutliple distinct values may occur in those columns. Because the expressions in the GROUP BY
list are returned a single row, the resultset simply doesn't have room to return all of them. (Most rdbms-es consider it an error to have a column that does not appear in the
GROUP BY
list in the SELECT
list. By default MySQL will accept such a statement. MySQL even returns the correct value too if the column has a unique value within the group. Otherwise, the returned value is undefined. This particular behaviour can be overidden using the ONLY_FULL_GROUP_BY
server SQL mode. In that particular mode, MySQL will treat it as an error to have such columns in the SELECT
-list too. See the manual for more info on this subject.)Although the detail values for each group cannot be accessed in the
SELECT
list directly, they can be processed by an aggregate function, like MIN
, MAX
, COUNT
, SUM
. All these functions operate on a group of objects only to return a single scalar value.GROUP_CONCAT()
is such an aggregate function. The function processes the expression passed to it by concatenating all occurrences within the group together. Apart from the expression value, an ORDER BY
clause may be specified to control the order in which the detail expression values are concatenated to each other. Also, a SEPARATOR
clause may be passed to specify the string that should be used to separate the multiple entries in the return value.Adding a level
Ok, we've seen that in MySQL, generating the repeating group in the query result is really very easy.
Now, suppose the requirements for the query change again. This time, we would also like to see the list of customers living in each city. First, let's take a look at the relevant structures in the database schema:
If we would not require the repeating groups in the output, the solution is a quite straightforward extension of the first country/city query:
SELECT country.country
, city.city
, customer.last_name
, customer.first_name
FROM country
JOIN city
ON country.country_id = city.country_id
JOIN address
ON city.city_id = address.city_id
JOIN customer
ON address.address_id = customer.address_id
WHERE country.country = 'United Kingdom'
ORDER BY country.country
, city.city
, customer.last_name
, customer.first_name
;
(This time, I used a
WHERE
condition to limit the number of rows. It has no bearing otherwise.)This is the result:
+----------------+-----------------+-----------+------------+
| country | city | last_name | first_name |
+----------------+-----------------+-----------+------------+
| United Kingdom | Bradford | POWELL | ANNE |
| United Kingdom | Dundee | BURNS | APRIL |
| United Kingdom | London | HOFFMAN | MATTIE |
| United Kingdom | London | VINES | CECIL |
| United Kingdom | Southampton | THORN | MARSHALL |
| United Kingdom | Southend-on-Sea | MARTIN | SANDRA |
| United Kingdom | Southport | GRUBER | ARMANDO |
| United Kingdom | Stockport | PAINE | DAN |
| United Kingdom | York | SLEDGE | GILBERT |
+----------------+-----------------+-----------+------------+
9 rows in set (0.00 sec)
However, we run into problems when we try to think of an appropriate
GROUP BY
clause and GROUP_CONCAT()
expression to generate the desired repeating groups.If we
GROUP BY
the country.country
column like we did before, we would get one summary row for the country, exaclty as required. However, what remains is a set of city/customer combinations. Although we could process those with GROUP_CONCAT
, that will never give us the desired result. The repeating group generated by GROUP_CONCAT
will have an entry for each city/customer combination corresponding to the United Kingdom, and that is not what we want: what we want is a repeating group of cities per country and per city, a repeating group of customers. So, we need two different repeating groups occurring for two different groupings or levels. This means that somehow, our query needs to have two corresponding
GROUP_CONCAT()
's as well as two GROUP BY
's. To achieve that, we need to have a '..query inside a query..' that we can equip with it's own GROUP BY
clause so it can generate it's own repeating group.A 'query inside a query' is called a subquery. Inside a
SELECT
expression, a subquery can appear in three different places: in the WHERE
-clause, in the FROM
-clause, or in SELECT
-list. Let's stick as close as we can to our previous attempt, JOIN
-ing all relevant tables, and write a subquery in the FROM
clause.Well, it's not really hard to see what parts we must place in the subquery. We had the repeating group for the
city
table in place, all we did since then is add the address
and customer
tables. These two tables really are all that is required to generate the repeating group of customers per city:
SELECT a.city_id
, GROUP_CONCAT(
CONCAT(
c.first_name
, ' '
, c.last_name
)
ORDER BY c.last_name
SEPARATOR ', '
) customers_per_city
FROM customer c
JOIN address a
ON c.address_id = a.address_id
WHERE a.city_id = 312
GROUP BY a.city_id
(312 is the city_id of London in the United Kingdom)
You can see that this query indeed generates the repeating group of customers for a particular value of
city_id
:
+---------+-----------------------------+
| city_id | customers_per_city |
+---------+-----------------------------+
| 312 | MATTIE HOFFMAN, CECIL VINES |
+---------+-----------------------------+
In fact, the repeating group corresponds to the 3rd and 4th entry in the result of the previous query: the customers living in London.
Now, we combine this with the previous query, substituting the directly joined
address
and customer
tables and their JOIN
condition with this query:
SELECT concat(
country.country
, ': '
, GROUP_CONCAT(
concat(
'\n\t', city.city, ':'
, '\n\t\t', customer_per_city.customer_per_city
)
separator ''
)
) customers_per_city
FROM country
JOIN city
ON country.country_id = city.country_id
JOIN (
SELECT a.city_id
, GROUP_CONCAT(
CONCAT(
c.first_name
, ' '
, c.last_name
)
ORDER BY c.last_name
SEPARATOR ', '
) customer_per_city
FROM customer c
JOIN address a
ON c.address_id = a.address_id
GROUP BY a.city_id
) customer_per_city
ON city.city_id = customer_per_city.city_id
WHERE country.country = 'United Kingdom'
GROUP BY country.country ASC
(In this query, some adjustments are made to the separators. Specificly, some newlines and indenting were added to make the result more readable. It's still a repeating group but this time, a new-line is output between cities instead of the semi colon used in previous examples.)
This is the result:
United Kingdom:
Bradford:
ANNE POWELL
Dundee:
APRIL BURNS
London:
MATTIE HOFFMAN, CECIL VINES
Southampton:
MARSHALL THORN
Southend-on-Sea:
SANDRA MARTIN
Southport:
ARMANDO GRUBER
Stockport:
DAN PAINE
York:
GILBERT SLEDGE
Summary
The MySQL aggregate function
GROUP_CONCAT()
generates a repeating group for the string expression passed as argument. The items that appear in a repeating group can themselves be repeating groups. Nesting of repeated groups is achieved by writing distinct queries for each distinct repeating group. These distinct queries are then joined (as subqueries) to relate the repeating groups to each other.
1 comment:
Sheeri, thanks for the comment.
I think you may have misread the initial query. I'm using comma's alright, but in their proper sense, that is, as separators for the column expressions in the SELECT-list.
It's true I used the JOIN keyword instead of INNER JOIN. I usually write INNER JOIN in full. However, this time, I referred to the query exactly as it appeared in a previous blog entry .
In that entry, "intelligent join syntax"? (http://rpbouman.blogspot.com/2006/04/intelligent-sql-join-syntax_09.html) I jotted down some ideas on join concepts in general, and I chose to write JOIN instead of INNER JOIN to avoid an explanation on the differences between inner and outer joins.
In that entry, I explain why I prefer an explicit ON over a NATURAL or named columns join (the one with the USING syntax), and why I think it should be possible to refer to a foreign key constraint to specify the join condition.
Post a Comment