UPDATE
Thanks for your interest in this article. Please note that it was written a long time ago and that MySQL has progressed enormously since. Check
MySQL 5.7.5: GROUP BY respects functional dependencies! for more info on a more current state of affairs. This original article will remain in place to provide background information.
Original article below:
Debunking GROUP BY Myths
There is a popular myth about the SQL
GROUP BY
clause. The myth holds that 'standard SQL' requires columns referenced in the
SELECT
list of a query to also appear in the
GROUP BY
clause, unless these columns appear exclusively in an aggregated expression.
MySQL is often accused of violating this standard.
In this article I will attempt to debunk this myth, and to provide a more balanced view regarding MySQL's treatment of
GROUP BY
at the same time.
To do that, I will first demonstrate that MySQL can be instructed to only accept
GROUP BY
clauses that include all non-aggregated expressions referred to in the
SELECT
list, thus making MySQL's behaviour conform more to that of other well-known rdbms-products.
Second, I will show that it is very important to clearly define which version of the SQL-standard is being referred to. The two most recent versions use a rather sophisticated way of defining the required relationships between expressions appearing in the
GROUP BY
clause and the
SELECT
list. Contrary to a popular belief, these standards do not literally require that all non-aggregated
SELECT
list columns appear in the
GROUP BY
clause.
Third, I will use a simple yet realistic example to illustrate in an informal manner what I believe is the intent expressed in the more recent versions of the SQL standard. Hopefully I will be able to convince you why it may even be better to not blindly include all non-aggregated columns from the
SELECT
list in the
GROUP BY
clause.
Before we dive into the details, I'll start with a brief introduction with regard to
GROUP BY
for those that are not too familiar at all with the construct. In the introduction, I will illustrate why most database products require all non-aggregated columns that are referenced in the
SELECT
list to appear in the
GROUP BY
clause, and why users run into trouble sometimes due to MySQL's treatment of
GROUP BY
clause.
The GROUP BY
-clause
So, what is the
GROUP BY
-clause, and what does it do?
The
GROUP BY
-clause is an optional element of SQL
SELECT
expressions. Syntactically, the
GROUP BY
-clause consists of the keyword sequence
GROUP BY
, followed by a comma-separated list of (scalar) expressions. If the
SELECT
expression contains a
GROUP BY
-clause, it must appear after the
WHERE
clause. (If the
WHERE
clause is omitted, the
GROUP BY
clause will immediately follow after the
FROM
clause.)
When included,
GROUP BY
specifies that rows from the intermediate result set are to be divided in a number of groups, returning one single row for each such group. The list of expressions provided in the
GROUP BY
list defines how the grouping takes place. All rows that have the same combination of values for all expressions specified in the
GROUP BY
are in the same group.
Let's do a few simple example queries to illustrate the effect of the
GROUP BY
-clause. (For these examples I'll use the
pet
table from the
menagerie
database.) The following query retrieves all rows from the
pet
table in the
menagerie
database:
SELECT *
FROM menagerie.pet
The query returns a result that might look like this:
+----------+--------+---------+------+------------+------------+
| name | owner | species | sex | birth | death |
+----------+--------+---------+------+------------+------------+
| Fluffy | Harold | cat | f | 1993-02-04 | NULL |
| Claws | Gwen | cat | m | 1994-03-17 | NULL |
| Buffy | Harold | dog | f | 1989-05-13 | NULL |
| Fang | Benny | dog | m | 1990-08-27 | NULL |
| Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 |
| Chirpy | Gwen | bird | f | 1998-09-11 | NULL |
| Whistler | Gwen | bird | NULL | 1997-12-09 | NULL |
| Slim | Benny | snake | m | 1996-04-29 | NULL |
| Puffball | Diane | hamster | f | 1999-03-30 | NULL |
+----------+--------+---------+------+------------+------------+
(Because we did not specify an
ORDER BY
-clause, the rows are returned in some order determined by the database, so your results might not look exactly like this. However, for this example, the actual rows are important - not the order.)
Now, suppose we want to make groups for each species. The following addition of the
GROUP BY
-clause does just that:
SELECT species
FROM menagerie.pet
GROUP BY species -- make one group for each species
The query returns this result:
+---------+
| species |
+---------+
| bird |
| cat |
| dog |
| hamster |
| snake |
+---------+
At a glance, it seems as if the
GROUP BY
clause does nothing more than scan for unique occurrences in the
species
column and return those. However, it is better to think of each row in the
GROUP BY
result as a summary row that represents a group of rows that have the same value in the
species
column. So, in this case, the
bird
row represents the group of pets that are birds, ("Chirpy" and "Whistler"); the
cat
row represents the group of pets that are cats ("Fluffy" and "Claws"), and so on and so forth.
Calculating Aggregates for a group of rows
A
GROUP BY
query allows one to apply
aggregate functions on the collection of rows associated with each group defined by the
GROUP BY
clause. An aggregate function can process expressions for each row in a group of rows to compute a single return value. A number of well-known standard aggregate functions are
COUNT
,
MIN
,
MAX
, and
SUM
.
(Aggregate functions can also be used without a
GROUP BY
clause, in which case the entire intermediate resultset is treated as one big group. Try imagining the effect of the
GROUP BY
operation with an empty
GROUP BY
list: the query will return just one row that summarizes all rows from the intermediate result set).
Expanding our previous example query for
GROUP BY
, the following example illustrates the effect of some of these aggregate functions:
SELECT species
, GROUP_CONCAT(name) -- make a list of pets per species
, COUNT(*) -- count pets per species
, MIN(birth) -- birthdate of oldest pet per species
, MAX(birth) -- birthdate of youngest pet per species
FROM menagerie.pet
GROUP BY species
This example also includes usage of the
MySQL specific
GROUP_CONCAT
aggregate function which will prove to be very useful to illustrate the effect of the
GROUP BY
-clause.
The result looks something like this:
+---------+--------------------+----------+------------+------------+
| species | GROUP_CONCAT(name) | COUNT(*) | MIN(birth) | MAX(birth) |
+---------+--------------------+----------+------------+------------+
| bird | Chirpy,Whistler | 2 | 1997-12-09 | 1998-09-11 |
| cat | Fluffy,Claws | 2 | 1993-02-04 | 1994-03-17 |
| dog | Buffy,Fang,Bowser | 3 | 1979-08-31 | 1990-08-27 |
| hamster | Puffball | 1 | 1999-03-30 | 1999-03-30 |
| snake | Slim | 1 | 1996-04-29 | 1996-04-29 |
+---------+--------------------+----------+------------+------------+
Again, we see one row for each group of rows that have an identical value in the
species
column, but this time, we also see the effect of processing the individual rows for each species using aggregate functions:
- The
GROUP_CONCAT
function was applied to the name
column. For each species in our pet table, there maybe multiple pets, and GROUP_CONCAT
concatenates their names, separating the individual names by default with a comma. Thus in this example, the GROUP_CONCAT
expression reveals the make-up of each group of pets of a single species.
- The
COUNT
function is used with a wildcard *
, instructing it to count the number of rows associated with each group. Verifying this with the previous column, we can immediately see that the number of rows in the group is consistent with the number of names concatenated by the GROUP_CONCAT
expression
- The
MIN
and MAX
functions are applied to the birth
column and for each species, respectively report the birth date of whichever pet is oldest (MIN(birth)
, the birth date that is smaller than any of the other birth dates) and youngest (MAX(birth)
, the birth date that is larger than any of the other birth dates).
So, aggregate functions are applied to expressions taken from a group of rows and have the effect of 'condensing' (aggregating) the group, yielding a single value. Most aggregate functions calculate or determine some kind of statistical metric which serves to characterize the group as a whole. The MySQL specific
GROUP_CONCAT
function is an exception: it simply enumerates all members in the group passed to the function, and as such it is not a statistical function. However, it still exposes the main property of aggregate functions, namely the ability to turn the expressions from a group of rows into a single value.
Running into trouble with GROUP BY
So far, we've seen a few examples with
GROUP BY
that make perfect sense. Yet is easy to run into trouble with
GROUP BY
. Take a look at the following query:
SELECT species
, MIN(birth) -- birthdate of oldest pet per species
, MAX(birth) -- birthdate of youngest pet per species
, birth -- birthdate of ... uh oh...!
FROM menagerie.pet
GROUP BY species
This query is similar to the previous query, where we calculated a few aggregates for each group of pets belonging to the same species. However, this time, we also include the a plain reference to the
birth
column in the
SELECT
list.
An attempt to run this query on Oracle results in an error:
SQL> SELECT species
2 , MIN(birth)
3 , MAX(birth)
4 , birth
5 FROM sakila.pet
6 GROUP BY species;
, birth
*
ERROR at line 4:
ORA-00979: not a GROUP BY expression
Running this query on MySQL however does return a result, which may look something like this:
mysql> SELECT species
-> , MIN(birth) -- birthdate of oldest pet per species
-> , MAX(birth) -- birthdate of youngest pet per species
-> , birth -- birthdate of ... uh oh...!
-> FROM menagerie.pet
-> GROUP BY species;
+---------+------------+------------+------------+
| species | MIN(birth) | MAX(birth) | birth |
+---------+------------+------------+------------+
| bird | 1997-12-09 | 1998-09-11 | 1998-09-11 |
| cat | 1993-02-04 | 1994-03-17 | 1993-02-04 |
| dog | 1979-08-31 | 1990-08-27 | 1989-05-13 |
| hamster | 1999-03-30 | 1999-03-30 | 1999-03-30 |
| snake | 1996-04-29 | 1996-04-29 | 1996-04-29 |
+---------+------------+------------+------------+
5 rows in set (0.00 sec)
What is happening here? Why do we see such different behaviours? In fact, what is MySQL's behaviour in this case? Sometimes, the
birth
column reports a value that looks like the maximum value for
birth
within the species (first row), and sometimes we see the maximum value (second row). We even see one case where the returned value is in between the minimum and maximum values (row 3). How can we explain this seemingly random behaviour?
Understanding the Problem
It's not too hard to deduce what is happening here. All we need to do is go back to our explanation of the effect of the
GROUP BY
clause, and see how it applies to our last query.
It was already explained that the
GROUP BY
clause returns one row for each group of rows in the intermediate result, and that the groups are defined by the expression list defined in the
GROUP BY
clause. So, in this case, we are creating one result row for each group of rows that belong to the same species because the
GROUP BY
list only contains the
species
column. Yet there are several pets that may belong to a specific species, so the
birth
column may have (and often has) a different value for each row in a particular species group.
Having realized that, we can now ask ourselves: assuming there are multiple values in the
birth
column for a particular value of
species
, which one should be returned? What did we mean when we specified the
birth
column in the
SELECT
list?
There is no good answer to this question. It is certainly possible to select just one of the possible values for the
birth
column: in fact, this is exactly what MySQL does. However, it is impossible to define the significance of which ever of the possible values is chosen. That is, it makes no sense to want to mix the plain values from the rows that belong to the group with the group itself. Therefore it does not make sense to even include the
birth
column in the
SELECT
list of the query.
Another way of looking at it is to say that the 'grain' of
species
values is different (and therefore, incompatible) with the grain of the
birth
values. It does not mean we cannot access the values in the
birth
column; it merely means we must use an aggregate function to compute the 'right' value from a whole group of them.
Avoiding the Problem
What about the behaviour such as exercised by Oracle? Doesn't it make more sense to issue an error message rather than returning non-sense data? Put this way, most people will probably agree. Of course, the error message itself is somewhat puzzling:
, birth
*
ERROR at line 4:
ORA-00979: not a GROUP BY expression
This seems to suggest that the problem is that the
birth
column in the
SELECT
list is not included in the
GROUP BY
clause. In turn, this raises the question whether the problem would be solved if we would have included the
birth
column in the
GROUP BY
clause.
Well, including the
birth
column in the
GROUP BY
clause certainly gets rid of the error message. However, a lot of users with entry level skills in SQL fail to understand that this yields quite a different query. The original
GROUP BY species
yields one group for each species, whereas
GROUP BY species, birth
yields a group for each combination of values in
species
and
birth
- most probably not at all what is intended.
On the other hand, we cannot expect the database management system to know what we were thinking when we included the
birth
in the
SELECT
list in the first place. So, despite that the error message may seem a bit puzzling, it is still preferable over silently returning non-sense data.
But do we really have to put up with this? The answer is "No!".
Including ONLY_FULL_GROUP_BY
in MySQL's sql_mode
Nowadays, MySQL is capable of detecting this problem too, and it is perfectly possible to make MySQL reject the previous query to avoid the problem of returning non-sense data. This is achieved by including
ONLY_FULL_GROUP_BY
in the
sql_mode
.
Like many server settings, we can specify the
sql_mode
using the
--sql-mode
command line argument to the MySQL server executable (
mysqld
), or we can include it in an
option file. For example, including the following line in the option file will enable
ONLY_FULL_GROUP_BY
when the server starts up:
sql_mode=ONLY_FULL_GROUP_BY
Beginning with MySQL 4.1, it is also possible to set the
sql_mode
at runtime using the
SET
syntax. In this way, the
sql_mode
can be set globally or for the session level. The latter is the most useful, as it allows one to setup a
sql_mode
most suited for a particular application without affecting any other applications that run on the server. (Some applications don't expect anything other than the default setting and may run into trouble with a particular
sql_mode
.)
The following snippet illustrates how to include
ONLY_FULL_GROUP_BY
in the
sql_mode
at runtime:
mysql> SET sql_mode := CONCAT('ONLY_FULL_GROUP_BY,',@@sql_mode);
Query OK, 0 rows affected (0.00 sec)
The
@@sql_mode
server variable contains a possibly empty, comma-separated string of current
sql_mode
settings. The
CONCAT
expression prepends whatever the current setting is of the
sql_mode
with
ONLY_FULL_GROUP_BY
. Note the comma immediately following
ONLY_FULL_GROUP_BY
. If the value of
@@sql_mode
is the empty string, the value of the
CONCAT
expression will have a trailing comma, but this is allowed in the assignment (trimming off the comma in the process).
When we now attempt to execute our query again, it fails with an error message:
mysql> SELECT species
-> , MIN(birth) -- birthdate of oldest pet per species
-> , MAX(birth) -- birthdate of youngest pet per species
-> , birth -- birthdate of ... uh oh...!
-> FROM menagerie.pet
-> GROUP BY species;
ERROR 1055 (42000): 'menagerie.pet.birth' isn't in GROUP BY
The error messages indicates that we did not include the
birth
column in the
GROUP BY
clause. Now, MySQL behaves similar to Oracle for this query.
(Alas, MySQL's
ONLY_FULL_GROUP_BY
is not as clever as it should be, and there are particular cases where
ONLY_FULL_GROUP_BY
is too restrictive in enforcing only full
GROUP BY
clauses. The details are described
here. The good news is that
the community can help to fix this bug! Go to MySQL Forge and check out
Worklog task 2489.)
What does 'the' SQL standard say
In the previous sections, we've seen how Oracle and MySQL react very differently to the same SQL
GROUP BY
query. But what do the standards say? How is the
GROUP BY
clause supposed to behave?
In the introduction of this article, I claimed I would debunk a popular myth that holds that
...standard SQL requires columns referenced in the SELECT
list of a query to also appear in the GROUP BY
clause, unless these columns appear exclusively in an aggregated expression.
Now, I don't want to pretend I'm an expert as far as the SQL standard (ISO/IEC 9075) is concerned. In fact, I've noticed repeatedly that the sheer volume of the documentation as well as the persistent formal wording prevent me from obtaining a clear overview of it. But, let's give it a try anyway.
The
1992 version of the standard, 7.9 - 7 states that:
If T is a grouped table, then each <column reference> in each <value expression> that references a column of T shall reference a grouping column or be specified within a <set function specification>.
As said, I'm not an expert in this area, but the way I read it it boils down to:
- Queries that include a
GROUP BY
clause can only include column references in SELECT
-ed expressions if the column appears in the GROUP BY
clause, or if that column appears as part of an aggregate.
Now, in the 7.12 - 15 of the 2003 version of the standard we find this:
If T is a grouped table, then let G be the set of grouping columns of T. In each <value expression> contained in <select list>, each column reference that references a column of T shall reference some column C that is functionally dependent on G or shall be contained in an aggregated argument of a <set function specification> whose aggregation query is QS.
The 1999 version of the standard contains a similar rule. The important thing to note here is that both versions that succeeded the 1992 version stopped requiring explicitly that all non-aggregated columns in the
SELECT
list must be present in the
GROUP BY
clause. Instead they require that any non-aggregated column appearing in the
SELECT
list is
functionally dependent upon the
GROUP BY
clause.
Functional dependencies
What would the 1999 and 2003 version of the SQL standard mean by the term "functionally dependent"? The answer to that question is also defined by the standard. Unfortunately, it cannot be illustrated by a simple quote, as the formal definition of what exactly constitutes a functional dependency according to the standard is fairly extensive and complicated.
Luckily, the concept of functional dependencies can be easily illustrated in a less formal way. Suppose we have two expressions,
A and
B. Now, B is functionally dependent upon A if B has exactly one value for a particular value of A. Consider this snippet of code:
mysql> SELECT @A:=1 AS A
-> , @B:=@A + 1 AS B;
+---+------+
| A | B |
+---+------+
| 1 | 2 |
+---+------+
Here, the column
B
is functionally dependent upon column
A
. The value of
B
can be derived from the value of
A
in a very straightforward manner, namely by adding
1
to whatever the value of
A happens to be. We know how to compute the value of B
for any given value of A
, and for any given value of A
, the corresponding value of B
will always be the same.
The functional dependency concept can also be be applied to multiple columns:
SELECT CONCAT(A,B) C
FROM someTable
Column C
is defined as the result of CONCAT(A,B)
expression. If we have the values of both A
and B
we can compute the result of CONCAT(A,B)
. Of course, for a given pair of expressions for A
and B
the result of CONCAT(A,B)
will always be the same. Therefore, C
is functionally dependent upon the column pair A
and B
.
Note that it is not enough just to only know the method to calculate B
out of A
. Consider this example:
mysql> SELECT @A:=1 AS A
-> , @B:=@A + RAND() AS B;
Here, we know the recipe to derive the value of A
from the value of A
: we have to take the value of A
and add the value returned by a call to the RAND()
function. However, the RAND()
function will return a different value each time it is called. Therefore, the value of B
will be different too every time, and hence we cannot say that B
is functionally dependent upon A
. There is thus not a single value for B
for a given value of A
, and thus B
is not functionally dependent upon A
.
Functional dependency and Normalization
The term functional depencency
is also used with regard to normalization. Part of the normalization process involves discovering functional dependencies between different groups of columns.
Normalization requires that each table has at least one key. A key is a column or group of columns that may be used to identify a single record in the table. By definition, if we have a key, all non-key columns are functionally dependent upon the key. Another way to think about this is to imagine that we look up a row using a key. If the key entry exists, this will result in exactly one row. By definition, each column in that row has exactly one value, so the value of each non-key column can be determined by the key.
Functional dependencies between a group of columns that makes up a key and any other group of columns is allowed, but functional dependencies between two groups of non-key columns are eliminated by the normalization process. (This is done by splitting off the groups of columns that expose the functional dependency to a new table, making one of the groups of columns a key in the new table.)
Functional dependency and GROUP BY
We have just seen that the 1999 and 2003 versions of the SQL standard require that the columns appearing in the SELECT
list are functionally dependent upon the groups defined by the GROUP BY
clause. In other words, if we know that a column contains only one value for any given combination of values in the columns appearing in the GROUP BY
clause, we may reference the column in the SELECT
list even if it does not appear in an aggregate expression.
We've also seen that if we have a (primary or unique) key, all columns that are not included in the key are by definition functionally dependent upon the key. This means that if we include all key columns in the GROUP BY
clause, we can reference any column we like in the SELECT
list, even if they appear outside an aggregate expression.
The following example from the sakila sample database might help to illustrate this:
mysql> SELECT film_id -- primary key
-> , title -- non-key column
-> , COUNT(*) -- one row per group
-> FROM sakila.film
-> GROUP BY film_id; -- group by on primary key
+---------+-----------------------------+----------+
| film_id | title | COUNT(*) |
+---------+-----------------------------+----------+
| 1 | ACADEMY DINOSAUR | 1 |
. . . .
. . . .
[...not showing 998 rows...]
. . . .
. . . .
| 1000 | ZORRO ARK | 1 |
+---------+-----------------------------+----------+
1000 rows in set (0.05 sec)
Here, we query the film
table. The primary key of the film
table consists of only the film_id
column. The GROUP BY
clause contains the film_id
column. As a result, the query returns a collection of groups, each of which summarizes only one row. Of course, because there is only one row per group, there can be only one value in each of the other columns of the film
table. Therefore, it is safe to include whatever column we like in the SELECT
list. For this reason, it is perfectly ok to include the film_title
column in the SELECT
list.
Of course, the GROUP BY
in the previous query does not make sense logically. Because the primary key of the film
table consists of only the film_id
column, we already know there can be only one row for any given value of film_id
. However, it becomes interesting when we include another table in the query. Consider the next example:
mysql> SELECT f.film_id
-> , f.title
-> , COUNT(fa.actor_id)
-> FROM film f
-> LEFT JOIN film_actor fa
-> ON f.film_id = fa.film_id
-> GROUP BY f.film_id;
+---------+-----------------------------+----------+
| film_id | title | COUNT(*) |
+---------+-----------------------------+----------+
| 1 | ACADEMY DINOSAUR | 10 |
. . . .
. . . .
[...not showing 998 rows...]
. . . .
. . . .
| 1000 | ZORRO ARK | 3 |
+---------+-----------------------------+----------+
1000 rows in set (0.02 sec)
Here, we have added a LEFT JOIN
to calculate the number of actors per film. This time, the GROUP BY
clause on the film_id
does make sense: we now get one group of actors that play a role in each film. At the same time, we know that all columns in the film
table are functionally dependent upon the film_id
column. Each group returned by the GROUP BY
clause corresponds to exactly one row from the film
table, and this means that for each group, there is only one value in any column of the film
table. So, it's perfectly safe to reference the columns from the film
table in the SELECT
list, even if we don't use them in an aggregate expression.
It is important to realize that we cannot reference just any column in the SELECT
list: we can only reference those columns that are functionally dependent upon the film_id
column of the film
table. This means that it is wrong to reference any column of the film_actor
table in the SELECT
list directly: we may do so only in an aggregate expression.
The previous example demonstrates a pattern. The film
table acts as the so-called master, and the film_actor
table acts as the detail. The master-detail pattern is very common: Order and Order Items, Vendor and Products, Country and Cities are all examples of this pattern.
So why would I do that?
Ok, hopefully, I've been able to explain under which circumstances it is safe for GROUP BY
queries to reference columns in the SELECT
list directly. One might wonder though what the advantages and disadvantages are. I mean, just because you can doesn't mean you should, right?
Disadvantages?
Well, there certainly are reasons for always writing a full GROUP BY
clause.
First of all, many rdbms products will only allow a full GROUP BY
clause anyway, so in those cases there really is no choice. In MySQL however, we do have the choice as long as we are not using ONLY_FULL_GROUP_BY
in the sql_mode
.
Another reason for always writing the GROUP BY
clause in full is that other developers might not understand under which circumstances it is ok to use a partial GROUP BY
clause. In many cases, they've spent considerable time to learn to blindly repeat all SELECT
columns in the GROUP BY
clause, and they will usually point out that it is wrong to not adhere to that rule.
Of course, it is impossible to distinguish between a query that intentionally omits columns from the GROUP BY
clause and one that accidentally forgot to include them. When it is the intention to always write a full GROUP BY
clause, it is easy to verify whether the column references in the GROUP BY
clause and the the SELECT
list match.
I have heard people argue that including only the key in the GROUP BY
clause will lead to problems when the definition of the key is changed. Personally, I think this is a bogus argument. When you are considering to change the definition of the key, you are most likely going to review all your queries anyway, because all your joins will need to reflect this change too. I just mean to say that changing a few GROUP BY
's here and there is probably least of your problems when you are considering to change the definition of a key.
Another argument I have heard is that it is somehow 'more clear', 'cleaner' or 'prettier' to repeat all columns referenced in the SELECT
list in the GROUP BY
clause. Personally, I think this is a bogus argument too. At the very least, these are all a matter of opinion.
Advantages
Personally, I feel it is more clear and prettier to GROUP BY
only on key columns where possible. I argued that it is a matter of opinion what is 'clear' or 'pretty', so I must discard this argument likewise.
I would argue that full GROUP BY
clauses are harder to maintain. Many changes will require two edits of the code instead of one. Of course, this might or might not outweigh any of the advantages of a full GROUP BY
clause.
A full GROUP BY
clause might be slower than a partial one. The following query finds all film titles that gathered more than 300 $ worth of payments:
mysql> SELECT f.film_id
-> , f.title
-> , sum(p.amount) sum_amount
-> FROM film f
-> LEFT JOIN inventory i
-> ON f.film_id = i.film_id
-> LEFT JOIN rental r
-> ON i.inventory_id = r.inventory_id
-> LEFT JOIN payment p
-> ON r.rental_id = p.rental_id
-> GROUP BY f.film_id
-> HAVING sum_amount > 300;
Empty set (0.18 sec)
Using only the film_id
column in the GROUP BY
clause, it takes 0.18
seconds to discover there are no film titles that satisfy this criterion. Now, let's compare this to the equivalent query using a full GROUP BY
clause:
mysql> SELECT f.film_id
-> , f.title
-> , sum(p.amount) sum_amount
-> FROM film f
-> LEFT JOIN inventory i
-> ON f.film_id = i.film_id
-> LEFT JOIN rental r
-> ON i.inventory_id = r.inventory_id
-> LEFT JOIN payment p
-> ON r.rental_id = p.rental_id
-> GROUP BY f.film_id
-> , f.title
-> HAVING sum_amount > 300;
Empty set (0.51 sec)
This query takes almost three times as long to complete! With EXPLAIN
we can retrieve the execution plans for these queries. Without a full GROUP BY
clause, we see a fairly normal execution plan:
mysql> EXPLAIN
-> SELECT f.film_id
-> , f.title
-> , sum(p.amount) sum_amount
-> FROM film f
-> LEFT JOIN inventory i
-> ON f.film_id = i.film_id
-> LEFT JOIN rental r
-> ON i.inventory_id = r.inventory_id
-> LEFT JOIN payment p
-> ON r.rental_id = p.rental_id
-> GROUP BY f.film_id
-> HAVING sum_amount > 300
-> \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: f
type: index
possible_keys: NULL
key: PRIMARY
key_len: 2
ref: NULL
rows: 953
Extra:
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: i
type: ref
possible_keys: idx_fk_film_id
key: idx_fk_film_id
key_len: 2
ref: sakila.f.film_id
rows: 2
Extra: Using index
*************************** 3. row ***************************
id: 1
select_type: SIMPLE
table: r
type: ref
possible_keys: idx_fk_inventory_id
key: idx_fk_inventory_id
key_len: 3
ref: sakila.i.inventory_id
rows: 1
Extra: Using index
*************************** 4. row ***************************
id: 1
select_type: SIMPLE
table: p
type: ref
possible_keys: fk_payment_rental
key: fk_payment_rental
key_len: 5
ref: sakila.r.rental_id
rows: 1
Extra:
4 rows in set (0.00 sec)
With a full GROUP BY
list, we notice a difference for the film
table:
mysql> EXPLAIN
-> SELECT f.film_id
-> , f.title
-> , sum(p.amount) sum_amount
-> FROM film f
-> LEFT JOIN inventory i
-> ON f.film_id = i.film_id
-> LEFT JOIN rental r
-> ON i.inventory_id = r.inventory_id
-> LEFT JOIN payment p
-> ON r.rental_id = p.rental_id
-> GROUP BY f.film_id
-> , f.title
-> HAVING sum_amount > 300
-> \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: f
type: index
possible_keys: NULL
key: idx_title
key_len: 767
ref: NULL
rows: 953
Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: i
type: ref
possible_keys: idx_fk_film_id
key: idx_fk_film_id
key_len: 2
ref: sakila.f.film_id
rows: 2
Extra: Using index
*************************** 3. row ***************************
id: 1
select_type: SIMPLE
table: r
type: ref
possible_keys: idx_fk_inventory_id
key: idx_fk_inventory_id
key_len: 3
ref: sakila.i.inventory_id
rows: 1
Extra: Using index
*************************** 4. row ***************************
id: 1
select_type: SIMPLE
table: p
type: ref
possible_keys: fk_payment_rental
key: fk_payment_rental
key_len: 5
ref: sakila.r.rental_id
rows: 1
Extra:
4 rows in set (0.01 sec)
In case you did not yet notice, the film
table now has Extra: Using index; Using temporary; Using filesort
.
What I think that is happening is that MySQL takes the GROUP BY
clause literally and performs the GROUP BY
algorithm for each of the specified expressions. MySQL implements GROUP BY
by sorting the rows according to the GROUP BY
expressions. In this particular case, adding the title
column to the GROUP BY
clause does not allow the server to sort the rows in-memory, and forces the GROUP BY
to be evaluated using a temporary table and a file sort. This requires extra IO operations which is causing a decrease in performance.
Of course, it would be nice if MySQL was smart enough to deduce that the result cannot possibly be different due to including the title
column in the GROUP BY
list. It could attempt to detect that a key of the film
table is included in the GROUP BY
list and that the title
column can be completely ignored when evaluating the GROUP BY
clause, because there will be exactly one value in the title
column for each film_id
column. But, then again, MySQL does not require us to write a full group by list. So if performance is paramount, be smart and do not write a full GROUP BY
clause.
Aggregating on functionally dependent columns
We just argued that it is safe to include columns in the SELECT
list as long as these columns are functionally dependent upon the GROUP BY
list. The reasoning is that since the columns are functionally dependent upon the GROUP BY
clause, there will be one value for each result group anyway. It was also shown that it may be a bad idea to include these columns in the GROUP BY
clause, as it can hurt performance.
For some people, it may still seem unacceptable to include functionally dependent columns in the SELECT
list without referencing these columns in the GROUP BY
clause. For example, you might be using a rdbms that requires all columns that are not referenced in the GROUP BY
to be aggregated. In those cases is might be better to apply an aggregate function to the functionally dependent column rather than including it in the GROUP BY
clause. Consider the following query:
mysql> SELECT f.film_id
-> , MAX(f.title) AS title
-> , sum(p.amount) sum_amount
-> FROM film f
-> LEFT JOIN inventory i
-> ON f.film_id = i.film_id
-> LEFT JOIN rental r
-> ON i.inventory_id = r.inventory_id
-> LEFT JOIN payment p
-> ON r.rental_id = p.rental_id
-> GROUP BY f.film_id
-> HAVING sum_amount > 300;
Empty set (0.20 sec)
Again, we GROUP BY
on the film_id
column, which makes up the primary key of the film
table. This time however, we apply the MAX
aggregate function on the title
column. We know that there is only one value in the title
column for each value in the film_id
column, so the aggregation will not influence the result. In fact, we could've used MIN
equally well.
These aggregate functions will return the right value for exactly the same reason why it is safe to not include the functionally dependent column in the GROUP BY
clause. Technically the aggregation is therefore unnecessary: it is just a trick to fool the rdbms.
Most likely, applying the aggregate function will be somewhat slower than not applying the aggregate. However, it will in most cases be faster than including the functionally dependent column in the GROUP BY
clause.
Conclusion
Contrary to popular belief, the SQL standard does not require GROUP BY
queries to reference all non-aggregated columns from the SELECT
list in the GROUP BY
clause. As of the 1999 version of the SQL standard, it is explicitly allowed for the SELECT
list to reference non-aggregated expressions as long as they are functionally dependent upon the GROUP BY
list.
Each expression that has exactly one value for each group defined by the GROUP BY
clause is functionally dependent upon the GROUP BY
clause. Functional dependencies can be witnessed in a common query pattern: whenever we have a join between a master and a detail table to calculate aggregates over the detail rows for each row from the master, we can GROUP BY
over the primary or unique key from the master. All non-key columns of the master row will be functionally dependent upon the key, and can thus appear in the SELECT
-list outside an aggregate expression.
In MySQL, one can write GROUP BY
queries that reference non-aggregated columns in the SELECT
list that are not included in the GROUP BY
clause, even if these columns are not functionally dependent upon the GROUP BY
clause. This behaviour conforms to none of the SQL standard's versions. It is possible to avoid this behaviour by including ONLY_FULL_GROUP_BY
in the sql_mode
server setting, but it might make more sense to take advantage of the ability to write only partial GROUP BY
clauses.
In a nutshell:
- It is completely safe to write partial
GROUP BY
clauses as long as all non-aggregated columns in the SELECT
list are functionally dependent upon the GROUP BY
clause.
- A partial
GROUP BY
list can result in better performance, because it keeps the server from evaluating the entire GROUP BY
list.
- If one does not want to write partial
GROUP BY
clauses, consider using MIN
or MAX
to 'aggregate' the functionally dependent columns in the SELECT
list rather than moving the functionally dependent columns to the GROUP BY
clause.