UPDATE
Thanks for your interest in this article. Please note that it was written a long time ago and that MySQL has progressed enormously since. Check MySQL 5.7.5: GROUP BY respects functional dependencies! for more info on a more current state of affairs. This original article will remain in place to provide background information.Original article below:
Debunking GROUP BY Myths
There is a popular myth about the SQLGROUP BY
clause. The myth holds that 'standard SQL' requires columns referenced in the SELECT
list of a query to also appear in the GROUP BY
clause, unless these columns appear exclusively in an aggregated expression. MySQL is often accused of violating this standard.In this article I will attempt to debunk this myth, and to provide a more balanced view regarding MySQL's treatment of
GROUP BY
at the same time.To do that, I will first demonstrate that MySQL can be instructed to only accept
GROUP BY
clauses that include all non-aggregated expressions referred to in the SELECT
list, thus making MySQL's behaviour conform more to that of other well-known rdbms-products. Second, I will show that it is very important to clearly define which version of the SQL-standard is being referred to. The two most recent versions use a rather sophisticated way of defining the required relationships between expressions appearing in the
GROUP BY
clause and the SELECT
list. Contrary to a popular belief, these standards do not literally require that all non-aggregated SELECT
list columns appear in the GROUP BY
clause.Third, I will use a simple yet realistic example to illustrate in an informal manner what I believe is the intent expressed in the more recent versions of the SQL standard. Hopefully I will be able to convince you why it may even be better to not blindly include all non-aggregated columns from the
SELECT
list in the GROUP BY
clause.Before we dive into the details, I'll start with a brief introduction with regard to
GROUP BY
for those that are not too familiar at all with the construct. In the introduction, I will illustrate why most database products require all non-aggregated columns that are referenced in the SELECT
list to appear in the GROUP BY
clause, and why users run into trouble sometimes due to MySQL's treatment of GROUP BY
clause.The GROUP BY
-clause
So, what is the
GROUP BY
-clause, and what does it do? The
GROUP BY
-clause is an optional element of SQL SELECT
expressions. Syntactically, the GROUP BY
-clause consists of the keyword sequence GROUP BY
, followed by a comma-separated list of (scalar) expressions. If the SELECT
expression contains a GROUP BY
-clause, it must appear after the WHERE
clause. (If the WHERE
clause is omitted, the GROUP BY
clause will immediately follow after the FROM
clause.)When included,
GROUP BY
specifies that rows from the intermediate result set are to be divided in a number of groups, returning one single row for each such group. The list of expressions provided in the GROUP BY
list defines how the grouping takes place. All rows that have the same combination of values for all expressions specified in the GROUP BY
are in the same group.Let's do a few simple example queries to illustrate the effect of the
GROUP BY
-clause. (For these examples I'll use the pet
table from the menagerie
database.) The following query retrieves all rows from the pet
table in the menagerie
database:
SELECT *
FROM menagerie.pet
The query returns a result that might look like this:
+----------+--------+---------+------+------------+------------+
| name | owner | species | sex | birth | death |
+----------+--------+---------+------+------------+------------+
| Fluffy | Harold | cat | f | 1993-02-04 | NULL |
| Claws | Gwen | cat | m | 1994-03-17 | NULL |
| Buffy | Harold | dog | f | 1989-05-13 | NULL |
| Fang | Benny | dog | m | 1990-08-27 | NULL |
| Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 |
| Chirpy | Gwen | bird | f | 1998-09-11 | NULL |
| Whistler | Gwen | bird | NULL | 1997-12-09 | NULL |
| Slim | Benny | snake | m | 1996-04-29 | NULL |
| Puffball | Diane | hamster | f | 1999-03-30 | NULL |
+----------+--------+---------+------+------------+------------+
(Because we did not specify an
ORDER BY
-clause, the rows are returned in some order determined by the database, so your results might not look exactly like this. However, for this example, the actual rows are important - not the order.)Now, suppose we want to make groups for each species. The following addition of the
GROUP BY
-clause does just that:
SELECT species
FROM menagerie.pet
GROUP BY species -- make one group for each species
The query returns this result:
+---------+
| species |
+---------+
| bird |
| cat |
| dog |
| hamster |
| snake |
+---------+
At a glance, it seems as if the
GROUP BY
clause does nothing more than scan for unique occurrences in the species
column and return those. However, it is better to think of each row in the GROUP BY
result as a summary row that represents a group of rows that have the same value in the species
column. So, in this case, the bird
row represents the group of pets that are birds, ("Chirpy" and "Whistler"); the cat
row represents the group of pets that are cats ("Fluffy" and "Claws"), and so on and so forth.Calculating Aggregates for a group of rows
A
GROUP BY
query allows one to apply aggregate functions on the collection of rows associated with each group defined by the GROUP BY
clause. An aggregate function can process expressions for each row in a group of rows to compute a single return value. A number of well-known standard aggregate functions are COUNT
, MIN
, MAX
, and SUM
. (Aggregate functions can also be used without a
GROUP BY
clause, in which case the entire intermediate resultset is treated as one big group. Try imagining the effect of the GROUP BY
operation with an empty GROUP BY
list: the query will return just one row that summarizes all rows from the intermediate result set).Expanding our previous example query for
GROUP BY
, the following example illustrates the effect of some of these aggregate functions:
SELECT species
, GROUP_CONCAT(name) -- make a list of pets per species
, COUNT(*) -- count pets per species
, MIN(birth) -- birthdate of oldest pet per species
, MAX(birth) -- birthdate of youngest pet per species
FROM menagerie.pet
GROUP BY species
This example also includes usage of the MySQL specific
GROUP_CONCAT
aggregate function which will prove to be very useful to illustrate the effect of the GROUP BY
-clause.The result looks something like this:
+---------+--------------------+----------+------------+------------+
| species | GROUP_CONCAT(name) | COUNT(*) | MIN(birth) | MAX(birth) |
+---------+--------------------+----------+------------+------------+
| bird | Chirpy,Whistler | 2 | 1997-12-09 | 1998-09-11 |
| cat | Fluffy,Claws | 2 | 1993-02-04 | 1994-03-17 |
| dog | Buffy,Fang,Bowser | 3 | 1979-08-31 | 1990-08-27 |
| hamster | Puffball | 1 | 1999-03-30 | 1999-03-30 |
| snake | Slim | 1 | 1996-04-29 | 1996-04-29 |
+---------+--------------------+----------+------------+------------+
Again, we see one row for each group of rows that have an identical value in the
species
column, but this time, we also see the effect of processing the individual rows for each species using aggregate functions:- The
GROUP_CONCAT
function was applied to thename
column. For each species in our pet table, there maybe multiple pets, andGROUP_CONCAT
concatenates their names, separating the individual names by default with a comma. Thus in this example, theGROUP_CONCAT
expression reveals the make-up of each group of pets of a single species. - The
COUNT
function is used with a wildcard*
, instructing it to count the number of rows associated with each group. Verifying this with the previous column, we can immediately see that the number of rows in the group is consistent with the number of names concatenated by theGROUP_CONCAT
expression - The
MIN
andMAX
functions are applied to thebirth
column and for each species, respectively report the birth date of whichever pet is oldest (MIN(birth)
, the birth date that is smaller than any of the other birth dates) and youngest (MAX(birth)
, the birth date that is larger than any of the other birth dates).
So, aggregate functions are applied to expressions taken from a group of rows and have the effect of 'condensing' (aggregating) the group, yielding a single value. Most aggregate functions calculate or determine some kind of statistical metric which serves to characterize the group as a whole. The MySQL specific
GROUP_CONCAT
function is an exception: it simply enumerates all members in the group passed to the function, and as such it is not a statistical function. However, it still exposes the main property of aggregate functions, namely the ability to turn the expressions from a group of rows into a single value.Running into trouble with GROUP BY
So far, we've seen a few examples with
GROUP BY
that make perfect sense. Yet is easy to run into trouble with GROUP BY
. Take a look at the following query:
SELECT species
, MIN(birth) -- birthdate of oldest pet per species
, MAX(birth) -- birthdate of youngest pet per species
, birth -- birthdate of ... uh oh...!
FROM menagerie.pet
GROUP BY species
This query is similar to the previous query, where we calculated a few aggregates for each group of pets belonging to the same species. However, this time, we also include the a plain reference to the
birth
column in the SELECT
list.An attempt to run this query on Oracle results in an error:
SQL> SELECT species
2 , MIN(birth)
3 , MAX(birth)
4 , birth
5 FROM sakila.pet
6 GROUP BY species;
, birth
*
ERROR at line 4:
ORA-00979: not a GROUP BY expression
Running this query on MySQL however does return a result, which may look something like this:
mysql> SELECT species
-> , MIN(birth) -- birthdate of oldest pet per species
-> , MAX(birth) -- birthdate of youngest pet per species
-> , birth -- birthdate of ... uh oh...!
-> FROM menagerie.pet
-> GROUP BY species;
+---------+------------+------------+------------+
| species | MIN(birth) | MAX(birth) | birth |
+---------+------------+------------+------------+
| bird | 1997-12-09 | 1998-09-11 | 1998-09-11 |
| cat | 1993-02-04 | 1994-03-17 | 1993-02-04 |
| dog | 1979-08-31 | 1990-08-27 | 1989-05-13 |
| hamster | 1999-03-30 | 1999-03-30 | 1999-03-30 |
| snake | 1996-04-29 | 1996-04-29 | 1996-04-29 |
+---------+------------+------------+------------+
5 rows in set (0.00 sec)
What is happening here? Why do we see such different behaviours? In fact, what is MySQL's behaviour in this case? Sometimes, the
birth
column reports a value that looks like the maximum value for birth
within the species (first row), and sometimes we see the maximum value (second row). We even see one case where the returned value is in between the minimum and maximum values (row 3). How can we explain this seemingly random behaviour?Understanding the Problem
It's not too hard to deduce what is happening here. All we need to do is go back to our explanation of the effect of the
GROUP BY
clause, and see how it applies to our last query. It was already explained that the
GROUP BY
clause returns one row for each group of rows in the intermediate result, and that the groups are defined by the expression list defined in the GROUP BY
clause. So, in this case, we are creating one result row for each group of rows that belong to the same species because the GROUP BY
list only contains the species
column. Yet there are several pets that may belong to a specific species, so the birth
column may have (and often has) a different value for each row in a particular species group. Having realized that, we can now ask ourselves: assuming there are multiple values in the
birth
column for a particular value of species
, which one should be returned? What did we mean when we specified the birth
column in the SELECT
list?There is no good answer to this question. It is certainly possible to select just one of the possible values for the
birth
column: in fact, this is exactly what MySQL does. However, it is impossible to define the significance of which ever of the possible values is chosen. That is, it makes no sense to want to mix the plain values from the rows that belong to the group with the group itself. Therefore it does not make sense to even include the birth
column in the SELECT
list of the query.Another way of looking at it is to say that the 'grain' of
species
values is different (and therefore, incompatible) with the grain of the birth
values. It does not mean we cannot access the values in the birth
column; it merely means we must use an aggregate function to compute the 'right' value from a whole group of them. Avoiding the Problem
What about the behaviour such as exercised by Oracle? Doesn't it make more sense to issue an error message rather than returning non-sense data? Put this way, most people will probably agree. Of course, the error message itself is somewhat puzzling:
, birth
*
ERROR at line 4:
ORA-00979: not a GROUP BY expression
This seems to suggest that the problem is that the
birth
column in the SELECT
list is not included in the GROUP BY
clause. In turn, this raises the question whether the problem would be solved if we would have included the birth
column in the GROUP BY
clause. Well, including the
birth
column in the GROUP BY
clause certainly gets rid of the error message. However, a lot of users with entry level skills in SQL fail to understand that this yields quite a different query. The original GROUP BY species
yields one group for each species, whereas GROUP BY species, birth
yields a group for each combination of values in species
and birth
- most probably not at all what is intended.On the other hand, we cannot expect the database management system to know what we were thinking when we included the
birth
in the SELECT
list in the first place. So, despite that the error message may seem a bit puzzling, it is still preferable over silently returning non-sense data. But do we really have to put up with this? The answer is "No!".
Including ONLY_FULL_GROUP_BY
in MySQL's sql_mode
Nowadays, MySQL is capable of detecting this problem too, and it is perfectly possible to make MySQL reject the previous query to avoid the problem of returning non-sense data. This is achieved by including
ONLY_FULL_GROUP_BY
in the sql_mode
.Like many server settings, we can specify the
sql_mode
using the --sql-mode
command line argument to the MySQL server executable (mysqld
), or we can include it in an option file. For example, including the following line in the option file will enable ONLY_FULL_GROUP_BY
when the server starts up:
sql_mode=ONLY_FULL_GROUP_BY
Beginning with MySQL 4.1, it is also possible to set the
sql_mode
at runtime using the SET
syntax. In this way, the sql_mode
can be set globally or for the session level. The latter is the most useful, as it allows one to setup a sql_mode
most suited for a particular application without affecting any other applications that run on the server. (Some applications don't expect anything other than the default setting and may run into trouble with a particular sql_mode
.)The following snippet illustrates how to include
ONLY_FULL_GROUP_BY
in the sql_mode
at runtime:
mysql> SET sql_mode := CONCAT('ONLY_FULL_GROUP_BY,',@@sql_mode);
Query OK, 0 rows affected (0.00 sec)
The
@@sql_mode
server variable contains a possibly empty, comma-separated string of current sql_mode
settings. The CONCAT
expression prepends whatever the current setting is of the sql_mode
with ONLY_FULL_GROUP_BY
. Note the comma immediately following ONLY_FULL_GROUP_BY
. If the value of @@sql_mode
is the empty string, the value of the CONCAT
expression will have a trailing comma, but this is allowed in the assignment (trimming off the comma in the process).When we now attempt to execute our query again, it fails with an error message:
mysql> SELECT species
-> , MIN(birth) -- birthdate of oldest pet per species
-> , MAX(birth) -- birthdate of youngest pet per species
-> , birth -- birthdate of ... uh oh...!
-> FROM menagerie.pet
-> GROUP BY species;
ERROR 1055 (42000): 'menagerie.pet.birth' isn't in GROUP BY
The error messages indicates that we did not include the
birth
column in the GROUP BY
clause. Now, MySQL behaves similar to Oracle for this query.(Alas, MySQL's
ONLY_FULL_GROUP_BY
is not as clever as it should be, and there are particular cases where ONLY_FULL_GROUP_BY
is too restrictive in enforcing only full GROUP BY
clauses. The details are described here. The good news is that the community can help to fix this bug! Go to MySQL Forge and check out Worklog task 2489.)What does 'the' SQL standard say
In the previous sections, we've seen how Oracle and MySQL react very differently to the same SQL
GROUP BY
query. But what do the standards say? How is the GROUP BY
clause supposed to behave? In the introduction of this article, I claimed I would debunk a popular myth that holds that
...standard SQL requires columns referenced in theSELECT
list of a query to also appear in theGROUP BY
clause, unless these columns appear exclusively in an aggregated expression.
Now, I don't want to pretend I'm an expert as far as the SQL standard (ISO/IEC 9075) is concerned. In fact, I've noticed repeatedly that the sheer volume of the documentation as well as the persistent formal wording prevent me from obtaining a clear overview of it. But, let's give it a try anyway.
The 1992 version of the standard, 7.9 - 7 states that:
If T is a grouped table, then each <column reference> in each <value expression> that references a column of T shall reference a grouping column or be specified within a <set function specification>.As said, I'm not an expert in this area, but the way I read it it boils down to:
- Queries that include a
GROUP BY
clause can only include column references inSELECT
-ed expressions if the column appears in theGROUP BY
clause, or if that column appears as part of an aggregate.
Now, in the 7.12 - 15 of the 2003 version of the standard we find this:
If T is a grouped table, then let G be the set of grouping columns of T. In each <value expression> contained in <select list>, each column reference that references a column of T shall reference some column C that is functionally dependent on G or shall be contained in an aggregated argument of a <set function specification> whose aggregation query is QS.
The 1999 version of the standard contains a similar rule. The important thing to note here is that both versions that succeeded the 1992 version stopped requiring explicitly that all non-aggregated columns in the
SELECT
list must be present in the GROUP BY
clause. Instead they require that any non-aggregated column appearing in the SELECT
list is functionally dependent upon the GROUP BY
clause.Functional dependencies
What would the 1999 and 2003 version of the SQL standard mean by the term "functionally dependent"? The answer to that question is also defined by the standard. Unfortunately, it cannot be illustrated by a simple quote, as the formal definition of what exactly constitutes a functional dependency according to the standard is fairly extensive and complicated.
Luckily, the concept of functional dependencies can be easily illustrated in a less formal way. Suppose we have two expressions, A and B. Now, B is functionally dependent upon A if B has exactly one value for a particular value of A. Consider this snippet of code:
mysql> SELECT @A:=1 AS A
-> , @B:=@A + 1 AS B;
+---+------+
| A | B |
+---+------+
| 1 | 2 |
+---+------+
Here, the column
B
is functionally dependent upon column A
. The value of B
can be derived from the value of A
in a very straightforward manner, namely by adding 1
to whatever the value of B
for any given value of A
, and for any given value of A
, the corresponding value of B
will always be the same. The functional dependency concept can also be be applied to multiple columns:
SELECT CONCAT(A,B) C
FROM someTable
Column
C
is defined as the result of CONCAT(A,B)
expression. If we have the values of both A
and B
we can compute the result of CONCAT(A,B)
. Of course, for a given pair of expressions for A
and B
the result of CONCAT(A,B)
will always be the same. Therefore, C
is functionally dependent upon the column pair A
and B
.Note that it is not enough just to only know the method to calculate
B
out of A
. Consider this example:
mysql> SELECT @A:=1 AS A
-> , @B:=@A + RAND() AS B;
Here, we know the recipe to derive the value of
A
from the value of A
: we have to take the value of A
and add the value returned by a call to the RAND()
function. However, the RAND()
function will return a different value each time it is called. Therefore, the value of B
will be different too every time, and hence we cannot say that B
is functionally dependent upon A
. There is thus not a single value for B
for a given value of A
, and thus B
is not functionally dependent upon A
.Functional dependency and Normalization
The term
functional depencency
is also used with regard to normalization. Part of the normalization process involves discovering functional dependencies between different groups of columns. Normalization requires that each table has at least one key. A key is a column or group of columns that may be used to identify a single record in the table. By definition, if we have a key, all non-key columns are functionally dependent upon the key. Another way to think about this is to imagine that we look up a row using a key. If the key entry exists, this will result in exactly one row. By definition, each column in that row has exactly one value, so the value of each non-key column can be determined by the key.
Functional dependencies between a group of columns that makes up a key and any other group of columns is allowed, but functional dependencies between two groups of non-key columns are eliminated by the normalization process. (This is done by splitting off the groups of columns that expose the functional dependency to a new table, making one of the groups of columns a key in the new table.)
Functional dependency and GROUP BY
We have just seen that the 1999 and 2003 versions of the SQL standard require that the columns appearing in the
SELECT
list are functionally dependent upon the groups defined by the GROUP BY
clause. In other words, if we know that a column contains only one value for any given combination of values in the columns appearing in the GROUP BY
clause, we may reference the column in the SELECT
list even if it does not appear in an aggregate expression. We've also seen that if we have a (primary or unique) key, all columns that are not included in the key are by definition functionally dependent upon the key. This means that if we include all key columns in the
GROUP BY
clause, we can reference any column we like in the SELECT
list, even if they appear outside an aggregate expression.The following example from the sakila sample database might help to illustrate this:
mysql> SELECT film_id -- primary key
-> , title -- non-key column
-> , COUNT(*) -- one row per group
-> FROM sakila.film
-> GROUP BY film_id; -- group by on primary key
+---------+-----------------------------+----------+
| film_id | title | COUNT(*) |
+---------+-----------------------------+----------+
| 1 | ACADEMY DINOSAUR | 1 |
. . . .
. . . .
[...not showing 998 rows...]
. . . .
. . . .
| 1000 | ZORRO ARK | 1 |
+---------+-----------------------------+----------+
1000 rows in set (0.05 sec)
Here, we query the
film
table. The primary key of the film
table consists of only the film_id
column. The GROUP BY
clause contains the film_id
column. As a result, the query returns a collection of groups, each of which summarizes only one row. Of course, because there is only one row per group, there can be only one value in each of the other columns of the film
table. Therefore, it is safe to include whatever column we like in the SELECT
list. For this reason, it is perfectly ok to include the film_title
column in the SELECT
list.Of course, the
GROUP BY
in the previous query does not make sense logically. Because the primary key of the film
table consists of only the film_id
column, we already know there can be only one row for any given value of film_id
. However, it becomes interesting when we include another table in the query. Consider the next example:
mysql> SELECT f.film_id
-> , f.title
-> , COUNT(fa.actor_id)
-> FROM film f
-> LEFT JOIN film_actor fa
-> ON f.film_id = fa.film_id
-> GROUP BY f.film_id;
+---------+-----------------------------+----------+
| film_id | title | COUNT(*) |
+---------+-----------------------------+----------+
| 1 | ACADEMY DINOSAUR | 10 |
. . . .
. . . .
[...not showing 998 rows...]
. . . .
. . . .
| 1000 | ZORRO ARK | 3 |
+---------+-----------------------------+----------+
1000 rows in set (0.02 sec)
Here, we have added a
LEFT JOIN
to calculate the number of actors per film. This time, the GROUP BY
clause on the film_id
does make sense: we now get one group of actors that play a role in each film. At the same time, we know that all columns in the film
table are functionally dependent upon the film_id
column. Each group returned by the GROUP BY
clause corresponds to exactly one row from the film
table, and this means that for each group, there is only one value in any column of the film
table. So, it's perfectly safe to reference the columns from the film
table in the SELECT
list, even if we don't use them in an aggregate expression.It is important to realize that we cannot reference just any column in the
SELECT
list: we can only reference those columns that are functionally dependent upon the film_id
column of the film
table. This means that it is wrong to reference any column of the film_actor
table in the SELECT
list directly: we may do so only in an aggregate expression.The previous example demonstrates a pattern. The
film
table acts as the so-called master, and the film_actor
table acts as the detail. The master-detail pattern is very common: Order and Order Items, Vendor and Products, Country and Cities are all examples of this pattern. So why would I do that?
Ok, hopefully, I've been able to explain under which circumstances it is safe for
GROUP BY
queries to reference columns in the SELECT
list directly. One might wonder though what the advantages and disadvantages are. I mean, just because you can doesn't mean you should, right?Disadvantages?
Well, there certainly are reasons for always writing a full
GROUP BY
clause. First of all, many rdbms products will only allow a full
GROUP BY
clause anyway, so in those cases there really is no choice. In MySQL however, we do have the choice as long as we are not using ONLY_FULL_GROUP_BY
in the sql_mode
. Another reason for always writing the
GROUP BY
clause in full is that other developers might not understand under which circumstances it is ok to use a partial GROUP BY
clause. In many cases, they've spent considerable time to learn to blindly repeat all SELECT
columns in the GROUP BY
clause, and they will usually point out that it is wrong to not adhere to that rule. Of course, it is impossible to distinguish between a query that intentionally omits columns from the
GROUP BY
clause and one that accidentally forgot to include them. When it is the intention to always write a full GROUP BY
clause, it is easy to verify whether the column references in the GROUP BY
clause and the the SELECT
list match. I have heard people argue that including only the key in the
GROUP BY
clause will lead to problems when the definition of the key is changed. Personally, I think this is a bogus argument. When you are considering to change the definition of the key, you are most likely going to review all your queries anyway, because all your joins will need to reflect this change too. I just mean to say that changing a few GROUP BY
's here and there is probably least of your problems when you are considering to change the definition of a key.Another argument I have heard is that it is somehow 'more clear', 'cleaner' or 'prettier' to repeat all columns referenced in the
SELECT
list in the GROUP BY
clause. Personally, I think this is a bogus argument too. At the very least, these are all a matter of opinion.Advantages
Personally, I feel it is more clear and prettier to
GROUP BY
only on key columns where possible. I argued that it is a matter of opinion what is 'clear' or 'pretty', so I must discard this argument likewise.I would argue that full
GROUP BY
clauses are harder to maintain. Many changes will require two edits of the code instead of one. Of course, this might or might not outweigh any of the advantages of a full GROUP BY
clause.A full
GROUP BY
clause might be slower than a partial one. The following query finds all film titles that gathered more than 300 $ worth of payments:
mysql> SELECT f.film_id
-> , f.title
-> , sum(p.amount) sum_amount
-> FROM film f
-> LEFT JOIN inventory i
-> ON f.film_id = i.film_id
-> LEFT JOIN rental r
-> ON i.inventory_id = r.inventory_id
-> LEFT JOIN payment p
-> ON r.rental_id = p.rental_id
-> GROUP BY f.film_id
-> HAVING sum_amount > 300;
Empty set (0.18 sec)
Using only the
film_id
column in the GROUP BY
clause, it takes 0.18
seconds to discover there are no film titles that satisfy this criterion. Now, let's compare this to the equivalent query using a full GROUP BY
clause:
mysql> SELECT f.film_id
-> , f.title
-> , sum(p.amount) sum_amount
-> FROM film f
-> LEFT JOIN inventory i
-> ON f.film_id = i.film_id
-> LEFT JOIN rental r
-> ON i.inventory_id = r.inventory_id
-> LEFT JOIN payment p
-> ON r.rental_id = p.rental_id
-> GROUP BY f.film_id
-> , f.title
-> HAVING sum_amount > 300;
Empty set (0.51 sec)
This query takes almost three times as long to complete! With
EXPLAIN
we can retrieve the execution plans for these queries. Without a full GROUP BY
clause, we see a fairly normal execution plan:
mysql> EXPLAIN
-> SELECT f.film_id
-> , f.title
-> , sum(p.amount) sum_amount
-> FROM film f
-> LEFT JOIN inventory i
-> ON f.film_id = i.film_id
-> LEFT JOIN rental r
-> ON i.inventory_id = r.inventory_id
-> LEFT JOIN payment p
-> ON r.rental_id = p.rental_id
-> GROUP BY f.film_id
-> HAVING sum_amount > 300
-> \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: f
type: index
possible_keys: NULL
key: PRIMARY
key_len: 2
ref: NULL
rows: 953
Extra:
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: i
type: ref
possible_keys: idx_fk_film_id
key: idx_fk_film_id
key_len: 2
ref: sakila.f.film_id
rows: 2
Extra: Using index
*************************** 3. row ***************************
id: 1
select_type: SIMPLE
table: r
type: ref
possible_keys: idx_fk_inventory_id
key: idx_fk_inventory_id
key_len: 3
ref: sakila.i.inventory_id
rows: 1
Extra: Using index
*************************** 4. row ***************************
id: 1
select_type: SIMPLE
table: p
type: ref
possible_keys: fk_payment_rental
key: fk_payment_rental
key_len: 5
ref: sakila.r.rental_id
rows: 1
Extra:
4 rows in set (0.00 sec)
With a full
GROUP BY
list, we notice a difference for the film
table:
mysql> EXPLAIN
-> SELECT f.film_id
-> , f.title
-> , sum(p.amount) sum_amount
-> FROM film f
-> LEFT JOIN inventory i
-> ON f.film_id = i.film_id
-> LEFT JOIN rental r
-> ON i.inventory_id = r.inventory_id
-> LEFT JOIN payment p
-> ON r.rental_id = p.rental_id
-> GROUP BY f.film_id
-> , f.title
-> HAVING sum_amount > 300
-> \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: f
type: index
possible_keys: NULL
key: idx_title
key_len: 767
ref: NULL
rows: 953
Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: i
type: ref
possible_keys: idx_fk_film_id
key: idx_fk_film_id
key_len: 2
ref: sakila.f.film_id
rows: 2
Extra: Using index
*************************** 3. row ***************************
id: 1
select_type: SIMPLE
table: r
type: ref
possible_keys: idx_fk_inventory_id
key: idx_fk_inventory_id
key_len: 3
ref: sakila.i.inventory_id
rows: 1
Extra: Using index
*************************** 4. row ***************************
id: 1
select_type: SIMPLE
table: p
type: ref
possible_keys: fk_payment_rental
key: fk_payment_rental
key_len: 5
ref: sakila.r.rental_id
rows: 1
Extra:
4 rows in set (0.01 sec)
In case you did not yet notice, the
film
table now has Extra: Using index; Using temporary; Using filesort
. What I think that is happening is that MySQL takes the
GROUP BY
clause literally and performs the GROUP BY
algorithm for each of the specified expressions. MySQL implements GROUP BY
by sorting the rows according to the GROUP BY
expressions. In this particular case, adding the title
column to the GROUP BY
clause does not allow the server to sort the rows in-memory, and forces the GROUP BY
to be evaluated using a temporary table and a file sort. This requires extra IO operations which is causing a decrease in performance.Of course, it would be nice if MySQL was smart enough to deduce that the result cannot possibly be different due to including the
title
column in the GROUP BY
list. It could attempt to detect that a key of the film
table is included in the GROUP BY
list and that the title
column can be completely ignored when evaluating the GROUP BY
clause, because there will be exactly one value in the title
column for each film_id
column. But, then again, MySQL does not require us to write a full group by list. So if performance is paramount, be smart and do not write a full GROUP BY
clause.Aggregating on functionally dependent columns
We just argued that it is safe to include columns in the
SELECT
list as long as these columns are functionally dependent upon the GROUP BY
list. The reasoning is that since the columns are functionally dependent upon the GROUP BY
clause, there will be one value for each result group anyway. It was also shown that it may be a bad idea to include these columns in the GROUP BY
clause, as it can hurt performance.For some people, it may still seem unacceptable to include functionally dependent columns in the
SELECT
list without referencing these columns in the GROUP BY
clause. For example, you might be using a rdbms that requires all columns that are not referenced in the GROUP BY
to be aggregated. In those cases is might be better to apply an aggregate function to the functionally dependent column rather than including it in the GROUP BY
clause. Consider the following query:
mysql> SELECT f.film_id
-> , MAX(f.title) AS title
-> , sum(p.amount) sum_amount
-> FROM film f
-> LEFT JOIN inventory i
-> ON f.film_id = i.film_id
-> LEFT JOIN rental r
-> ON i.inventory_id = r.inventory_id
-> LEFT JOIN payment p
-> ON r.rental_id = p.rental_id
-> GROUP BY f.film_id
-> HAVING sum_amount > 300;
Empty set (0.20 sec)
Again, we
GROUP BY
on the film_id
column, which makes up the primary key of the film
table. This time however, we apply the MAX
aggregate function on the title
column. We know that there is only one value in the title
column for each value in the film_id
column, so the aggregation will not influence the result. In fact, we could've used MIN
equally well. These aggregate functions will return the right value for exactly the same reason why it is safe to not include the functionally dependent column in the
GROUP BY
clause. Technically the aggregation is therefore unnecessary: it is just a trick to fool the rdbms.Most likely, applying the aggregate function will be somewhat slower than not applying the aggregate. However, it will in most cases be faster than including the functionally dependent column in the
GROUP BY
clause.Conclusion
Contrary to popular belief, the SQL standard does not require
GROUP BY
queries to reference all non-aggregated columns from the SELECT
list in the GROUP BY
clause. As of the 1999 version of the SQL standard, it is explicitly allowed for the SELECT
list to reference non-aggregated expressions as long as they are functionally dependent upon the GROUP BY
list. Each expression that has exactly one value for each group defined by the
GROUP BY
clause is functionally dependent upon the GROUP BY
clause. Functional dependencies can be witnessed in a common query pattern: whenever we have a join between a master and a detail table to calculate aggregates over the detail rows for each row from the master, we can GROUP BY
over the primary or unique key from the master. All non-key columns of the master row will be functionally dependent upon the key, and can thus appear in the SELECT
-list outside an aggregate expression.In MySQL, one can write
GROUP BY
queries that reference non-aggregated columns in the SELECT
list that are not included in the GROUP BY
clause, even if these columns are not functionally dependent upon the GROUP BY
clause. This behaviour conforms to none of the SQL standard's versions. It is possible to avoid this behaviour by including ONLY_FULL_GROUP_BY
in the sql_mode
server setting, but it might make more sense to take advantage of the ability to write only partial GROUP BY
clauses. In a nutshell:
- It is completely safe to write partial
GROUP BY
clauses as long as all non-aggregated columns in theSELECT
list are functionally dependent upon theGROUP BY
clause. - A partial
GROUP BY
list can result in better performance, because it keeps the server from evaluating the entireGROUP BY
list. - If one does not want to write partial
GROUP BY
clauses, consider usingMIN
orMAX
to 'aggregate' the functionally dependent columns in theSELECT
list rather than moving the functionally dependent columns to theGROUP BY
clause.
25 comments:
i love the article!!! congratulations
one thing that is not clear -- in the conclusion, there are three paragraphs, and paragraph 2 seems to contradict paragraph 1
in particular, "This behaviour conforms to none of the SQL standard's versions."
however, in the first paragraph, you do seem to say that the SQL99 standard allows it "as long as they are functionally dependent upon the GROUP BY list."
could you please clear up this discrepancy for me?
thanks again for a wonderful article
i shall be referring people to it frequently
Rudy,
thanks for your comment! You are right, thanks for mentioning it. The sentence in the 2nd paragraph:
"In MySQL, one can write GROUP BY queries that reference non-aggregated columns in the SELECT list that are not included in the GROUP BY clause, even if these columns are not functionally dependent upon the GROUP BY clause."
I forgot the trailing: "...even if these columns are not functionally dependent upon the GROUP BY clause."
So, MySQL will allow a query like:
SELECT f.film_id
FROM sakila.film f
GROUP BY f.rating
Even though f.film_id is not functionally dependent upon f.rating.
This behaviour can be avoided by including ONLY_FULL_GROUP_BY in the server sql_mode.
wonderful article. great topic. well written.
Thanks a lot. I wrote a View last week, which was similar to your 'master' and 'detail' example. I was under the impression that the View was perfect until I read your blog.
Thanks again for clearing all doubts.
Absolute masterpiece !
This is the best article with regards to RDBMS I have ever read.
Well structured, well thought, well written, well done !
It has just landd in my favourites to share it with my friends.
Thanks a lot !
Jeff, Rishi, Serid: Thanks! I appreciate it a lot :)
I have come across this article a little late. Nevertheless. i got my concepts cleared. Thanks a lot
Just found this article - very interesting read (now I'm off to check SQL Server, as that's the dB I use). Nice work.
Dear Sir;
We have two servers production and test.
On production we are using MySQL with below specifications
RHEL: Red Hat Enterprise Linux Server release 5.3 (Tikanga)
MySQL : mysql Ver 14.12 Distrib 5.0.45, for redhat-linux-gnu (x86_64) using readline 5.0
while
On test server we are using MySQL with below specifications
RHEL: Red Hat Enterprise Linux Server release 5.6 (Tikanga)
MySQL: mysql Ver 14.12 Distrib 5.0.77, for redhat-linux-gnu (x86_64) using readline 5.1
On production server when i am issuing below sql query and it is rejecting with an error
Query: Select type, count(*) from location
Error: #1140 - Mixing of GROUP columns (MIN(),MAX(),COUNT(),...) with no GROUP columns
On test server the same query is not giving us error.
In the same line when i referred MySQL documentation ,
I was told that we need to set SQL_MODE parameter on test server to generate
same error on both servers so I set it.
SQL_MODE=ONLY_FULL_GROUP_BY
But exactly after this an another query with Having clause start giving error on test server.
The error is "non-grouping field '' is used in HAVING clause"
This is also identified in the document.
http://dev.mysql.com/doc/refman/5.0/en/server-sql-mode.html#sqlmode_only_full_group_by
We never set such parameters on Production server or test server we are using almost default
setup.
Can you please share with me why such difference is their.
What made such difference between two MySQL version.
How we can solve this issue?
thanks
Devang
Hi Devang!
"Query: Select type, count(*) from location"
ouch. This does look like a wrong query. there should be a GROUP BY type clause.
"On test server the same query is not giving us error."
No, but the result will be unreliable. Silently.
"I was told that we need to set SQL_MODE parameter on test server to generate
same error on both servers so I set it."
I agree that this should make the test server give you the same error.
"But exactly after this an another query with Having clause start giving error on test server."
"What made such difference between two MySQL version."
First let me get this straight. Are you saying that the query that errs with the "non-grouping field '' is used in HAVING clause" is performing ok on the production server? If it doesn't fail there too, there is probably some bug in the (older) version you're running on the production server. If you want to prevent these surprises you should always make sure you're running the same MySQL version on both test and production servers.
"How we can solve this issue?"
Personally I disrecommend using the ONLY_FULL_GROUP_BY sqlmode. My main reason is that it is buggy, see: http://bugs.mysql.com/bug.php?id=8510
To prevent bad queries from being used, like "Query: Select type, count(*) from location"
I suggest you have code reviews.
If that is for some reason unacceptable to you, you can take the query that is causing the "non-grouping field '' is used in HAVING clause" error, and see if it makes any difference if you use the aggregate function in the HAVING clause instead of an alias. It's kinda hard to explain without having your specific query, but going by the example in the docs you cite: if this is the original query (that errs)
SELECT name, COUNT(name) AS c
FROM orders
GROUP BY name
HAVING c = 1;
You could try rewriting that to
SELECT name, COUNT(name) AS c
FROM orders
GROUP BY name
HAVING COUNT(name) = 1;
Thanks
Beep Beep
Burp Burp.
Spent my saturday reading this post and it was well worth the read. Certainly helped my understanding of writing SQL. Amazing how this was written in 2007, but i dare say not too many people really have a good grasp of these things till date. Thanks for putting this up and also updating.
Thanks! Really appreciate the kind words. Glad it was useful to you!
Excellent article!!! Hats off Bouman.. I really appreciate kind of understanding & explanation, I stopped my work to completely read and find the purpose of "GROUP BY" clause and functional dependency. I'm the victim of GROUP BY with non-aggregate/non dependent columns, but now happy to sort it out.
Dear Bouman,
I've just mentioned table below to recreate my situation. Please go through it..
items (item_id, name, category_id, price)
categories(category_id, name)
users(user_id, name, email, parent_id, super_user_id)
orders(order_id, total_amt, order_date, user_id)
ordered_items(order_item_id, order_id, item_id, quantity, amt, category_id)
SELECT *, SUM(total_amt) tamt from (SELECT DISTINCT (o.order_id), u.name, u.user_id , IF(u.parent_id = 0, u.user_id, u.parent_id) vendor_id, SUM(oi.amt) total_amt
FROM users u INNER JOIN orders o ON o.user_id = u.user_id INNER JOIN ordered_items oi ON o.order_id = oi.order_id
INNER JOIN categories c ON oi.category_id = c.category_id
WHERE u.super_user_id = ? GROUP BY oi.category_id, vendor_id) A GROUP BY vendor_id ORDER BY tamt DESC
This query has given me an error "sql_mode ONLY_FULL_GROUP_BY for order_id column " when disabled ONLY_FULL_GROUP_BY in sql_mode it worked.
Query explaination:
1) Wanted to list parent users with their total sales
2) category_id in query is to add condition ( oi.category_id = ?) in other scenario.
I just did a work around to get the output but not happy with the query.
Can you please help me to write this query more effective and standard one.
Hi Sailesh Jaiswal,
"1) Wanted to list parent users with their total sales"
What are "parent" users? I see a condition
WHERE u.super_user_id = ?
in your query. Is that it? No need to report any data about that "parent" user?
"2) category_id in query is to add condition ( oi.category_id = ?) in other scenario."
Not sure I understand. Do you mean that in some other case you need a query almost like this one, except with the need to filter for one particular category?
If that is the case, then you have a bit of a problem. Code reuse is good, but in the case of SQL it very often leads to suboptimal queries.
In this particular case I don't think it's possible to do those 2 things in one query. See, if you want to select the category so you can filter on it later, then you'll have to do the sum of sales per category. This is at odds of your requirement to calculate total sales.
Thank you Bouman.
Sorry to confuse you. Here is little explanatory content. There is Dealer (super_user_id) who has multiple Retailers (parent_id) and Retails have multiple SalePersons (user_id). Retails can sell items as well SalesPerson can sell items.
So when Dealer logs in, he wants what retailer total sales are. As per your guidance I've separated the queries for first & second requirement.
Now the first query appears like this.
SELECT SUM(total_amt), IF(parent_id = 0, u.user_id, u.parent_id) retailer_id, u.name FROM users u INNER JOIN orders o ON o.user_id = u.user_id WHERE super_user_id = ? GROUP BY retailer_id
This time again I got error below
"Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'u.name' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by"
Second query:
SELECT SUM(amt), IF(parent_id = 0, u.user_id, u.parent_id) retailer_id, u.name, category_id FROM users u INNER JOIN orders o ON o.user_id = u.user_id INNER JOIN ordered_items oi ON oi.order_id = o.order_id WHERE super_user_id = ? AND category_id = ? GROUP BY category_id, retailer_id
This time again I got error below
"Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'u.name' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by"
Please suggest
Sailesh Jaiswal, it seems to me that in both cases you need to add u.name to the GROUP BY list.
Thank you. It fixed the issue. It is understood that every column mentioned in select should be either used in group by or aggregated.
"It is understood that every column mentioned in select should be either used in group by or aggregated."
The point of the article is to show that that is *not* the case. Expressions that are functionally dependent upon the GROUP BY list *should* be selectable, and the SQL standard after '92 allows this for column expressions (not just arbitrary expressions)
In your example the name is not functionally dependent upon the group by list, so it is not allowed.
Please checkout Guilhelms blogs here http://mysqlserverteam.com/mysql-5-7-only_full_group_by-improved-recognizing-functional-dependencies-enabled-by-default/ and http://mysqlserverteam.com/when-only_full_group_by-wont-see-the-query-is-deterministic/
One of the best article about SQL that I've ever read.
Congratulation
Hi CRC, thanks for the kind words! Glad it was useful for you :)
Best article on Group BY.
Thank you
Post a Comment