- InnoDB improvements: Simplified tablespace recovery, support for spatial indexes, dynamic configuration of the
innodb_buffer_pool_sizeparameter(!), more efficient creation and rebuilding of indexes ("sorted index build")
- Several general improvements for spatial data, such as support for open polygons and functions to manipulate geohash values and GeoJSON documents
performance_schemaadditions and improvements, such as a new
user_variables_by_threadtable, addition of
WORK_ESTIMATEDcolumns to the stage event tables, improvements to the wait event tables, instrumentation for InnoDB memory allocation in the memory summary tables.
- Quite a bunch of optimizer improvements, such as better cost estimation for semi-join materialization, configurable cost model (by way of the
mysql.engine_costsystem tables) and more exact index statistics.
- Many improvements and additions that make replication more robust
- A more sane default SQL mode and
Now, one of the changes I'd like to highlight in this post is the improved GROUP BY support.
More than 7 years ago, I wrote an article on this blog called Debunking GROUP BY Myths. The article is basically an explanation of the syntax and semantics of the SQL
GROUP BY behavior before MySQL 5.7.5
GROUP BYclause, with a focus on MySQL particular non-standard implementation.
Before MySQL 5.7.5, MySQL would by default always aggregate over the list of expressions that appear in the
GROUP BY-clause, even if the
SELECT-list contains non-aggregate expressions that do not also appear in the
GROUP BY-clause. In the final resultset, MySQL would produce one of the available values for such non-aggregate expressions and the result would basically not be deterministic from the user's point of view.
This behavior is not standard: SQL92 states that any non-aggregate expressions appearing in the
SELECT-list must appear in the
GROUP BY-clause; SQL99 and on state that any non-aggregate expressions that appear in the
SELECT-list must be functionally dependent upon the list of expressions appearing in the
GROUP BY. In this context, "functionally dependent" simply means that for each unique combination of values returned by the expressions that make up the
GROUP BY-clause, the non-aggregate expression necessarily yields exactly one value. (This concept is further explained and illustrated in my original article.)
Most RDBMS-es implement the SQL92 behavior, and generate an error in case a non-aggregate expression appears in the
SELECT-list but not the
GROUP BY-clause. Because MySQL would not generate an error at all and instead would simply allow such queries while silently producing a non-deterministic result for such expressions, many users got bitten and frustrated.
My original article offered 3 suggestions to cope with this non-standard behavior:
- One could explicitly add a
ONLY_FULL_GROUP_BYoption to the
sql_mode(since it was not included by default). This should essentially make pre-MySQL 5.7.5 behave according to SQL92. Unfortunately, this feature would often erroneously spot properly aggregated
SELECT-list expressions and reject perfectly valid queries. This is why I disrecommended this approach. (See bug #8510 for details.)
- I argued instead to be more conscious when building queries, and manually ensure that non-aggregated expressions in the
SELECT-list are functionally dependent upon the list of expressions appearing in the
GROUP BYclause. The aim of my original article was to teach a way of thinking about aggregate queries so that developers would be conditioned to do "the right thing" and avoid writing non-deterministic queries.
- The final suggestion was to artificially convert non-aggregate expressions in the
SELECT-list to aggregate expressions by wrapping them inside an appropriate "dummy" aggregate function like
SELECT-list to the
GROUP BY-clause. However, as I have shown in my original article, this is typically not the best solution: if that expression is truly functionally dependent upon the expressions in the
GROUP BY-clause it can (and often does) have a non-trivial impact on the performance. (And if it wasn't functionally dependent, you probably weren't exactly aware what you were doing when you wrote your query ;-)
At the time my recommendation was to go for option two and manually ensure that any non-aggregate expressions in the
SELECT-list are dependent upon the expressions in the
GROUP BY-clause. The bug in the former behahaviour of
ONLY_FULL_GROUP_BYwas simply too restrictive to work with, and adding dummy aggregate expressions makes it harder to maintain the query. Besides, successfully writing those dummy aggregates or add those non-aggregate expressions to the
GROUP BYlist still requires the same structural understanding of the query that was required to write it correctly, so why bother if you could just as well have written the query right in the first place?
Basically the message was that giving your query a little more thought is simply the best solution on all accounts.
In the 5.7.5m15 milestone release,
GROUP BY in MySQL 5.7.5
ONLY_FULL_GROUP_BYis included in the
sql_modeby default. Contrary to what its name might suggest, this does *not* mean that
GROUP BY-clauses must list all non-aggregated columns appearing in the
SELECT-list. Oracle and the MySQL development team, in particular Guilhelm Bichot, really went the extra mile and implemented behavior that is identical, or at least very close to what is described in SQL99 and beyond. That is, MySQL 5.7.5m15 will by default reject only those
GROUP BY-queries that include non-aggregated expressions in the
SELECT-list that are not functionally dependent upon the
This not only means that you cannot mess up your
GROUP BY-queries anymore (as MySQL will now reject an improper
GROUP BYquery), it will also not require you to write non-sensical "dummy" aggregates over expressions that can only have one value per aggregated result row. Hurrah!
Note that this does not mean that writing the query becomes any easier. It is just as hard (or easy) as before, this new feature simply means it becomes impossible to accidentally write a query that delivers non-deterministic results. I don't think anybody in their right mind can be against that.
ExamplesLet's put it to the test with the examples from my original article. First, let's check what the default
sql_modelooks like in MySQL-5.7.5-m15:
mysql> select version(), @@sql_mode; +-----------+---------------------------------------------------------------+ | version() | @@sql_mode | +-----------+---------------------------------------------------------------+ | 5.7.5-m15 | ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION | +-----------+---------------------------------------------------------------+ 1 row in set (0.00 sec)
As you can see,
ONLY_FULL_GROUP_BYis included by default. Now let's try this query:
mysql> use menagerie; mysql> SELECT species -> , MIN(birth) -- birthdate of oldest pet per species -> , MAX(birth) -- birthdate of youngest pet per species -> , birth -- birthdate of ... uh oh...! -> FROM menagerie.pet -> GROUP BY species; ERROR 1055 (42000): Expression #4 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'menagerie.pet.birth' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_byWe should have expected the query to be rejected, because the
birthcolumn appears in the
SELECT-list outside an aggregate function, but is not in the
This exact same query would have been rejected prior to MySQL 5.7.5 as well, provided the
ONLY_FULL_GROUP_BYwould have been explicitly included in the
sql_mode. However, the error message was quite a bit shorter, and furthermore, it conveyed a different message:
ERROR 1055 (42000): 'menagerie.pet.birth' isn't in GROUP BYTake a moment to consider the difference: prior to MySQL 5.7.5, the complaint was that the non-aggregated column is not part of the
GROUP BY-list, implying that we should include it; in MySQL 5.7.5, the message is that it is not functionally dependent upon the
GROUP BYlist, implying that we can include non-aggregated columns that are functionally dependent upon the
Let's consider this query to illustrate the difference:
mysql> use sakila; mysql> SELECT film_id -- primary key -> , title -- non-key column -> , COUNT(*) -- one row per group -> FROM sakila.film -> GROUP BY film_id; -- group by on primary keyIn this query, the
GROUP BY-clause contains only
film_id, which is the primary key of the film table. (Note that it doesn't really make much sense to perform such an aggregation, since each aggregate row is based on exactly one source row, but the point is to illustrate how MySQL 5.7.5 handles this query differently than prior versions.)
SELECT-list also contains the
film_idcolumn, which should be ok since it appears in the
GROUP BYlist. But the
SELECT-list also contains the
titlecolumn, which does not appear in the
GROUP BY-list. However, since both columns come from the same table, and
film_idis the primary key, it follows that for any value of
film_idthere can be only one value for
title. In other words, if the value for
film_idis known, then the corresponding value for
titleis also known. A fancy way of saying it is that
titleis functionally dependent upon
fim_id, which is to say that the value of
titleis fully determined once the value for
film_idis known. This is not surprising since
film_idis the key of the table. This is virtually identical to the very definition of what it means to be a key: some column or combination of columns upon which any other column is functionally dependent.
So, it should be perfectly alright to execute this query, and MySQL 5.7.5 does. Note that because of the much narrower semantics of the
ONLY_FULL_GROUP_BYoption in MySQL prior to 5.7.5, this query would be rejected in earlier MySQL versions if
ONLY_FULL_GROUP_BYis part of the
sql_mode. Prior to 5.7.5 MySQL could execute this query but not if the
Now, in the previous query, the functional dependency between
titleis quite straightforward and easy to detect, since both columns come from the same table. This type of functional dependency is also correctly detected by for example Postgres 9.1, and this query can be executed there too.
However, MySQL 5.7.5 is capable of detecting more complex functional dependencies. Consider this query:
mysql> SELECT i.film_id -> , f.title -> , count(i.inventory_id) inventory_count -> FROM film f -> LEFT JOIN inventory i -> ON f.film_id = i.film_id -> GROUP BY f.film_id -> HAVING inventory_count = 0This is almost a typical master detail query, which joins
film_idto find out how many copies of each film exist in the inventory. As an extra criterion, we filter out those films for which there are no copies available by writing
HAVING inventory_count = 0. The
GROUP BY-clause is again on
film.film_id, so this means we should be able to use any column from the
filmtable in the
SELECT-list, since they are functionally dependent upon the
GROUP BY-list; again we ask for the
titlecolumn from the
Again we also select a
film_idcolumn, but instead of asking for the
film_idcolumn from the
filmtable, we ask for the one from the
inventory.film_idcolumn does not appear in the
GROUP BY-list, and is also not an aggregated. But even though there may be multiple rows from the
inventorytable for one specific row in the
filmtable, the query is still valid. This is because the join condition
f.film_id = i.film_idensures the value of the
film_idcolumn in the
inventorytable is functionally dependent upon the
film_idcolumn from the
filmtable. And becaue the
film_idcolumn from the
filmtable does appear in the
GROUP BY-list, it must mean the
film_idcolumn from the
inventorytable is fully determined, and hence the query is valid.
This query will fail in Postgres 9.1, but not in MySQL 5.7.5: In Postgres 9.1, we first have to rewrite the
HAVING-clause and refer to
count(i.inventory_id)rather than its alias
inventory_count. But even then, it still considers
inventory.film_idnot to be functionally dependent upon the
GROUP BY-clause, and it will reject this query. (If you have any indication that later versions of Postgres also handle this query correctly, please let me know and I'll gladly amend this article)
(A little digression: I just argued that in the previous query,
inventoryis functionally dependent upon
filmbecause of the join condition. However, this does not necessarily mean the value of these columns is identical. In fact, in this particular example the selected
HAVING-clause, whereas the value of
filmtable is never
NULL. But it is still true that for each distinct value from
filmtable, the value of
inventorytable is fully determined, and hence, functionally dependent.)
Upgrade adviceIf you decide to upgrade to MySQL 5.7.5 (or beyond), and you used to run with a
sql_modethat did not include
ONLY_FULL_GROUP_BY, then some
GROUP BYqueries that used to work prior to the upgrade might fail after the upgrade. This sounds like a bad thing but if you think about it, it really isn't: the queries that are going to fail were in fact invalid all along, and gave you non-deterministic results. You just didn't notice.
A simple way to make these queries work again would be to remove
sql_mode. However, I would very strongly disrecommend that approach. Rather, each query that fails in MySQL 5.7.5 (or beyond) due to enabling
ONLY_FULL_GROUP_BYoption should be inspected and rewritten. If your query contains a non-aggregated expression in the
SELECT-list that is not dependent upon the
GROUP BY-list, your application was probably using bogus (or well, at least non-deterministic) results and you should decide what the proper behavior should be and modify the query (and/or the application) accordingly. If you really want to keep relying on non-deterministic results (why?), you can wrap such expressions into the new
ANY_VALUE()function. This will essentially preserve the old behaviour even if
ONLY_FULL_GROUP_BYis enabled. I suppose this is still better than running without
ONLY_FULL_GROUP_BY, because in this way it will at least be clear the result will be non-deterministic since you're literally asking for it.
(One word about the
ANY_VALUE()"function". By the way it looks and is called, you might get the impression that
ANY_VALUE()is some kind of new aggregate function that aggregates by actively picking one value out of a list. This is not the case. Proper aggregate functions, like
MAX()etc. will condense the resultset to one single row in the absence of a
ANY_VALUE()does no such thing. It is merely a placeholder that tells the optimizer to not generate an error when the
ONLY_FULL_GROUP_BYcontract is broken. Basically,
ANY_VALUE()means: this query is broken but instead of fixing it we chose to ignore the fact that it is broken.)
FinallyMySQL 5.7.5 is looking like it will bring a lot of improvements, and thanks to Guilhelm Bichot, very much improved standards compliance for
GROUP BY. I have no hesitation to recommend you start using the
ONLY_FULL_GROUP_BYoption. Of course many of the considerations to consciously writing your query still applies, and now MySQL will make it even more easy to do so.
If you're interested in these or other improvements, consider downloading and installing MySQL 5.7.5 m15 to test your current systems. MySQL 5.7.5 brings quite a number of incompatible changes and while I believe they are all improvements, one best get prepared. Happy hacking.
- My original Debunking GROUP BY myths article.
- Guilhelm Bichot, the developer of the improvements on this topic: MySQL 5.7: only_full_group_by Improved, Recognizing Functional Dependencies, Enabled by Default!