Friday, September 26, 2014

MySQL 5.7.5: GROUP BY respects functional dependencies!

Today, Oracle announced the availability of the Development Milestone Release 15 of MySQL 5.7.5. The tagline for this release promises "Enhanced Database Performance and Manageability". That may sound rather generic, the actual list of changes and improvements is simply *huge*, and includes many items that I personally find rather exciting! Perhaps I'm mistaken but I think this may be one of the largest number of changes packed into a MySQL point release that I've witnessed in a long time. The list of changes includes improvements such as:
  • InnoDB improvements: Simplified tablespace recovery, support for spatial indexes, dynamic configuration of the innodb_buffer_pool_size parameter(!), more efficient creation and rebuilding of indexes ("sorted index build")
  • Several general improvements for spatial data, such as support for open polygons and functions to manipulate geohash values and GeoJSON documents
  • performance_schema additions and improvements, such as a new user_variables_by_thread table, addition of WORK_COMPLETED and WORK_ESTIMATED columns to the stage event tables, improvements to the wait event tables, instrumentation for InnoDB memory allocation in the memory summary tables.
  • Quite a bunch of optimizer improvements, such as better cost estimation for semi-join materialization, configurable cost model (by way of the mysql.server_cost and mysql.engine_cost system tables) and more exact index statistics.
  • Many improvements and additions that make replication more robust
  • A more sane default SQL mode and GROUP BY behaviour
This is very far from an exhaustive list; It really is not an exaggeration to say that there is much much more than I can cover here. Just see for yourself.

Now, one of the changes I'd like to highlight in this post is the improved GROUP BY support.

GROUP BY behavior before MySQL 5.7.5

More than 7 years ago, I wrote an article on this blog called Debunking GROUP BY Myths. The article is basically an explanation of the syntax and semantics of the SQL GROUP BY clause, with a focus on MySQL particular non-standard implementation.

Before MySQL 5.7.5, MySQL would by default always aggregate over the list of expressions that appear in the GROUP BY-clause, even if the SELECT-list contains non-aggregate expressions that do not also appear in the GROUP BY-clause. In the final resultset, MySQL would produce one of the available values for such non-aggregate expressions and the result would basically not be deterministic from the user's point of view.

This behavior is not standard: SQL92 states that any non-aggregate expressions appearing in the SELECT-list must appear in the GROUP BY-clause; SQL99 and on state that any non-aggregate expressions that appear in the SELECT-list must be functionally dependent upon the list of expressions appearing in the GROUP BY. In this context, "functionally dependent" simply means that for each unique combination of values returned by the expressions that make up the GROUP BY-clause, the non-aggregate expression necessarily yields exactly one value. (This concept is further explained and illustrated in my original article.)

Most RDBMS-es implement the SQL92 behavior, and generate an error in case a non-aggregate expression appears in the SELECT-list but not the GROUP BY-clause. Because MySQL would not generate an error at all and instead would simply allow such queries while silently producing a non-deterministic result for such expressions, many users got bitten and frustrated.

My original article offered 3 suggestions to cope with this non-standard behavior:
  • One could explicitly add a ONLY_FULL_GROUP_BY option to the sql_mode (since it was not included by default). This should essentially make pre-MySQL 5.7.5 behave according to SQL92. Unfortunately, this feature would often erroneously spot properly aggregated SELECT-list expressions and reject perfectly valid queries. This is why I disrecommended this approach. (See bug #8510 for details.)
  • I argued instead to be more conscious when building queries, and manually ensure that non-aggregated expressions in the SELECT-list are functionally dependent upon the list of expressions appearing in the GROUP BY clause. The aim of my original article was to teach a way of thinking about aggregate queries so that developers would be conditioned to do "the right thing" and avoid writing non-deterministic queries.
  • The final suggestion was to artificially convert non-aggregate expressions in the SELECT-list to aggregate expressions by wrapping them inside an appropriate "dummy" aggregate function like MIN() or MAX().
In addition, it may seem that one can also simply add the non-aggregate expressions int the SELECT-list to the GROUP BY-clause. However, as I have shown in my original article, this is typically not the best solution: if that expression is truly functionally dependent upon the expressions in the GROUP BY-clause it can (and often does) have a non-trivial impact on the performance. (And if it wasn't functionally dependent, you probably weren't exactly aware what you were doing when you wrote your query ;-)

At the time my recommendation was to go for option two and manually ensure that any non-aggregate expressions in the SELECT-list are dependent upon the expressions in the GROUP BY-clause. The bug in the former behahaviour of ONLY_FULL_GROUP_BY was simply too restrictive to work with, and adding dummy aggregate expressions makes it harder to maintain the query. Besides, successfully writing those dummy aggregates or add those non-aggregate expressions to the GROUP BY list still requires the same structural understanding of the query that was required to write it correctly, so why bother if you could just as well have written the query right in the first place?

Basically the message was that giving your query a little more thought is simply the best solution on all accounts.

GROUP BY in MySQL 5.7.5

In the 5.7.5m15 milestone release, ONLY_FULL_GROUP_BY is included in the sql_mode by default. Contrary to what its name might suggest, this does *not* mean that GROUP BY-clauses must list all non-aggregated columns appearing in the SELECT-list. Oracle and the MySQL development team, in particular Guilhelm Bichot, really went the extra mile and implemented behavior that is identical, or at least very close to what is described in SQL99 and beyond. That is, MySQL 5.7.5m15 will by default reject only those GROUP BY-queries that include non-aggregated expressions in the SELECT-list that are not functionally dependent upon the GROUP BY-list.

This not only means that you cannot mess up your GROUP BY-queries anymore (as MySQL will now reject an improper GROUP BY query), it will also not require you to write non-sensical "dummy" aggregates over expressions that can only have one value per aggregated result row. Hurrah!

Note that this does not mean that writing the query becomes any easier. It is just as hard (or easy) as before, this new feature simply means it becomes impossible to accidentally write a query that delivers non-deterministic results. I don't think anybody in their right mind can be against that.

Examples

Let's put it to the test with the examples from my original article. First, let's check what the default sql_mode looks like in MySQL-5.7.5-m15:
mysql> select version(), @@sql_mode;
+-----------+---------------------------------------------------------------+
| version() | @@sql_mode                                                    |
+-----------+---------------------------------------------------------------+
| 5.7.5-m15 | ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION |
+-----------+---------------------------------------------------------------+
1 row in set (0.00 sec)

As you can see, ONLY_FULL_GROUP_BY is included by default. Now let's try this query:
mysql> use menagerie;
mysql> SELECT   species
    -> ,        MIN(birth)  -- birthdate of oldest pet per species
    -> ,        MAX(birth)  -- birthdate of youngest pet per species
    -> ,        birth       -- birthdate of ... uh oh...!
    -> FROM     menagerie.pet
    -> GROUP BY species;
ERROR 1055 (42000): Expression #4 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'menagerie.pet.birth' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
We should have expected the query to be rejected, because the birth column appears in the SELECT-list outside an aggregate function, but is not in the GROUP BY-clause.

This exact same query would have been rejected prior to MySQL 5.7.5 as well, provided the ONLY_FULL_GROUP_BY would have been explicitly included in the sql_mode. However, the error message was quite a bit shorter, and furthermore, it conveyed a different message:
ERROR 1055 (42000): 'menagerie.pet.birth' isn't in GROUP BY
Take a moment to consider the difference: prior to MySQL 5.7.5, the complaint was that the non-aggregated column is not part of the GROUP BY-list, implying that we should include it; in MySQL 5.7.5, the message is that it is not functionally dependent upon the GROUP BY list, implying that we can include non-aggregated columns that are functionally dependent upon the GROUP BY-clause.

Let's consider this query to illustrate the difference:
mysql> use sakila;
mysql> SELECT   film_id       -- primary key
    -> ,        title         -- non-key column
    -> ,        COUNT(*)      -- one row per group
    -> FROM     sakila.film
    -> GROUP BY film_id;      -- group by on primary key
In this query, the GROUP BY-clause contains only film_id, which is the primary key of the film table. (Note that it doesn't really make much sense to perform such an aggregation, since each aggregate row is based on exactly one source row, but the point is to illustrate how MySQL 5.7.5 handles this query differently than prior versions.)

The SELECT-list also contains the film_id column, which should be ok since it appears in the GROUP BY list. But the SELECT-list also contains the title column, which does not appear in the GROUP BY-list. However, since both columns come from the same table, and film_id is the primary key, it follows that for any value of film_id there can be only one value for title. In other words, if the value for film_id is known, then the corresponding value for title is also known. A fancy way of saying it is that title is functionally dependent upon fim_id, which is to say that the value of title is fully determined once the value for film_id is known. This is not surprising since film_id is the key of the table. This is virtually identical to the very definition of what it means to be a key: some column or combination of columns upon which any other column is functionally dependent.

So, it should be perfectly alright to execute this query, and MySQL 5.7.5 does. Note that because of the much narrower semantics of the ONLY_FULL_GROUP_BY option in MySQL prior to 5.7.5, this query would be rejected in earlier MySQL versions if ONLY_FULL_GROUP_BY is part of the sql_mode. Prior to 5.7.5 MySQL could execute this query but not if the sql_mode includes ONLY_FULL_GROUP_BY.

Now, in the previous query, the functional dependency between film_id and title is quite straightforward and easy to detect, since both columns come from the same table. This type of functional dependency is also correctly detected by for example Postgres 9.1, and this query can be executed there too.

However, MySQL 5.7.5 is capable of detecting more complex functional dependencies. Consider this query:
mysql> SELECT    i.film_id
    -> ,         f.title
    -> ,         count(i.inventory_id) inventory_count
    -> FROM      film f
    -> LEFT JOIN inventory i
    -> ON        f.film_id = i.film_id
    -> GROUP BY  f.film_id
    -> HAVING    inventory_count = 0
This is almost a typical master detail query, which joins film to inventory over film_id to find out how many copies of each film exist in the inventory. As an extra criterion, we filter out those films for which there are no copies available by writing HAVING inventory_count = 0. The GROUP BY-clause is again on film.film_id, so this means we should be able to use any column from the film table in the SELECT-list, since they are functionally dependent upon the GROUP BY-list; again we ask for the title column from the film table.

Again we also select a film_id column, but instead of asking for the film_id column from the film table, we ask for the one from the inventory table. The inventory.film_id column does not appear in the GROUP BY-list, and is also not an aggregated. But even though there may be multiple rows from the inventory table for one specific row in the film table, the query is still valid. This is because the join condition f.film_id = i.film_id ensures the value of the film_id column in the inventory table is functionally dependent upon the film_id column from the film table. And becaue the film_id column from the film table does appear in the GROUP BY-list, it must mean the film_id column from the inventory table is fully determined, and hence the query is valid.

This query will fail in Postgres 9.1, but not in MySQL 5.7.5: In Postgres 9.1, we first have to rewrite the HAVING-clause and refer to count(i.inventory_id) rather than its alias inventory_count. But even then, it still considers inventory.film_id not to be functionally dependent upon the GROUP BY-clause, and it will reject this query. (If you have any indication that later versions of Postgres also handle this query correctly, please let me know and I'll gladly amend this article)

(A little digression: I just argued that in the previous query, film_id from inventory is functionally dependent upon film_id from film because of the join condition. However, this does not necessarily mean the value of these columns is identical. In fact, in this particular example the selected film_id column from inventory will be NULL because our HAVING-clause, whereas the value of film_id from the film table is never NULL. But it is still true that for each distinct value from film_id from the film table, the value of film_id from the inventory table is fully determined, and hence, functionally dependent.)

Upgrade advice

If you decide to upgrade to MySQL 5.7.5 (or beyond), and you used to run with a sql_mode that did not include ONLY_FULL_GROUP_BY, then some GROUP BY queries that used to work prior to the upgrade might fail after the upgrade. This sounds like a bad thing but if you think about it, it really isn't: the queries that are going to fail were in fact invalid all along, and gave you non-deterministic results. You just didn't notice.

A simple way to make these queries work again would be to remove ONLY_FULL_GROUP_BY from the sql_mode. However, I would very strongly disrecommend that approach. Rather, each query that fails in MySQL 5.7.5 (or beyond) due to enabling ONLY_FULL_GROUP_BY option should be inspected and rewritten. If your query contains a non-aggregated expression in the SELECT-list that is not dependent upon the GROUP BY-list, your application was probably using bogus (or well, at least non-deterministic) results and you should decide what the proper behavior should be and modify the query (and/or the application) accordingly. If you really want to keep relying on non-deterministic results (why?), you can wrap such expressions into the new ANY_VALUE() function. This will essentially preserve the old behaviour even if ONLY_FULL_GROUP_BY is enabled. I suppose this is still better than running without ONLY_FULL_GROUP_BY, because in this way it will at least be clear the result will be non-deterministic since you're literally asking for it.

(One word about the ANY_VALUE() "function". By the way it looks and is called, you might get the impression that ANY_VALUE() is some kind of new aggregate function that aggregates by actively picking one value out of a list. This is not the case. Proper aggregate functions, like COUNT(), MIN(), MAX() etc. will condense the resultset to one single row in the absence of a GROUP BY list; ANY_VALUE() does no such thing. It is merely a placeholder that tells the optimizer to not generate an error when the ONLY_FULL_GROUP_BY contract is broken. Basically, ANY_VALUE() means: this query is broken but instead of fixing it we chose to ignore the fact that it is broken.)

Finally

MySQL 5.7.5 is looking like it will bring a lot of improvements, and thanks to Guilhelm Bichot, very much improved standards compliance for GROUP BY. I have no hesitation to recommend you start using the ONLY_FULL_GROUP_BY option. Of course many of the considerations to consciously writing your query still applies, and now MySQL will make it even more easy to do so.

If you're interested in these or other improvements, consider downloading and installing MySQL 5.7.5 m15 to test your current systems. MySQL 5.7.5 brings quite a number of incompatible changes and while I believe they are all improvements, one best get prepared. Happy hacking.

Further Reading

25 comments:

Guilhem Bichot said...

Hello Roland. Thanks for your kind words!!! Actually, your old blog "debunking GROUP BY myths" was one motivation to get this job done :-)
The 5.7 documentation is being edited as we speak, to describe the new logic of only_full_group_by, functional dependencies...

rpbouman said...

Hi Guilhem!

Hey, I'm really glad I could contribute in that way. But kudos to you for implementing it - I'm sure kind of thing could grow into enormous complexity real fast. At least, I once read the spec and it made my ears ring and my eyes bleed so I can only imagine what it must be like implementing it.

Thanks again! I'm very happy you managed it! :)

Anonymous said...

You might also look at the query rewriting support. It'll provide a mostly automatic way to rewrite queries until an application can be changed. It's the solution to many "can't change the application" issues as well as great protection for any backwards compatibility issues with the improved optimiser.

James Day, MySQL Senior Principal Support Engineer, Oracle

rpbouman said...

James,

thanks for chiming in!

I noticed the query rewrite is available through a labs plugin, is that correct?

This sounds like really valuable advice. Thanks for the suggestion!

CRC said...

Great post! Very clear and useful.
Thanks a lot.

Mikiya Okuno said...

Hello Roland,

It must be an off topic, but I come up with an idea that the new behavior us really useful when we normalize tables. It can judge if two sets of columns, one is specified in GROUP BY clause and the other is listed in SELECT list, are functionally dependent or not. It may be a really fantastic feature for database designers.

Mikiya Okuno said...

Hi Roland,

The implementation appears to be different from what I thought.

select Name, Continent, avg(Gnp) from world.Country group by Name;

This query results in an error when ONLY_FULL_GROUP_BY is enabled. However, Continent column is functionally dependent on Name. Name is not a unique key, but its cardinality is same as its PRIMARY KEY. So, it's a yet another candidate key (in relational model sense). Thus, any columns in this table are actually functionally dependent on Name.

Theoretically, column B is functionally dependent A iif any values on column B is always same for every value on column A. Note that candidate key is a matter of distinct values, not a matter of existence of underlying index(es). Functional dependencies cannot be determined without reading values in the given table. However, the query above results in error without accessing any rows.

IMHO, the current implementation is not sufficient. The truly correct implementation is to have internal aggregate function to judge if more than one value appeared over one key value specified in GROUP BY clause or not. Assume that the FD aggregate function returns true if exactly one value appears for the given key. In other words, the FD function will return true when the given column is functionally dependent on the key. When ONLY_FULL_GROUP_BY is enabled, an aggregate query should immediately terminate when the hidden FD aggregate function returns false.

Without such an implementation, I don't think that ONLY_FULL_GROUP_BY really handles functional dependencies. So, we must take care when writing a manual not to make a mistake.

Kind regards,
Mikiya

rpbouman said...

Hi Mikiya,

on a conceptual level, you are completely right - the cardinality indicates that the Name column is functionally dependent. But whether the SQL spec counts this as functionally dependent, I do not know. Though I would suspect they use a more formal appproach that requires the functional dependency to be declared in advance - for example, by the explicit definition of a key. I guess a "derived column" should count as a FD too.

You are also right that the use of the term "functional dependency" in my text does not make it clear whether I'm talking on a conceptual or a more formal level. I will take care to better make the distinction in the future.

In the mean while, you should consider explicitly marking those columns that are unique with a unique constraint. I think it makes a difference whether something simply happens to be functionally dependent, or whether we know for sure that it is functionally dependent. I think it would be very hard for any query optimizer to work with what might be possible rather than what it knows it to be the case.

Mikiya Okuno said...

Hi Roland,

Thank you for your reply.

I agree that the current implementation is really reasonable, but wording in the manual is not reasonable. BTW, as I suggested in the previous post, implementing FD aggregate function is not difficult, is it? Adding such a new feature would really make sense because:

* Unnormalized tables are popular among web developers for performance tuning technique, unfortunately. (I don't like such a technique.) Such tables are often used for analysis using aggregate functions. So, hidden FD columns are often used in GROUP BY clause.

* SQL doesn't have a mechanism to check FD inside a table. We cannot ensure FD using indexes or constraints.

I do not expect FD aggregation is added in 5.7, but I would like you dev team to realise its usefulness and consider to add it to future release.

I do not deny checking FD using unique indexes. It surely is a good thing, because it needs little cost. It's far better than FD aggregate function in query cost point of view. In addition, a query results in an error right away when no suitable unique indexes are found.

IMHO, the optimizer has two execution plans; Just like the range optimizer, which chooses table scan when no suitable indexes are found, the GROUP BY optimizer should choose scan with FD aggregate function when no suitable unique indexes are found for FD check. I think this is a reasonable strategy. (In such case loose index scan cannot be employed, though.)

If FD aggregate function is implemented, FD check and query execution are done at the same time. The query should fail in the middle of execution immediately after the first FD constraint violation is found via FD aggregate function, or should succeed with warning, depending on user preferences (maybe sql mode). This is an ideal behavior for real life analysis applications.

Kind regards,
Mikiya

rpbouman said...

Hi Mikiya,

Interesting points. I think you should go to bugs.mysql.com and make a case for each of them.

Best regards,

Roland

Anonymous said...

Unfortunately this update completely took down an entire service that I have written... Updating MySql took some of my queries offline and caused some seriously angry sneers from other teams in my company when the service no longer worked.

If I am to ask one question: Why would it make more sense to select Max(name) from books where author = name group by name; The dummy aggregate does make sense and reading it makes me wonder what is happening rather than resolve it. To quote your article:
"The final suggestion was to artificially convert non-aggregate expressions in the SELECT-list to aggregate expressions by wrapping them inside an appropriate "dummy" aggregate function like MIN() or MAX()." why why why?

rpbouman said...

@anonymous, puzzled. I'm pretty sure

select Max(name) from books where author = name group by name;

should be written as

select name from books where author = name group by name;

and this is regardless of the feature discussed in this blog entry. Did you write the correct query?

Anonymous said...

SELECT count(fileName), micro_test.id, micro_test.fileName
from test
group by fileName having count(fileName) > 1
order by fileName

This is a terrible update. It has forced me to use max aggregates when it doesn't make sense!
Above is an example of a query that worked just fine and makes sense. Below is the new revised query that doesn't make sense and enforces dummy aggregates...

SELECT count(fileName), max(micro_test.id) as id, max(micro_test.fileName) as fileName
from test
group by fileName having count(fileName) > 1
order by fileName

Why is this a better solution???

rpbouman said...

@anonymous

"This is a terrible update. It has forced me to use max aggregates when it doesn't make sense!"

Well you're not "forced" to. You could also configure your sql mode to not include the ONLY_FULL_GROUP_BY option, and everything should work exactly like it did pre 5.7.5

Looking at your rewrite, there are a bunch of things that don't make sense to me.

You wrote: MAX(micro_test.fileName). But fileName is also in the GROUP BY list. This is silly - if you're grouping by a column, it simply does not make sense to also aggregate. To be sure - you can leave out the MAX on filename, right? If fileName really is also in the GROUP BY?

Now rearding max(micro_test.id) - I'd love to comment on that, but it seem that table micro_test is not in your FROM clause. What is the structure of your tables and what is your actual query?

Anonymous said...

I removed the table name for anonymity that was just a type error on my part suffice it to be micro_test as the table. If you notice my original query does not contain the max aggregates. However with the ONLY_FULL_GROUP mode "by default" it disallows any select list to not have an aggregate function tied to it... This surprised me. My original query was as follows:

SELECT count(fileName), micro_test.id, micro_test.fileName
from micro_test
group by fileName having count(fileName) > 1
order by fileName

This in particular threw an error on the select expr #2 so doing this:

SELECT count(fileName), max(micro_test.id), micro_test.fileName
from micro_test
group by fileName having count(fileName) > 1
order by fileName

Throws and error on the select expr #3 due to it not using an aggregate.

Here is another example of the error being thrown with max around the fileName but not id:

SELECT count(fileName), micro_test.id, max(micro_test.fileName) FROM micro_test
group by fileName having count(fileName) >1
LIMIT 0, 1000 Error Code: 1055. Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'micro_test.id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by 0.015 sec
NOTE: I have changed some naming for company anonymity.

My goal with the original query was to count how many times the fileName was duplicated in that table as well as provide me with the id and string literal for the fileName. So your confusion on the max aggregate function over the id and fileName is also my confusion. I am grouping by the fileName but only to count it. Due to this "default mode" of only_full_group_by it made many queries break. But to fix them it just didn't make logical sense. I guess my real question is why is this the default setting?

rpbouman said...

"Throws and error on the select expr #3 due to it not using an aggregate." < I agree that would be strange. What happens if you write

GROUP BY micro_test.fileName
HAVING COUNT(micro_test.fileName) > 1

(I mean, the fileName from the SELECT list is in fact the same as the column you're grouping by, right?)

rather then omitting the alias?

"Here is another example of the error being thrown with max around the fileName but not id:" < well that id is a problem here makes perfect sense to me. The error means that, given a particular, single value for fileName, you can have multiple values for id. This happens if fileName is not a primary key, or has no UNIQUE constraint. It might be so that your data happens to be such that for any given value of fileName, all rows that match that fileName happen to have one and the same value for id. But you can't expect the database to know about that.

The old behavior would let you report the id column in this case, and it would just pick some value. You can still achieve that behavior either by not setting the sql mode to include ONLY_FULL_GROUP_BY, or by wrapping the id column in ANY_VALUE

What does not make sense in this query is the max around fileName. fileName is in the GROUP BY list, so there is no way there can be multiple values of fileName.

"My goal with the original query was to count how many times the fileName was duplicated in that table as well as provide me with the id and string literal for the fileName."

Well, reporting the count and the filename makes sense. But reporting "the id" does not - if you have a bunch of rows that match fileName 'x', then each matching row could have a different id. So there is not "the id"- there may be a bunch of them. GROUP_CONCAT(id) would give you the list of id's, which seems the only right thing to do here.

" I guess my real question is why is this the default setting?" < because it would be non-sense to ask for "the id" whereas there may be multiple values for any given value of fileName. That's just how GROUP BY works.

If you still have trouble understanding that behavior of GROUP BY then please first read http://rpbouman.blogspot.nl/2007/05/debunking-group-by-myths.html

Unknown said...

Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'Data.List_Subscribers_Laaptu.emailaddress' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by

Unknown said...

Thanks for the post. Great source of information and helpful. I am working on updating a legacy system and have been running into this issue a lot.

rpbouman said...

James, you're very welcome! Always glad to hear the blog is of some use to other. Best regards,

Roland.

Unknown said...

@Roland, thank you for your articles on grouping semantics in SQL with a special respect to functional dependency issue.
I have additional question in this context.
What SQL Standard says about conditions on mixed aggregate and non-aggregate expressions in SELECT clause? For example for simple table:

create table t
(
a integer,
b integer,
c integer
);

following queries fail with ORA-00979: not a GROUP BY expression

select t.a + count(*)
from t
group by t.a + 1;

select t.a + count(*) + t.b
from t
group by t.a + t.b;

Does 'functional depenency' apply in these cases?

Janek




rpbouman said...

Hi Janek!

hey, thanks for your question :)

To be sure you would have to analyze the precise wording of the standard your interested in.
I found the 1999 one one here so you can check for yourself:

ftp://ftp.iks-jena.de/mitarb/lutz/standards/sql/ansi-iso-9075-2-1999.pdf

Your starting point would be 7.11, 13 (query specification, p. 261) where it reads:

"If T is a grouped table, then let G be the set consisting of every column referenced by a <column reference> contained in the <group by clause> immediately contained in <table expression>. In each <value expression>, each <column reference> that references a column of T shall reference some column C that is functionally dependent on G or shall be contained in a <set function specification>"

So when we apply this to your queries, then the entire statement is the "grouped table T" and G is a set of columns containing only column t.a. In the select list we find the value expresion t.a + count(*), which has one column reference to again t.a. Since the column references match, one would think that this condition is satisfied.

By the same logic, the second statement would satisfy the condition as well, since the only referenced columns in the value list are t.a and t.b, which both appear in the group by list as well.

However, it seems to me I must be misunderstanding the logic of SQL 1999, or there must be an error as it is too simple. Consider this query:

select a + b, count(*)
from (select 1 a, 1 b
union all
select 1 a, -1 b
union all
select -1 a, 1 b
union all
select -1 a, -1 b) q
group by a*b
;

Here, the referenced columns are a,b; and these are also the only columns appearing in the select list. But as you can see, while both -1 * -1 and 1 * 1 yields 1, -1 + -1 gives us -2 and 1+1 gives us 2. So in this case, the select list is not functionally dependent upon the group by list.

Back to your queries:

Another way of approaching it is to ask whether t.a + count(*) is functionally dependent upon t.a + 1. I would yes - we can simplify

t.a + 1 ---> t.a + count(*)

to

t.a ---> t.a - 1 + count(*)

Since t.a is by definition functionally dependent upon itself, we know only have to ask whether - 1 + count(*) is functionally dependent on t.a.

The count(*) is a count over the values from the group by list, which in this case is conveniently only t.a itself; So, we can say that count(*) is functionally dependant upon t.a. That leaves the -1; and obviously applying that is applying a "substract by one function", and the result is there for functionally dependent.

So, in short - query1 looks legit.

We can apply the same logic to the second query, and I believe we can make the same case.

Finally, if you do dig into the standard, then you'll find that it defines "known functional dependencies", that is, those functional dependencies that are explicitly defined with rules in the standard. But it does not provide a full definition for every "functional dependency". So I think this leaves room for the implementation to decide whether something really is functionally dependent.

I hope this helps!

Best regards,

Roland

Guilhem Bichot said...

Hello. In the SQL standard (I checked a recent version), things are simple: GROUP BY may only list columns (see the syntax of ), so "GROUP BY a+b" is illegal. Oracle, MySQL among others allow it, but recognizing functional dependencies like "a+1+b is dependent on a+b", is not done by any as far as I know.

rpbouman said...

Guilhelm,

thanks for chiming in! Much appreciated.

Yeah I think queries like these are of little practical value anyway.

Janek, I assume your examples were designed to better understand the concepts and the logic, or do you have some real-world queries where functional dependencies like these come into play? I would guess that for the practical cases it would in large majority be possible to rewrite them so the dependencies are clear and explicit. Writing it so you don't have functions or operations in the group by list sounds like excellent advice.

Arhant said...

Roland,

Really !! A Great Blog.I want to see the code base behind this implementation.
Can you please help me.

Thanks

rpbouman said...

@Arhant, it's on github. https://github.com/mysql/mysql-server

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...