If you mean "did this sequence of events factually take place?" then I can be brief: No. And that's a good thing too, because if a writer would describe reality, the result would be very hard to believe, if not completely inconceivable. The course taken by real life is just too crazy. A true description is bound to be seem like a constructed mannerism and no reader can be expected to believe even one word of it.
However, as far as the single events are concerned, I exclusively use only those things bourne from reality.
If you ever read The Daily WTF you will probably agree that this point of view makes a lot of sense. Each time a really crazy or insensible piece of code pops up on the Daily's there are always people commenting, saying something like:
No...this just cannot be true. I refuse to believe that software developers from large professional organisations can deliver this kind of crap..and get away with it.
Well, the odds are that indeed the pieces of code that appear there should not be taken too literally. I mean, for one thing, the code appears with indentation, and sometimes even with comments: no way this is real code. Real Code is Worse than what you'll ever find on The Daily WTF.
My take on WTF
The reason I'm choosing this topic because I was recently asked to take a look at a database query which was giving "...trouble...". Usually this means: "...not performing as fast as we'd like...".
In this case, the people having trouble were actually consultants from a professional IT service provider. They were trying to migrate some enterprise application from a big, operational Oracle database to MySQL 5.0 Server. Their efforts were all experimental, and so ar, they were quite enthusiastic about MySQL. However, there were two troublesome queries performancewise. (I will discuss only one - the other one is also a story on it's own.)
They found that in this particular case, Oracle took 30ms whereas MySQL took a little over one minute. Of course this is a huge difference, and I was immediately challenged.
The query I got was somewhat like this:
select *
from (
select film_id
, title
, release_year
, 'PG'
from film
where rating in ('PG','PG-13')
union
select film_id
, title
, release_year
, 'R'
from film
where rating in ('R','NC-17')
union
select film_id
, title
, release_year
, rating
from film
where rating not in ('PG','PG-13','R','NC-17')
) film
where release_year > 2005
and release_year <= 2006
Well, of course, their enterprise application was not built on the sakila sample database: I stylized the original query to make it tangible. But don't let that distract you - just look.
Now I'm not saying that I'm not guilty of writing bad code. I mean, we all started somewhere, that is: from nothing. We all had to learn, and alas, in a lot of cases, you can only learn by making mistakes. However, things like the Daily WTF should be valuable eye-openers for developers and software engineers to always remain focussed on monitoring the quality of code. Even, or especially if it's their own.
So, instead of taking the chuckly attitude of laughing at someone else's apparent WTF, I'll try and do a step by step refactoring of this query. This will probably not be interesting for advanced users, and they can still have their laugh. If you can't appreciate the WTF-edness, keep reading: this article is for you.
A Simple Query
What we see here is essentially a simple query. A subselect appears in the
FROM
-clause: this is sometimes referred to as a derived table or an inline-view. If we forget the complexity of the subquery for a while and pretend that it actually is a real view, we can see easily that it really is simple:
select ...
from (/*complexity goes here*/) film
where ...
So, no
JOIN
, no GROUP BY
, and not correlelated subselects. Normally, this query should be quite fast, right? So why is it so slow now? Well, we could use the basic performance and tuning instincts, and start asking questions like:
- What engine is used for the
film
table? - Is there an index on
release_year
? - What is the size of the supa_dupa_xyz buffer?
However, I don't think that's really necessary at this point. Right now, we need to know but one thing, and that is that the resultset of an inline view is computed before as a separate step.
Using EXPLAIN
We sure can witness this using
EXPLAIN
:
explain
select *
from ( -- 1
... -- 2
UNION
... -- 3
UNION
... -- 4
) film -- union result
where ...
;
+----+--------------+--------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+--------------+------+---------------+------+---------+------+------+-------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1000 | Using where |
| 2 | DERIVED | film | ALL | NULL | NULL | NULL | NULL | 952 | Using where |
| 3 | UNION | film | ALL | NULL | NULL | NULL | NULL | 952 | Using where |
| 4 | UNION | film | ALL | NULL | NULL | NULL | NULL | 952 | Using where |
| | UNION RESULT | <union2,3,4> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+--------------+------+---------------+------+---------+------+------+-------------+
The explain result is not too hard to interpret: The first row has the
select_type
equal to PRIMARY
and this corresponds to the outer query. The table
mentions that it queries <derived2>
, which means it queries on the subquery (I don't know why it does not use the subquery alias - it just doesn't).The second, third and fourth row in the explain result refer to the queries that are
UNION
-ed together inside the subselect. It's a bit puzzling that the select_type
for the second row is marked DERIVED
, but it just means it's the first thing that is evaluated as part of the subquery. The third and fourth row correspond to the sets that are to be
UNION
-ed to the total result of the subquery, and that's why you see UNION
there in the select_type
column. For those that do not know:
UNION
is a so-called set operator, it takes the resultsets produced by the queries on the left and right hand side, and constructs a new resultset out of it that contains the rows from the resultset operands. So, UNION
'adds up' rows in a vertical manner. (The much used and well known JOIN
operator 'adds up' resultsets in the horizontal direction)The final row represents the actual 'adding up' of the all the rows from the three queries inside the subselect, and that's why it has
UNION RESULT
in the select_type
column. This step is not trivial, because UNION
is required to filter out any duplicate rows in the final result after applying the UNION
. More about that later.Do we really need a derived table?
Now, we were about to think whether we really need the derived table. What would happen if we'd simply expand the contents of the subquery? It will require some work, because we would have to duplicate the
WHERE
clause of the outer query into the individual UNION
-ed queries:
select film_id
, title
, release_year
, 'PG'
from film
where rating in ('PG','PG-13')
and release_year > 2005
and release_year <= 2006
union
select film_id
, title
, release_year
, 'R'
from film
where rating in ('R','NC-17')
and release_year > 2005
and release_year <= 2006
union
select film_id
, title
, release_year
, rating
from film
where rating not in ('PG','PG-13','R','NC-17')
and release_year > 2005
and release_year <= 2006
So, what would this do to the result from
EXPLAIN
? Well, we don't expect the subquery step again in oour result:
+----+--------------+--------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+--------------+------+---------------+------+---------+------+------+-------------+
| 1 | PRIMARY | film | ALL | NULL | NULL | NULL | NULL | 952 | Using where |
| 2 | UNION | film | ALL | NULL | NULL | NULL | NULL | 952 | Using where |
| 3 | UNION | film | ALL | NULL | NULL | NULL | NULL | 952 | Using where |
| | UNION RESULT | <union1,2,3> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+--------------+------+---------------+------+---------+------+------+-------------+
The result does not show a huge difference from what we had: first our
PRIMARY
step selected from the derived table - now it does so directly from the union result. However, in reality, it can make all the difference. In our original query, the whole union result was first built, and after that, the
WHERE
-clause on release_year
was applied. This means that the whole of the union result must be scanned row by row to filter for the right release_year
. The modification 'pushed down' the
WHERE
-clause and applied it directly on the individual sets. Apart from enabling us to cut out the subselect (and thus, the extra scan), a nice side effect is that this could potentially make the UNION
result faster too, because each WHERE
-clause could filter out rows, leaving less total rows to 'sum up' by the UNION
operation.So, when do I use a derived table?
Why would anybody ever use a inline view like this? What's the point, can't we always expand it into the outer query? Well, No. There is one case where it is literally impossible to expand it. That's when you're aggregating aggregates.
An example: suppose you have a soccer team database, and you want to take the average of the maximum number of goals per season per player. That's two levels of aggregation, first
max(no_of_goals)
per player per season, and then, the average of that (also per player per season).You can find another example here on nesting repeated groups.
Why a UNION
? Can't we use UNION ALL
?
I just mentioned that the
UNION
operator discards duplicate rows from the final resultset. Although that can be a nice functionality now and then it really is rarely needed. In most cases, the sets in the union don't have any overlap. Can we say something about this particular query? Well, first of all, the three queries all select the same columns from the
film
table, so potentially there could be overlap between he resultsets. However, the first two queries each select a different constant in the SELECT
-list:
select ..
, 'PG'
from ..
where ..
union
select ..
, 'R'
from ..
where ..
So, by definition, these two queries can never yield duplicates, meaning that we don't need the
UNION
to discard any duplicate rows: on the contrary, we'd rather have it skip that step because that is most likely faster. We can do that using UNION ALL
:
select ..
, 'PG'
from ..
where ..
union all
select ..
, 'R'
from ..
where ..
It's actually a good idea to always write
UNION ALL
in case you know for sure that there is no everlap. Even if performance is not an issue, it makes the code more explicit. This makes it easier for the database to devise a query plan, and it will also be easier to fellow developers to see the exact intention.Concerning the third query in the
UNION
, what is happening with that one? Well, it also selects from the code>filmtable, and again, it uses the same columns. However this one does not select a constant, but the rating
column. Now if we only could somehow prove that the values in the
rating
column here will never contain either of the values 'PG'
or 'R'
(the constants selected by the previous two queries) we must conclude that this resultset can never overlap with the resultset from either of the other queries. In other words, we would be able to safely use the UNION ALL
trick here too.The WTF-edness
We can find out if the third query will ever select a
'PG'
or 'R'
in the rating
column by loooking at the WHERE
-clause:
where rating not in ('PG','PG-13','R','NC-17')
Mmm, the
WHERE
-clause explicitly excludes the 'PG'
and 'R'
values. At least, that means that we can use the UNION ALL
trick here too.But wait a minute...let's look at the other
WHERE
-s too, ok?
select ..
from ..
where rating in ('PG','PG-13')
and ..
union all
select ..
from ..
where rating in ('R','NC-17')
and ..
union all
select ..
from ..
where rating not in ('PG','PG-13','R','NC-17')
and ..
Now that's beatiful: the third set selects exactly all the data not selected by the first and second. This means we can skip all the mumbo-jumbo and just write:
select film_id
, title
, release_year
, case rating
when 'PG-13' then 'PG'
when 'NC-17' then 'R'
else rating
end
from film
where release_year > 2005
and release_year <= 2006
So the
CASE
expression now makes up for the rating column.Needless to say that the explain output has shrunk quite a lot:
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | film | ALL | NULL | NULL | NULL | NULL | 952 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
Finally...
Just a short summary:
- Real-life code can be very bad
- Avoid derived tables
- Use
UNION ALL
instead ofUNION
Sadly I have not been able to test it myself. I'm quite confident that the reasoning behind this refactoring is sound, but you never know...
If you have any questions or suggestions, post a comment to the article.
5 comments:
Good article, great original query
;)
BTW, isn't:
where release_year > 2005
and release_year <= 2006
the same as:
where release_year = 2006
Or do we have fractional year parts? :)
Also I'd have thought that the optimiser would smart enough to rewrite this part but it doesn't seem to be...
Toasty:
yes, you are right, good spot! Thanks for mentioning it.
However, I chose these :) In the original query, we were dealing with real DATE fields, and I adapted it to the sakila database. Incidentally, all the films there have the release_year 2006, but i should of course've taken a more convincing period.
How ironic that I should make that mistake in an article like this. May it remain here on the web forever as a testimony to my own remarks regarding that.
Roland Bouman
By the way, Toasty:
"Also I'd have thought that the optimiser would smart enough to rewrite this part but it doesn't seem to be..."
I'm pretty sure the MySQL optimizer does not optimize the UNION here. I'd have to check and see what Oracle does.
I really did not get a chance to look thorough enough to the real system, so it's all conjecture.
However, I thought it would be nice to take a refactoring example that focusses on only logic thinking, instead of the more 'hardcore' performance tuning activities which usually dominate in query rewrite articles.
Interesting post Roland, could you have used an IF instead of a CASE? Not sure what effect that would have on performance.
Often with SQL you stuff that works but isn't exactly optimised, due in part to many developers lack of experience with the more detailed level of SQL syntax.
With regards to your soccer team example I think in Oracle you could use Analytical Queries to do that.
Hi Andrew (gilf! long time no see, how are you mate?!)
I think a CASE expression can always be translated into an IF expression, and vice versa. The example from the query:
CASE rating
WHEN 'PG-13' THEN 'PG'
WHEN 'NC-17' THEN 'R'
ELSE rating
END
is a so-called "simple case expression". It's called simple because there are no compound expressions involved in evaluating the WHEN alternatives. This is a lot like the Oracle DECODE() function (The MySQL DECODE() function is an entirely different thing!).
Anyway, let's translate this to a "searched case expression", just for fun:
CASE
WHEN rating='PG-13' THEN 'PG'
WHEN rating='NC-17' THEN 'R'
WHEN rating
END
As you can see, this is not a "simple" evaluation anymore. Each when alternative tests an entire condition. We just happen to test against the RATING column for each WHEN, but we don't have to. We can test entirely unrelated conditions with the searched case, and even use compound conditions with AND and OR.
I am doing this translation to show how much this is like the classic IF...THEN...ELSEIF...ELSE logic that everybody is familiar with:
IF rating='PG-13' THEN 'PG'
ELSEIF rating='NC-17' THEN 'R'
ELSE rating
END
Well, only one trivial step is to separate the ELSIF into ELSE and IF:
IF rating='PG-13' THEN 'PG'
ELSE
IF rating='NC-17' THEN 'R'
ELSE rating
END
And this is of course already almost like the MySQL IF() function you are referring to:
IF(
rating='PG-13'
, 'PG'
, IF(
rating='NC-17'
, 'R'
, rating
)
)
So yeah. I can't prove it formally, but I am pretty sure any CASE construct can be translatd into a series of nested IF..ELSE constructs.
(Note: in the functions section of the MySQL Manual, these CASE and IF expressions are called "Control Flow Functions", see http://dev.mysql.com/doc/refman/5.0/en/control-flow-functions.html I don't like that name at all, because to me, these are plain expressions, that is, operations that yield a value. "Flow of Control" is something I associate with a process - kind of what is described as "Flow Control Contructs" here http://dev.mysql.com/doc/refman/5.0/en/flow-control-constructs.html So I stubbornly refer to these functions as "Conditional Functions". The SQL Standard refers to all these things as "Case Expressions")
Your note on the soccer team example is true too. Oracle has analytical functions that will allow you to do this. Interestingly, the analytical functions are implemented by saving intermediate resultsets in temporary tables and requerying those to get the aggregate-of-and-aggregate effect. Of course, a subquery is also just a temptable - at least, in practice it is. It's probably more correct to say that a subquery is a recipe to create a temptable, and that by writing down the recipe, you automagically create such a temporary table.
Post a Comment