Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Wednesday, October 18, 2006

Refactoring: Derived table, UNION...WTF?

I greatly admire the works of the Dutch literary author Gerard Reve (1923 - 2006). On many occasions, he was asked whether his stories were real-life stories, and he always answered like this:

If you mean "did this sequence of events factually take place?" then I can be brief: No. And that's a good thing too, because if a writer would describe reality, the result would be very hard to believe, if not completely inconceivable. The course taken by real life is just too crazy. A true description is bound to be seem like a constructed mannerism and no reader can be expected to believe even one word of it.
However, as far as the single events are concerned, I exclusively use only those things bourne from reality.

If you ever read The Daily WTF you will probably agree that this point of view makes a lot of sense. Each time a really crazy or insensible piece of code pops up on the Daily's there are always people commenting, saying something like:

No...this just cannot be true. I refuse to believe that software developers from large professional organisations can deliver this kind of crap..and get away with it.

Well, the odds are that indeed the pieces of code that appear there should not be taken too literally. I mean, for one thing, the code appears with indentation, and sometimes even with comments: no way this is real code. Real Code is Worse than what you'll ever find on The Daily WTF.

My take on WTF


The reason I'm choosing this topic because I was recently asked to take a look at a database query which was giving "...trouble...". Usually this means: "...not performing as fast as we'd like...".

In this case, the people having trouble were actually consultants from a professional IT service provider. They were trying to migrate some enterprise application from a big, operational Oracle database to MySQL 5.0 Server. Their efforts were all experimental, and so ar, they were quite enthusiastic about MySQL. However, there were two troublesome queries performancewise. (I will discuss only one - the other one is also a story on it's own.)

They found that in this particular case, Oracle took 30ms whereas MySQL took a little over one minute. Of course this is a huge difference, and I was immediately challenged.

The query I got was somewhat like this:

select *
from (
select film_id
, title
, release_year
, 'PG'
from film
where rating in ('PG','PG-13')
union
select film_id
, title
, release_year
, 'R'
from film
where rating in ('R','NC-17')
union
select film_id
, title
, release_year
, rating
from film
where rating not in ('PG','PG-13','R','NC-17')
) film
where release_year > 2005
and release_year <= 2006

Well, of course, their enterprise application was not built on the sakila sample database: I stylized the original query to make it tangible. But don't let that distract you - just look.

Now I'm not saying that I'm not guilty of writing bad code. I mean, we all started somewhere, that is: from nothing. We all had to learn, and alas, in a lot of cases, you can only learn by making mistakes. However, things like the Daily WTF should be valuable eye-openers for developers and software engineers to always remain focussed on monitoring the quality of code. Even, or especially if it's their own.

So, instead of taking the chuckly attitude of laughing at someone else's apparent WTF, I'll try and do a step by step refactoring of this query. This will probably not be interesting for advanced users, and they can still have their laugh. If you can't appreciate the WTF-edness, keep reading: this article is for you.

A Simple Query


What we see here is essentially a simple query. A subselect appears in the FROM-clause: this is sometimes referred to as a derived table or an inline-view. If we forget the complexity of the subquery for a while and pretend that it actually is a real view, we can see easily that it really is simple:

select ...
from (/*complexity goes here*/) film
where ...

So, no JOIN, no GROUP BY, and not correlelated subselects. Normally, this query should be quite fast, right? So why is it so slow now?

Well, we could use the basic performance and tuning instincts, and start asking questions like:

  • What engine is used for the film table?

  • Is there an index on release_year?

  • What is the size of the supa_dupa_xyz buffer?


However, I don't think that's really necessary at this point. Right now, we need to know but one thing, and that is that the resultset of an inline view is computed before as a separate step.

Using EXPLAIN


We sure can witness this using EXPLAIN:

explain
select *
from ( -- 1
... -- 2
UNION
... -- 3
UNION
... -- 4
) film -- union result
where ...
;
+----+--------------+--------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+--------------+------+---------------+------+---------+------+------+-------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1000 | Using where |
| 2 | DERIVED | film | ALL | NULL | NULL | NULL | NULL | 952 | Using where |
| 3 | UNION | film | ALL | NULL | NULL | NULL | NULL | 952 | Using where |
| 4 | UNION | film | ALL | NULL | NULL | NULL | NULL | 952 | Using where |
| | UNION RESULT | <union2,3,4> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+--------------+------+---------------+------+---------+------+------+-------------+

The explain result is not too hard to interpret: The first row has the select_type equal to PRIMARY and this corresponds to the outer query. The table mentions that it queries <derived2>, which means it queries on the subquery (I don't know why it does not use the subquery alias - it just doesn't).

The second, third and fourth row in the explain result refer to the queries that are UNION-ed together inside the subselect. It's a bit puzzling that the select_type for the second row is marked DERIVED, but it just means it's the first thing that is evaluated as part of the subquery.

The third and fourth row correspond to the sets that are to be UNION-ed to the total result of the subquery, and that's why you see UNION there in the select_type column.

For those that do not know: UNION is a so-called set operator, it takes the resultsets produced by the queries on the left and right hand side, and constructs a new resultset out of it that contains the rows from the resultset operands. So, UNION 'adds up' rows in a vertical manner. (The much used and well known JOIN operator 'adds up' resultsets in the horizontal direction)

The final row represents the actual 'adding up' of the all the rows from the three queries inside the subselect, and that's why it has UNION RESULT in the select_type column. This step is not trivial, because UNION is required to filter out any duplicate rows in the final result after applying the UNION. More about that later.

Do we really need a derived table?


Now, we were about to think whether we really need the derived table. What would happen if we'd simply expand the contents of the subquery? It will require some work, because we would have to duplicate the WHERE clause of the outer query into the individual UNION-ed queries:

select film_id
, title
, release_year
, 'PG'
from film
where rating in ('PG','PG-13')
and release_year > 2005
and release_year <= 2006
union
select film_id
, title
, release_year
, 'R'
from film
where rating in ('R','NC-17')
and release_year > 2005
and release_year <= 2006
union
select film_id
, title
, release_year
, rating
from film
where rating not in ('PG','PG-13','R','NC-17')
and release_year > 2005
and release_year <= 2006

So, what would this do to the result from EXPLAIN? Well, we don't expect the subquery step again in oour result:

+----+--------------+--------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+--------------+------+---------------+------+---------+------+------+-------------+
| 1 | PRIMARY | film | ALL | NULL | NULL | NULL | NULL | 952 | Using where |
| 2 | UNION | film | ALL | NULL | NULL | NULL | NULL | 952 | Using where |
| 3 | UNION | film | ALL | NULL | NULL | NULL | NULL | 952 | Using where |
| | UNION RESULT | <union1,2,3> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+--------------+------+---------------+------+---------+------+------+-------------+

The result does not show a huge difference from what we had: first our PRIMARY step selected from the derived table - now it does so directly from the union result. However, in reality, it can make all the difference.

In our original query, the whole union result was first built, and after that, the WHERE-clause on release_year was applied. This means that the whole of the union result must be scanned row by row to filter for the right release_year.

The modification 'pushed down' the WHERE-clause and applied it directly on the individual sets. Apart from enabling us to cut out the subselect (and thus, the extra scan), a nice side effect is that this could potentially make the UNION result faster too, because each WHERE-clause could filter out rows, leaving less total rows to 'sum up' by the UNION operation.

So, when do I use a derived table?


Why would anybody ever use a inline view like this? What's the point, can't we always expand it into the outer query? Well, No. There is one case where it is literally impossible to expand it. That's when you're aggregating aggregates.

An example: suppose you have a soccer team database, and you want to take the average of the maximum number of goals per season per player. That's two levels of aggregation, first max(no_of_goals) per player per season, and then, the average of that (also per player per season).

You can find another example here on nesting repeated groups.

Why a UNION? Can't we use UNION ALL?


I just mentioned that the UNION operator discards duplicate rows from the final resultset. Although that can be a nice functionality now and then it really is rarely needed.

In most cases, the sets in the union don't have any overlap. Can we say something about this particular query? Well, first of all, the three queries all select the same columns from the film table, so potentially there could be overlap between he resultsets. However, the first two queries each select a different constant in the SELECT-list:

select ..
, 'PG'
from ..
where ..
union
select ..
, 'R'
from ..
where ..

So, by definition, these two queries can never yield duplicates, meaning that we don't need the UNION to discard any duplicate rows: on the contrary, we'd rather have it skip that step because that is most likely faster. We can do that using UNION ALL:

select ..
, 'PG'
from ..
where ..
union all
select ..
, 'R'
from ..
where ..

It's actually a good idea to always write UNION ALL in case you know for sure that there is no everlap. Even if performance is not an issue, it makes the code more explicit. This makes it easier for the database to devise a query plan, and it will also be easier to fellow developers to see the exact intention.

Concerning the third query in the UNION, what is happening with that one? Well, it also selects from the code>filmtable, and again, it uses the same columns. However this one does not select a constant, but the rating column.

Now if we only could somehow prove that the values in the rating column here will never contain either of the values 'PG' or 'R' (the constants selected by the previous two queries) we must conclude that this resultset can never overlap with the resultset from either of the other queries. In other words, we would be able to safely use the UNION ALL trick here too.

The WTF-edness


We can find out if the third query will ever select a 'PG' or 'R' in the rating column by loooking at the WHERE-clause:

where rating not in ('PG','PG-13','R','NC-17')

Mmm, the WHERE-clause explicitly excludes the 'PG' and 'R' values. At least, that means that we can use the UNION ALL trick here too.

But wait a minute...let's look at the other WHERE-s too, ok?

select ..
from ..
where rating in ('PG','PG-13')
and ..
union all
select ..
from ..
where rating in ('R','NC-17')
and ..
union all
select ..
from ..
where rating not in ('PG','PG-13','R','NC-17')
and ..

Now that's beatiful: the third set selects exactly all the data not selected by the first and second. This means we can skip all the mumbo-jumbo and just write:

select film_id
, title
, release_year
, case rating
when 'PG-13' then 'PG'
when 'NC-17' then 'R'
else rating
end
from film
where release_year > 2005
and release_year <= 2006

So the CASE expression now makes up for the rating column.

Needless to say that the explain output has shrunk quite a lot:

+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | film | ALL | NULL | NULL | NULL | NULL | 952 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+

Finally...


Just a short summary:

  • Real-life code can be very bad

  • Avoid derived tables

  • Use UNION ALL instead of UNION


Sadly I have not been able to test it myself. I'm quite confident that the reasoning behind this refactoring is sound, but you never know...

If you have any questions or suggestions, post a comment to the article.

5 comments:

Toasty said...

Good article, great original query
;)

BTW, isn't:

where release_year > 2005
and release_year <= 2006


the same as:
where release_year = 2006

Or do we have fractional year parts? :)

Also I'd have thought that the optimiser would smart enough to rewrite this part but it doesn't seem to be...

Roland Bouman said...

Toasty:

yes, you are right, good spot! Thanks for mentioning it.

However, I chose these :) In the original query, we were dealing with real DATE fields, and I adapted it to the sakila database. Incidentally, all the films there have the release_year 2006, but i should of course've taken a more convincing period.

How ironic that I should make that mistake in an article like this. May it remain here on the web forever as a testimony to my own remarks regarding that.

Roland Bouman

Roland Bouman said...

By the way, Toasty:

"Also I'd have thought that the optimiser would smart enough to rewrite this part but it doesn't seem to be..."

I'm pretty sure the MySQL optimizer does not optimize the UNION here. I'd have to check and see what Oracle does.

I really did not get a chance to look thorough enough to the real system, so it's all conjecture.

However, I thought it would be nice to take a refactoring example that focusses on only logic thinking, instead of the more 'hardcore' performance tuning activities which usually dominate in query rewrite articles.

gilf said...

Interesting post Roland, could you have used an IF instead of a CASE? Not sure what effect that would have on performance.

Often with SQL you stuff that works but isn't exactly optimised, due in part to many developers lack of experience with the more detailed level of SQL syntax.

With regards to your soccer team example I think in Oracle you could use Analytical Queries to do that.

Roland Bouman said...

Hi Andrew (gilf! long time no see, how are you mate?!)

I think a CASE expression can always be translated into an IF expression, and vice versa. The example from the query:

CASE rating
WHEN 'PG-13' THEN 'PG'
WHEN 'NC-17' THEN 'R'
ELSE rating
END

is a so-called "simple case expression". It's called simple because there are no compound expressions involved in evaluating the WHEN alternatives. This is a lot like the Oracle DECODE() function (The MySQL DECODE() function is an entirely different thing!).

Anyway, let's translate this to a "searched case expression", just for fun:

CASE
WHEN rating='PG-13' THEN 'PG'
WHEN rating='NC-17' THEN 'R'
WHEN rating
END

As you can see, this is not a "simple" evaluation anymore. Each when alternative tests an entire condition. We just happen to test against the RATING column for each WHEN, but we don't have to. We can test entirely unrelated conditions with the searched case, and even use compound conditions with AND and OR.

I am doing this translation to show how much this is like the classic IF...THEN...ELSEIF...ELSE logic that everybody is familiar with:

IF rating='PG-13' THEN 'PG'
ELSEIF rating='NC-17' THEN 'R'
ELSE rating
END

Well, only one trivial step is to separate the ELSIF into ELSE and IF:

IF rating='PG-13' THEN 'PG'
ELSE
IF rating='NC-17' THEN 'R'
ELSE rating
END

And this is of course already almost like the MySQL IF() function you are referring to:

IF(
rating='PG-13'
, 'PG'
, IF(
rating='NC-17'
, 'R'
, rating
)
)

So yeah. I can't prove it formally, but I am pretty sure any CASE construct can be translatd into a series of nested IF..ELSE constructs.

(Note: in the functions section of the MySQL Manual, these CASE and IF expressions are called "Control Flow Functions", see http://dev.mysql.com/doc/refman/5.0/en/control-flow-functions.html I don't like that name at all, because to me, these are plain expressions, that is, operations that yield a value. "Flow of Control" is something I associate with a process - kind of what is described as "Flow Control Contructs" here http://dev.mysql.com/doc/refman/5.0/en/flow-control-constructs.html So I stubbornly refer to these functions as "Conditional Functions". The SQL Standard refers to all these things as "Case Expressions")

Your note on the soccer team example is true too. Oracle has analytical functions that will allow you to do this. Interestingly, the analytical functions are implemented by saving intermediate resultsets in temporary tables and requerying those to get the aggregate-of-and-aggregate effect. Of course, a subquery is also just a temptable - at least, in practice it is. It's probably more correct to say that a subquery is a recipe to create a temptable, and that by writing down the recipe, you automagically create such a temporary table.