TRANSFORM...PIVOT..
)If you really need a crosstab, here are some techniques that might be of use. I'll illustrate them using the updated version of the sakila database. I really think the following stuff will be a lot more fun if you download and install this database first. If you think that's too much of a hassle, then just open the diagram of this database in a new window to help you understand queries.
Step 1: decide what objects you want to know something about
Let's say we want to know something about films. We will need something to recognize a film, and we'll be choosing the title to do that:
select f.title
from films f
This will just return a list of all films.
Step 2: decide what you want to know about those objects
Well, suppose we want to know how well our films are being rented, we could do:
select f.title
, count(r.inventory_id) rentals
from film f
inner join inventory i
on f.film_id = i.film_id
left join rental r
on i.inventory_id = r.inventory_id
group by f.title
This query takes all films, and finds the stores that have the film in their inventory. Because we do this using an inner join, we exclude the films that aren't in any store's inventory. This is good: by definition, there can't be any rentals for those films, so it does not make sense to include them in the result.
The inventories are left joined to the rentals. This ensures we won't exclude films that have never been rented. In this case, we definitely want to include these films. If you were a DVD rental store manager, wouldn't you want to know about those films that are for rent but have never been rented before?
Step 3: decide how you want to categorize your data
Now, we do know how many times a particular film has been rented, wich is good. But there's more to a rental than a film ending up to be rented. For starters, a film isn't "just" rented: it is being rented by a particular customer. Second, the customer doesn't "just" rent the film: it's rented from a particular store. The different customers that can rent films, and the different stores where they rent film from offer a perspective that let's us 'slice' out a subset of all the rentals.
For this example, let's stick to the stores to categorize our data.
Given our previous query, it's pretty easy to get our hands on the actual information. We just need to add the store_id to the select and group by list from the previous query:
select f.title
, i.store_id
, count(r.inventory_id) rentals
from film f
inner join inventory i
on f.film_id = i.film_id
left join rental r
on i.inventory_id = r.inventory_id
group by f.title
, i.store_id
This will give us the number of times a film was rented per store.
Hey, you promised us a crosstab, now pay up!
But wait! This is not a crosstab, this is just a tablular resultset! That's right of course. But we know all the information is there, so what's the matter then?
Actually, a crosstab is just a particular presentation of data. To present a crosstab of this resultset, we could have the films along it's vertical axis (in the direction of the rows), the stores along it's horizontal axis (in the direction of the columns). (We could just as well turn it the other way around, but because we have so much movies, it's probably more practical to have them placed along the vertical axis.) The number of rentals would then be at the row-column intersections or the cells.
Now, as for having the films along the vertical axis - that's pretty much what our queries have been doing all along. A
SELECT
expression are all about dynamically generating resultset rows according to data in database tables. We can be quite confident that this will not pose a problemn.Having the stores along the horizontal axis is an entirely different story. Columns in the resultset of a
SELECT
expression correspond to the individual select list expressions. We sure do control generation of resultset columns ourselves: the expression that we type is evaluated to data appearing in the column, and if we give that expression an alias, that will be used as the column heading. However, we can only exert this control in a static manner, by typing them as part of the statement. What we would like to have is a kind of device that would generate columns dynamically, driven by the data (in our case: the rows from the store table).Actually, we need to generate columns quite like the way a
SELECT
expression generates rows.Let's accept this situation - for the moment.
Step 4: Coding the horizontal axis
To find a way out, let's just pretend our collection of stores is not dynamic, (it doesn't change that often anyway) and try to achieve only the layout of a crosstab. This would involve us typing a separate
SELECT
list expression for each store, thus statically creating a column definition for each store.So let's inspect the store table and see for what stores we would have to type an extra column expression:
select *
from store
gives us:
+----------+---------+------------+
| store_id | manager | address_id |
+----------+---------+------------+
| 1 | 1 | 1 |
| 2 | 2 | 2 |
+----------+---------+------------+
That's not too bad. Only two stores, wich means we'll only have to type two expressions. Let's concentrate on these expressions.
In our previous queries, we used:
count(r.inventory_id)
to calculate the number of rentals. Now, we still want to calculate the number of rentals, but for a particular column, we only want to count those rentals that occurred in the store corresponding to the column. We need something like a conditional version of
count
. So, suppose we would be writing the expression for the store with store_id = 1
, we could try something like this:
count(
if(i.store_id=1
, r.inventory_id
, null
)
)
But wait! Something's wrong here. What if a particular film would be in the Inventory of the store with
store_id = 2
, but not in the inventory of Store with store_id = 1
? Well, the if
expression would always return nnull
, because no records will be found for the store with store_id = 1
. Count
will return the count of non-null values, but zero if there are only null
values. So, this expression will return zero for the store with store_id = 1
. This is bad news! It means we cannot distinguish between a film that does not exist in the inventory of a particular store and a film that does exist in the inventory but isn't rented at all. Lucky for us, we can emulate
count
using sum
, like this:
sum(
if(i.store_id=1
, 1
, null
)
)
The
if
will return 1
for each record where the store_id = 1
, and null
in other cases.SUM
will faithfully sum these 1's while ignoring the null's, effectively yielding the count.If
sum
is fed with only null
arguments, sum
will return null
. This is good, because the result will now let us distinguish between having a film that happens not to exists in the inventory of a particular store as opposed to having the film in the inventroy without ever renting it.So, the query looks like this:
select f.title
, sum(
if(i.store_id=1
, 1
, null
)
) "store_id=1"
, sum(
if(i.store_id=2
, 1
, null
)
) "store_id=2"
from film f
inner join inventory i
on f.film_id = i.film_id
left join rental r
on i.inventory_id = r.inventory_id
group by f.title
So, there's our crosstab. It wasn't hard at all, just a bit tedious.
Step 4 - continued: Generating the statement
Up untill now, we've put up with the nuisance of 'generating' columns by typing them. Now it's time to stop putting up with that. We've hand-coded a working Example, and it has taken us enough time. We do not want to waste more time on the same problem in the future. And really, we're bound to encounter this problem again. Sooner or later, the data in the store table will change, probably invalidating our crosstab query.
But hey, it's not so bad, is it? We know exactly wich parts of the code correspond to occurrences in the store table, right? And we know what parts of the code are independant of all that. So, let's try and devise something that will generate all that code for us.
select concat(
'select f.title','\n'
, group_concat(
concat(
', sum(','\n'
, ' if(i.store_id=',s.store_id,'\n'
, ' , 1','\n'
, ' , null','\n'
, ' )\n'
, ' )'
, ' "store_id=',s.store_id,'"\n'
)
order by s.store_id
separator ''
)
, 'from film f','\n'
, 'inner join inventory i','\n'
, 'on f.film_id = i.film_id','\n'
, 'left join rental r','\n'
, 'on i.inventory_id = r.inventory_id','\n'
, 'group by f.title','\n'
) statement
from store s
This query returns just one row. The one column will contain a string value that is exactly equal to the statement text of our previous hand-coded crosstab query. Compare the hand-coded crosstab-query
SELECT
expression with the SELECT
list of the generator. As you can see, the generator does not much more than select the concatenated string literals that make up the crosstab query. The only variable element is the GROUP_CONCAT
call wich aggregates all the rows from the store
table in the FROM
list of the generator, yielding the collection of SELECT
list expressions that makes up the horizontal axis of the crosstab, and of course the data appearing in the cells.(By the way,
GROUP_CONCAT
is a commonly recurring element in these generators. I just love this feature, wich really seems to be unique to MySQL. So much for open source products not being innovative)Now that we can generate the crosstab query
SELECT
expression, we will never have to worry about data changing in the store table. We can simply run this query to obtain an updated query for the crosstab. Step 4 - continued: wrapping generation and execution in a stored procedure
Generating the query is certainly an improvement. By generating the statement, we created a single point of definition that handles the duplication of code (the Column expression) in a structured, data driven way. That saves us time, and more important, it will always be correct. It will never contain syntax errors, and it will always cost the same amount of time to create an updated version of the Statement should the data in the store stable be modified.
It's still a bit of a nuisance that we can't just execute the crosstab query with a single statement though.
Right now, we always need at least two statements: one to generate the statement, and one to execute the generated statement.
We would really like to do that in one statement.
Lucky for us, we can actually do that.
We can create a stored procedure that performs these steps in the right order:
CREATE PROCEDURE `sakilar2`.`p_film_rentals_per_store`()
LANGUAGE SQL
NOT DETERMINISTIC
CONTAINS SQL
SQL SECURITY DEFINER
begin
select concat(
'select f.title','\n'
, group_concat(
concat(
', sum(','\n'
, ' if(i.store_id=',s.store_id,'\n'
, ' , 1','\n'
, ' , null','\n'
, ' )\n'
, ' )'
, ' "store_id=',s.store_id,'"\n'
)
separator ''
)
, 'from film f','\n'
, 'inner join inventory i','\n'
, 'on f.film_id = i.film_id','\n'
, 'left join rental r','\n'
, 'on i.inventory_id = r.inventory_id','\n'
, 'group by f.title','\n'
) statement
into @film_rentals_per_store_sql
from store s
order by s.store_id
;
prepare film_rentals_per_store
from @film_rentals_per_store_sql
;
execute film_rentals_per_store
;
deallocate
prepare film_rentals_per_store;
end
Inside the procedure, we first generate the sql that defines the query, and then use the prepared statement syntax to execute the generated statement. It's not as if we really need statement preparation; we just need some dynamic SQL. The prepared statement syntax happens to be the only construct that supports this in MySQL right now.
By now, we've achieved pretty much all we wanted. We can get a crosstab query result by simply calling the procedure. I'ts a bit of a pity we cannot setup a more generalized approach, but I'll settle with this for a while.