Monday, October 31, 2005

Creating Crosstabs in MySQL

In some cases, it would be very nice to be able to get a crosstab result from a query. You can't really do that in MySQL, at least not directly; as matter of fact, you can't do it in most rdbms-es (Interestingly, MS Access does have a language construct for crosstabs: TRANSFORM...PIVOT..)
If you really need a crosstab, here are some techniques that might be of use. I'll illustrate them using the updated version of the sakila database. I really think the following stuff will be a lot more fun if you download and install this database first. If you think that's too much of a hassle, then just open the diagram of this database in a new window to help you understand queries.

Step 1: decide what objects you want to know something about


Let's say we want to know something about films. We will need something to recognize a film, and we'll be choosing the title to do that:

select f.title
from films f

This will just return a list of all films.

Step 2: decide what you want to know about those objects


Well, suppose we want to know how well our films are being rented, we could do:

select f.title
, count(r.inventory_id) rentals
from film f
inner join inventory i
on f.film_id = i.film_id
left join rental r
on i.inventory_id = r.inventory_id
group by f.title

This query takes all films, and finds the stores that have the film in their inventory. Because we do this using an inner join, we exclude the films that aren't in any store's inventory. This is good: by definition, there can't be any rentals for those films, so it does not make sense to include them in the result.

The inventories are left joined to the rentals. This ensures we won't exclude films that have never been rented. In this case, we definitely want to include these films. If you were a DVD rental store manager, wouldn't you want to know about those films that are for rent but have never been rented before?

Step 3: decide how you want to categorize your data


Now, we do know how many times a particular film has been rented, wich is good. But there's more to a rental than a film ending up to be rented. For starters, a film isn't "just" rented: it is being rented by a particular customer. Second, the customer doesn't "just" rent the film: it's rented from a particular store. The different customers that can rent films, and the different stores where they rent film from offer a perspective that let's us 'slice' out a subset of all the rentals.

For this example, let's stick to the stores to categorize our data.

Given our previous query, it's pretty easy to get our hands on the actual information. We just need to add the store_id to the select and group by list from the previous query:

select f.title
, i.store_id
, count(r.inventory_id) rentals
from film f
inner join inventory i
on f.film_id = i.film_id
left join rental r
on i.inventory_id = r.inventory_id
group by f.title
, i.store_id

This will give us the number of times a film was rented per store.

Hey, you promised us a crosstab, now pay up!


But wait! This is not a crosstab, this is just a tablular resultset! That's right of course. But we know all the information is there, so what's the matter then?

Actually, a crosstab is just a particular presentation of data. To present a crosstab of this resultset, we could have the films along it's vertical axis (in the direction of the rows), the stores along it's horizontal axis (in the direction of the columns). (We could just as well turn it the other way around, but because we have so much movies, it's probably more practical to have them placed along the vertical axis.) The number of rentals would then be at the row-column intersections or the cells.

Now, as for having the films along the vertical axis - that's pretty much what our queries have been doing all along. A SELECT expression are all about dynamically generating resultset rows according to data in database tables. We can be quite confident that this will not pose a problemn.

Having the stores along the horizontal axis is an entirely different story. Columns in the resultset of a SELECT expression correspond to the individual select list expressions. We sure do control generation of resultset columns ourselves: the expression that we type is evaluated to data appearing in the column, and if we give that expression an alias, that will be used as the column heading. However, we can only exert this control in a static manner, by typing them as part of the statement. What we would like to have is a kind of device that would generate columns dynamically, driven by the data (in our case: the rows from the store table).
Actually, we need to generate columns quite like the way a SELECT expression generates rows.

Let's accept this situation - for the moment.

Step 4: Coding the horizontal axis


To find a way out, let's just pretend our collection of stores is not dynamic, (it doesn't change that often anyway) and try to achieve only the layout of a crosstab. This would involve us typing a separate SELECT list expression for each store, thus statically creating a column definition for each store.
So let's inspect the store table and see for what stores we would have to type an extra column expression:

select *
from store

gives us:

+----------+---------+------------+
| store_id | manager | address_id |
+----------+---------+------------+
| 1 | 1 | 1 |
| 2 | 2 | 2 |
+----------+---------+------------+

That's not too bad. Only two stores, wich means we'll only have to type two expressions. Let's concentrate on these expressions.
In our previous queries, we used:

count(r.inventory_id)

to calculate the number of rentals. Now, we still want to calculate the number of rentals, but for a particular column, we only want to count those rentals that occurred in the store corresponding to the column. We need something like a conditional version of count. So, suppose we would be writing the expression for the store with store_id = 1, we could try something like this:

count(
if(i.store_id=1
, r.inventory_id
, null
)
)

But wait! Something's wrong here. What if a particular film would be in the Inventory of the store with store_id = 2, but not in the inventory of Store with store_id = 1? Well, the if expression would always return nnull, because no records will be found for the store with store_id = 1. Count will return the count of non-null values, but zero if there are only null values. So, this expression will return zero for the store with store_id = 1.

This is bad news! It means we cannot distinguish between a film that does not exist in the inventory of a particular store and a film that does exist in the inventory but isn't rented at all. Lucky for us, we can emulate count using sum, like this:

sum(
if(i.store_id=1
, 1
, null
)
)


The if will return 1 for each record where the store_id = 1, and null in other cases.
SUM will faithfully sum these 1's while ignoring the null's, effectively yielding the count.
If sum is fed with only null arguments, sum will return null. This is good, because the result will now let us distinguish between having a film that happens not to exists in the inventory of a particular store as opposed to having the film in the inventroy without ever renting it.

So, the query looks like this:

select f.title
, sum(
if(i.store_id=1
, 1
, null
)
) "store_id=1"
, sum(
if(i.store_id=2
, 1
, null
)
) "store_id=2"
from film f
inner join inventory i
on f.film_id = i.film_id
left join rental r
on i.inventory_id = r.inventory_id
group by f.title

So, there's our crosstab. It wasn't hard at all, just a bit tedious.

Step 4 - continued: Generating the statement


Up untill now, we've put up with the nuisance of 'generating' columns by typing them. Now it's time to stop putting up with that. We've hand-coded a working Example, and it has taken us enough time. We do not want to waste more time on the same problem in the future. And really, we're bound to encounter this problem again. Sooner or later, the data in the store table will change, probably invalidating our crosstab query.

But hey, it's not so bad, is it? We know exactly wich parts of the code correspond to occurrences in the store table, right? And we know what parts of the code are independant of all that. So, let's try and devise something that will generate all that code for us.


select concat(
'select f.title','\n'
, group_concat(
concat(
', sum(','\n'
, ' if(i.store_id=',s.store_id,'\n'
, ' , 1','\n'
, ' , null','\n'
, ' )\n'
, ' )'
, ' "store_id=',s.store_id,'"\n'
)
order by s.store_id
separator ''
)
, 'from film f','\n'
, 'inner join inventory i','\n'
, 'on f.film_id = i.film_id','\n'
, 'left join rental r','\n'
, 'on i.inventory_id = r.inventory_id','\n'
, 'group by f.title','\n'
) statement
from store s

This query returns just one row. The one column will contain a string value that is exactly equal to the statement text of our previous hand-coded crosstab query. Compare the hand-coded crosstab-query SELECT expression with the SELECT list of the generator. As you can see, the generator does not much more than select the concatenated string literals that make up the crosstab query. The only variable element is the GROUP_CONCAT call wich aggregates all the rows from the store table in the FROM list of the generator, yielding the collection of SELECT list expressions that makes up the horizontal axis of the crosstab, and of course the data appearing in the cells.

(By the way, GROUP_CONCAT is a commonly recurring element in these generators. I just love this feature, wich really seems to be unique to MySQL. So much for open source products not being innovative)

Now that we can generate the crosstab query SELECT expression, we will never have to worry about data changing in the store table. We can simply run this query to obtain an updated query for the crosstab.

Step 4 - continued: wrapping generation and execution in a stored procedure


Generating the query is certainly an improvement. By generating the statement, we created a single point of definition that handles the duplication of code (the Column expression) in a structured, data driven way. That saves us time, and more important, it will always be correct. It will never contain syntax errors, and it will always cost the same amount of time to create an updated version of the Statement should the data in the store stable be modified.

It's still a bit of a nuisance that we can't just execute the crosstab query with a single statement though.
Right now, we always need at least two statements: one to generate the statement, and one to execute the generated statement.
We would really like to do that in one statement.

Lucky for us, we can actually do that.
We can create a stored procedure that performs these steps in the right order:

CREATE PROCEDURE `sakilar2`.`p_film_rentals_per_store`()
LANGUAGE SQL
NOT DETERMINISTIC
CONTAINS SQL
SQL SECURITY DEFINER
begin
select concat(
'select f.title','\n'
, group_concat(
concat(
', sum(','\n'
, ' if(i.store_id=',s.store_id,'\n'
, ' , 1','\n'
, ' , null','\n'
, ' )\n'
, ' )'
, ' "store_id=',s.store_id,'"\n'
)
separator ''
)
, 'from film f','\n'
, 'inner join inventory i','\n'
, 'on f.film_id = i.film_id','\n'
, 'left join rental r','\n'
, 'on i.inventory_id = r.inventory_id','\n'
, 'group by f.title','\n'
) statement
into @film_rentals_per_store_sql
from store s
order by s.store_id
;
prepare film_rentals_per_store
from @film_rentals_per_store_sql
;
execute film_rentals_per_store
;
deallocate
prepare film_rentals_per_store;
end

Inside the procedure, we first generate the sql that defines the query, and then use the prepared statement syntax to execute the generated statement. It's not as if we really need statement preparation; we just need some dynamic SQL. The prepared statement syntax happens to be the only construct that supports this in MySQL right now.

By now, we've achieved pretty much all we wanted. We can get a crosstab query result by simply calling the procedure. I'ts a bit of a pity we cannot setup a more generalized approach, but I'll settle with this for a while.

29 comments:

Anonymous said...

WOW! This is really neat! You saved my day, thanks Roland!

I was actually researching a way to get statements that cannot be executed by PREPARE to be added to a dynamically created file and read back, all from SP's and PS's. This goes well, until reading back the statements, which, I found, cannot be done from an SP or PS with something like:

SOURCE my_stmts.txt

(why would you want this? Well, statements like CREATE DATABASE and DROP TRIGGER cannot be parameterized, ie, use variables, and cannot be used with PREPARE).

Back to you crosstab: it really saved my day for generating my reports! What a work ;-)

-- Abel

rpbouman said...

Hi Abel,

thanks for your kind comments. Maybe you should take a look at Beat Vontobel's blog article: http://www.futhark.ch/mysql/106.html

There you will find a general and more flexible approach for this problem. Another very, very good resource is Giuseppe Maxia's presentation:
http://programm.froscon.de/attachments/37-pivot_tables_mysql_5.pdf#search=%22crosstab%20maxia%22

Anonymous said...

Absolutely brilliant. Thank you!!

Harshal @ Snehapadma said...

Works like charm...

Just wanted to one thing:

If your no of columns are high, you might to tweak with variable group_concat_max_len of your server instance.

Anonymous said...

The above procedure works only for numerical columns to be represented as crosstab, but it doesnt work for columns with characters data type

rpbouman said...

Hi Anonymous,

"...works only for numerical columns..."

I guess you mean the cells that appear inside the grid? Well, that's because we used SUM to aggregate. You can't use SUM to sum up strings, but I think that is pretty obvious,

If you would use MIN, MAX it would work (although the result may not be very useful). And of course, GROUP_CONCAT also works - perhaps that is the closest to a SUM operation for strings.

HTH,

Roland

Anant said...

Hi,

I need to create a crosstab query as follows:

table is as follows:
name,location
Chris,California
Chris,Texas
Isabel,Florida
Isabel,Texas
Isabel,Virginia
Isabel,Washington
Rhea,Montana

Desired Table:

Name loc1 loc2 loc3 loc4
Chris California Texas
Isabel Florida Texas Virginia Washington
Rhea Montana

Any help will be appreciated.

rpbouman said...

Hi Anant,

the article really contains all information you need. Just read, try, and tell us what you came up with. If you post some things you got stuck with, we can try and help. However, we won't just do your homework for free.

Thank you.

Anant said...

hello Roland,

i have tried the following queries, but cannot get what i want:

input table:

mysql> select * from crosstab;
+----+---------+------+
| id | name | dept |
+----+---------+------+
| 1 | anant | ME |
| 2 | anant | EE |
| 3 | darshan | MATH |
| 4 | darshan | PHYS |
| 5 | darshan | ME |
| 6 | gautam | ME |
| 7 | gautam | EE |
| 8 | vishal | EE |
| 9 | vishal | MATH |
+----+---------+------+

query:

select name,group_concat(if(dept="EE","EE",null))sub,
group_concat(if(dept="ME","ME",null))sub2,
group_concat(if(dept="MATH","MATH",null))sub3,
group_concat(if(dept="PHYS","PHYS",null))sub4
from crosstab
group by name

result:

+---------+------+------+------+------+
| name | sub | sub2 | sub3 | sub4 |
+---------+------+------+------+------+
| anant | EE | ME | NULL | NULL |
| darshan | NULL | ME | MATH | PHYS |
| gautam | EE | ME | NULL | NULL |
| vishal | EE | NULL | MATH | NULL |
+---------+------+------+------+------+

i want the subjects to start from the first column. also, i cannot figure out how can i automate the sub, sub1, sub2, sub3

another query that i tried to get subjects to start from first column:

select name,if(dept="EE",dept,if(dept="ME",dept,if(dept="MATH",dept,if(dept="PHYS",dept,null))))sub
from crosstab
group by name

+---------+------+
| name | sub |
+---------+------+
| anant | ME |
| darshan | MATH |
| gautam | ME |
| vishal | EE |
+---------+------+

but i can't figure out how do i get the remaining subjects in column.

appreciate your prompt response. trust me i have tried so many different queries. i think i am missing some small fundamental and hence i am not able to get what i want.

thanks for your help.

HD said...

Anant

To get them in one column, use the concat function. See code below...

select name,
rtrim(concat(group_concat(if(dept="EE","EE ",null)),
group_concat(if(dept="ME","ME ",null)),
group_concat(if(dept="MATH","MATH ",null)),
group_concat(if(dept="PHYS","PHYS ",null))))
from crosstab
group by name

Anant said...

hello hd,

i tried using your query. but it does not give any result.

-------------------------------------------------------+
| name | rtrim(concat(group_concat(if(dept="EE","EE ",null)),
group_concat(if(dept="ME","ME ",null)),
group_concat(if(dept="MATH","MATH ",null)),
group_concat(if(dept="PHYS","PHYS ",null)))) |
+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| anant | NULL |
| darshan | NULL |
| gautam | NULL |
| vishal | NULL |
+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

T. Dye said...

Roland,

This is very nice.

In the case where the variables used for column names are varchar (rather than integer) something like this works:

select concat(
'select b.context','\n'
, group_concat(
concat(
', round(sum(','\n'
, ' if(v.taxon="',z.taxon,'"\n'
, ' , v.weight','\n'
, ' , null','\n'
, ' )\n'
, ' ),0)'
, ' "',z.taxon,'"\n'
)
order by z.taxon
separator ''
)
, 'from baglist b', '\n'
, 'left join invert_id v', '\n'
, 'on b.bag = v.bag', '\n'
, 'where b.project = 112 and v.project = 112', '\n'
, 'group by 1 order by 1', '\n'
) statement
from invert_categories z

Note the extra "" in the if statement.

This might be what anonymous was after a few months ago.

Tom

rpbouman said...

Hi T Dye!

Thanks for your contribution!
I kind of forgot this question was still lingering, and I guess I never really understood what wasn't working for that poster.

Anyway, I hope this solves it. Thanks again :)

Pete said...

Nice article, although the final SQL is so complicated it's dangerous. Does kind of re-iterate the fact the SQL is not built for this kind of reporting and you are much better off using some simple procedural language to get what you want.

rpbouman said...

Hi Pete,

yeah i mostly agree. You can take the sting out of it somewhat by using a stored procedure to generatee and dynamically execute the necessary SQL. Butt even then, I think that in most cases it is indeed better to leave it up to the application.

But thten again, I must confess that I simply enjoy these types of things as a passtime.

kind regards,
Roland

T Dye said...

Hi Roland,

I've found that creating a cross-tab in Python (which I'm assuming falls in the category of simple procedural language) is complicated. Perhaps this is due to my half-hearted programming skills. I discovered your blog as I was looking for a simpler approach. To me, it seems simple enough. I am interested in Pete's notion of "dangerous" SQL code--typically, I overlook dangers when I first jump into a project. Could you spell out for me what these might be?

All the best,
Tom

Unknown said...

This continues to be a very awesome article. As up update for those that need it, the sakila database diagram can be found at http://dev.mysql.com/doc/refman/5.5/en/wb-documenting-sakila.html

Regards,

Sean P. O. MacCath-Moran
www.emanaton.com

rpbouman said...

Sean, thanks! very rewarding to see these kind words, esp. since this article is already so old.

thanks again!

Roland.

Hans Heinrich Bergfeld said...

Great piece of code!
Saved my day.

Just one question. You use "CONTAINS SQL" in CREATE PROCEDURE. Should this not rather be "READS SQL DATA" since you select from the tables "store" and "film"?

Best regards

Hans Heinrich

rpbouman said...

Hi Hans!

glad you like it. Regarding the CONTAINS SQL characteristic: you're absolutely right! READS SQL DATA is much more appropriate. Right now, it doesn't matter though as MySQL doesnt actually do anything with these characteristics. But still, if you're writing anything there, it should be READS SQL DATA so that's my bad.

Thanks for pointing that out.

kind regards,

Roland

Anonymous said...

I disagree with your comment that there's no easy way to do a crosstab in most relational databases; Oracle, SQL Server, and PostgreSQL all have their own built-in implementation of crosstab. Only MySQL lacks it.

Scales and Pom said...

Hi... is there anyway to get the files you make reference to in your first paragraph? OpenWin.com looks like it's a dead link & squaters have it now. Thanks!

rpbouman said...

@Casa de Toad, you can download the sakila database here:

http://dev.mysql.com/doc/index-other.html

Unknown said...

Sir may i get sql dump of the sakila db .. to try your script of crosstab query . ant try to apply it in my case..

Best Regard's
Syafii

rpbouman said...

Hi Abdullah syafii,

thanks for your comment. The links to the sakila database were badly outdated. I just updated them. Installation instructions and dowload link are now here: http://dev.mysql.com/doc/sakila/en/sakila-installation.html

Unknown said...

Thank you Sir...

Unknown said...

Dear Mr Bouman

I'm Syafii from Jakarta , Indonesia

Sir, I want to asking About query .. I have a problem that i can't solved
this is the result of my query
1.)
ref_number | Product | Qty_plus
SO/1208/00003 | PSP/D01/Class01/L06/BP | 5
SO/1208/00003 | PSP/D01/Class01/L06/UP | 10
SO/1208/00003 | PSP/D01/Class01/L06/BP | 15
SO/1208/00003 | PSP/D01/Class01/L06/UP | 25

2.)
ref_number | Product | Qty_minus
SO/1208/00003 | PSP/D01/Class01/L06/BP | 2
SO/1208/00003 | PSP/D01/Class01/L06/UP | 2
SO/1208/00003 | PSP/D01/Class01/L06/BP | 4
SO/1208/00003 | PSP/D01/Class01/L06/UP | 10

and how to make the result of my data on top into data like in bottom

ref_number Product Qty_plus Qty_minus Qty_summary
SO/1208/00003 "PSP/D01/Class01/L06/BP" 20 6 14
SO/1208/00003 "PSP/D01/Class01/L06/UP" 35 10 25


in the result I want to make summary report like it

can you help me sir the SQL Syntax please ..

I'm Sorry about my langguage .. i can't write english fluently

Best Regards

Abdullah Syafii

Shyamala Lakshmanan said...

Hi Roland,

Very nice coding for CrossTab. Can you add a row-wise and column-wise TOTALS and then a % to the Total. Thus there will be two more rows and two more columns. The last row and rightmost column intersection might be blank. - TVL

rpbouman said...

Shyamala Lakshmanan,

if you have the code for these totals, please post it so other readers can benefit.

Thank you.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...