Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Sunday, April 30, 2006

MySQL hack emulates BEFORE statement-level triggers

MySQL 5 supports row-level triggers. This means you can have the server execute a SQL (or SQL/PSM) statement just before or after a row is inserted, updated or deleted. Because these triggers are fired in response to an operation occurring for a single row, these are called row-level triggers.

The SQL Standard also defines the syntax and semantics for triggers that are to be fired once for an entire statement, regardless of the number of rows that is touched by that statement. Unsurprisingly, these are called statement-level triggers.

MySQL does not support statement-level triggers. However, there's a quick hack to emulate BEFORE STATEMENT triggers.

Using the ROW_COUNT function


The hack depends on the ROW_COUNT() function. This function returns the number of rows updated, inserted, or deleted by the preceding statement. An example update statement for the sakila database might illustrate that:

UPDATE sakila.category
SET name = UCASE(name)
;

Query OK, 16 rows affected (0.22 sec)
Rows matched: 16 Changed: 16 Warnings: 0

SELECT row_count();
+-------------+
| row_count() |
+-------------+
| 16 |
+-------------+
1 row in set (0.04 sec)

(BTW, you might want to backup your sakila installation and restore it or find some other way to undo the changes performed by the UPDATE statement.)

So, the update statement updated all 16 rows in the category table.

Using ROW_COUNT inside TRIGGERs


I wondered how ROW_COUNT() would react if it would be called from within a conventional MySQL row-level trigger, so I decided to test it.

First, we'll set up a table for which we want to define the triggers:

CREATE TABLE test_row_count(
id int
);


To log trigger activity for operations occurring on this table, we'll need another table:

CREATE TABLE test_row_count_log(
id int AUTO_INCREMENT PRIMARY KEY
, test_row_count_id int
, trigger_event enum('DELETE','INSERT','UPDATE')
, trigger_time enum('BEFORE','AFTER')
, row_count int
);


Now we can create triggers for each of the three DML events insert, update and delete, and for each of these events, we can write both a before and a after tigger. So, potentially, we could write up to 3 * 2 = 6 different triggers. Let's start out with the two INSERT triggers and see what happens:

delimiter //

CREATE TRIGGER bir_test_row_count
BEFORE INSERT ON test_row_count
FOR EACH ROW
INSERT
INTO test_row_count_log(
test_row_count_id
, trigger_event
, trigger_time
, row_count
) VALUES (
new.id
, 'INSERT'
, 'BEFORE'
, row_count()
);
//

CREATE TRIGGER air_test_row_count
AFTER INSERT ON test_row_count
FOR EACH ROW
INSERT
INTO test_row_count_log(
test_row_count_id
, trigger_event
, trigger_time
, row_count
) VALUES (
new.id
, 'INSERT'
, 'AFTER'
, row_count()
);
//

delimiter ;

Now let's see what happens when we insert a row into the test_row_count table.:

INSERT
INTO test_row_count(
id
) VALUES (
1
)
;

Executing this INSERT will fire both the triggers exactly once, so we expect to see two rows inside the test_row_count_log table.

We expect to see one row in test_row_count_log corresponding to the instant immediately before a row was created in test_row_count; we also expect one row corresponding to the moment immediately after the insert on test_row_count occurred. We're most interested in the value stored in the row_count column of the test_row_count_log table, as this would've captured the value returned by the ROW_COUNT() function. So:

SELECT *
FROM test_row_count_log
ORDER BY id
;

Here's what happened:

+----+-------------------+---------------+--------------+-----------+
| id | test_row_count_id | trigger_event | trigger_time | row_count |
+----+-------------------+---------------+--------------+-----------+
| 1 | 1 | INSERT | BEFORE | -1 |
| 2 | 1 | INSERT | AFTER | 1 |
+----+-------------------+---------------+--------------+-----------+
2 rows in set (0.08 sec)

So, both triggers fired once, and both inserted a row in the test_row_count_log. Actually, it's quite funny seeing the -1 value in the row_count column of the row corresponding to the firing of the BEFORE trigger. I mean, being a count, I'd expect the value 0 rather than -1.

For the row corresponding to the AFTER trigger, the value in the row_count value makes more sense. At this point, we can certainly agree that indeed exactly one row is inserted, and so the value is 1 in this case.

When we execute a statement that inserts multiple rows, we can observe another interesting phenomenon:

INSERT
INTO test_row_count
VALUES (2),(3)
;

SELECT *
FROM test_row_count_log
WHERE test_row_count_id in (2,3)
ORDER BY id
;

+----+-------------------+---------------+--------------+-----------+
| id | test_row_count_id | trigger_event | trigger_time | row_count |
+----+-------------------+---------------+--------------+-----------+
| 3 | 2 | INSERT | BEFORE | -1 |
| 4 | 2 | INSERT | AFTER | 1 |
| 5 | 3 | INSERT | BEFORE | 1 |
| 6 | 3 | INSERT | AFTER | 1 |
+----+-------------------+---------------+--------------+-----------+
4 rows in set (0.02 sec)

Even though the statement fires the BEFORE trigger multiple times, the -1 is returned only for the first row that is touched by the statement. For all subsequent rows, ROW_COUNT() returns a 1 for de row_count column.

Emulating the statement level trigger


This behaviour offers the opportunity to emulate a statement-level trigger. Let's add a level column to the test_row_count_log table and rewrite the BEFORE trigger to demonstrate this:

ALTER TABLE test_row_count_log
ADD level enum('ROW','STATEMENT')
DEFAULT 'ROW'
;
DROP TRIGGER bir_test_row_count
;

delimiter //

CREATE TRIGGER bir_test_row_count
BEFORE INSERT ON test_row_count
FOR EACH ROW
BEGIN
DECLARE v_row_count int DEFAULT row_count();
IF v_row_count!=1 THEN
INSERT
INTO test_row_count_log(
test_row_count_id
, trigger_event
, trigger_time
, row_count
, level
) VALUES (
new.id
, 'INSERT'
, 'BEFORE'
, v_row_count
, 'STATEMENT'
);
END IF;
INSERT
INTO test_row_count_log(
test_row_count_id
, trigger_event
, trigger_time
, row_count
, level
) VALUES (
new.id
, 'INSERT'
, 'BEFORE'
, v_row_count
, 'ROW'
);
END;
//

delimiter ;



INSERT
INTO test_row_count (id)
SELECT id + 3
FROM test_row_count
;

SELECT *
FROM test_row_count_log
WHERE test_row_count_id > 3
ORDER BY id
;

+----+-------------------+---------------+--------------+-----------+-----------+
| id | test_row_count_id | trigger_event | trigger_time | row_count | level |
+----+-------------------+---------------+--------------+-----------+-----------+
| 7 | 4 | INSERT | BEFORE | -1 | STATEMENT |
| 8 | 4 | INSERT | BEFORE | -1 | ROW |
| 9 | 4 | INSERT | AFTER | 1 | ROW |
| 10 | 5 | INSERT | BEFORE | 1 | ROW |
| 11 | 5 | INSERT | AFTER | 1 | ROW |
| 12 | 6 | INSERT | BEFORE | 1 | ROW |
| 13 | 6 | INSERT | AFTER | 1 | ROW |
+----+-------------------+---------------+--------------+-----------+-----------+

Using an IF statement to test the value returned by the ROW_COUNT() value, we can detect if the trigger is handling the first row, and if it is, we can do some special work. In this case, we simply insert a row in the test_row_count_log table, marking it by storing the value 'STATEMENT' in the level column.

Some closing notes


The IF statement tests for those cases where the ROW_COUNT() function returns something else than 1, rather than checking explicitly for equality with -1. This is because I suspect that the value of -1 might be a bug, and it wouldn't surprise me if that would be changed to be a 0 in a newer version of the server.

There's another observation that might convince you to be careful with this hack. At least in MySQL versions 5.0.18 and 5.1.7 (the platform I tested), ROW_COUNT() behaves differently when you wrap a procedure around the code that's now inside the triggers. I found that ROW_COUNT() always returns 0 in this case. As far as I can see now, at least three distinct contexts are involved: procedures, triggers and immediate statements. There might be even more of course, but my observations with ROW_COUNT() imply at least these three.

If you plan to use ROW_COUNT() to emulate a statement-level BEFORE trigger, it's probably a good idea to be careful when upgrading, and convince yourself that the code still works in the newer server version.

I also tested this hack for UPDATE and DELETE statements - that worked for me. Of course, if anyone has any new insights, or methods to achieve a statement-level trigger in MySQL, I'd be very interested. Just leave a comment here and tell the world about it.

One of the things that keep intriguing me if it would be possible to emulate an AFTER statement-level trigger. As far as I can see now, this is impossible, as there is now information regarding the total number of rows that is handled by the statement that fires the triggers. Of course, If you know a hack that does this, I'd be most grateful to hear about it.

Monday, April 24, 2006

Sightseeing San Francisco

As a prelude to the MySQL users conference,

Markus and I went to fetch Mike from the airport to do some sightseeing.

Well, it's like this, I'm from a country that's very small in every way. Apart from being small, it's also very flat. Can you imagine what a place like San Francisco does with someone that's been living for over 30 years in such an enviroment? No?

Check out what happened to us while we were sightseeing San Francisco:

Mike already blogged about us taking the cable cart....

But he didn't tell you how fast these things go:

...and what perils it faces when it navigates it's way through the other traffic...

...but lucky for us, they installed a guardian angel to watch over the cable cart travellers:

After that, we figured that the golden gate bridge just might maybe a little more relaxing...

...until you decide to look down!!

...It just makes you wanna reach out...

...or call for some counseling:

Thursday, April 20, 2006

JIT Hotel Reservation for the MySQL UC

I finally made hotel reservations for my stay in the U.S. for the MySQL User Conference.

I'm flying to San Fransisco tomorrow and I will arrive there in the afternoon. Then, I'm straight off to Santa Clara to check into my hotel.

I'd like to go for some sightseeing on maybe Saturday but definitely Sunday, and I'm thinking of San Fransisco itself.

I'm supposed to have WIFI access at the hotel, so if anyone would like to join me or arrange a meeting, I should be able to read my mail there too.

Have a nice trip everyone, and see you at the UC!

Be careful with \ in your my.ini

It's probably a nobrainer, but as this is the first time I'm running into this problem, I decided I'd write it down.

I nearly ran out of diskspace on the C: disk of my notebook (running windows there). You now how it goes: I knew I did not have too much space left, but that little...It just couldn't be true. So, I ran a little Windows Wizard to clean up temporary files and such, and the next thing that happened:


ERROR 126 (HY000): Incorrect key file for table 'C:\WINDOWS\TEMP\#sql_474_0.MYI'; try to repair it


Whoamm! The mysql client utility quit running my query. Seems like a temporary table got cleaned up too...
(I'm running an insane query for experimentation purposes).

The good news was that my disk space problems were indeed instantly solved.

Of course, I'm a little bit surprised that it would be possible for windows to throw away the files associated with the temporary table. I would expect them to be locked by the MySQL server so that it would be impossible for windows to delete it.

Anyway, I decided that maybe I should not be so lazy and put some effort in not running into this kind of trouble so easily. I've got lots (...well, it's never enough, right?) of space left on my D: disk, so I looked in the manual to look for some hint on how to specify the location for the MySQL temporary files.

Well, that was easy enough: http://dev.mysql.com/doc/refman/5.1/en/temporary-files.html tells you that MySQL will use the value of the TMPDIR environment variable as the path, or else a system default. You can also set the directory using the --tmpdir option of the mysqld server executable.

I guess that in my case, MySQL was using the system default, because I do not have a TMPDIR environment variable (I've got a TEMP and a TMP though, neither of which was used). Apparently, C:\WINDOWS\TEMP is the system default on windows.

I decided that it would be best to set the directory using the --tmpdir option, because I can add it as an entry in the my.ini file, which I can reuse whenever I upgrade the server software. So, I went right ahead, and added this entry to my.ini:


tmpdir=D:\MySQL\temp


I restarted the server, and re-ran the query. Almost instantly, it went:

ERROR 1 (HY000): Can't create/write to file 'D:\MySQL emp\#sql_648_0.MYI' (Errcode: 22)


LOL! It seems that the \t sequence in the path is interpreted as the horizontal tab character (ascii 0x09). I looked it up in the manual in http://dev.mysql.com/doc/refman/5.1/en/mysql-config-wizard-editing.html, but I could not find a note documenting this behaviour. Well, it's always worth trying to use forward slashes:


tmpdir=D:/MySQL/temp


Yes! Now I can run the query. You can probably also escape the \ using \\ but I'm not going to try it right now as I like the forward slashes better.

Monday, April 17, 2006

Server SQL Modes Quickref chart available at the MySQLForge Wiki

Wow!

I just created my first Wiki Page, ever.

I created a new Quickref category in the wiki of the MySQLForge site. There, I created a new article where I put the Server SQL Mode chart I blogged about yesterday in a HTML format.

I like the MySQL reference manual a lot, but especially when I'm looking for some specifics about a topic I already read before, I find that it takes too much time to quickly locate what I need. Knowing what simple Server SQL modes are member of what composite mode is such an example. Knowing the order of arguments for a function is another such problem. I hope the quickref section in the wiki will remedy this, and I hope other people will find it useful too.

Sunday, April 16, 2006

Overview of Server SQL Modes

One of the features that distinguishes MySQL 5 from it's predecessors is the increased ability to control how the server interprets SQL and how strict or relaxed it should behave towards errors. This is done by setting the server SQL mode.

Choosing an appropriate Server SQL Mode can be an important tool to ease the task of porting from database products like Oracle or Microsoft Sequal Server. Despite the terrific documentation on the MySQL Reference manual, I really felt the need to have some kind of quick reference around to make effective use of the server sql modes. So, I made up one myself:



Enjoy!

Some MySQL XML utilities

Hi everyone,

I know it's next to nothing, but in case you happen to need it, I've got some XML-related MySQL utilities.



It's all as-is of course but this time, I explicitly included a couple of lines saying that the usage is all under GPL license.

Regarding the information_schema dump, there are some more tools in the works, among which is a xslt stylesheet that generates human readable documentation for a database schema in html format. As soon as I feel these tools are usuable to others, I will make them available too.

Nested repeating groups in MySQL query output

There are a lot of articles around that explain that relations (tables) should be normalized (an article by Mike Hillyer will get you started) before you can store them in a relational database. The complete normalization process consists of two steps:

  1. elimination of multi-valued attributes ('at most one value per column')

  2. elimination of redundancy ('no duplication of data')


Although these are two distinct activities that solve different problems, the methods used to perform them have a great deal in common. In both cases, some columns from the original table are moved to a new table, and a foreign key relationship is constructed to be able to relate the two tables.

The goal of the entire normalization process is to make the data as maintainable as possible to ensure integrity and validity of the data. Essentially, that goal is achieved by putting each separate 'thing' in it's own table reducing the amount of effort that is needed to add, remove or modify it.

This ususally means that retrieving and presenting data requires more effort than it would've been if the data would've been stored in an unnormalized structure. Because data is spread out across multiple tables, answering even simple questions usually involves accessing multiple tables.

More often than not, data presentation involves recreating (but not storing) a redundant structure with repeating groups. This is especially true for human-readable presentations of data (reports etc), but it may also occur when generating flat files from the database for the purpose of data-exchange (for example, XML-documents).

Although it may require considerable effort to query and retrieve a normalized structure to present data in a redundant format with repeating groups, it is doable. It's certainly a lot easier than storing the data in a redundant format and maintaining integrity and validity in case data is added, modified or removed.

GROUP_CONCAT(): repeating groups through aggregation


Consider the following structure from the Sakila sample database:

(Some of you might recall I used this structure in my previous blog entry to support my plea for a new JOIN USING FOREIGN KEY syntax.)
This is a normalized structure. There are no multi-valued attributes (no column can have more than one value for a particular row). There is no redundancy either: each column is dependant only upon the primary key of it's table.

Suppose you'd want an overview of all the countries and their cities. As described in my previous blog entry, this is easily achieved by JOIN-ing the city table to the country table:

SELECT country.country
, city.city
FROM country
JOIN city
ON country.country_id = city.country_id
ORDER BY country.country
, city.city
LIMIT 5
;

(The ORDER BY and LIMIT clauses are there only to allow me to present a small, managable list of data - it has no other bearing).
This query yields the following result:

+----------------+--------+
| country | city |
+----------------+--------+
| Afghanistan | Kabul |
| Algeria | Batna |
| Algeria | Bchar |
| Algeria | Skikda |
| American Samoa | Tafuna |
+----------------+--------+
5 rows in set (0.03 sec)


Now, let's rephrase the requirements just a little bit. What if we want an overview of all countries and a list of all it's cities. It resembles the previous overview in that each country is displayed with all of it's cities; it's just the presentation that is different. Here, the output is required to use a repeating group: the result itself will be a list of countries, and each country in the list will contain a list of cities itself. That's pretty much the definition of a repeating group:

+----------------+----------------------+
| country | cities |
+----------------+----------------------+
| Afghanistan | Kabul |
| Algeria | Batna, Bchar, Skikda |
| American Samoa | Tafuna |
+----------------+----------------------+
3 rows in set (0.03 sec)

As it turns out, it is impossible to solve this with a single SELECT expression in Oracle or MS SQL. How convenient that this is not a problem at all in MySQL:

SELECT country.country
, GROUP_CONCAT(
city.city
ORDER BY city
SEPARATOR ', '
) cities
FROM country country
JOIN city city
ON country.country_id = city.country_id
GROUP BY country.country ASC
LIMIT 3

This query has two extra elements compared to the previous one: the GROUP BY clause and the GROUP_CONCAT() function.

GROUP BY accepts a comma-separated list of expressions. GROUP BY modifies the returned resultset: all rows from the original resultset that have a distinct combination of values in the list of GROUP BY expressions, are bundled (grouped) and returned as one single row in the returned resultset.

Columns from the original resultset that do not appear in the GROUP BY list ('detail'-columns, as opposed to 'grouping'- or 'summary'-columns) should not be used directly in the SELECT list. That's because for one group in the final resultset, mutliple distinct values may occur in those columns. Because the expressions in the GROUP BY list are returned a single row, the resultset simply doesn't have room to return all of them.

(Most rdbms-es consider it an error to have a column that does not appear in the GROUP BY list in the SELECT list. By default MySQL will accept such a statement. MySQL even returns the correct value too if the column has a unique value within the group. Otherwise, the returned value is undefined. This particular behaviour can be overidden using the ONLY_FULL_GROUP_BY server SQL mode. In that particular mode, MySQL will treat it as an error to have such columns in the SELECT-list too. See the manual for more info on this subject.)

Although the detail values for each group cannot be accessed in the SELECT list directly, they can be processed by an aggregate function, like MIN, MAX, COUNT, SUM. All these functions operate on a group of objects only to return a single scalar value.

GROUP_CONCAT() is such an aggregate function. The function processes the expression passed to it by concatenating all occurrences within the group together. Apart from the expression value, an ORDER BY clause may be specified to control the order in which the detail expression values are concatenated to each other. Also, a SEPARATOR clause may be passed to specify the string that should be used to separate the multiple entries in the return value.

Adding a level


Ok, we've seen that in MySQL, generating the repeating group in the query result is really very easy.
Now, suppose the requirements for the query change again. This time, we would also like to see the list of customers living in each city. First, let's take a look at the relevant structures in the database schema:

If we would not require the repeating groups in the output, the solution is a quite straightforward extension of the first country/city query:

SELECT country.country
, city.city
, customer.last_name
, customer.first_name
FROM country
JOIN city
ON country.country_id = city.country_id
JOIN address
ON city.city_id = address.city_id
JOIN customer
ON address.address_id = customer.address_id
WHERE country.country = 'United Kingdom'
ORDER BY country.country
, city.city
, customer.last_name
, customer.first_name
;

(This time, I used a WHERE condition to limit the number of rows. It has no bearing otherwise.)
This is the result:

+----------------+-----------------+-----------+------------+
| country | city | last_name | first_name |
+----------------+-----------------+-----------+------------+
| United Kingdom | Bradford | POWELL | ANNE |
| United Kingdom | Dundee | BURNS | APRIL |
| United Kingdom | London | HOFFMAN | MATTIE |
| United Kingdom | London | VINES | CECIL |
| United Kingdom | Southampton | THORN | MARSHALL |
| United Kingdom | Southend-on-Sea | MARTIN | SANDRA |
| United Kingdom | Southport | GRUBER | ARMANDO |
| United Kingdom | Stockport | PAINE | DAN |
| United Kingdom | York | SLEDGE | GILBERT |
+----------------+-----------------+-----------+------------+
9 rows in set (0.00 sec)

However, we run into problems when we try to think of an appropriate GROUP BY clause and GROUP_CONCAT() expression to generate the desired repeating groups.

If we GROUP BY the country.country column like we did before, we would get one summary row for the country, exaclty as required. However, what remains is a set of city/customer combinations. Although we could process those with GROUP_CONCAT, that will never give us the desired result. The repeating group generated by GROUP_CONCAT will have an entry for each city/customer combination corresponding to the United Kingdom, and that is not what we want: what we want is a repeating group of cities per country and per city, a repeating group of customers.

So, we need two different repeating groups occurring for two different groupings or levels. This means that somehow, our query needs to have two corresponding GROUP_CONCAT()'s as well as two GROUP BY's. To achieve that, we need to have a '..query inside a query..' that we can equip with it's own GROUP BY clause so it can generate it's own repeating group.

A 'query inside a query' is called a subquery. Inside a SELECT expression, a subquery can appear in three different places: in the WHERE-clause, in the FROM-clause, or in SELECT-list. Let's stick as close as we can to our previous attempt, JOIN-ing all relevant tables, and write a subquery in the FROM clause.

Well, it's not really hard to see what parts we must place in the subquery. We had the repeating group for the city table in place, all we did since then is add the address and customer tables. These two tables really are all that is required to generate the repeating group of customers per city:


SELECT a.city_id
, GROUP_CONCAT(
CONCAT(
c.first_name
, ' '
, c.last_name
)
ORDER BY c.last_name
SEPARATOR ', '
) customers_per_city
FROM customer c
JOIN address a
ON c.address_id = a.address_id
WHERE a.city_id = 312
GROUP BY a.city_id

(312 is the city_id of London in the United Kingdom)
You can see that this query indeed generates the repeating group of customers for a particular value of city_id:

+---------+-----------------------------+
| city_id | customers_per_city |
+---------+-----------------------------+
| 312 | MATTIE HOFFMAN, CECIL VINES |
+---------+-----------------------------+

In fact, the repeating group corresponds to the 3rd and 4th entry in the result of the previous query: the customers living in London.

Now, we combine this with the previous query, substituting the directly joined address and customer tables and their JOIN condition with this query:


SELECT concat(
country.country
, ': '
, GROUP_CONCAT(
concat(
'\n\t', city.city, ':'
, '\n\t\t', customer_per_city.customer_per_city
)
separator ''
)
) customers_per_city
FROM country
JOIN city
ON country.country_id = city.country_id
JOIN (
SELECT a.city_id
, GROUP_CONCAT(
CONCAT(
c.first_name
, ' '
, c.last_name
)
ORDER BY c.last_name
SEPARATOR ', '
) customer_per_city
FROM customer c
JOIN address a
ON c.address_id = a.address_id
GROUP BY a.city_id
) customer_per_city
ON city.city_id = customer_per_city.city_id
WHERE country.country = 'United Kingdom'
GROUP BY country.country ASC

(In this query, some adjustments are made to the separators. Specificly, some newlines and indenting were added to make the result more readable. It's still a repeating group but this time, a new-line is output between cities instead of the semi colon used in previous examples.)
This is the result:

United Kingdom:
Bradford:
ANNE POWELL
Dundee:
APRIL BURNS
London:
MATTIE HOFFMAN, CECIL VINES
Southampton:
MARSHALL THORN
Southend-on-Sea:
SANDRA MARTIN
Southport:
ARMANDO GRUBER
Stockport:
DAN PAINE
York:
GILBERT SLEDGE


Summary


The MySQL aggregate function GROUP_CONCAT() generates a repeating group for the string expression passed as argument.

The items that appear in a repeating group can themselves be repeating groups. Nesting of repeated groups is achieved by writing distinct queries for each distinct repeating group. These distinct queries are then joined (as subqueries) to relate the repeating groups to each other.

Sunday, April 09, 2006

Intelligent SQL JOIN syntax?

The SQL standard specifies two different intelligent variants of the JOIN syntax: the natural join and the named columns join:


<natural join> ::=
<table reference> NATURAL [ <join type> ] JOIN <table factor>

<named columns join> ::=
USING <left paren> <join column list> <right paren>


(from ISO/IEC 9075-2:2003; 7.7 <joined table>)

I call these two forms of the join syntax intelligent because they imply a non-trivial join condition. This means that such a join operation will relate the records from the two tables expressions based on some criterion without requiring that the criterion is specified in each and every detail. This is unlike the other variants of the JOIN syntax: to be meaningful, these require that the join condition in the form of a boolean expression is associated with the JOIN operation using a ON clause.

Relationships, Foreign Keys and Joins


Consider this structure taken from the MySQL sakila sample database:



Here, we have table country that has a 1 to many relationship with city. The relationship conveys the fact that a city is situated inside a particular country (I won't be discussing the validity of that here). The concept of a relationship in itself has nothing todo with databases or computers or whatever. It merely has to do with they way we, as humans, perceive the world around us. 'Relationship' is a conceptual term.

In a relational database, relationships are implemented using foreign keys. Here in the example, the city table (the 'many' end of the relationship) has a country_id column. For a particular record from the city table this column stores one of the values found in the country_id column of the country table (the 'one' end of the relationship).

The country_id column in the country table stores only unique values: this is enforced using a primary key constraint. A foreign key constraint on the city table ensures that the country_id column in the city table may only contain values that are in fact present in the country_id column of the country table.

An arbitrary record from the city table refers to exactly one particular record in the country table by storing a value from the country table's key, coutry_id. That's why it is called a foreign key: the city stores values that act as key values (identifying values) in another table, in this case the country table.

So, a foreign key is an implementation of a relationship in a relational database: foreign key is an implementation term. (In fact, it is they most common way to implement a relationship.) Usually, foreign keys are actively enforced by declaring a foreign key constraint. A foreign key constraint declares the exact column mapping involved in the referenced, and it makes the database management system prevent any data from being entered, removed or modified that might violate any of the references.

The SQL code to create this structure is shown here:

CREATE TABLE country (
country_id
SMALLINT UNSIGNED
NOT NULL
AUTO_INCREMENT
, country
VARCHAR(50)
NOT NULL
, last_update
TIMESTAMP
NOT NULL
DEFAULT CURRENT_TIMESTAMP
ON UPDATE CURRENT_TIMESTAMP
, PRIMARY KEY (country_id)
)ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE city (
city_id
SMALLINT UNSIGNED
NOT NULL
AUTO_INCREMENT
, city
VARCHAR(50)
NOT NULL
, country_id
SMALLINT UNSIGNED
NOT NULL
, last_update
TIMESTAMP
NOT NULL
DEFAULT CURRENT_TIMESTAMP
ON UPDATE CURRENT_TIMESTAMP
, PRIMARY KEY (city_id)
, KEY idx_fk_country_id (country_id)
, CONSTRAINT `city_resides_in_country`
FOREIGN KEY (country_id)
REFERENCES country (country_id)
ON DELETE RESTRICT
ON UPDATE CASCADE
)ENGINE=InnoDB DEFAULT CHARSET=utf8;


Joins


Now, suppose we want to have a list of all the city names together with the name of the country in which it is situated. Such a list can be conveniently produced using a JOIN operation. In a Join operation, records from different tables are combined to create new (virtual) records. So, joining is a (runtime) processing term. Usuallly, the join operation combines records that have some sort of correspondence; that is, records that maintain some kind of relationship between one another.

To understand the result of the JOIN operation we can imagine that the database server performs the following actions (Caution: the following 'algorithm' does not illustrate what a database server actually does to obtain the result of the JOIN operation - it only illustrates what the result will look like):

  1. For each record from the city table, all records from the country table are retrieved. A new record is constructed consisting of all the fields from the 'city' record and all the fields from the 'country' record, thus yielding all combinations of city and country records: the cartesian product. These newly constructed records are temporarily stored in a virtual table.

  2. A filter is applied to the virtual table, only retaining those records that have equal values for both country_id fields

  3. For each record in the filtered table, create a new record consisting only of the city and country fields (projection), and return those records as the final resultset


It is in fact possible to write a SELECT expression that corresponds quite literally to this conceptual algorithm:

--
-- step 3: column projection
--
SELECT city.city
, country.country
--
-- step 1: construct a cartesian product
--
FROM city
CROSS JOIN country
--
-- step 2: retain only the corresponding records
--
WHERE city.country_id = country.country_id


Instead of the CROSS JOIN operator, one can also write a comma (,). However, because the comma denotes a separator when used in the SELECT list, it's better to disambiguate using the keywords.

Now, the WHERE clause contains the condition that forms a criterion for filtering the records. Our condition is made up of only one element: the column comparison that specifies that only corresponding records should be retained. However, the WHERE is a general container for such conditions: we may write any kind of condition there. And, we should if we need to apply extra criteria. Suppose we want to have the list only for those countries of which the name starts with a capital A. Then, our WHERE clause could be extended to look like this:

WHERE city.country_id = country.country_id -- join condition
AND country.country like 'A%' -- just an extra criterion

So, both the join condition - the condition that is supposed to retain the related records - and our extra criterion appear mixed all in one big WHERE condition. This style of joining is called theta join.

Another way of putting it is to say that this is an unqualified join: the join condition does not seem to be associated with the join operation itself: it just happens to be a criterion that requires columns from both the tables in join operation to share equal values.

Qualified Joins


It's generally considered to be better to avoid theta join style in favour of ansi join style. The ansi join style requires the join condition to be directly associated with the join operation: we say that the join is qualified with the condition.

The join condition is separated from the join operation by a single ON keyword only:


--
-- step 3: column projection
--
SELECT city.city
, country.country
--
-- step 1: combine only corresponding records
--
FROM city
JOIN country
ON city.country_id = country.country_id
--
-- step 2: apply any criteria
--
WHERE country.country like 'A%'


We can modify the 'conceptual algorithm' presented earlier to match this more closely. (Please bear in mind that this 'algorithm' still has nothing to do with how a server might actually implement it. Also, note that this has no bearing at all on our understanding of the final resultset: the final resultset here is indistinguishable from the one we had earlier):


  1. For each record from the city table, loop through all records from the country table. A new record is constructed consisting of all the fields from the 'city' record and all the fields from the 'country' record, but oly if the values of the country_id column of both records are exactly equal. These newly constructed records are temporarily stored in a virtual table.

  2. A filter is applied to the virtual table, only retaining those records that match the WHERE clause

  3. For each record in the filtered table, create a new record consisting only of the city and country fields (projection), and return those records as the final resultset



From the developer's point of view, the advantage of the ansi join syntax is that all the elements that have to do with relating the two tables can be written on consecutive lines. This makes it much easier to maintain the code if the database structure changes, or to remove the join.

Intelligent Join constructs


There are two particular forms of qualified join syntax that do not require the precise join condition to be specified. Rather, this syntax relies on some kind of rule that implies the precise join condition. The actual join condition itself is derived from the rule, and never entered literally.

Natural Join


The natural join implies a join condition that requires that equally named pairs of columns belonging to the table expressions appearing on the left and right side of the JOIN operator have equal values. In our example, two such pairs are present: both tables have a column named country_id and a column named last_update. Therefore, a natural join between the country and the city table implies a join condition that requires that the columns from the country_id and the last_update pairs have equal values. So, this:

SELECT city.city
, country.country
FROM city
NATURAL JOIN country

is equivalent to

SELECT city.city
, country.country
FROM city
JOIN country
ON city.country_id = country.country_id
AND city.last_update = country.last_update

This example immediately illustrates an important limitation of the NATURAL JOIN construct. We now that the city table referes to the country table by means of the country_id foreign key. So, the implied join condition is 'half-right' in requiring equality for the values in the country_id columns. However, because both tables have a last_update column, equality is required for these columns too. This is of course total non-sense: the last_update column is a TIMESTAMP column, and it will contain the current date and time of the moment the record was last updated (or inserted). Any equality found for these columns will be a matter of coincidence, so altough the columns can be related to each other, this does not constitute a relationship.

One could of course argue that this natural join does not check out because the table definition is flawed. If the last_update columns would've been given a name that is more specific to the table to which they belong, the problem would not've occurred, and we could've used the natural join. However, I feel that is not a very good argument

If we were to go along with this, we will be forced to rename all columns systematically in order to avoid accidentally occurring identical column names. Although we could limit ourselves to rename only the columns that are uninentionally included in the join conditions, this would introduce an amount of inconsistency in the identifier set, which is undesirable because it makes it harder to code against the schema. Also, the columns that we don't rename now could become a problem later on when we add new tables and columns to the schema, hampering further development. So we really need some sort of systematic approach towards picking identifiers. For example, all columns that 'truly' belong to a table could be given a prefix that is derived from the table name. Although this seems doable, it is rather impractical. It would yield long column names - for some rdbms-es, a column name can easily become too long (Oracle has a maximum of 30 characters, MaxDB has a maximum of 32).

Named columns join


The other 'intelligent' join construct is the named columns join. Like the natural join, the named columns join implies a join condition based on requiring equal values for pairs of identically named columns. The difference with the natural join is that the column pairs that should be implied in the join condition must be specified explicitly with a USING clause.

Because the column pairs must be explicitly specified, we can avoid unrelated column pairs that happen to have identical names to be implied in the join condition. This solves the problem with the last_update column we just discussed for the natural join between the city and the country table:


SELECT city.city
, country.country
FROM city
JOIN country
USING(country_id)


This is equivalent to:


SELECT city.city
, country.country
FROM city
JOIN country
ON city.country_id = country.country_id


Athough the named columns join is thus slightly more useful than the natural join, it still poses a problem. No matter how we name our columns, we can never find a satisfactory solution for those cases where a particular table has more than one foreign key referring to one key in another table. In that case, each foreign key will need it's own set of columns, and only one of these can choose column names identical to the names of the referenced columns. The natural join as well as the named columns join can be used for joining along one relationship only. Joining along the other relationships always involves writing an explicit join condition.

All in all, if think that bioth the natural join and the named columns join constructs are not very useful. I never use them, because they would introduce inconsistency in the code. I feel that similar tasks should be solved in similar ways to avoid confusion, and because you will potentially need to write explicit join conditions anyway, I prefer to do so right away, even if a particular case would allow for a natural or named columns join.

What about a foreign key join?


Nevertheless, I think that the concept of an implied join condition is really elegant and charming if it could be used generically. Both the natural join and the named columns join cannot be used generically because they infer the join condition based on equality of the column names. Clearly, equality of column names just isn't the right sort of thing to use to infer join conditions.

In a lot of cases, we want to join along the columns of a foreign key, regardless of the actual column names. If we enforce the foreign key with a constraint, the rdbms could use that information to infer a join condition. So I think it's really amazing to discover that neither the SQL Standard nor any of the more renowned rdbms-es support a syntax that actually takes advantage of this.

I do know a development environment, USoft, that actually supports such a construct. I don't know if they still support it, but when I used it, you could write the example join something like this:


SELECT city.city
, country.country
FROM city RELATE country WITH city_resides_in_country


So, the foreign key name city_resides_in_country is used to specify the join condition. A variant that resembles the usual named columns join syntax could look like this:


SELECT city.city
, country.country
FROM city JOIN country USING city_resides_in_country


The nice thing about this syntax is that it solves all the problems described with the natural join and the ordinary named columns join. As a bonus, our code has become resilient to structural changes of the keys: If we were to add a column to the foreign key in the city table and corresponding primary key in the country table, our joins would not need to be modified because they never refer to a column name.

Anyone to point me in the direction of a rdbms that supports this kind of syntax?

Thursday, April 06, 2006

Being more productive with DBDesigner 4

I'm using DBDesigner 4 quite a lot. I think the successor, MySQL Workbench, is very promising, but it does not meet my requirements (yet).

Having DBDesigner Generate SQL


Those that've worked with DBDesigner have probably noticed the particular strategy it uses to generate SQL code from the model. First of all, the user needs to specify what kind of SQL script is to be generated: a DROP, CREATE or optimization script. Then, DBDesigner does it's thing. In case of a DROP or CREATE script, exactly one statement is generated for each table in the diagram (or in the current selection if you choose to).

In case of the CREATE script, each CREATE TABLE statement contains not only the column definitions, but also the PRIMARY KEY, UNIQUE and FOREIGN KEY constraint definitions corresponding to that table.

DBDesigner tries to put the statements into a particular sequence in order to end up with a script that can be run immediately without generating errors due unresolved dependencies. For example, if a particular table T1 has a foreign key referencing another table T2, DBDesigner will place the CREATE TABLE statement for T1 after the CREATE TABLE statement for T2. For the DROP script, the sequence is reversed for the same reasons.

DB Designer Problems


Usually this all works out fine. However, occasionally this approach leads to problems. Consider this example:

Here, we have to tables that are referencing each other. It's not that common, but I think the example I made up here is not that artificial. My real world case is a little bit more complicated, but the relationships are equivalent.

Well, it does not take long to figure out that a structure like this can never be realized using only CREATE TABLE statements. At least, two CREATE TABLE statements and one ALTER TABLE ADD CONSTRAINT statement are needed in order to create the tables before creating the dependency (either one of the foreign key relationships).

DBDesigner seems to take a rather principal point of view, because it decides not to generate any SQL at all. This is BTW rather nasty, because if you were so unlucky to specify the hitherto working script file as target for the SQL export action, you will discover that your script now amounts to 0 bytes...ouch!

There is another reason why DB Designer is not so good anymore for generating code. It has nothing to do with DB Designer. Rather, it has to do with mysql 5.1.7.

MySQL 5.1.7 does not accept the previously deprecated TYPE= option in CREATE TABLE statements. (You must use ENGINE= instead nowadays.)

Now, don't get me wrong: I'm not complaining - I think DBDesigner is a mighty cool tool, and I like it very much. It's just that these little quirks do take some time to adjust to, and occasionally they cost time too. And of course, DB Designer never could've foreseen that the TYPE= option would be disallowed in the future.

How to cope


All this just means that we need to think of a trick to keep on using the DB Designer GUI while generating the SQL in another way. Well, this is actually not too hard. DB Designer saves it's model in an XML format. XML is easily parsed and translated using XSLT.

So, that's what I did. I'm will discus it here briefly. If someone needs it, feel free to use it. Of couse, you will need a utility to apply the xslt stylesheet to the Db Designer xml output, but there are literally hundreds of tools that do that for you - most of them are freely obtainable. (I use a little msxml based javasript utility I - for no particular reason - hacked up myself)

The XSLT stylesheet


My stylesheet is an ordinary xslt stylesheet:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet
version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>
<xsl:output
method="text"
/>

<!--

Some variables/parameters go here

-->
<xsl:variable name="tables" select="/DBMODEL/METADATA/TABLES/TABLE"/>
<xsl:variable name="foreign-keys" select="/DBMODEL/METADATA/RELATIONS/RELATION"/>

<xsl:param name="PARAM_SCRIPT_TYPE" select="'CREATE'"/>

<!--

Some templates go here

-->
..
</xsl:stylesheet>

The entry point for the transformation is formed by a single matching template. The template contents are controlled by a global parameter value (which is set from the environment when invoking the XSLT processor):

<xsl:param name="PARAM_SCRIPT_TYPE" select="'CREATE'"/>

<xsl:template match="/">
<xsl:choose>
<xsl:when test="$PARAM_SCRIPT_TYPE='DROP'">
<xsl:call-template name="drop-foreign-keys"/>
<xsl:call-template name="drop-tables"/>
</xsl:when>
<xsl:when test="$PARAM_SCRIPT_TYPE='CREATE'">
<xsl:call-template name="create-tables"/>
<xsl:call-template name="create-foreign-keys"/>
</xsl:when>
</xsl:choose>
</xsl:template>

What happens is that the xslt processor will start off with the node that represents the entire document - I mean the document that contains our DB Designer model. Looking in the xslt stylesheet, the xslt processor will discover that this template matches the document node: the match attribute on the xsl:template element has the XPath expression / as value, which is a shorthand nodetest for the entire document.
So, the contents of our template are processed. The xsl:choose element selects one of the options according to the value of the PARAM_SCRIPT_TYPE parameter.
If the value equals DROP, the named templates drop-foreign-keys and drop-tables are executed. If it equals CREATE, then create-tables and create-foreign-keys are executed.
Well, you can imagine what these do, can't you? That's right, these generate the SQL statments to drop foreign keys, drop tables, create tables and create foreign keys. Of course, this sequence of executing the DDL statements can never offend any dependencies: dropping a table can never fail because there are no more foreign keys at that point that might prevent it, and creating a foreign key kan never fail because all the tables have already been created.

Well, the rest is just work: putting together little XPath queries and have the data that they retrieve drive static template text. For convenience, I declared two global variables. one nodeset containing all tables in the model and one containing all foreign keys:

<xsl:variable name="tables" select="/DBMODEL/METADATA/TABLES/TABLE"/>
<xsl:variable name="foreign-keys" select="/DBMODEL/METADATA/RELATIONS/RELATION"/>

These are conveniently used by the several templates. For example, look at the implementation of the drop-foreign-keys named template:

<xsl:template name="drop-foreign-keys">
<xsl:for-each select="$foreign-keys">
<xsl:call-template name="drop-foreign-key">
<xsl:with-param name="foreign-key" select="."/>
</xsl:call-template>
</xsl:for-each>
</xsl:template>

<xsl:template name="drop-foreign-key">
<xsl:param name="foreign-key"/>
<xsl:variable name="DstTableID" select="$foreign-key/@DestTable"/>
<xsl:variable name="DstTable" select="$tables[@ID=$DstTableID]"/>
ALTER TABLE <xsl:value-of select="$DstTable/@Tablename"/>
DROP FOREIGN KEY <xsl:value-of select="$foreign-key/@RelationName"/>
;
</xsl:template>

The drop-foreign-keys template itself just loops through each of the nodes from the nodeset stored in the global foreign-keys variable, passing the current node through to the drop-foreign-key template.
This does the actual work of generating the ALTER TABLE statement.

Getting to know where all the info is stored in the DBDesigner model is just a matter of common sense, and some trial and error. Anyway, it does not take much effort. The entire stylesheet cost me about 2 hours to write, and that includes getting to know the source format (of course ignoring all the stuff in there I don't need at this point).

If anyone's interested, download the stylesheet at: http://www.xcdsql.org/MySQL/DBDesigner4/dbdesigner.xslt.
The utility I use to transform xml with xslt is available for dowload here (MS Windows only Im afraid): http://www.xcdsql.org/Misc/xslt/xslt.zip.

License info is not included (i think) but you can use all that stuff in any way you like. Always nice to give me credit should you like it, or should you want to pass it on to somebody else.