Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Sunday, April 09, 2006

Intelligent SQL JOIN syntax?

The SQL standard specifies two different intelligent variants of the JOIN syntax: the natural join and the named columns join:


<natural join> ::=
<table reference> NATURAL [ <join type> ] JOIN <table factor>

<named columns join> ::=
USING <left paren> <join column list> <right paren>


(from ISO/IEC 9075-2:2003; 7.7 <joined table>)

I call these two forms of the join syntax intelligent because they imply a non-trivial join condition. This means that such a join operation will relate the records from the two tables expressions based on some criterion without requiring that the criterion is specified in each and every detail. This is unlike the other variants of the JOIN syntax: to be meaningful, these require that the join condition in the form of a boolean expression is associated with the JOIN operation using a ON clause.

Relationships, Foreign Keys and Joins


Consider this structure taken from the MySQL sakila sample database:



Here, we have table country that has a 1 to many relationship with city. The relationship conveys the fact that a city is situated inside a particular country (I won't be discussing the validity of that here). The concept of a relationship in itself has nothing todo with databases or computers or whatever. It merely has to do with they way we, as humans, perceive the world around us. 'Relationship' is a conceptual term.

In a relational database, relationships are implemented using foreign keys. Here in the example, the city table (the 'many' end of the relationship) has a country_id column. For a particular record from the city table this column stores one of the values found in the country_id column of the country table (the 'one' end of the relationship).

The country_id column in the country table stores only unique values: this is enforced using a primary key constraint. A foreign key constraint on the city table ensures that the country_id column in the city table may only contain values that are in fact present in the country_id column of the country table.

An arbitrary record from the city table refers to exactly one particular record in the country table by storing a value from the country table's key, coutry_id. That's why it is called a foreign key: the city stores values that act as key values (identifying values) in another table, in this case the country table.

So, a foreign key is an implementation of a relationship in a relational database: foreign key is an implementation term. (In fact, it is they most common way to implement a relationship.) Usually, foreign keys are actively enforced by declaring a foreign key constraint. A foreign key constraint declares the exact column mapping involved in the referenced, and it makes the database management system prevent any data from being entered, removed or modified that might violate any of the references.

The SQL code to create this structure is shown here:

CREATE TABLE country (
country_id
SMALLINT UNSIGNED
NOT NULL
AUTO_INCREMENT
, country
VARCHAR(50)
NOT NULL
, last_update
TIMESTAMP
NOT NULL
DEFAULT CURRENT_TIMESTAMP
ON UPDATE CURRENT_TIMESTAMP
, PRIMARY KEY (country_id)
)ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE city (
city_id
SMALLINT UNSIGNED
NOT NULL
AUTO_INCREMENT
, city
VARCHAR(50)
NOT NULL
, country_id
SMALLINT UNSIGNED
NOT NULL
, last_update
TIMESTAMP
NOT NULL
DEFAULT CURRENT_TIMESTAMP
ON UPDATE CURRENT_TIMESTAMP
, PRIMARY KEY (city_id)
, KEY idx_fk_country_id (country_id)
, CONSTRAINT `city_resides_in_country`
FOREIGN KEY (country_id)
REFERENCES country (country_id)
ON DELETE RESTRICT
ON UPDATE CASCADE
)ENGINE=InnoDB DEFAULT CHARSET=utf8;


Joins


Now, suppose we want to have a list of all the city names together with the name of the country in which it is situated. Such a list can be conveniently produced using a JOIN operation. In a Join operation, records from different tables are combined to create new (virtual) records. So, joining is a (runtime) processing term. Usuallly, the join operation combines records that have some sort of correspondence; that is, records that maintain some kind of relationship between one another.

To understand the result of the JOIN operation we can imagine that the database server performs the following actions (Caution: the following 'algorithm' does not illustrate what a database server actually does to obtain the result of the JOIN operation - it only illustrates what the result will look like):

  1. For each record from the city table, all records from the country table are retrieved. A new record is constructed consisting of all the fields from the 'city' record and all the fields from the 'country' record, thus yielding all combinations of city and country records: the cartesian product. These newly constructed records are temporarily stored in a virtual table.

  2. A filter is applied to the virtual table, only retaining those records that have equal values for both country_id fields

  3. For each record in the filtered table, create a new record consisting only of the city and country fields (projection), and return those records as the final resultset


It is in fact possible to write a SELECT expression that corresponds quite literally to this conceptual algorithm:

--
-- step 3: column projection
--
SELECT city.city
, country.country
--
-- step 1: construct a cartesian product
--
FROM city
CROSS JOIN country
--
-- step 2: retain only the corresponding records
--
WHERE city.country_id = country.country_id


Instead of the CROSS JOIN operator, one can also write a comma (,). However, because the comma denotes a separator when used in the SELECT list, it's better to disambiguate using the keywords.

Now, the WHERE clause contains the condition that forms a criterion for filtering the records. Our condition is made up of only one element: the column comparison that specifies that only corresponding records should be retained. However, the WHERE is a general container for such conditions: we may write any kind of condition there. And, we should if we need to apply extra criteria. Suppose we want to have the list only for those countries of which the name starts with a capital A. Then, our WHERE clause could be extended to look like this:

WHERE city.country_id = country.country_id -- join condition
AND country.country like 'A%' -- just an extra criterion

So, both the join condition - the condition that is supposed to retain the related records - and our extra criterion appear mixed all in one big WHERE condition. This style of joining is called theta join.

Another way of putting it is to say that this is an unqualified join: the join condition does not seem to be associated with the join operation itself: it just happens to be a criterion that requires columns from both the tables in join operation to share equal values.

Qualified Joins


It's generally considered to be better to avoid theta join style in favour of ansi join style. The ansi join style requires the join condition to be directly associated with the join operation: we say that the join is qualified with the condition.

The join condition is separated from the join operation by a single ON keyword only:


--
-- step 3: column projection
--
SELECT city.city
, country.country
--
-- step 1: combine only corresponding records
--
FROM city
JOIN country
ON city.country_id = country.country_id
--
-- step 2: apply any criteria
--
WHERE country.country like 'A%'


We can modify the 'conceptual algorithm' presented earlier to match this more closely. (Please bear in mind that this 'algorithm' still has nothing to do with how a server might actually implement it. Also, note that this has no bearing at all on our understanding of the final resultset: the final resultset here is indistinguishable from the one we had earlier):


  1. For each record from the city table, loop through all records from the country table. A new record is constructed consisting of all the fields from the 'city' record and all the fields from the 'country' record, but oly if the values of the country_id column of both records are exactly equal. These newly constructed records are temporarily stored in a virtual table.

  2. A filter is applied to the virtual table, only retaining those records that match the WHERE clause

  3. For each record in the filtered table, create a new record consisting only of the city and country fields (projection), and return those records as the final resultset



From the developer's point of view, the advantage of the ansi join syntax is that all the elements that have to do with relating the two tables can be written on consecutive lines. This makes it much easier to maintain the code if the database structure changes, or to remove the join.

Intelligent Join constructs


There are two particular forms of qualified join syntax that do not require the precise join condition to be specified. Rather, this syntax relies on some kind of rule that implies the precise join condition. The actual join condition itself is derived from the rule, and never entered literally.

Natural Join


The natural join implies a join condition that requires that equally named pairs of columns belonging to the table expressions appearing on the left and right side of the JOIN operator have equal values. In our example, two such pairs are present: both tables have a column named country_id and a column named last_update. Therefore, a natural join between the country and the city table implies a join condition that requires that the columns from the country_id and the last_update pairs have equal values. So, this:

SELECT city.city
, country.country
FROM city
NATURAL JOIN country

is equivalent to

SELECT city.city
, country.country
FROM city
JOIN country
ON city.country_id = country.country_id
AND city.last_update = country.last_update

This example immediately illustrates an important limitation of the NATURAL JOIN construct. We now that the city table referes to the country table by means of the country_id foreign key. So, the implied join condition is 'half-right' in requiring equality for the values in the country_id columns. However, because both tables have a last_update column, equality is required for these columns too. This is of course total non-sense: the last_update column is a TIMESTAMP column, and it will contain the current date and time of the moment the record was last updated (or inserted). Any equality found for these columns will be a matter of coincidence, so altough the columns can be related to each other, this does not constitute a relationship.

One could of course argue that this natural join does not check out because the table definition is flawed. If the last_update columns would've been given a name that is more specific to the table to which they belong, the problem would not've occurred, and we could've used the natural join. However, I feel that is not a very good argument

If we were to go along with this, we will be forced to rename all columns systematically in order to avoid accidentally occurring identical column names. Although we could limit ourselves to rename only the columns that are uninentionally included in the join conditions, this would introduce an amount of inconsistency in the identifier set, which is undesirable because it makes it harder to code against the schema. Also, the columns that we don't rename now could become a problem later on when we add new tables and columns to the schema, hampering further development. So we really need some sort of systematic approach towards picking identifiers. For example, all columns that 'truly' belong to a table could be given a prefix that is derived from the table name. Although this seems doable, it is rather impractical. It would yield long column names - for some rdbms-es, a column name can easily become too long (Oracle has a maximum of 30 characters, MaxDB has a maximum of 32).

Named columns join


The other 'intelligent' join construct is the named columns join. Like the natural join, the named columns join implies a join condition based on requiring equal values for pairs of identically named columns. The difference with the natural join is that the column pairs that should be implied in the join condition must be specified explicitly with a USING clause.

Because the column pairs must be explicitly specified, we can avoid unrelated column pairs that happen to have identical names to be implied in the join condition. This solves the problem with the last_update column we just discussed for the natural join between the city and the country table:


SELECT city.city
, country.country
FROM city
JOIN country
USING(country_id)


This is equivalent to:


SELECT city.city
, country.country
FROM city
JOIN country
ON city.country_id = country.country_id


Athough the named columns join is thus slightly more useful than the natural join, it still poses a problem. No matter how we name our columns, we can never find a satisfactory solution for those cases where a particular table has more than one foreign key referring to one key in another table. In that case, each foreign key will need it's own set of columns, and only one of these can choose column names identical to the names of the referenced columns. The natural join as well as the named columns join can be used for joining along one relationship only. Joining along the other relationships always involves writing an explicit join condition.

All in all, if think that bioth the natural join and the named columns join constructs are not very useful. I never use them, because they would introduce inconsistency in the code. I feel that similar tasks should be solved in similar ways to avoid confusion, and because you will potentially need to write explicit join conditions anyway, I prefer to do so right away, even if a particular case would allow for a natural or named columns join.

What about a foreign key join?


Nevertheless, I think that the concept of an implied join condition is really elegant and charming if it could be used generically. Both the natural join and the named columns join cannot be used generically because they infer the join condition based on equality of the column names. Clearly, equality of column names just isn't the right sort of thing to use to infer join conditions.

In a lot of cases, we want to join along the columns of a foreign key, regardless of the actual column names. If we enforce the foreign key with a constraint, the rdbms could use that information to infer a join condition. So I think it's really amazing to discover that neither the SQL Standard nor any of the more renowned rdbms-es support a syntax that actually takes advantage of this.

I do know a development environment, USoft, that actually supports such a construct. I don't know if they still support it, but when I used it, you could write the example join something like this:


SELECT city.city
, country.country
FROM city RELATE country WITH city_resides_in_country


So, the foreign key name city_resides_in_country is used to specify the join condition. A variant that resembles the usual named columns join syntax could look like this:


SELECT city.city
, country.country
FROM city JOIN country USING city_resides_in_country


The nice thing about this syntax is that it solves all the problems described with the natural join and the ordinary named columns join. As a bonus, our code has become resilient to structural changes of the keys: If we were to add a column to the foreign key in the city table and corresponding primary key in the country table, our joins would not need to be modified because they never refer to a column name.

Anyone to point me in the direction of a rdbms that supports this kind of syntax?

10 comments:

mac said...

this really resonates with me. I am currently trying to find a faster, easier way to get columns from other tables without having to use the inner join. Though I have defined the relationships between the tables in the InnoDB table structure, I cannot find a fast way to reference the fields. For example, I have a table called events with primary key `id`. I also have another table called counties with primary key `id`. The `events` table has a `county_id` field that refers to the primary key of counties.

I have this dream that I will be able to just write SELECT events.name, counties.name FROM events JOIN counties WHERE events.id = 5, because I've already explained to the tables how to relate. Is there a way to do this?

Anonymous said...

hear hear!

Anonymous said...

this is good explicated!!!with examples

PeterK said...

I never understood JOINs, now i do!

Anonymous said...

As always it seems that evolving from one technique to the next, advantages are created but also lost.
The Hierarchical and Network Database Models gave us easy path retrieval of nodes either leave stuctured or owner/member like construct closely governed by the design.
Relational gave us freedom at the same time gave us JOINs to worry about.
Thanks Roland, good article.

Walter (LinkedIn)

Anonymous said...

nothing mentioned about advantages of joins.

Roland Bouman said...

@Anonymous: "nothing mentioned about advantages of joins."

I'm afraid I don't understand your remark. This article is not about joins in general, and does not discuss the advantages or disadvantages of the join concept. Rather this article is about the advantages and mainly disadvantages of particular syntax we can use to denote joins in the SQL language.

I hope this helps. kind regards,

Roland

tejas said...

The Article is very good explaining importance of different joins

It would be more helpful if it provides some general advantages so that beginners can use Joins Properly

Also their limitations would help to avoid the complexity while using them by some programmers...

Thanks for the Article.......

6X said...

I've also been thinking along this path for quite some time but haven't found anything on it until this. Have You seen any development since 2006 on this topic? Would it be possible to add as an extention to PostgreSQL or MariaDB for example?
//Best wishes

Roland Bouman said...

Hi 6X,

to the best of my knowledge, nothing has changed - there are no databases that implement a feauture like this.

I don't know about PG but I think it's highly unlikely that you can add this functionality to MySQL or MariaDB with an "extension" or plugin.

You can of course fork the code and add this feature to your own branch. If you can get that to work without changing any other functionality (ie preserve backward compatibility) you might be able to interest MariaDB to accept your patch.

That said, I suspect that changing this in MySQL requires serious hacking on both the parser and the query planner. These are very critical, complex parts of the code and I can imagine both MySQL and MariaDB to be very weary of accepting patches for these areas unless they offer substantial benefit for many people (i.e., solve a bug or offer a massive performance enhancement).

In other words I think the chances are slim they would change the parser / planner just to add an -apperently not so popular- feature.