Friday, August 26, 2005

Yet another way to model hierarchies

I'm seriously considering to use a nested set model, but I'm still not too happy about all the overhead involved.

I reread the article by Mike Hillyer, especially the bit that handles inserting new nodes. I like the function I wrote because it uses a single insert statement to handle both the rearrangement of the tree as well as the insertion of the new node. This also relieves me from the task of explicitly locking tables.
On the other hand, the wee lines Mike uses to perform an insert seem so elegant, that I can't help thinking I'm doing something horribly wrong. But then again, in Mike's article, the user that's inserting the nodes somehow magically knows where to make room in the tree. The whole point of using a hybrid model like I want to, is to allow for easy inserts (and updates for that matter, although I'd still have to implement it) as well as straighforward, efficient extraction. On the other hand, what I really want is to encapsulate all the tree maintenance within triggers, but I know already that this approach will not be easy to realize in MySQL.
Think of it, I would want to build triggers that handle insert,update and delete, but in order to insert or to update or to delete, I would have to rearrange the tree and run some updates as a side effect. This would of course fire my update trigger, which would have to be so smart as to prevent reentry. The fact that mysql only supports row triggers makes it all very cumbersome.
In Oracle or MSSQL I'd probably solve this by writing a view with INSTEAD OF triggers. These triggers would perform the necesary actions in the base table, and everything would be hunky dory.
The table would not be granted to anyone, but the view would. The view wouldn't even expose the left and right attributes upon which the nested set model relies. Anyway, I can't do this in MySQL, not now anyway, so I'm still willing to use other soltions.

Now, I have not yet investigated it thoroughly, but I think that, besides the hybrid solution, there's another way to handle hierarchies that is about as intuitive as the adjacency list model, and yet nearly as manageable as the nested set model when it comes to extracting data from it. I don't know if there's an official name for it, but I will call it the 'ordered level' model until someone prompts me for something better than that.


create table ordered_level(
id int unsigned not null auto_increment
, scope int unsigned not null
, position int unsigned not null
, level int unsigned not null
, label varchar(32) not null
, constraint pk_ordered_level
primary key (id)
, constraint uk_ordered_level
unique (
scope
, position
)
);

insert into ordered_level(scope,position,level,label)values
(1,1,1'Clothing')
,(1,2,2'Footwear')
,(1,5,2'Suits')
,(1,3,3'Shoes')
,(1,4,3'Socks')
,(2,1,1'Food')
,(2,2,2'Beverages')
,(2,5,2'Pastry')
,(2,3,3'Alcoholic')
,(2,4,3'Non-Alcoholic')


Like the nested set model, the ordered level model may be seen as a serialization of the depth first traversal of the tree underlying the hierarchy. The difference with the nested set as described by Joe Celko and Mike Hillyer is that instead of maintaining a left_node and a right_node property, we now rember only the position and level properties.

The position is quite like left_node in the nested set in that it stores a depth first traversal path through the tree. The level is needed to add depth to the tree. Together, level and position are sufficient to maintain the hierarchy.

The scope property is there to allow for multiple trees. So, the position reflects a depth first traversal within the tree identified by scope. The advantadge to me previous attemt is that it's now a lot easier to keep trees separate. This is reflected in updates: instead of having a single insert potetially update the entire forest, the worst what can happen here is only an entire tree needs to be updated.

Querying is not very hard compared to the original nested set, and some queries are actually easier:


/**
* printing an entire tree
*
*/

select scope
, id
, position
, level
, concat(
repeat(' ',4*(level-1))
, label
) label
from ordered_level ol
order by scope
, position


/*
* printing a tree starting with an arbitrary node
*
*/

select r.scope scope
, d.position position
, r.id rid
, r.level rlevel
, r.label rlabel
, d.id did
, d.level dlevel
, concat(
repeat(' '
, (d.level-1)*4)
, d.label
) label
from ordered_level r
inner join ordered_level d
on r.scope = d.scope
and r.position <= d.position
and r.level <= d.level
where d.position < (
select coalesce(min(position),d.position+1)
from ordered_level n
where n.scope = r.scope
and n.level = r.level
and n.position > r.position
)
order by r.scope
, r.level
, r.position
, d.position


For the latter ,there's a variant that does not need a correlated subquery:


select r.scope scope
, d.position position
, r.id rid
, r.level rlevel
, r.label rlabel
, d.id did
, d.level dlevel
, concat(
repeat(' '
, (d.level-1)*4)
, d.label
) dlabel
from (select r.id id
, r.scope scope
, r.position position
, r.level level
, r.label label
, min(n.position) next_position
from ordered_level r
left join ordered_level n
on r.scope = n.scope
and r.level = n.level
and r.position < n.position
group by r.id
, r.scope
, r.position
, r.level
, r.label
) as r
inner join ordered_level d
on r.scope = d.scope
and r.position <= d.position
and r.level <= d.level
where (r.next_position > d.position
or r.next_position is null)
order by scope
, r.level
, r.position
, position


Getting all the leafs is harder than in the nested set though:


/**
* Getting all the leafs
*
*/

select *
from ordered_level ol
where not exists (
select null
from ordered_level p
where p.scope = ol.scope
and p.level > ol.level
)


Getting all the roots is a similar query:


/**
* Getting all the roots
*
*/

select *
from ordered_level ol
where not exists (
select null
from ordered_level p
where p.scope = ol.scope
and p.level < ol.level
)



When we don't want to make the strucure redundant by combining it with the adjacency list model, we must resort to something like this to retrieve a parent or the children:


/**
*
* Getting the parent
*/

select c.label
, p.label
from ordered_level c
left join ordered_level p
on c.scope = p.scope
and c.position > p.position
and c.level > p.level
where p.position = (
select max(pp.position)
from ordered_level pp
where pp.scope = c.scope
and pp.position < c.position
and pp.level < c.level
)
or p.position is null


/**
*
* Getting the children
*
*/
select r.id
, r.level
, r.label
, d.id
, d.level
, d.position
, d.label
from ordered_level r
inner join ordered_level d
on r.scope = d.scope
and r.position <= d.position
and r.level <= d.level
where d.position < coalesce((
select min(n.position)
from ordered_level n
where n.scope = r.scope
and n.level = r.level
and n.position > r.position
),d.position+1)
and d.level = r.level + 1
and r.id = 18


Getting all ancestors is similar:


/*
*
* getting the ancestors
*
*/
select a.id
, a.position
, a.level
, concat(
repeat(' ',4*(a.level-1))
, a.label
) label
from ordered_level c
inner join ordered_level a
on c.scope = a.scope
and c.position >= a.position
and c.level >= a.level
where a.position = (
select max(aa.position)
from ordered_level aa
where aa.scope = c.scope
and aa.level = a.level
and aa.position <= c.position
group by aa.level
)
and c.id = 16


For those who are wondering: no, I did not just make this model up. I picked it up working with Microsoft Project. When you save a project in the database, the task list is stored according to such a datamodel. I've known that structure for quite some time, but only now I see the resemblance to the nested set, in that this too relies on linearization of the depth first traversal of the hierarchy.

Thursday, August 25, 2005

Hierarchies...Again

Yep, handling of hierarchical data in a relational database is a popular subject lately. I never heard of the nested set model until I stumbled upon it on the mysql newbie forum.
Now, it seems as if it's popping up everywhere. MySQL even republished the article pointed to in that thread. And, it's being read, too, at least by Clive Galway, who's using it manage Google Map data.

Anyway, I'm seriously considering to use this structure in the future too, so I decided to investigate it for a bit. And while I'm at it, I thought that it would be a nice opportunity to familiarize myself with MySQL a bit.
For starters, I think I won't do away with the adjacency list all together. What I have in mind is a hybrid of the nested set model and the adjacency list:


CREATE TABLE node (
id int unsigned NOT NULL auto_increment
, parent_id int unsigned -- AL
, position int unsigned NOT NULL default 0 -- AL
, left_node int unsigned NOT NULL -- NS
, right_node int unsigned NOT NULL -- NS
, label varchar(32)
, PRIMARY KEY(id)
, CONSTRAINT fk_node
FOREIGN KEY (parent_id)
REFERENCES node (id)
)


I don't know if there's any classical relational terminology that captures it, but we definitely have some form of redundance: we can construct a hierarchy using either the parent_id and id columns or the left_node and right_node columns, and if we would do so, these hierarchies must not different. So we would most certainly have to control that redundance.
So why would I want to breed this hybrid? What reason could I have? The obvious answer is that i want the best of both models. I'll try and illustrate that.

The adjacency list is a very intuitive representation of a hierarchy. Entering data could not be simpler than with the adjacency list. All we need is the data that belongs to the node itself (here represented by the label column).
Optionally, we could provide the id of the parent node. We can create a root node simply by not specifying the parentnode at all. (By the term "root node", I simply mean a node without a parent - nothing more, nothing less)
If we want to, we can supply a value for the position column, wich governs what position the new node will occupy among it's sibling nodes. (The sibling nodes are the nodes that are also direct children of the parent node, or -in case we're creating a root node, the other root nodes).

We really want to make advantage of this property of the adjacency list model.

Now everyone that's ever made use of such a model will be quite aware of the downside of this type of model: the inability to extract data from it for an arbitrary branch and for an arbitrary depth.
A lot of people always mention that this is because of the recursive nature of this type of structure, but i disagree on that. In my opinion the underlying cause for the difficulties of this model lie in the fact that you always need some kind of multi-pass algorithm.
Say you want to find all descendants from a give node down, you'd need to repeatedly scan through all entries in the table to look for nodes who's parent_id is the current id. Once you have those nodes, you have to repeat the proces for those nodes again until you find that there are no more nodes that have a matching parent id.
Although a recursive algorithm forms a very natural solution to perform this type of multi-pass job, there's nothing that prevents you from using iteration to achieve the same thing. In an iterative aproach, one would have a loop that records the information that's used in the subsequent passes of the loop, again and again until there's no more information registered wherafter the loop ends.
Anyway, be it recursive or iterative, we need multiple passes to scan our set, and this is something that's not supported, at least not in pure SQL (save a construct like CONNECT BY).

The nested set model is just beatiful when it comes to extracting data. Because the hierarchy is expressed using numerical intervals, a simple, single-pass algorithm is sufficient to extract all the data from a single node down or from a single node up, simply by comparing numerical intervals. We only need the interval of the one node that starts the process, and compare that with the intervals of all the other nodes and a single comparison is enough to assert that a node is related or unrelated to that node. This is all very different from the adjacency list, where the information that we need to get the next level of nodes is only available at the latest possible moment.

We really want to make advantage of this property of the nested set model. I already have a view that's good for this:


CREATE VIEW tree.tree_node
AS
select rn.id AS root_id
, rn.parent_id AS root_parent_id
, rn.position AS root_position
, rn.label AS root_label
, dn.left_node AS descendant_order
, dn.id AS descendant_id
, dn.parent_id AS descendant_parent_id
, dn.position AS descendant_position
, dn.label AS descendant_label
, count(an.id) AS descendant_depth
from tree.node rn
inner join tree.node dn
on rn.left_node <= dn.left_node
and rn.right_node >= dn.left_node
inner join tree.node an
on dn.left_node >= an.left_node
and dn.left_node <= an.right_node
and an.left_node >= rn.left_node
and an.left_node <= rn.right_node
group by rn.id
, rn.parent_id
, rn.position
, rn.label
, dn.left_node
, dn.id
, dn.parent_id
, dn.position
, dn.label
;


The downside of the nested set model is of course the construction and maintenace of these intervals. When we want to add a node to our hierarchy, we need to specify an interval for our new node.
This interval needs fit in the interval of the parent node, but is must also but in between the intervals of the previous and next sibling of our new node. When the inteval of the parent node isn't wide enough, we need to stretch it to make room for our new node. The intervals of the siblings might need to be shifted to make room inbetween. Another real bummer: all this shifting and widening impacts the nodes that are directly related to the sibling nodes and the parent node in a similar way.
In a worst case scenario, the insertion of a single node in the leftmost, depthmost branch of the tree will cause the intervals of *ALL* the other nodes to be adjusted. Just one little mistake, and we'll be in a lot of trouble reconstructing the original hierarchy. Cut short, not the kind of stuff I trust the applications to handle well. If we are going to have all this interval maintenance going on, we're going to have it in one place, and that's the database. This also the most convenient place, since we potentially need to access all nodes to perform one single action.

So far, I've come up with this function to create new nodes:


DELIMITER $$

DROP FUNCTION IF EXISTS `tree`.`f_ins_node`$$
CREATE FUNCTION `tree`.`f_ins_node`(
p_label varchar(32)
, p_parent_id int unsigned
, p_position int unsigned
) RETURNS int(10) unsigned
BEGIN
insert
into node
select id
, parent_id
, position
, left_node
, right_node
, label
from (
--
-- parent node + ancestors
--
select o.id id
, o.parent_id parent_id
, o.position
+ if(
o.position = p_position
and o.parent_id = p.id
, 1
, 0
) position
, o.left_node left_node
, o.right_node + 1 right_node
, o.label label
from node p
inner join node o
on p.left_node >= o.left_node
and p.left_node <= o.right_node
where p.id = p_parent_id
--
-- right siblings
--
union all
select o.id
, o.parent_id
, o.position + 1
, o.left_node + 1
, o.right_node + 1
, o.label
from node o
where o.parent_id = p_parent_id
and o.position >= p_position
--
-- right unrelated branches 1
--
union all
select o.id
, o.parent_id
, o.position
, o.left_node + 1
, o.right_node + 1
, o.label
from node p
inner join node o
on p.right_node < o.left_node
where p.id = p_parent_id
--
-- right unrelated brances 2
--
union all
select d.id
, d.parent_id
, d.position
+ if(
d.id= o.id
and o.position >= p_position
, 1
, 0
) position
, d.left_node + 1
, d.right_node + 1
, d.label
from node o
inner join node d
on o.left_node <= d.left_node
and o.right_node >= d.left_node
where p_parent_id is null
and o.parent_id is null
and o.position >= p_position
--
-- self 1:
-- root
--
union all
select null
, null
, coalesce(p_position,p.position + 1)
, max(p.right_node) + 1
, max(p.right_node) + 1
, p_label
from node p
where p_parent_id is null
and p.parent_id is null
and p.position <= coalesce(p_position,p.position)
group by p.id
--
--
--
union all
select null
, p.id
, coalesce(
p_position
, max(c.position)+1
, 0
) as position
, coalesce(
if(count(c.id)=0,p.left_node+1,null) -- no siblings
, if(p_position is null,max(c.right_node)+1,null) -- after last sibling
, max(if(c.position = p_position,c.left_node,null)) -- at position
, max(if(c.position < p_position,c.right_node+1,null)) -- max before position
, p.left_node + 1 -- none before
) as left_node
, coalesce(
if(count(c.id)=0,p.left_node+1,null) -- no siblings
, if(p_position is null,max(c.right_node)+1,null) -- after last sibling
, max(if(c.position = p_position,c.left_node,null)) -- at position
, max(if(c.position < p_position,c.right_node+1,null)) -- max before position
, p.left_node + 1 -- none before
) as right_node
, p_label
from node p
left join node c
on p.id = c.parent_id
where p.id = p_parent_id
group by p.id
) as n
on duplicate key
update
position = n.position
, left_node = n.left_node
, right_node = n.right_node
;
return last_insert_id()
;
END$$

DELIMITER ;


My interval maintenance is a bit different than what I found in the examples I find on the web. In particular, my leaf nodes have the propery that

left_node = right_node

and not

left_node = right_node -1

Another difference is that my hierarchy can have multiple roots. I really want to be able to do this, for example to model organization structures. I really should be possible to have a couple mother organization-trees that are not related to each other. I decided not to provide a default dummy root-of-all-roots node, simply because I want my data to express that we can have a couple of hierarchies that are completely independant upon each other. Because this is still a adjacency list model, i can still identify a root easily bij requiring that

parent_id is null

I'm still working on a procedure to delete nodes (nearly finished) and a procedure to move a node from one brach to the others.
I'd also like to program a couple of commands to navigate through a hierarchy on the basis of a current node. Anyway, when I finish it all, ill post it to the MySQL forum and see what people think of this hybrid approach.

Monday, August 22, 2005

SQL Standards Download...It's useful, but is it legal?

So much as a couple of years ago, I stumbled upon this link:

http://www.ncb.ernet.in/education/modules/dbms/sql99index.html

It pops up at page 14 when googling for "ISO 9075".
I have no means of checking it it's is genuine, but everything suggests it is.

Funny thing is, I always thought that these cost money, even though this version of the standard is outdated.

Saturday, August 20, 2005

MySQL: Create Dirty Little Tricker FOR SELECT ON view

Wow, i never thought it would work, but it sure seems to be the case. I stumbled over a post at the MySQL Triggers forum by Greg Chandler.

Greg would've liked to increment a number that's stored in the database. That's not an uncommon thing to want, but what makes Greg's question interesting in my opinion, is that he would like this to happen in the event that a user only reads the database field that stores the number. Now, I don't wan't to go into details as to why he needs it, but feel free and follow the link to find out all about it.

Suppose you'd want to implement this, what you'd need is something like:


create trigger {trigger-name}
for select on {table-name}[(
{columnlist}
)]
for each row
begin
..increase that number...
end;


Of course, I made this syntax up to illustrate the resemblance with an ordinary triggered action. Like a stored procedure, a trigger is associated with a sequence of statements that's persistently stored in the database. This sequence can be executed repeatedly afterwards. Unlike a stored procedure, a trigger can't be called or evaluated directly at the user's request. Rather, a triggers is fired in response to some event occurring in the database.

The classical trigger events are associated with a database base table. These events correspond directly to the occurrence of one of the fundamental operations that can be performed to manipulate the contents of a table: INSERT, UPDATE and DELETE.

Of course, the concept of an event that fires execution of a sequence of statements can be extended to all kinds of objects, and depending upon the type of object, you could think of appropriate events for that object. For starters, quite some database products offer similar functionality for views as well as for base tables. And Oracle even supports triggers that react to operations that alter the structure of a table, as well as events that occur to the system as a whole (system shutdown etc.), not just a particular part of the system.

What I think is interesting about Gregs original problem is that there's no RDBMS product that I know that supports the trigger concept for the SELECT operation. I can think of at least one purpose for such functionality. Suppose we'd have a collection of views, for example as part of a reporting environment. Now, it would be really cool to know which views are used a lot, how many rows a query on the view generally yields and the like.

Now, there's this dirty little trick that seems to do the job. I don't think that this trick reflects intended usage of the MySQL database product, and I wouldn't be suprised if it doesn't survive the 5.0.10 version I'm currently running. But I'm so excited about it, I just want to show it off. And of course, I hope that it will be of some use to you, dear reader.

The trick is implemented by creating a function that performs the desired actions. Now, function evaluation is just a particular kind of expression. This means we can write a view that references our function, thus executing whatever statements we put in the function. Whenever the user select from the view, our function would be evaluated, and thus, the actions are performed.

Let's stick to the 'logging view usage' case. I'll be using the test database for all this, so let's make it the current database:


use test;


First, we'll create a table to log usage in:


CREATE TABLE select_log (
id int unsigned NOT NULL auto_increment PRIMARY KEY
, log_who varchar(16) NOT NULL -- the user that performs the SELECT
, log_conn int unsigned NOT NULL -- the id of the connection
, log_when timestamp NOT NULL -- the moment the operation is performed
, log_schema varchar(64) NOT NULL -- the schema in which the view resides
, log_view varchar(64) NOT NULL -- the view
)
;

Here's the function that performs the logging:

delimiter $$
create function f_log_select(
p_log_schema varchar(64)
, p_log_view varchar(64)
)
returns int unsigned
NOT DETERMINISTIC
begin
insert
into select_log(
log_who
, log_conn
, log_when
, log_schema
, log_view
) values (
current_user()
, connection_id()
, now()
, p_log_schema
, p_log_view
);
return last_insert_id();
end;
$$


So far so good. Now, we need a table that represents the actual data for wich we can write a view that 'fires' our function:


CREATE TABLE table1 (
number int(11) default
, clientid int(11) default
, datebegin datetime default
, dateend datetime default
, data char(1) default
)


And we'll need some data to populate it:


+--------+----------+---------------------+---------------------+------+
| number | clientid | datebegin | dateend | data |
+--------+----------+---------------------+---------------------+------+
| 1 | 1 | 2005-01-01 00:00:00 | 2005-01-02 00:00:00 | A |
| 2 | 1 | 2005-01-02 00:00:00 | 2005-01-04 00:00:00 | b |
| 3 | 1 | 2005-01-04 00:00:00 | 2005-01-05 00:00:00 | c |
+--------+----------+---------------------+---------------------+------+


Finally, we're arriving at the heart of the matter: the view definition. We want the view to select all the rows from table1, and we want to make sure that our f_log_select function is called in the process. This is what I came up with:


create or replace view v_table1
as
select *
from table1
where f_log_select(
'test'
, 'table1'
)


Now, lets select from our view, and see what happens:


mysql> select * from v_table1;
+--------+----------+---------------------+---------------------+------+
| number | clientid | datebegin | dateend | data |
+--------+----------+---------------------+---------------------+------+
| 1 | 1 | 2005-01-01 00:00:00 | 2005-01-02 00:00:00 | A |
| 2 | 1 | 2005-01-02 00:00:00 | 2005-01-04 00:00:00 | b |
| 3 | 1 | 2005-01-04 00:00:00 | 2005-01-05 00:00:00 | c |
+--------+----------+---------------------+---------------------+------+
3 rows in set (0.00 sec)

mysql> select * from select_log;
+----+----------------+----------+---------------------+------------+----------+
| id | log_who | log_conn | log_when | log_schema | log_view |
+----+----------------+----------+---------------------+------------+----------+
| 1 | root@localhost | 1 | 2005-08-20 09:39:03 | test | v_table1 |
+----+----------------+----------+---------------------+------------+----------+
1 row in set (0.00 sec)


Wow! It works...or does it? I was a bit puzzled at first why there's only one row in the log table, whereas there are three rows in the view. I'd expected to see three rows in the log table as well. Giving it a bit more thought, I must conclude that the results are in fact correct. The f_log_select function has about the same status as a call to a built in function like NOW(). So, whenever a select is issued against the view, f_log_select is evaluated exactly once. The evaluation is totally independant of the number of rows selected from the view:


mysql> select * from v_table1 where number = 0;
Empty set (0.01 sec)

mysql> select * from select_log;
+----+----------------+----------+---------------------+------------+----------+
| id | log_who | log_conn | log_when | log_schema | log_view |
+----+----------------+----------+---------------------+------------+----------+
| 1 | root@localhost | 1 | 2005-08-20 09:39:03 | test | v_table1 |
| 2 | root@localhost | 1 | 2005-08-20 09:47:26 | test | v_table1 |
+----+----------------+----------+---------------------+------------+----------+
2 rows in set (0.01 sec)


Having thought about this, I thought that it would be really cool to log both the occurrence of a select statement as well as the event of ann individual row being selected. So, I extended my log table a bit:


alter table select_log
add log_type enum('ROW','STATEMENT')
not null
default 'STATEMENT'
;


(I used 'STATEMENT' as default because I'm to lazy to rewrite f_log_select at this stage).

Ok, so what would we do to trick MySQL into evaluating a function for each row selected by the view? Actually, it's not that hard. We must pass a value from our record to the function, so that MySQL is forced to re-evaluate for each row. So, in addition to f_log_select, I write this:


DELIMITER $$

DROP FUNCTION IF EXISTS `test`.`f_log_select_row`$$
CREATE FUNCTION `test`.`f_log_select_row`(
p_log_schema varchar(64)
, p_log_view varchar(64)
, p_expr varchar(1)
) RETURNS int(10) unsigned
begin
insert
into select_log(
log_who
, log_conn
, log_when
, log_schema
, log_view
, log_type
) values (
current_user()
, connection_id()
, now()
, p_log_schema
, p_log_view
, 'ROW'
);
return last_insert_id();
end$$

DELIMITER ;


And, I rewrote the view:


create or replace view v_table1
as
select *
from table1
where f_log_select(
'test'
, 'table1'
)
and f_log_select_row(
'test'
, 'table1'
, data
)


When we clear the log, select from the view and then see what's in the log we get this:


+----+----------------+----------+---------------------+------------+----------+-----------+
| id | log_who | log_conn | log_when | log_schema | log_view | log_type |
+----+----------------+----------+---------------------+------------+----------+-----------+
| 3 | root@localhost | 1 | 2005-08-20 10:14:19 | test | v_table1 | STATEMENT |
| 4 | root@localhost | 1 | 2005-08-20 10:14:19 | test | v_table1 | ROW |
| 5 | root@localhost | 1 | 2005-08-20 10:14:19 | test | v_table1 | ROW |
| 6 | root@localhost | 1 | 2005-08-20 10:14:19 | test | v_table1 | ROW |
+----+----------------+----------+---------------------+------------+----------+-----------+


So, it definitely seems to work.

Well, I already said that I regard this as rather dirty little trick, but I had fun devising it, and it learnt me a bit about function evaluation within views.

Maybe this isn't a trick after all, and maybe it's even useful for a real purpose. If you find such a purpose, go ahead, and try if it works for you. I'll be most interested to hear from anyone who tries.

Friday, August 19, 2005

Slight CONNECT BY problem....SOLVED?

It's been a while since I've been to my employers HQ. Usually, I'm stationed at my client's office, but today, me and a co-worker of mine, demo-ed a portal solution to my client.

Anyway, the presenstation went fine, and I spent the rest of the afternoon telling my colleages about a nifty little trick I picked up at the MySQL Forum. It's a not really a trick, it's actually a very clever datamodel to store hierarchical data in a relational database.

Usually, this is done using the adjency list model, but because this model relies on a recursive relationship, it's a hell of a job to extract all the data relating to an entire brach of the hierarchy. The trick I picked up is the nested set model. You can read all about that over here:

http://dev.mysql.com/tech-resources/articles/hierarchical-data.html

Now, my coworker has developed a project database that uses a table to store the work breakdown structure; a typical hierarchical structure. It runs on Oracle.

Here's an example resembling his table:


create table PRODUCT(
ID NUMBER NOT NULL
, NAME VARCHAR2(30) NOT NULL
, POSITION NUMBER NOT NULL
, PARENT_ID NUMBER
, CONSTRAINT PK_PRODUCT
PRIMARY KEY(ID)
, CONSTRAINT FK_PRODUCT
FOREIGN KEY(PARENT_ID)
REFERENCES PRODUCT(ID)
)

(sorry 'bout the caps, but i extract it like this from the data dictionary)

And here's some sample data:


ID NAME POSITION PARENT_ID
---------- ---------------- ---------- ----------
1 project 1
2 hardware 1 1
20 windows bak 1 2
21 linux bak 2 2
3 software 2 1
30 schermpje 1 3
31 driver 2 3
4 migratie 9 1
41 extractie 1 4
42 laden 2 4
26 documentatie 4 1
5 systeem 1 26
128 gebruikers 2 26
6 html 1 128


So, what we're seeing here is a typical adjency list implementation, with a recursive relationship defined on the parent_id column. This column optionally references the id of the table. Now there's one extra feature in this table, the position column. The purpose of the position column is to define the order in which sibling rows should be sorted.

Now, part of his application wants to select an entire tree to depict the complete work breakdown structure. Oracle has some built-in features that relieves some of the burdens you encounter when you're working with the 'adjacency list model'. In particular, Oracle supports the CONNECT BY language construct which tells the database to perform a recursive query:


select p.*
, level as inner_level
from product p
connect by prior p.id = p.parent_id
start with id = 1


It looks a bit like a JOIN, but what's really odd is that there's only one instance of the product table. So, connect by literally joins a table to itself - not to another instance of itself (SELF or AUTO JOIN). Related constructs are the PRIOR operator and the START WITH clause.

The construct is best explained by assuming that execution starts with the row from the product table for wich id = 1 is true. Then, all other rows are scanned, looking for those rows where the parent_id matches the id from the original row we started out with. Once we're done with that, the level pseudocolumn is incremented by 1, and the cycle is started anew, this time staring out with the id's from the rows we gained this round. A process of cycles is executed untill there are no more rows to match. Query execution is then complete, and the result is returned.


ID NAME POSITION PARENT_ID INNER_LEVEL
---------- ------------------------------ ---------- ---------- -----------
1 project 1 1
2 hardware 1 1 2
20 windows bak 1 2 3
21 linux bak 2 2 3
3 software 2 1 2
30 schermpje 1 3 3
31 driver 2 3 3
4 migratie 9 1 2
41 extractie 1 4 3
42 laden 2 4 3
26 documentatie 4 1 2
5 systeem 1 26 3
128 gebruikers 2 26 3
6 html 1 128 4

The algoritm is like a breadth first recursive algorithm. That is, each cycle retrieves only the immediate child rows, and only when these are all retrieved, the level pseudocolumn is incremented, and the next cycle is started with the rows we just retrieved in the round. (Note that the rows are not returned in the order of depth. It all remains a relational system, and any order or sorting must be applied explicitly.)

This recursiveness is what really distinguishes CONNECT BY from an ordinary join. It's very powerful in the sense that one need not know how deep the tree is: CONNECT BY will execute and execute until there is no new level to descend to. Unfortunately, the query will throw an exception without retunring any row when there's a circulare reference in the data. As of Oracle

Now, there's a slight problem with this result. It sure does yield the right rows, but what we really want is to have Oracle present the results like, well, like a tree.
The result should be something like this:


PARENT_ID POSITION ID NAME
---------- ---------- ---------- --------------------
1 1 project
1 1 2 hardware
2 1 20 windows bak
2 2 21 linux bak
1 2 3 software
3 1 30 schermpje
3 2 31 driver
1 4 26 documentatie
26 1 5 systeem
26 2 128 gebruikers
128 1 6 html
1 9 4 migratie
4 1 41 extractie
4 2 42 laden

Well I was really surprised to hear that so far, no one had come up with a query variant that would return the rows from the original query in the desired order. In fact, my co-worker told me that it was IMPOSSIBLE to do it. So, what can I do but feel really challenged? I started of full of optimism, only to discover that everything i tried amounted to nothing. I kept coming back at the same theme: to order the nodes depth-first according to their parent_id and position, you need to access all of the ancestor rows in order to account for the positional information stored in their parent_id and position columns too.

Well, I still hope there's a better way, but finally, I came up with this:


select p.id
, p.parent_id
, p.position
, lpad(' ',inner_level)||p.name as name
, inner_level
, power(10,3-inner_level)
* (select sum(p1.position * power(10,level))
from product p1
connect by prior p1.parent_id = p1.id
start with p1.id = p.id
) sort
from (
select p.*
, level as inner_level
from product p
connect by prior p.id = p.parent_id
start with id = 1
) p order by 6


It yields this result:


ID PARENT_ID POSITION NAME INNER_LEVEL SORT
--- ---------- ---------- ---------------- ----------- ----------
1 project 1 1000
2 1 1 hardware 2 1100
20 2 1 windows bak 3 1110
21 2 2 linux bak 3 1120
3 1 2 software 2 1200
30 3 1 schermpje 3 1210
31 3 2 driver 3 1220
26 1 4 documentatie 2 1400
5 26 1 systeem 3 1410
128 26 2 gebruikers 3 1420
6 128 1 html 4 1421
4 1 9 migratie 2 1900
41 4 1 extractie 3 1910
42 4 2 laden 3 1920


Now, I know the solution is flawed. For example, you can't have more than 9 children per node, and you can't have a deeper tree than 38 (number precision). You could tinker it a bit to allow for 99 children per node with a maximum depth of 19, which seems reasonable for this particular purpose (work breakdown structure)

However, the more I'm familiarizing myself with the nested set model, the more i feel that that is a much, much better solution to handle hierarchy within a relational database.

Thursday, August 11, 2005

Blog cherry popped

Yep, I guess this is it....I never imagined it would be like this. Wow...My First Post on My First Blog.

I just wanted to post a wee comment in response to a tiny IE issue, and before you know it, I got my own blog. Now, let's hope I'm more faithful to you than I am to my http://www.xcdsql.org/ homepage, dear Blog...

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...