Thursday, August 25, 2005

Hierarchies...Again

Yep, handling of hierarchical data in a relational database is a popular subject lately. I never heard of the nested set model until I stumbled upon it on the mysql newbie forum.
Now, it seems as if it's popping up everywhere. MySQL even republished the article pointed to in that thread. And, it's being read, too, at least by Clive Galway, who's using it manage Google Map data.

Anyway, I'm seriously considering to use this structure in the future too, so I decided to investigate it for a bit. And while I'm at it, I thought that it would be a nice opportunity to familiarize myself with MySQL a bit.
For starters, I think I won't do away with the adjacency list all together. What I have in mind is a hybrid of the nested set model and the adjacency list:


CREATE TABLE node (
id int unsigned NOT NULL auto_increment
, parent_id int unsigned -- AL
, position int unsigned NOT NULL default 0 -- AL
, left_node int unsigned NOT NULL -- NS
, right_node int unsigned NOT NULL -- NS
, label varchar(32)
, PRIMARY KEY(id)
, CONSTRAINT fk_node
FOREIGN KEY (parent_id)
REFERENCES node (id)
)


I don't know if there's any classical relational terminology that captures it, but we definitely have some form of redundance: we can construct a hierarchy using either the parent_id and id columns or the left_node and right_node columns, and if we would do so, these hierarchies must not different. So we would most certainly have to control that redundance.
So why would I want to breed this hybrid? What reason could I have? The obvious answer is that i want the best of both models. I'll try and illustrate that.

The adjacency list is a very intuitive representation of a hierarchy. Entering data could not be simpler than with the adjacency list. All we need is the data that belongs to the node itself (here represented by the label column).
Optionally, we could provide the id of the parent node. We can create a root node simply by not specifying the parentnode at all. (By the term "root node", I simply mean a node without a parent - nothing more, nothing less)
If we want to, we can supply a value for the position column, wich governs what position the new node will occupy among it's sibling nodes. (The sibling nodes are the nodes that are also direct children of the parent node, or -in case we're creating a root node, the other root nodes).

We really want to make advantage of this property of the adjacency list model.

Now everyone that's ever made use of such a model will be quite aware of the downside of this type of model: the inability to extract data from it for an arbitrary branch and for an arbitrary depth.
A lot of people always mention that this is because of the recursive nature of this type of structure, but i disagree on that. In my opinion the underlying cause for the difficulties of this model lie in the fact that you always need some kind of multi-pass algorithm.
Say you want to find all descendants from a give node down, you'd need to repeatedly scan through all entries in the table to look for nodes who's parent_id is the current id. Once you have those nodes, you have to repeat the proces for those nodes again until you find that there are no more nodes that have a matching parent id.
Although a recursive algorithm forms a very natural solution to perform this type of multi-pass job, there's nothing that prevents you from using iteration to achieve the same thing. In an iterative aproach, one would have a loop that records the information that's used in the subsequent passes of the loop, again and again until there's no more information registered wherafter the loop ends.
Anyway, be it recursive or iterative, we need multiple passes to scan our set, and this is something that's not supported, at least not in pure SQL (save a construct like CONNECT BY).

The nested set model is just beatiful when it comes to extracting data. Because the hierarchy is expressed using numerical intervals, a simple, single-pass algorithm is sufficient to extract all the data from a single node down or from a single node up, simply by comparing numerical intervals. We only need the interval of the one node that starts the process, and compare that with the intervals of all the other nodes and a single comparison is enough to assert that a node is related or unrelated to that node. This is all very different from the adjacency list, where the information that we need to get the next level of nodes is only available at the latest possible moment.

We really want to make advantage of this property of the nested set model. I already have a view that's good for this:


CREATE VIEW tree.tree_node
AS
select rn.id AS root_id
, rn.parent_id AS root_parent_id
, rn.position AS root_position
, rn.label AS root_label
, dn.left_node AS descendant_order
, dn.id AS descendant_id
, dn.parent_id AS descendant_parent_id
, dn.position AS descendant_position
, dn.label AS descendant_label
, count(an.id) AS descendant_depth
from tree.node rn
inner join tree.node dn
on rn.left_node <= dn.left_node
and rn.right_node >= dn.left_node
inner join tree.node an
on dn.left_node >= an.left_node
and dn.left_node <= an.right_node
and an.left_node >= rn.left_node
and an.left_node <= rn.right_node
group by rn.id
, rn.parent_id
, rn.position
, rn.label
, dn.left_node
, dn.id
, dn.parent_id
, dn.position
, dn.label
;


The downside of the nested set model is of course the construction and maintenace of these intervals. When we want to add a node to our hierarchy, we need to specify an interval for our new node.
This interval needs fit in the interval of the parent node, but is must also but in between the intervals of the previous and next sibling of our new node. When the inteval of the parent node isn't wide enough, we need to stretch it to make room for our new node. The intervals of the siblings might need to be shifted to make room inbetween. Another real bummer: all this shifting and widening impacts the nodes that are directly related to the sibling nodes and the parent node in a similar way.
In a worst case scenario, the insertion of a single node in the leftmost, depthmost branch of the tree will cause the intervals of *ALL* the other nodes to be adjusted. Just one little mistake, and we'll be in a lot of trouble reconstructing the original hierarchy. Cut short, not the kind of stuff I trust the applications to handle well. If we are going to have all this interval maintenance going on, we're going to have it in one place, and that's the database. This also the most convenient place, since we potentially need to access all nodes to perform one single action.

So far, I've come up with this function to create new nodes:


DELIMITER $$

DROP FUNCTION IF EXISTS `tree`.`f_ins_node`$$
CREATE FUNCTION `tree`.`f_ins_node`(
p_label varchar(32)
, p_parent_id int unsigned
, p_position int unsigned
) RETURNS int(10) unsigned
BEGIN
insert
into node
select id
, parent_id
, position
, left_node
, right_node
, label
from (
--
-- parent node + ancestors
--
select o.id id
, o.parent_id parent_id
, o.position
+ if(
o.position = p_position
and o.parent_id = p.id
, 1
, 0
) position
, o.left_node left_node
, o.right_node + 1 right_node
, o.label label
from node p
inner join node o
on p.left_node >= o.left_node
and p.left_node <= o.right_node
where p.id = p_parent_id
--
-- right siblings
--
union all
select o.id
, o.parent_id
, o.position + 1
, o.left_node + 1
, o.right_node + 1
, o.label
from node o
where o.parent_id = p_parent_id
and o.position >= p_position
--
-- right unrelated branches 1
--
union all
select o.id
, o.parent_id
, o.position
, o.left_node + 1
, o.right_node + 1
, o.label
from node p
inner join node o
on p.right_node < o.left_node
where p.id = p_parent_id
--
-- right unrelated brances 2
--
union all
select d.id
, d.parent_id
, d.position
+ if(
d.id= o.id
and o.position >= p_position
, 1
, 0
) position
, d.left_node + 1
, d.right_node + 1
, d.label
from node o
inner join node d
on o.left_node <= d.left_node
and o.right_node >= d.left_node
where p_parent_id is null
and o.parent_id is null
and o.position >= p_position
--
-- self 1:
-- root
--
union all
select null
, null
, coalesce(p_position,p.position + 1)
, max(p.right_node) + 1
, max(p.right_node) + 1
, p_label
from node p
where p_parent_id is null
and p.parent_id is null
and p.position <= coalesce(p_position,p.position)
group by p.id
--
--
--
union all
select null
, p.id
, coalesce(
p_position
, max(c.position)+1
, 0
) as position
, coalesce(
if(count(c.id)=0,p.left_node+1,null) -- no siblings
, if(p_position is null,max(c.right_node)+1,null) -- after last sibling
, max(if(c.position = p_position,c.left_node,null)) -- at position
, max(if(c.position < p_position,c.right_node+1,null)) -- max before position
, p.left_node + 1 -- none before
) as left_node
, coalesce(
if(count(c.id)=0,p.left_node+1,null) -- no siblings
, if(p_position is null,max(c.right_node)+1,null) -- after last sibling
, max(if(c.position = p_position,c.left_node,null)) -- at position
, max(if(c.position < p_position,c.right_node+1,null)) -- max before position
, p.left_node + 1 -- none before
) as right_node
, p_label
from node p
left join node c
on p.id = c.parent_id
where p.id = p_parent_id
group by p.id
) as n
on duplicate key
update
position = n.position
, left_node = n.left_node
, right_node = n.right_node
;
return last_insert_id()
;
END$$

DELIMITER ;


My interval maintenance is a bit different than what I found in the examples I find on the web. In particular, my leaf nodes have the propery that

left_node = right_node

and not

left_node = right_node -1

Another difference is that my hierarchy can have multiple roots. I really want to be able to do this, for example to model organization structures. I really should be possible to have a couple mother organization-trees that are not related to each other. I decided not to provide a default dummy root-of-all-roots node, simply because I want my data to express that we can have a couple of hierarchies that are completely independant upon each other. Because this is still a adjacency list model, i can still identify a root easily bij requiring that

parent_id is null

I'm still working on a procedure to delete nodes (nearly finished) and a procedure to move a node from one brach to the others.
I'd also like to program a couple of commands to navigate through a hierarchy on the basis of a current node. Anyway, when I finish it all, ill post it to the MySQL forum and see what people think of this hybrid approach.

3 comments:

Anonymous said...

I can definately see mileage in the concept of having both nested set _and_ adjacency models in one db.
My personal idea is that the adjacency lookup is the "Master" and the nested set lft and rgt fields are an additional lookup.
There are many examples on the net of how to convert adjacency to nested set, so I just lock the table, add the new entry in with adjacency values (easy to calculate) and then run the adjacency to nested set convert.
I think this may be simpler than trying to do both at once.

Also, regarding multi-root nested sets. Not sure if this is *really* useful or worth the extra hassle.
I personally think the one "root of roots" is fine. If you want seperate trees, just make you code ignore the root node as a record. ie behave like it doesn't exist. It's only use is to tie all the trees together.
If you have multiple roots, you have no way of knowing wether a data is truly "valid" by looking at it. With one root, walk the tree and if the number of records you come across = the record count of the table, your nested set tree is good.
Just my 2p's worth...

Anonymous said...

And one other thing, I just noticed that you are not using lft=rgt-1.

OK, so I am nothing like as good at SQL as you, but this makes zero sense to me.

Granted, all I know about nested set was gleaned from: http://dev.mysql.com/tech-resources/articles/hierarchical-data.html

but for example, if the lft and rgt value of a node are the same, how the hell can you do all those interesting calculations listed on the mysql page.

For example, given a node "parent" and another node "check":

if check.lft is between parent.lft and parent.rgt, it is a child of parent.

I dont get how the hell that's gonna work if lft=rgt...

rpbouman said...

About the multiple roots: your'e probably right that you could work around it. The reason that I dont want to is that this would involve an explicit check to exclude the 'superroot' when you query the hierarchy. Now, because insert, update and delete already require special care as it is, I'm quite willing to invest a little bit extra in those processes if that will avoid checking for this in all the SELECT's I'm going to write against the hierarchy.

Also, without the superroot, it's immediately clear to everyone that you're dealing with seperate trees that bear no relationship to each other. With the 'artificial supperroot', you'd always be asking yourself if you're dealing with a pure, single hierarchy or with multiple independant hierarchies.

As for the map data you are dealing with: i gueass that's a quite clear cut case where there is actually one superroot.

As for the lft=rgt stuff: i don't know much more than you do about the nested set. My source is also the article you mention (and the pointers mentioned there, although they don't shed any other light on it).

As far as I can see, the queries are not any different from the lft=rgt-1 implementation. That's because the relationship between a node and it's parent (or ancestors for that matters) remains unchanged. The lft of the child is still between lft and rgt of the ancestors. Only when the child itself would gain a child would we need to make room. But, this is exactly the same for the lft = rgt-1 case (or isn't it? The case you mention does not seem like a big problem to me).

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...