Roland Bouman's blog: Yet another way to model hierarchies

Friday, August 26, 2005

Yet another way to model hierarchies

I'm seriously considering to use a nested set model, but I'm still not too happy about all the overhead involved.

I reread the article by Mike Hillyer, especially the bit that handles inserting new nodes. I like the function I wrote because it uses a single insert statement to handle both the rearrangement of the tree as well as the insertion of the new node. This also relieves me from the task of explicitly locking tables.
On the other hand, the wee lines Mike uses to perform an insert seem so elegant, that I can't help thinking I'm doing something horribly wrong. But then again, in Mike's article, the user that's inserting the nodes somehow magically knows where to make room in the tree. The whole point of using a hybrid model like I want to, is to allow for easy inserts (and updates for that matter, although I'd still have to implement it) as well as straighforward, efficient extraction. On the other hand, what I really want is to encapsulate all the tree maintenance within triggers, but I know already that this approach will not be easy to realize in MySQL.
Think of it, I would want to build triggers that handle insert,update and delete, but in order to insert or to update or to delete, I would have to rearrange the tree and run some updates as a side effect. This would of course fire my update trigger, which would have to be so smart as to prevent reentry. The fact that mysql only supports row triggers makes it all very cumbersome.
In Oracle or MSSQL I'd probably solve this by writing a view with INSTEAD OF triggers. These triggers would perform the necesary actions in the base table, and everything would be hunky dory.
The table would not be granted to anyone, but the view would. The view wouldn't even expose the left and right attributes upon which the nested set model relies. Anyway, I can't do this in MySQL, not now anyway, so I'm still willing to use other soltions.

Now, I have not yet investigated it thoroughly, but I think that, besides the hybrid solution, there's another way to handle hierarchies that is about as intuitive as the adjacency list model, and yet nearly as manageable as the nested set model when it comes to extracting data from it. I don't know if there's an official name for it, but I will call it the 'ordered level' model until someone prompts me for something better than that.


create table ordered_level(
    id int unsigned not null auto_increment
,   scope int unsigned not null
,   position int unsigned not null
,   level int unsigned not null
,   label varchar(32) not null
,   constraint pk_ordered_level
        primary key (id)
,   constraint uk_ordered_level
        unique (
            scope
        ,   position
        )
);

insert into ordered_level(scope,position,level,label)values
 (1,1,1'Clothing')
,(1,2,2'Footwear')
,(1,5,2'Suits')
,(1,3,3'Shoes')
,(1,4,3'Socks')
,(2,1,1'Food')
,(2,2,2'Beverages')
,(2,5,2'Pastry')
,(2,3,3'Alcoholic')
,(2,4,3'Non-Alcoholic')

Like the nested set model, the ordered level model may be seen as a serialization of the depth first traversal of the tree underlying the hierarchy. The difference with the nested set as described by Joe Celko and Mike Hillyer is that instead of maintaining a left_node and a right_node property, we now rember only the position and level properties.

The position is quite like left_node in the nested set in that it stores a depth first traversal path through the tree. The level is needed to add depth to the tree. Together, level and position are sufficient to maintain the hierarchy.

The scope property is there to allow for multiple trees. So, the position reflects a depth first traversal within the tree identified by scope. The advantadge to me previous attemt is that it's now a lot easier to keep trees separate. This is reflected in updates: instead of having a single insert potetially update the entire forest, the worst what can happen here is only an entire tree needs to be updated.

Querying is not very hard compared to the original nested set, and some queries are actually easier:


/**
*    printing an entire tree
*
*/

select     scope
,          id
,          position
,          level
,          concat( 
               repeat(' ',4*(level-1))
           ,   label
           ) label
from       ordered_level ol
order by   scope
,          position


/*
*    printing a tree starting with an arbitrary node
*
*/

select     r.scope                   scope
,          d.position                position
,          r.id                      rid
,          r.level                   rlevel
,          r.label                   rlabel
,          d.id                      did
,          d.level                   dlevel
,          concat(
               repeat(' '
           ,  (d.level-1)*4)
           ,   d.label
           )                         label
from       ordered_level   r
inner join ordered_level   d
on         r.scope       = d.scope
and        r.position   <= d.position
and        r.level      <= d.level
where      d.position   <  (
               select coalesce(min(position),d.position+1)
               from   ordered_level n
               where  n.scope = r.scope
               and    n.level = r.level
               and    n.position > r.position
           )
order by   r.scope
,          r.level
,          r.position
,          d.position

For the latter ,there's a variant that does not need a correlated subquery:


select     r.scope                   scope
,          d.position                position
,          r.id                      rid
,          r.level                   rlevel
,          r.label                   rlabel
,          d.id                      did
,          d.level                   dlevel
,          concat(
               repeat(' '
           ,  (d.level-1)*4)
           ,   d.label
           )                         dlabel
from      (select    r.id            id
           ,         r.scope         scope
           ,         r.position      position
           ,         r.level         level
           ,         r.label         label
           ,         min(n.position) next_position
           from      ordered_level r
           left join ordered_level n
           on        r.scope    = n.scope
           and       r.level    = n.level
           and       r.position < n.position
           group by  r.id
           ,         r.scope
           ,         r.position
           ,         r.level
           ,         r.label
          ) as r
inner join ordered_level     d
on         r.scope         = d.scope
and        r.position     <= d.position
and        r.level        <= d.level
where     (r.next_position > d.position
or         r.next_position is null)
order by   scope
,          r.level
,          r.position
,          position

Getting all the leafs is harder than in the nested set though:


/**
*    Getting all the leafs
*
*/

select     *
from       ordered_level ol
where not exists (
           select null 
           from   ordered_level p
           where  p.scope = ol.scope
           and    p.level > ol.level
           )

Getting all the roots is a similar query:


/**
*    Getting all the roots
*
*/

select     *
from       ordered_level ol
where not exists (
           select null 
           from   ordered_level p
           where  p.scope = ol.scope
           and    p.level < ol.level
           )

When we don't want to make the strucure redundant by combining it with the adjacency list model, we must resort to something like this to retrieve a parent or the children:


/**
*
*    Getting the parent
*/

select     c.label
,          p.label
from       ordered_level c
left join  ordered_level p
on         c.scope    = p.scope
and        c.position > p.position
and        c.level    > p.level
where      p.position = (
               select max(pp.position)
               from   ordered_level pp
               where  pp.scope = c.scope
               and    pp.position < c.position
               and    pp.level    < c.level
           )
or         p.position is null


/**
*
*    Getting the children
*
*/
select     r.id
,          r.level
,          r.label
,          d.id
,          d.level
,          d.position
,          d.label
from       ordered_level   r
inner join ordered_level   d
on         r.scope       = d.scope
and        r.position   <= d.position
and        r.level      <= d.level
where      d.position   <  coalesce((
               select min(n.position)
               from   ordered_level n
               where  n.scope = r.scope
               and    n.level = r.level
               and    n.position > r.position
           ),d.position+1)
and        d.level = r.level + 1
and        r.id = 18

Getting all ancestors is similar:


/*
*
*    getting the ancestors
*
*/
select     a.id
,          a.position
,          a.level
,          concat( 
               repeat(' ',4*(a.level-1))
           ,   a.label
           ) label
from       ordered_level c
inner join ordered_level a
on         c.scope     = a.scope
and        c.position >= a.position
and        c.level    >= a.level
where      a.position = (
               select   max(aa.position)
               from     ordered_level aa
               where    aa.scope = c.scope
               and      aa.level = a.level
               and      aa.position <= c.position
               group by aa.level
           )
and        c.id  = 16

For those who are wondering: no, I did not just make this model up. I picked it up working with Microsoft Project. When you save a project in the database, the task list is stored according to such a datamodel. I've known that structure for quite some time, but only now I see the resemblance to the nested set, in that this too relies on linearization of the depth first traversal of the hierarchy.

3 comments:

rpbouman said...: Hi Jayman,

thanks for posting.

First of all, let me stress that the 'ordered level' technique is, like the nested set model, based on a depth-first traversal of the tree.

Second, the scope column in the example is actually something that has nothing to do with this particular model. Think of it: you can always add an extra column to any table to compartimentalize the data inside it.

I do not really agree with:

"..like most other techniques (except maybe nested interval)..to insert a new node in the middle of a given tree...many nodes in the tree may have to be recomputed, correct?"

(I'm assuming you mean "nested set" instead of "nested interval")

There are basically three (or four) ways to store trees in a database:

1) adjacency list
2a) 'ordered level'
2b) nested set
3) 'materialized path'

I've arranged this list according to the amount of overhaul (in descending order) involved when restructuring the tree.

I think I can make a good case to prove that the less effort is involved to change the structure of the tree, the more effort it is to query the tree.

At least it holds for this list, and the reason for that must be sought in the amount of redundancy contained within the model. Now don't start thinking about redundancy the way one normally does concerning data modelling issues. I mean redundancy as a informal term here.

Look, the adjacency list is the least redundant in the sense that no more than the bare minimum of information that is needed to relate one particular node to another node is used. Each node "knows" only something about it's parent node - all other information must be derived from this data. A parent node has no way of knowing what it's children are - it must do a query first.

For this reason, the adjaceny list is very good in restructuring the tree. Giving a particular node another parent node instantaneously relocates the descandant nodes because each individual node contains no information whatsoever concerning the lineage (that is, the rest of the structure), but for the single pointer to the parent.

For the 'ordered level' model, the nodes are chained together by position, that is, the depth first order. Each node "knows" about the previous and next node directly via that position. This means that at the very least, restructuring of the tree implies updating the chain downstream of the insert or delete.

The nested set is similar: each node know about both the ancestors as well as the dscendants.

In the materialized path model, each node know everything about all other nodes ancestors. Censequently, this is the easiest to query: to get to all the descendants, simply look for all the nodes that share the same prefix, and all ancestors are implied.

In the mean time, I actully worked out the hybrid solution. My comment about how difficult it would be to implement turn out to be bogus.

I did not think of it then, but the solution is indeed to have two separate tables.
I'm now using a 'public' table - people can insert,update and delete against it. This table needs to have triggers for all these actions. The triggers manipulate a second 'private' table and that stores the nested set representation. So, there's a one to one relationship between records in the adjacency list and nested set models.

In MySQL, it is maybe not impossible to maintain it in one table. However, I can live with two tables. As it turns out, it is actually convenient as I dont want people to be able to update the nested set model directly anyway. Having a separate table makes it easy to handle privileges.
Also, queries become easier to read because usually, the query you're writing almost never serves a hybrid purpose.

I found the hybrid solution to be very useful, and I can recommed it to anyone. I would probably use it in other database too, regardless of what cool construct they might have to handle hierarchies.

hope this helps,

Roland; Wednesday, July 12, 2006 at 12:26:00 AM GMT+2
rpbouman said...: Hi Jayman,

Ok, now I get it...nested interval is a distinct model, and it is dicussed in the links you provided, right? Thanks for that, I will go and get educated on that.

Listen, I'd like to blog up my code and the idea's behind the nested set/adjacency list, but I'm really swamped. However if you need a solution right now, I am more than willing to let you have it. Just send me an mail. Youll find the address with my blogger profile data.

thanks for posting comments and the nested interval links,

Roland; Wednesday, July 12, 2006 at 8:30:00 AM GMT+2
Anonymous said...: Hi, Roland.
I've been reading about methods to represent hierarchies in SQL and suddenly I came up with an idea.
My idea is a hybrid between "Adjacency list" and "Relative path" methods. To represent the tree I use 2 tables - first table (let`s call it ENTITY) contains column PARENT_ID and as many other columns as you wish. PARENT_ID is a foreign key into ENTITY itself (this is a classic adjacency list model, or also called "parent-child" relationship - I prefer that name).
The second table (let`s call it HELPER) contains 3 columns: CHILD_ID, ANCESTOR_ID, DISTANCE
CHILD_ID is a foreign key into ENTITY; ANCESTOR_ID is also a foreign key into ENTITY, such that CHILD_ID is DISTANCE levels underneath the ANCESTOR_ID; DISTANCE can be between 1 and the number of levels in tree. CHILD_ID and ANCESTOR_ID form a unique key inside table HELPER.
I have no source code for tree manipulation yet (I'm writing this post right now as I though of the idea) but it will be ready in a few days.
I'm willing to hear from you about my idea, to further continue conversation and study it. I will be very happy if this becomes usefull for other people.
My email is:
ivo underscore gelov at gmx dot net,
or my ICQ is 72406680

Best wishes,
IVO GELOV; Monday, September 15, 2008 at 12:48:00 PM GMT+2

Friday, August 26, 2005

Yet another way to model hierarchies

3 comments:

SAP HANA Trick: DISTINCT STRING_AGG