Friday, August 26, 2005

Yet another way to model hierarchies

I'm seriously considering to use a nested set model, but I'm still not too happy about all the overhead involved.

I reread the article by Mike Hillyer, especially the bit that handles inserting new nodes. I like the function I wrote because it uses a single insert statement to handle both the rearrangement of the tree as well as the insertion of the new node. This also relieves me from the task of explicitly locking tables.
On the other hand, the wee lines Mike uses to perform an insert seem so elegant, that I can't help thinking I'm doing something horribly wrong. But then again, in Mike's article, the user that's inserting the nodes somehow magically knows where to make room in the tree. The whole point of using a hybrid model like I want to, is to allow for easy inserts (and updates for that matter, although I'd still have to implement it) as well as straighforward, efficient extraction. On the other hand, what I really want is to encapsulate all the tree maintenance within triggers, but I know already that this approach will not be easy to realize in MySQL.
Think of it, I would want to build triggers that handle insert,update and delete, but in order to insert or to update or to delete, I would have to rearrange the tree and run some updates as a side effect. This would of course fire my update trigger, which would have to be so smart as to prevent reentry. The fact that mysql only supports row triggers makes it all very cumbersome.
In Oracle or MSSQL I'd probably solve this by writing a view with INSTEAD OF triggers. These triggers would perform the necesary actions in the base table, and everything would be hunky dory.
The table would not be granted to anyone, but the view would. The view wouldn't even expose the left and right attributes upon which the nested set model relies. Anyway, I can't do this in MySQL, not now anyway, so I'm still willing to use other soltions.

Now, I have not yet investigated it thoroughly, but I think that, besides the hybrid solution, there's another way to handle hierarchies that is about as intuitive as the adjacency list model, and yet nearly as manageable as the nested set model when it comes to extracting data from it. I don't know if there's an official name for it, but I will call it the 'ordered level' model until someone prompts me for something better than that.


create table ordered_level(
id int unsigned not null auto_increment
, scope int unsigned not null
, position int unsigned not null
, level int unsigned not null
, label varchar(32) not null
, constraint pk_ordered_level
primary key (id)
, constraint uk_ordered_level
unique (
scope
, position
)
);

insert into ordered_level(scope,position,level,label)values
(1,1,1'Clothing')
,(1,2,2'Footwear')
,(1,5,2'Suits')
,(1,3,3'Shoes')
,(1,4,3'Socks')
,(2,1,1'Food')
,(2,2,2'Beverages')
,(2,5,2'Pastry')
,(2,3,3'Alcoholic')
,(2,4,3'Non-Alcoholic')


Like the nested set model, the ordered level model may be seen as a serialization of the depth first traversal of the tree underlying the hierarchy. The difference with the nested set as described by Joe Celko and Mike Hillyer is that instead of maintaining a left_node and a right_node property, we now rember only the position and level properties.

The position is quite like left_node in the nested set in that it stores a depth first traversal path through the tree. The level is needed to add depth to the tree. Together, level and position are sufficient to maintain the hierarchy.

The scope property is there to allow for multiple trees. So, the position reflects a depth first traversal within the tree identified by scope. The advantadge to me previous attemt is that it's now a lot easier to keep trees separate. This is reflected in updates: instead of having a single insert potetially update the entire forest, the worst what can happen here is only an entire tree needs to be updated.

Querying is not very hard compared to the original nested set, and some queries are actually easier:


/**
* printing an entire tree
*
*/

select scope
, id
, position
, level
, concat(
repeat(' ',4*(level-1))
, label
) label
from ordered_level ol
order by scope
, position


/*
* printing a tree starting with an arbitrary node
*
*/

select r.scope scope
, d.position position
, r.id rid
, r.level rlevel
, r.label rlabel
, d.id did
, d.level dlevel
, concat(
repeat(' '
, (d.level-1)*4)
, d.label
) label
from ordered_level r
inner join ordered_level d
on r.scope = d.scope
and r.position <= d.position
and r.level <= d.level
where d.position < (
select coalesce(min(position),d.position+1)
from ordered_level n
where n.scope = r.scope
and n.level = r.level
and n.position > r.position
)
order by r.scope
, r.level
, r.position
, d.position


For the latter ,there's a variant that does not need a correlated subquery:


select r.scope scope
, d.position position
, r.id rid
, r.level rlevel
, r.label rlabel
, d.id did
, d.level dlevel
, concat(
repeat(' '
, (d.level-1)*4)
, d.label
) dlabel
from (select r.id id
, r.scope scope
, r.position position
, r.level level
, r.label label
, min(n.position) next_position
from ordered_level r
left join ordered_level n
on r.scope = n.scope
and r.level = n.level
and r.position < n.position
group by r.id
, r.scope
, r.position
, r.level
, r.label
) as r
inner join ordered_level d
on r.scope = d.scope
and r.position <= d.position
and r.level <= d.level
where (r.next_position > d.position
or r.next_position is null)
order by scope
, r.level
, r.position
, position


Getting all the leafs is harder than in the nested set though:


/**
* Getting all the leafs
*
*/

select *
from ordered_level ol
where not exists (
select null
from ordered_level p
where p.scope = ol.scope
and p.level > ol.level
)


Getting all the roots is a similar query:


/**
* Getting all the roots
*
*/

select *
from ordered_level ol
where not exists (
select null
from ordered_level p
where p.scope = ol.scope
and p.level < ol.level
)



When we don't want to make the strucure redundant by combining it with the adjacency list model, we must resort to something like this to retrieve a parent or the children:


/**
*
* Getting the parent
*/

select c.label
, p.label
from ordered_level c
left join ordered_level p
on c.scope = p.scope
and c.position > p.position
and c.level > p.level
where p.position = (
select max(pp.position)
from ordered_level pp
where pp.scope = c.scope
and pp.position < c.position
and pp.level < c.level
)
or p.position is null


/**
*
* Getting the children
*
*/
select r.id
, r.level
, r.label
, d.id
, d.level
, d.position
, d.label
from ordered_level r
inner join ordered_level d
on r.scope = d.scope
and r.position <= d.position
and r.level <= d.level
where d.position < coalesce((
select min(n.position)
from ordered_level n
where n.scope = r.scope
and n.level = r.level
and n.position > r.position
),d.position+1)
and d.level = r.level + 1
and r.id = 18


Getting all ancestors is similar:


/*
*
* getting the ancestors
*
*/
select a.id
, a.position
, a.level
, concat(
repeat(' ',4*(a.level-1))
, a.label
) label
from ordered_level c
inner join ordered_level a
on c.scope = a.scope
and c.position >= a.position
and c.level >= a.level
where a.position = (
select max(aa.position)
from ordered_level aa
where aa.scope = c.scope
and aa.level = a.level
and aa.position <= c.position
group by aa.level
)
and c.id = 16


For those who are wondering: no, I did not just make this model up. I picked it up working with Microsoft Project. When you save a project in the database, the task list is stored according to such a datamodel. I've known that structure for quite some time, but only now I see the resemblance to the nested set, in that this too relies on linearization of the depth first traversal of the hierarchy.

3 comments:

rpbouman said...

Hi Jayman,

thanks for posting.

First of all, let me stress that the 'ordered level' technique is, like the nested set model, based on a depth-first traversal of the tree.

Second, the scope column in the example is actually something that has nothing to do with this particular model. Think of it: you can always add an extra column to any table to compartimentalize the data inside it.

I do not really agree with:

"..like most other techniques (except maybe nested interval)..to insert a new node in the middle of a given tree...many nodes in the tree may have to be recomputed, correct?"

(I'm assuming you mean "nested set" instead of "nested interval")

There are basically three (or four) ways to store trees in a database:

1) adjacency list
2a) 'ordered level'
2b) nested set
3) 'materialized path'

I've arranged this list according to the amount of overhaul (in descending order) involved when restructuring the tree.

I think I can make a good case to prove that the less effort is involved to change the structure of the tree, the more effort it is to query the tree.

At least it holds for this list, and the reason for that must be sought in the amount of redundancy contained within the model. Now don't start thinking about redundancy the way one normally does concerning data modelling issues. I mean redundancy as a informal term here.

Look, the adjacency list is the least redundant in the sense that no more than the bare minimum of information that is needed to relate one particular node to another node is used. Each node "knows" only something about it's parent node - all other information must be derived from this data. A parent node has no way of knowing what it's children are - it must do a query first.

For this reason, the adjaceny list is very good in restructuring the tree. Giving a particular node another parent node instantaneously relocates the descandant nodes because each individual node contains no information whatsoever concerning the lineage (that is, the rest of the structure), but for the single pointer to the parent.

For the 'ordered level' model, the nodes are chained together by position, that is, the depth first order. Each node "knows" about the previous and next node directly via that position. This means that at the very least, restructuring of the tree implies updating the chain downstream of the insert or delete.

The nested set is similar: each node know about both the ancestors as well as the dscendants.

In the materialized path model, each node know everything about all other nodes ancestors. Censequently, this is the easiest to query: to get to all the descendants, simply look for all the nodes that share the same prefix, and all ancestors are implied.

In the mean time, I actully worked out the hybrid solution. My comment about how difficult it would be to implement turn out to be bogus.

I did not think of it then, but the solution is indeed to have two separate tables.
I'm now using a 'public' table - people can insert,update and delete against it. This table needs to have triggers for all these actions. The triggers manipulate a second 'private' table and that stores the nested set representation. So, there's a one to one relationship between records in the adjacency list and nested set models.

In MySQL, it is maybe not impossible to maintain it in one table. However, I can live with two tables. As it turns out, it is actually convenient as I dont want people to be able to update the nested set model directly anyway. Having a separate table makes it easy to handle privileges.
Also, queries become easier to read because usually, the query you're writing almost never serves a hybrid purpose.

I found the hybrid solution to be very useful, and I can recommed it to anyone. I would probably use it in other database too, regardless of what cool construct they might have to handle hierarchies.

hope this helps,

Roland

rpbouman said...

Hi Jayman,

Ok, now I get it...nested interval is a distinct model, and it is dicussed in the links you provided, right? Thanks for that, I will go and get educated on that.

Listen, I'd like to blog up my code and the idea's behind the nested set/adjacency list, but I'm really swamped. However if you need a solution right now, I am more than willing to let you have it. Just send me an mail. Youll find the address with my blogger profile data.

thanks for posting comments and the nested interval links,

Roland

Anonymous said...

Hi, Roland.
I've been reading about methods to represent hierarchies in SQL and suddenly I came up with an idea.
My idea is a hybrid between "Adjacency list" and "Relative path" methods. To represent the tree I use 2 tables - first table (let`s call it ENTITY) contains column PARENT_ID and as many other columns as you wish. PARENT_ID is a foreign key into ENTITY itself (this is a classic adjacency list model, or also called "parent-child" relationship - I prefer that name).
The second table (let`s call it HELPER) contains 3 columns: CHILD_ID, ANCESTOR_ID, DISTANCE
CHILD_ID is a foreign key into ENTITY; ANCESTOR_ID is also a foreign key into ENTITY, such that CHILD_ID is DISTANCE levels underneath the ANCESTOR_ID; DISTANCE can be between 1 and the number of levels in tree. CHILD_ID and ANCESTOR_ID form a unique key inside table HELPER.
I have no source code for tree manipulation yet (I'm writing this post right now as I though of the idea) but it will be ready in a few days.
I'm willing to hear from you about my idea, to further continue conversation and study it. I will be very happy if this becomes usefull for other people.
My email is:
ivo underscore gelov at gmx dot net,
or my ICQ is 72406680

Best wishes,
IVO GELOV

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...