Roland Bouman's blog: August 2005

Friday, August 26, 2005

Yet another way to model hierarchies

I'm seriously considering to use a nested set model, but I'm still not too happy about all the overhead involved.

I reread the article by Mike Hillyer, especially the bit that handles inserting new nodes. I like the function I wrote because it uses a single insert statement to handle both the rearrangement of the tree as well as the insertion of the new node. This also relieves me from the task of explicitly locking tables.
On the other hand, the wee lines Mike uses to perform an insert seem so elegant, that I can't help thinking I'm doing something horribly wrong. But then again, in Mike's article, the user that's inserting the nodes somehow magically knows where to make room in the tree. The whole point of using a hybrid model like I want to, is to allow for easy inserts (and updates for that matter, although I'd still have to implement it) as well as straighforward, efficient extraction. On the other hand, what I really want is to encapsulate all the tree maintenance within triggers, but I know already that this approach will not be easy to realize in MySQL.
Think of it, I would want to build triggers that handle insert,update and delete, but in order to insert or to update or to delete, I would have to rearrange the tree and run some updates as a side effect. This would of course fire my update trigger, which would have to be so smart as to prevent reentry. The fact that mysql only supports row triggers makes it all very cumbersome.
In Oracle or MSSQL I'd probably solve this by writing a view with INSTEAD OF triggers. These triggers would perform the necesary actions in the base table, and everything would be hunky dory.
The table would not be granted to anyone, but the view would. The view wouldn't even expose the left and right attributes upon which the nested set model relies. Anyway, I can't do this in MySQL, not now anyway, so I'm still willing to use other soltions.

Now, I have not yet investigated it thoroughly, but I think that, besides the hybrid solution, there's another way to handle hierarchies that is about as intuitive as the adjacency list model, and yet nearly as manageable as the nested set model when it comes to extracting data from it. I don't know if there's an official name for it, but I will call it the 'ordered level' model until someone prompts me for something better than that.


create table ordered_level(
    id int unsigned not null auto_increment
,   scope int unsigned not null
,   position int unsigned not null
,   level int unsigned not null
,   label varchar(32) not null
,   constraint pk_ordered_level
        primary key (id)
,   constraint uk_ordered_level
        unique (
            scope
        ,   position
        )
);

insert into ordered_level(scope,position,level,label)values
 (1,1,1'Clothing')
,(1,2,2'Footwear')
,(1,5,2'Suits')
,(1,3,3'Shoes')
,(1,4,3'Socks')
,(2,1,1'Food')
,(2,2,2'Beverages')
,(2,5,2'Pastry')
,(2,3,3'Alcoholic')
,(2,4,3'Non-Alcoholic')

Like the nested set model, the ordered level model may be seen as a serialization of the depth first traversal of the tree underlying the hierarchy. The difference with the nested set as described by Joe Celko and Mike Hillyer is that instead of maintaining a left_node and a right_node property, we now rember only the position and level properties.

The position is quite like left_node in the nested set in that it stores a depth first traversal path through the tree. The level is needed to add depth to the tree. Together, level and position are sufficient to maintain the hierarchy.

The scope property is there to allow for multiple trees. So, the position reflects a depth first traversal within the tree identified by scope. The advantadge to me previous attemt is that it's now a lot easier to keep trees separate. This is reflected in updates: instead of having a single insert potetially update the entire forest, the worst what can happen here is only an entire tree needs to be updated.

Querying is not very hard compared to the original nested set, and some queries are actually easier:


/**
*    printing an entire tree
*
*/

select     scope
,          id
,          position
,          level
,          concat( 
               repeat(' ',4*(level-1))
           ,   label
           ) label
from       ordered_level ol
order by   scope
,          position


/*
*    printing a tree starting with an arbitrary node
*
*/

select     r.scope                   scope
,          d.position                position
,          r.id                      rid
,          r.level                   rlevel
,          r.label                   rlabel
,          d.id                      did
,          d.level                   dlevel
,          concat(
               repeat(' '
           ,  (d.level-1)*4)
           ,   d.label
           )                         label
from       ordered_level   r
inner join ordered_level   d
on         r.scope       = d.scope
and        r.position   <= d.position
and        r.level      <= d.level
where      d.position   <  (
               select coalesce(min(position),d.position+1)
               from   ordered_level n
               where  n.scope = r.scope
               and    n.level = r.level
               and    n.position > r.position
           )
order by   r.scope
,          r.level
,          r.position
,          d.position

For the latter ,there's a variant that does not need a correlated subquery:


select     r.scope                   scope
,          d.position                position
,          r.id                      rid
,          r.level                   rlevel
,          r.label                   rlabel
,          d.id                      did
,          d.level                   dlevel
,          concat(
               repeat(' '
           ,  (d.level-1)*4)
           ,   d.label
           )                         dlabel
from      (select    r.id            id
           ,         r.scope         scope
           ,         r.position      position
           ,         r.level         level
           ,         r.label         label
           ,         min(n.position) next_position
           from      ordered_level r
           left join ordered_level n
           on        r.scope    = n.scope
           and       r.level    = n.level
           and       r.position < n.position
           group by  r.id
           ,         r.scope
           ,         r.position
           ,         r.level
           ,         r.label
          ) as r
inner join ordered_level     d
on         r.scope         = d.scope
and        r.position     <= d.position
and        r.level        <= d.level
where     (r.next_position > d.position
or         r.next_position is null)
order by   scope
,          r.level
,          r.position
,          position

Getting all the leafs is harder than in the nested set though:


/**
*    Getting all the leafs
*
*/

select     *
from       ordered_level ol
where not exists (
           select null 
           from   ordered_level p
           where  p.scope = ol.scope
           and    p.level > ol.level
           )

Getting all the roots is a similar query:


/**
*    Getting all the roots
*
*/

select     *
from       ordered_level ol
where not exists (
           select null 
           from   ordered_level p
           where  p.scope = ol.scope
           and    p.level < ol.level
           )

When we don't want to make the strucure redundant by combining it with the adjacency list model, we must resort to something like this to retrieve a parent or the children:


/**
*
*    Getting the parent
*/

select     c.label
,          p.label
from       ordered_level c
left join  ordered_level p
on         c.scope    = p.scope
and        c.position > p.position
and        c.level    > p.level
where      p.position = (
               select max(pp.position)
               from   ordered_level pp
               where  pp.scope = c.scope
               and    pp.position < c.position
               and    pp.level    < c.level
           )
or         p.position is null


/**
*
*    Getting the children
*
*/
select     r.id
,          r.level
,          r.label
,          d.id
,          d.level
,          d.position
,          d.label
from       ordered_level   r
inner join ordered_level   d
on         r.scope       = d.scope
and        r.position   <= d.position
and        r.level      <= d.level
where      d.position   <  coalesce((
               select min(n.position)
               from   ordered_level n
               where  n.scope = r.scope
               and    n.level = r.level
               and    n.position > r.position
           ),d.position+1)
and        d.level = r.level + 1
and        r.id = 18

Getting all ancestors is similar:


/*
*
*    getting the ancestors
*
*/
select     a.id
,          a.position
,          a.level
,          concat( 
               repeat(' ',4*(a.level-1))
           ,   a.label
           ) label
from       ordered_level c
inner join ordered_level a
on         c.scope     = a.scope
and        c.position >= a.position
and        c.level    >= a.level
where      a.position = (
               select   max(aa.position)
               from     ordered_level aa
               where    aa.scope = c.scope
               and      aa.level = a.level
               and      aa.position <= c.position
               group by aa.level
           )
and        c.id  = 16

For those who are wondering: no, I did not just make this model up. I picked it up working with Microsoft Project. When you save a project in the database, the task list is stored according to such a datamodel. I've known that structure for quite some time, but only now I see the resemblance to the nested set, in that this too relies on linearization of the depth first traversal of the hierarchy.

Thursday, August 25, 2005

Hierarchies...Again

Yep, handling of hierarchical data in a relational database is a popular subject lately. I never heard of the nested set model until I stumbled upon it on the mysql newbie forum.
Now, it seems as if it's popping up everywhere. MySQL even republished the article pointed to in that thread. And, it's being read, too, at least by Clive Galway, who's using it manage Google Map data.

Anyway, I'm seriously considering to use this structure in the future too, so I decided to investigate it for a bit. And while I'm at it, I thought that it would be a nice opportunity to familiarize myself with MySQL a bit.
For starters, I think I won't do away with the adjacency list all together. What I have in mind is a hybrid of the nested set model and the adjacency list:


CREATE TABLE node (
    id         int unsigned NOT NULL auto_increment
,   parent_id  int unsigned                          -- AL
,   position   int unsigned NOT NULL default 0       -- AL
,   left_node  int unsigned NOT NULL                 -- NS
,   right_node int unsigned NOT NULL                 -- NS
,   label      varchar(32)  
,   PRIMARY KEY(id)
,   CONSTRAINT fk_node 
        FOREIGN KEY (parent_id) 
        REFERENCES node (id)
)

I don't know if there's any classical relational terminology that captures it, but we definitely have some form of redundance: we can construct a hierarchy using either the parent_id and id columns or the left_node and right_node columns, and if we would do so, these hierarchies must not different. So we would most certainly have to control that redundance.
So why would I want to breed this hybrid? What reason could I have? The obvious answer is that i want the best of both models. I'll try and illustrate that.

The adjacency list is a very intuitive representation of a hierarchy. Entering data could not be simpler than with the adjacency list. All we need is the data that belongs to the node itself (here represented by the label column).
Optionally, we could provide the id of the parent node. We can create a root node simply by not specifying the parentnode at all. (By the term "root node", I simply mean a node without a parent - nothing more, nothing less)
If we want to, we can supply a value for the position column, wich governs what position the new node will occupy among it's sibling nodes. (The sibling nodes are the nodes that are also direct children of the parent node, or -in case we're creating a root node, the other root nodes).

We really want to make advantage of this property of the adjacency list model.

Now everyone that's ever made use of such a model will be quite aware of the downside of this type of model: the inability to extract data from it for an arbitrary branch and for an arbitrary depth.
A lot of people always mention that this is because of the recursive nature of this type of structure, but i disagree on that. In my opinion the underlying cause for the difficulties of this model lie in the fact that you always need some kind of multi-pass algorithm.
Say you want to find all descendants from a give node down, you'd need to repeatedly scan through all entries in the table to look for nodes who's parent_id is the current id. Once you have those nodes, you have to repeat the proces for those nodes again until you find that there are no more nodes that have a matching parent id.
Although a recursive algorithm forms a very natural solution to perform this type of multi-pass job, there's nothing that prevents you from using iteration to achieve the same thing. In an iterative aproach, one would have a loop that records the information that's used in the subsequent passes of the loop, again and again until there's no more information registered wherafter the loop ends.
Anyway, be it recursive or iterative, we need multiple passes to scan our set, and this is something that's not supported, at least not in pure SQL (save a construct like CONNECT BY).

The nested set model is just beatiful when it comes to extracting data. Because the hierarchy is expressed using numerical intervals, a simple, single-pass algorithm is sufficient to extract all the data from a single node down or from a single node up, simply by comparing numerical intervals. We only need the interval of the one node that starts the process, and compare that with the intervals of all the other nodes and a single comparison is enough to assert that a node is related or unrelated to that node. This is all very different from the adjacency list, where the information that we need to get the next level of nodes is only available at the latest possible moment.

We really want to make advantage of this property of the nested set model. I already have a view that's good for this:


CREATE VIEW tree.tree_node
AS
select     rn.id          AS root_id
,          rn.parent_id   AS root_parent_id
,          rn.position    AS root_position
,          rn.label       AS root_label
,          dn.left_node   AS descendant_order
,          dn.id          AS descendant_id
,          dn.parent_id   AS descendant_parent_id
,          dn.position    AS descendant_position
,          dn.label       AS descendant_label
,          count(an.id)   AS descendant_depth
from       tree.node rn
inner join tree.node dn
on         rn.left_node  <= dn.left_node
and        rn.right_node >= dn.left_node
inner join tree.node an
on         dn.left_node  >= an.left_node
and        dn.left_node  <= an.right_node
and        an.left_node  >= rn.left_node
and        an.left_node  <= rn.right_node
group by   rn.id
,          rn.parent_id
,          rn.position
,          rn.label
,          dn.left_node
,          dn.id
,          dn.parent_id
,          dn.position
,          dn.label
;

The downside of the nested set model is of course the construction and maintenace of these intervals. When we want to add a node to our hierarchy, we need to specify an interval for our new node.
This interval needs fit in the interval of the parent node, but is must also but in between the intervals of the previous and next sibling of our new node. When the inteval of the parent node isn't wide enough, we need to stretch it to make room for our new node. The intervals of the siblings might need to be shifted to make room inbetween. Another real bummer: all this shifting and widening impacts the nodes that are directly related to the sibling nodes and the parent node in a similar way.
In a worst case scenario, the insertion of a single node in the leftmost, depthmost branch of the tree will cause the intervals of *ALL* the other nodes to be adjusted. Just one little mistake, and we'll be in a lot of trouble reconstructing the original hierarchy. Cut short, not the kind of stuff I trust the applications to handle well. If we are going to have all this interval maintenance going on, we're going to have it in one place, and that's the database. This also the most convenient place, since we potentially need to access all nodes to perform one single action.

So far, I've come up with this function to create new nodes:


DELIMITER $$

DROP FUNCTION IF EXISTS `tree`.`f_ins_node`$$
CREATE FUNCTION `tree`.`f_ins_node`(
    p_label     varchar(32)
,   p_parent_id int unsigned
,   p_position  int unsigned
) RETURNS int(10) unsigned
BEGIN
    insert
    into   node
    select id
    ,      parent_id
    ,      position
    ,      left_node
    ,      right_node
    ,      label
    from (
        --
        -- parent node + ancestors
        --
        select     o.id                           id
        ,          o.parent_id                    parent_id
        ,          o.position
        +          if(
                       o.position  = p_position
                   and o.parent_id = p.id
                   ,   1
                   ,   0
                   )                              position
        ,          o.left_node                    left_node
        ,          o.right_node + 1               right_node
        ,          o.label                        label
        from       node p
        inner join node o
        on         p.left_node >= o.left_node
        and        p.left_node <= o.right_node
        where      p.id = p_parent_id
        --
        -- right siblings
        --
        union all
        select     o.id
        ,          o.parent_id
        ,          o.position + 1
        ,          o.left_node + 1
        ,          o.right_node + 1
        ,          o.label
        from       node o
        where      o.parent_id = p_parent_id
        and        o.position >= p_position
        --
        -- right unrelated branches 1
        --
        union all
        select     o.id
        ,          o.parent_id
        ,          o.position
        ,          o.left_node + 1
        ,          o.right_node + 1
        ,          o.label
        from       node p
        inner join node o
        on         p.right_node < o.left_node
        where      p.id = p_parent_id
        --
        -- right unrelated brances 2
        --
        union all
        select     d.id
        ,          d.parent_id
        ,          d.position
        +          if(
                       d.id= o.id
                   and o.position >= p_position
                   ,   1
                   ,   0
                   ) position
        ,          d.left_node + 1
        ,          d.right_node + 1
        ,          d.label
        from       node o
        inner join node d
        on         o.left_node <= d.left_node
        and        o.right_node >= d.left_node
        where      p_parent_id is null
        and        o.parent_id is null
        and        o.position >= p_position
        --
        -- self 1:
        --    root
        --
        union all
        select     null
        ,          null
        ,          coalesce(p_position,p.position + 1)
        ,          max(p.right_node) + 1
        ,          max(p.right_node) + 1
        ,          p_label
        from       node p
        where      p_parent_id is null
        and        p.parent_id is null
        and        p.position <= coalesce(p_position,p.position)
        group by   p.id
        --
        --
        --
        union all
        select    null
        ,         p.id
        ,         coalesce(
                      p_position
                  ,   max(c.position)+1
                  ,   0
                  ) as position
        ,         coalesce(
                      if(count(c.id)=0,p.left_node+1,null)                   -- no siblings
                  ,   if(p_position is null,max(c.right_node)+1,null)       -- after last sibling
                  ,   max(if(c.position = p_position,c.left_node,null))     -- at position
                  ,   max(if(c.position < p_position,c.right_node+1,null))  -- max before position
                  ,   p.left_node + 1                                        -- none before
                  ) as left_node
        ,         coalesce(
                      if(count(c.id)=0,p.left_node+1,null)                   -- no siblings
                  ,   if(p_position is null,max(c.right_node)+1,null)        -- after last sibling
                  ,   max(if(c.position = p_position,c.left_node,null))      -- at position
                  ,   max(if(c.position < p_position,c.right_node+1,null))   -- max before position
                  ,   p.left_node + 1                                        -- none before
                  ) as right_node
        ,         p_label
        from      node p
        left join node c
        on        p.id = c.parent_id
        where     p.id = p_parent_id
        group by  p.id
    ) as n
    on duplicate key
    update
        position   = n.position
    ,   left_node  = n.left_node
    ,   right_node = n.right_node
    ;
    return last_insert_id()
    ;
END$$

DELIMITER ;

My interval maintenance is a bit different than what I found in the examples I find on the web. In particular, my leaf nodes have the propery that

left_node = right_node

and not

left_node = right_node -1

Another difference is that my hierarchy can have multiple roots. I really want to be able to do this, for example to model organization structures. I really should be possible to have a couple mother organization-trees that are not related to each other. I decided not to provide a default dummy root-of-all-roots node, simply because I want my data to express that we can have a couple of hierarchies that are completely independant upon each other. Because this is still a adjacency list model, i can still identify a root easily bij requiring that

parent_id is null

I'm still working on a procedure to delete nodes (nearly finished) and a procedure to move a node from one brach to the others.
I'd also like to program a couple of commands to navigate through a hierarchy on the basis of a current node. Anyway, when I finish it all, ill post it to the MySQL forum and see what people think of this hybrid approach.

Monday, August 22, 2005

SQL Standards Download...It's useful, but is it legal?

So much as a couple of years ago, I stumbled upon this link:

http://www.ncb.ernet.in/education/modules/dbms/sql99index.html

It pops up at page 14 when googling for "ISO 9075".
I have no means of checking it it's is genuine, but everything suggests it is.

Funny thing is, I always thought that these cost money, even though this version of the standard is outdated.

Saturday, August 20, 2005

MySQL: Create Dirty Little Tricker FOR SELECT ON view

Wow, i never thought it would work, but it sure seems to be the case. I stumbled over a post at the MySQL Triggers forum by Greg Chandler.

Greg would've liked to increment a number that's stored in the database. That's not an uncommon thing to want, but what makes Greg's question interesting in my opinion, is that he would like this to happen in the event that a user only reads the database field that stores the number. Now, I don't wan't to go into details as to why he needs it, but feel free and follow the link to find out all about it.

Suppose you'd want to implement this, what you'd need is something like:


create trigger {trigger-name}
for select on {table-name}[(
    {columnlist}
)]
for each row 
begin
    ..increase that number...
end;

Of course, I made this syntax up to illustrate the resemblance with an ordinary triggered action. Like a stored procedure, a trigger is associated with a sequence of statements that's persistently stored in the database. This sequence can be executed repeatedly afterwards. Unlike a stored procedure, a trigger can't be called or evaluated directly at the user's request. Rather, a triggers is fired in response to some event occurring in the database.

The classical trigger events are associated with a database base table. These events correspond directly to the occurrence of one of the fundamental operations that can be performed to manipulate the contents of a table: INSERT, UPDATE and DELETE.

Of course, the concept of an event that fires execution of a sequence of statements can be extended to all kinds of objects, and depending upon the type of object, you could think of appropriate events for that object. For starters, quite some database products offer similar functionality for views as well as for base tables. And Oracle even supports triggers that react to operations that alter the structure of a table, as well as events that occur to the system as a whole (system shutdown etc.), not just a particular part of the system.

What I think is interesting about Gregs original problem is that there's no RDBMS product that I know that supports the trigger concept for the SELECT operation. I can think of at least one purpose for such functionality. Suppose we'd have a collection of views, for example as part of a reporting environment. Now, it would be really cool to know which views are used a lot, how many rows a query on the view generally yields and the like.

Now, there's this dirty little trick that seems to do the job. I don't think that this trick reflects intended usage of the MySQL database product, and I wouldn't be suprised if it doesn't survive the 5.0.10 version I'm currently running. But I'm so excited about it, I just want to show it off. And of course, I hope that it will be of some use to you, dear reader.

The trick is implemented by creating a function that performs the desired actions. Now, function evaluation is just a particular kind of expression. This means we can write a view that references our function, thus executing whatever statements we put in the function. Whenever the user select from the view, our function would be evaluated, and thus, the actions are performed.

Let's stick to the 'logging view usage' case. I'll be using the test database for all this, so let's make it the current database:


    use test;

First, we'll create a table to log usage in:


CREATE TABLE select_log (
    id               int unsigned  NOT NULL auto_increment PRIMARY KEY
,   log_who          varchar(16)   NOT NULL -- the user that performs the SELECT
,   log_conn         int  unsigned NOT NULL -- the id of the connection
,   log_when         timestamp     NOT NULL -- the moment the operation is performed
,   log_schema       varchar(64)   NOT NULL -- the schema in which the view resides
,   log_view         varchar(64)   NOT NULL -- the view
)
;

Here's the function that performs the logging:


delimiter $$
create function f_log_select(
    p_log_schema varchar(64)
,   p_log_view   varchar(64)
)
returns int unsigned
NOT DETERMINISTIC
begin
    insert 
    into   select_log(
               log_who    
           ,   log_conn   
           ,   log_when  
           ,   log_schema 
           ,   log_view   
           ) values (
               current_user()
           ,   connection_id()
           ,   now()
           ,   p_log_schema
           ,   p_log_view
           );
    return last_insert_id();
end;
$$

So far so good. Now, we need a table that represents the actual data for wich we can write a view that 'fires' our function:


CREATE TABLE table1 (
    number    int(11)  default 
,   clientid  int(11)  default 
,   datebegin datetime default 
,   dateend   datetime default 
,   data      char(1)  default 
)

And we'll need some data to populate it:


+--------+----------+---------------------+---------------------+------+
| number | clientid | datebegin           | dateend             | data |
+--------+----------+---------------------+---------------------+------+
|      1 |        1 | 2005-01-01 00:00:00 | 2005-01-02 00:00:00 | A    |
|      2 |        1 | 2005-01-02 00:00:00 | 2005-01-04 00:00:00 | b    |
|      3 |        1 | 2005-01-04 00:00:00 | 2005-01-05 00:00:00 | c    |
+--------+----------+---------------------+---------------------+------+

Finally, we're arriving at the heart of the matter: the view definition. We want the view to select all the rows from table1, and we want to make sure that our f_log_select function is called in the process. This is what I came up with:


    create or replace view v_table1
    as
    select *
    from   table1
    where  f_log_select(   
               'test'
           ,   'table1' 
           )

Now, lets select from our view, and see what happens:


mysql> select * from v_table1;
+--------+----------+---------------------+---------------------+------+
| number | clientid | datebegin           | dateend             | data |
+--------+----------+---------------------+---------------------+------+
|      1 |        1 | 2005-01-01 00:00:00 | 2005-01-02 00:00:00 | A    |
|      2 |        1 | 2005-01-02 00:00:00 | 2005-01-04 00:00:00 | b    |
|      3 |        1 | 2005-01-04 00:00:00 | 2005-01-05 00:00:00 | c    |
+--------+----------+---------------------+---------------------+------+
3 rows in set (0.00 sec)

mysql> select * from select_log;
+----+----------------+----------+---------------------+------------+----------+
| id | log_who        | log_conn | log_when            | log_schema | log_view |
+----+----------------+----------+---------------------+------------+----------+
|  1 | root@localhost |        1 | 2005-08-20 09:39:03 | test       | v_table1 |
+----+----------------+----------+---------------------+------------+----------+
1 row in set (0.00 sec)

Wow! It works...or does it? I was a bit puzzled at first why there's only one row in the log table, whereas there are three rows in the view. I'd expected to see three rows in the log table as well. Giving it a bit more thought, I must conclude that the results are in fact correct. The f_log_select function has about the same status as a call to a built in function like NOW(). So, whenever a select is issued against the view, f_log_select is evaluated exactly once. The evaluation is totally independant of the number of rows selected from the view:


mysql> select * from v_table1 where number = 0;
Empty set (0.01 sec)

mysql> select * from select_log;
+----+----------------+----------+---------------------+------------+----------+
| id | log_who        | log_conn | log_when            | log_schema | log_view |
+----+----------------+----------+---------------------+------------+----------+
|  1 | root@localhost |        1 | 2005-08-20 09:39:03 | test       | v_table1 |
|  2 | root@localhost |        1 | 2005-08-20 09:47:26 | test       | v_table1 |
+----+----------------+----------+---------------------+------------+----------+
2 rows in set (0.01 sec)

Having thought about this, I thought that it would be really cool to log both the occurrence of a select statement as well as the event of ann individual row being selected. So, I extended my log table a bit:


alter table select_log
add log_type enum('ROW','STATEMENT')
not null
default 'STATEMENT'
;

(I used 'STATEMENT' as default because I'm to lazy to rewrite f_log_select at this stage).

Ok, so what would we do to trick MySQL into evaluating a function for each row selected by the view? Actually, it's not that hard. We must pass a value from our record to the function, so that MySQL is forced to re-evaluate for each row. So, in addition to f_log_select, I write this:


DELIMITER $$

DROP FUNCTION IF EXISTS `test`.`f_log_select_row`$$
CREATE FUNCTION `test`.`f_log_select_row`(
    p_log_schema varchar(64)
,   p_log_view   varchar(64)
,   p_expr       varchar(1)
) RETURNS int(10) unsigned
begin
    insert
    into   select_log(
               log_who
           ,   log_conn
           ,   log_when
           ,   log_schema
           ,   log_view
           ,   log_type
           ) values (
               current_user()
           ,   connection_id()
           ,   now()
           ,   p_log_schema
           ,   p_log_view
           ,   'ROW'
           );
    return last_insert_id();
end$$

DELIMITER ;

And, I rewrote the view:


    create or replace view v_table1
    as
    select *
    from   table1
    where  f_log_select(   
               'test'
           ,   'table1' 
           )
    and    f_log_select_row(   
               'test'
           ,   'table1' 
           ,   data
           )

When we clear the log, select from the view and then see what's in the log we get this:


+----+----------------+----------+---------------------+------------+----------+-----------+
| id | log_who        | log_conn | log_when            | log_schema | log_view | log_type  |
+----+----------------+----------+---------------------+------------+----------+-----------+
|  3 | root@localhost |        1 | 2005-08-20 10:14:19 | test       | v_table1 | STATEMENT |
|  4 | root@localhost |        1 | 2005-08-20 10:14:19 | test       | v_table1 | ROW       |
|  5 | root@localhost |        1 | 2005-08-20 10:14:19 | test       | v_table1 | ROW       |
|  6 | root@localhost |        1 | 2005-08-20 10:14:19 | test       | v_table1 | ROW       |
+----+----------------+----------+---------------------+------------+----------+-----------+

So, it definitely seems to work.

Well, I already said that I regard this as rather dirty little trick, but I had fun devising it, and it learnt me a bit about function evaluation within views.

Maybe this isn't a trick after all, and maybe it's even useful for a real purpose. If you find such a purpose, go ahead, and try if it works for you. I'll be most interested to hear from anyone who tries.

Friday, August 19, 2005

Slight CONNECT BY problem....SOLVED?

It's been a while since I've been to my employers HQ. Usually, I'm stationed at my client's office, but today, me and a co-worker of mine, demo-ed a portal solution to my client.

Anyway, the presenstation went fine, and I spent the rest of the afternoon telling my colleages about a nifty little trick I picked up at the MySQL Forum. It's a not really a trick, it's actually a very clever datamodel to store hierarchical data in a relational database.

Usually, this is done using the adjency list model, but because this model relies on a recursive relationship, it's a hell of a job to extract all the data relating to an entire brach of the hierarchy. The trick I picked up is the nested set model. You can read all about that over here:

http://dev.mysql.com/tech-resources/articles/hierarchical-data.html

Now, my coworker has developed a project database that uses a table to store the work breakdown structure; a typical hierarchical structure. It runs on Oracle.

Here's an example resembling his table:


create table PRODUCT(
    ID NUMBER NOT NULL
,   NAME VARCHAR2(30) NOT NULL
,   POSITION NUMBER NOT NULL
,   PARENT_ID NUMBER
,   CONSTRAINT PK_PRODUCT
        PRIMARY KEY(ID)
,   CONSTRAINT FK_PRODUCT
        FOREIGN KEY(PARENT_ID)
        REFERENCES PRODUCT(ID)
)

(sorry 'bout the caps, but i extract it like this from the data dictionary)

And here's some sample data:


        ID NAME             POSITION    PARENT_ID
---------- ---------------- ---------- ----------
         1          project          1
         2         hardware          1          1
        20      windows bak          1          2
        21        linux bak          2          2
         3         software          2          1
        30        schermpje          1          3
        31           driver          2          3
         4         migratie          9          1
        41        extractie          1          4
        42            laden          2          4
        26     documentatie          4          1
         5          systeem          1         26
       128       gebruikers          2         26
         6             html          1        128

So, what we're seeing here is a typical adjency list implementation, with a recursive relationship defined on the parent_id column. This column optionally references the id of the table. Now there's one extra feature in this table, the position column. The purpose of the position column is to define the order in which sibling rows should be sorted.

Now, part of his application wants to select an entire tree to depict the complete work breakdown structure. Oracle has some built-in features that relieves some of the burdens you encounter when you're working with the 'adjacency list model'. In particular, Oracle supports the CONNECT BY language construct which tells the database to perform a recursive query:


select  p.*
,       level as inner_level
from    product p
connect by prior p.id = p.parent_id
start   with id = 1

It looks a bit like a JOIN, but what's really odd is that there's only one instance of the product table. So, connect by literally joins a table to itself - not to another instance of itself (SELF or AUTO JOIN). Related constructs are the PRIOR operator and the START WITH clause.

The construct is best explained by assuming that execution starts with the row from the product table for wich id = 1 is true. Then, all other rows are scanned, looking for those rows where the parent_id matches the id from the original row we started out with. Once we're done with that, the level pseudocolumn is incremented by 1, and the cycle is started anew, this time staring out with the id's from the rows we gained this round. A process of cycles is executed untill there are no more rows to match. Query execution is then complete, and the result is returned.


        ID NAME                             POSITION  PARENT_ID INNER_LEVEL
---------- ------------------------------ ---------- ---------- -----------
         1                        project          1                      1     
         2                       hardware          1          1           2
        20                    windows bak          1          2           3
        21                      linux bak          2          2           3
         3                       software          2          1           2
        30                      schermpje          1          3           3
        31                          driver         2          3           3
         4                        migratie         9          1           2
        41                       extractie         1          4           3
        42                           laden         2          4           3 
        26                    documentatie         4          1           2
         5                         systeem         1         26           3
       128                      gebruikers         2         26           3 
         6                            html         1        128           4

The algoritm is like a breadth first recursive algorithm. That is, each cycle retrieves only the immediate child rows, and only when these are all retrieved, the level pseudocolumn is incremented, and the next cycle is started with the rows we just retrieved in the round. (Note that the rows are not returned in the order of depth. It all remains a relational system, and any order or sorting must be applied explicitly.)

This recursiveness is what really distinguishes CONNECT BY from an ordinary join. It's very powerful in the sense that one need not know how deep the tree is: CONNECT BY will execute and execute until there is no new level to descend to. Unfortunately, the query will throw an exception without retunring any row when there's a circulare reference in the data. As of Oracle

Now, there's a slight problem with this result. It sure does yield the right rows, but what we really want is to have Oracle present the results like, well, like a tree.
The result should be something like this:


 PARENT_ID   POSITION         ID NAME
---------- ---------- ---------- --------------------  
                    1          1 project
         1          1          2  hardware 
         2          1         20   windows bak
         2          2         21   linux bak 
         1          2          3  software 
         3          1         30   schermpje 
         3          2         31   driver 
         1          4         26  documentatie 
        26          1          5   systeem 
        26          2        128   gebruikers 
       128          1          6    html
         1          9          4  migratie 
         4          1         41   extractie 
         4          2         42   laden

Well I was really surprised to hear that so far, no one had come up with a query variant that would return the rows from the original query in the desired order. In fact, my co-worker told me that it was IMPOSSIBLE to do it. So, what can I do but feel really challenged? I started of full of optimism, only to discover that everything i tried amounted to nothing. I kept coming back at the same theme: to order the nodes depth-first according to their parent_id and position, you need to access all of the ancestor rows in order to account for the positional information stored in their parent_id and position columns too.

Well, I still hope there's a better way, but finally, I came up with this:


select p.id
,      p.parent_id
,      p.position
,      lpad(' ',inner_level)||p.name as name
,      inner_level
,      power(10,3-inner_level)
*     (select sum(p1.position * power(10,level))
       from product p1
       connect by prior p1.parent_id = p1.id
       start with p1.id = p.id
      ) sort
from (
select p.*
,      level as inner_level
from   product p
connect by prior p.id = p.parent_id
start with id = 1
) p order by 6

It yields this result:


 ID  PARENT_ID   POSITION NAME             INNER_LEVEL SORT
--- ---------- ---------- ---------------- ----------- ---------- 
  1                       project           1 1000 
  2          1          1  hardware         2 1100
 20          2          1   windows bak     3 1110 
 21          2          2   linux bak       3 1120
  3          1          2  software         2 1200
 30          3          1   schermpje       3 1210
 31          3          2   driver          3 1220
 26          1          4  documentatie     2 1400
  5         26          1   systeem         3 1410
128         26          2   gebruikers      3 1420
  6        128          1    html           4 1421
  4          1          9  migratie         2 1900
 41          4          1   extractie       3 1910
 42          4          2   laden           3 1920

Now, I know the solution is flawed. For example, you can't have more than 9 children per node, and you can't have a deeper tree than 38 (number precision). You could tinker it a bit to allow for 99 children per node with a maximum depth of 19, which seems reasonable for this particular purpose (work breakdown structure)

However, the more I'm familiarizing myself with the nested set model, the more i feel that that is a much, much better solution to handle hierarchy within a relational database.

Thursday, August 11, 2005

Blog cherry popped

Yep, I guess this is it....I never imagined it would be like this. Wow...My First Post on My First Blog.

I just wanted to post a wee comment in response to a tiny IE issue, and before you know it, I got my own blog. Now, let's hope I'm more faithful to you than I am to my http://www.xcdsql.org/ homepage, dear Blog...