Friday, August 19, 2005

Slight CONNECT BY problem....SOLVED?

It's been a while since I've been to my employers HQ. Usually, I'm stationed at my client's office, but today, me and a co-worker of mine, demo-ed a portal solution to my client.

Anyway, the presenstation went fine, and I spent the rest of the afternoon telling my colleages about a nifty little trick I picked up at the MySQL Forum. It's a not really a trick, it's actually a very clever datamodel to store hierarchical data in a relational database.

Usually, this is done using the adjency list model, but because this model relies on a recursive relationship, it's a hell of a job to extract all the data relating to an entire brach of the hierarchy. The trick I picked up is the nested set model. You can read all about that over here:

http://dev.mysql.com/tech-resources/articles/hierarchical-data.html

Now, my coworker has developed a project database that uses a table to store the work breakdown structure; a typical hierarchical structure. It runs on Oracle.

Here's an example resembling his table:


create table PRODUCT(
ID NUMBER NOT NULL
, NAME VARCHAR2(30) NOT NULL
, POSITION NUMBER NOT NULL
, PARENT_ID NUMBER
, CONSTRAINT PK_PRODUCT
PRIMARY KEY(ID)
, CONSTRAINT FK_PRODUCT
FOREIGN KEY(PARENT_ID)
REFERENCES PRODUCT(ID)
)

(sorry 'bout the caps, but i extract it like this from the data dictionary)

And here's some sample data:


ID NAME POSITION PARENT_ID
---------- ---------------- ---------- ----------
1 project 1
2 hardware 1 1
20 windows bak 1 2
21 linux bak 2 2
3 software 2 1
30 schermpje 1 3
31 driver 2 3
4 migratie 9 1
41 extractie 1 4
42 laden 2 4
26 documentatie 4 1
5 systeem 1 26
128 gebruikers 2 26
6 html 1 128


So, what we're seeing here is a typical adjency list implementation, with a recursive relationship defined on the parent_id column. This column optionally references the id of the table. Now there's one extra feature in this table, the position column. The purpose of the position column is to define the order in which sibling rows should be sorted.

Now, part of his application wants to select an entire tree to depict the complete work breakdown structure. Oracle has some built-in features that relieves some of the burdens you encounter when you're working with the 'adjacency list model'. In particular, Oracle supports the CONNECT BY language construct which tells the database to perform a recursive query:


select p.*
, level as inner_level
from product p
connect by prior p.id = p.parent_id
start with id = 1


It looks a bit like a JOIN, but what's really odd is that there's only one instance of the product table. So, connect by literally joins a table to itself - not to another instance of itself (SELF or AUTO JOIN). Related constructs are the PRIOR operator and the START WITH clause.

The construct is best explained by assuming that execution starts with the row from the product table for wich id = 1 is true. Then, all other rows are scanned, looking for those rows where the parent_id matches the id from the original row we started out with. Once we're done with that, the level pseudocolumn is incremented by 1, and the cycle is started anew, this time staring out with the id's from the rows we gained this round. A process of cycles is executed untill there are no more rows to match. Query execution is then complete, and the result is returned.


ID NAME POSITION PARENT_ID INNER_LEVEL
---------- ------------------------------ ---------- ---------- -----------
1 project 1 1
2 hardware 1 1 2
20 windows bak 1 2 3
21 linux bak 2 2 3
3 software 2 1 2
30 schermpje 1 3 3
31 driver 2 3 3
4 migratie 9 1 2
41 extractie 1 4 3
42 laden 2 4 3
26 documentatie 4 1 2
5 systeem 1 26 3
128 gebruikers 2 26 3
6 html 1 128 4

The algoritm is like a breadth first recursive algorithm. That is, each cycle retrieves only the immediate child rows, and only when these are all retrieved, the level pseudocolumn is incremented, and the next cycle is started with the rows we just retrieved in the round. (Note that the rows are not returned in the order of depth. It all remains a relational system, and any order or sorting must be applied explicitly.)

This recursiveness is what really distinguishes CONNECT BY from an ordinary join. It's very powerful in the sense that one need not know how deep the tree is: CONNECT BY will execute and execute until there is no new level to descend to. Unfortunately, the query will throw an exception without retunring any row when there's a circulare reference in the data. As of Oracle

Now, there's a slight problem with this result. It sure does yield the right rows, but what we really want is to have Oracle present the results like, well, like a tree.
The result should be something like this:


PARENT_ID POSITION ID NAME
---------- ---------- ---------- --------------------
1 1 project
1 1 2 hardware
2 1 20 windows bak
2 2 21 linux bak
1 2 3 software
3 1 30 schermpje
3 2 31 driver
1 4 26 documentatie
26 1 5 systeem
26 2 128 gebruikers
128 1 6 html
1 9 4 migratie
4 1 41 extractie
4 2 42 laden

Well I was really surprised to hear that so far, no one had come up with a query variant that would return the rows from the original query in the desired order. In fact, my co-worker told me that it was IMPOSSIBLE to do it. So, what can I do but feel really challenged? I started of full of optimism, only to discover that everything i tried amounted to nothing. I kept coming back at the same theme: to order the nodes depth-first according to their parent_id and position, you need to access all of the ancestor rows in order to account for the positional information stored in their parent_id and position columns too.

Well, I still hope there's a better way, but finally, I came up with this:


select p.id
, p.parent_id
, p.position
, lpad(' ',inner_level)||p.name as name
, inner_level
, power(10,3-inner_level)
* (select sum(p1.position * power(10,level))
from product p1
connect by prior p1.parent_id = p1.id
start with p1.id = p.id
) sort
from (
select p.*
, level as inner_level
from product p
connect by prior p.id = p.parent_id
start with id = 1
) p order by 6


It yields this result:


ID PARENT_ID POSITION NAME INNER_LEVEL SORT
--- ---------- ---------- ---------------- ----------- ----------
1 project 1 1000
2 1 1 hardware 2 1100
20 2 1 windows bak 3 1110
21 2 2 linux bak 3 1120
3 1 2 software 2 1200
30 3 1 schermpje 3 1210
31 3 2 driver 3 1220
26 1 4 documentatie 2 1400
5 26 1 systeem 3 1410
128 26 2 gebruikers 3 1420
6 128 1 html 4 1421
4 1 9 migratie 2 1900
41 4 1 extractie 3 1910
42 4 2 laden 3 1920


Now, I know the solution is flawed. For example, you can't have more than 9 children per node, and you can't have a deeper tree than 38 (number precision). You could tinker it a bit to allow for 99 children per node with a maximum depth of 19, which seems reasonable for this particular purpose (work breakdown structure)

However, the more I'm familiarizing myself with the nested set model, the more i feel that that is a much, much better solution to handle hierarchy within a relational database.

5 comments:

rpbouman said...

Yep dudes and dudettes, it's solved alright. See
this post

Anonymous said...

The cyclical references issue is solved in Oracle 10g with the "Connect By Nocycle" command.
http://www.oracle.com/technology/oramag/webcolumns/2003/techarticles/gennick_connectby_10g.html

rpbouman said...

Hi,

and thanks for your comment.

it's good that the cycle thing has been solved too.

It would be quite interesting to set up a benchmark (in oracle) to get a grip on the differences between a nested set model and an adjacency list.

I expect the likely: for a SELECT, the nested set will virtually always be faster, and that for INSERT/UPDATE/DELETE the nested set will always be faster, but how much?

AllForMusic said...

Hi,
I was facing a problem with hierarchical queries.

I am trying to do something like this:

select level from dual
start with level = 1
connect by level < 12

what i was looking forward to was a column with name "level" with 11 rows in it starting with 1 and incrementing till 11.

I have tried this query on other versions of oracle databases in my workplace and seems to have worked
for some of them,but for the one which is installed in my pc it fails to run and gives only "no rows returned"

Is there anything that i need to do like run any scripts before i start doing these things with my hierarchical queries?

Can you plz help me with it?

Thanks in advance :-)

rpbouman said...

Hi Joe,

"what i was looking forward to was a column with name "level" with 11 rows in it starting with 1 and incrementing till 11."

mm, that would be the case only if you would have a chain of nodes, each having one child node. I mean, the level denotes the depth of the node in the tree.

The Oracle rownum pseudocolumn always gives an incrementing sequence, but it works for any query - rownum has nothing to do with CONNECT BY.

"select level from dual
start with level = 1
connect by level < 12"

I don't think your syntax is sound. It should be something like:

select id, level
from node
connect by prior id = ref_id
start with id = 1

SAP HANA Trick: DISTINCT STRING_AGG

Nowadays, many SQL implementations offer some form of aggregate string concatenation function. Being an aggregate function, it has the effe...