Roland Bouman's blog: September 2005

Friday, September 30, 2005

Why REPEAT and WHILE are usually not handy to handle MySQL CURSORs

Most procedural languages offer some kind of syntax that influences program flow so that a statement or statement-sequence is allowed to be repeated. The MySQL stored procedure language (I think I'll start calling that MySPL from now on) offers three such language constructs: LOOP..END LOOP, REPEAT..UNTIL..END REPEAT and WHILE..DO..END WHILE.

Now, each of these essentially do the same: they denote a region in the stored procedure that will repeated at runtime. The differences between these constructs have to do with the way the repetition is terminated.

Unstructured loops: `LOOP..END LOOP`

LOOP..END LOOP is the simplest of the three loop constructs. LOOP denotes the start of the repetitive region, END LOOP denotes the end of the repetitive region, and anything inbetween those two markers is repeated. Period.


procedure p_count()
begin
    declare 
        v_counter
        int unsigned
        default 0
    ;
    loop                                -- start a repetitive region
        set v_counter := v_counter + 1; -- just keep counting
    end loop;                           -- end the repetitive region
end;

If the programmer does not take any special precautions, the repetition will be inifinte, which is not a good thing.

Lucky for the programmer, MySPL defines the LEAVE construct. At runtime LEAVE performs a kind of jump outside a region of code, resuming program execution after that particular region. When the LEAVE is executed from inside the body of a LOOP..END LOOP, the LEAVE can be used to jump right behind the END LOOP, effectively terminating repetition. Of course, just writing LEAVE is no good, as it effectively undoes the repetitive effect of the loop we wanted to achieve in the first place. But a conditionally executed LEAVE is very useful. Check out this snippet:


procedure p_count()
begin
    declare 
        v_max_counter
        int unsigned
        default 10
    ;
    declare 
        v_counter
        int unsigned
        default 0
    ;
    myLoop: loop                        -- start the repetitive region named 'myLoop'
        if                              -- check if we have counted enough
            v_counter = v_max_counter
        then
            leave myLoop;               -- terminate repetition of region 'myLoop'
        end if;
        set v_counter := v_counter + 1; -- just keep counting
    end loop myLoop;                    -- end the repetitive region 'myLoop'
    -- when the loop has been left, execution continues from here                                          
end;

This snippet differs in two ways from the previous one. Apart from the introduction of the IF..THEN..LEAVE..END IF, this snippet explicitly labels the repeated region myLoop. The region is labeled by prefixing the start of the region (the LOOP keyword in this case) by the identifier for the region ('myLoop') followed by a colon. We really must label the LOOP..END LOOP, else LEAVE does not know where to jump to, or rather, what to 'jump out of' (=leave).

Because the code that handles loop termination (IF..THEN..LEAVE..END IF in our snippt) is syntactically not part of the LOOP..END LOOP construct, LOOP..END LOOP is considered to be an unstructured loop construct.

Structured loops

The other MySQL loop constructs, REPEAT..UNTIL..END REPEAT and WHILE..DO..END WHILE are structured loop constructs. Their syntax expliclty requires a boolean expression, a condition, that controls loop termination. This compels the programmer to think about loop termination right away, providing some protection against writing infinite loops. Also, these loops have a well-defined, immutable exit point.

For a REPEAT..UNTIL..END REPEAT loop, the condition is checked after execution of the statements between the REPEAT and UNTIL markers. If the condition evaluates to TRUE, the loop terminates, and program execution resumes after the END REPEAT marker. If the condition does not evaluate to TRUE, that is, if it evaluates to either FALSE or NULL, program execution is repeated starting at the statement after the REPEAT.


procedure p_count()
begin
    declare 
        v_max_counter
        int unsigned
        default 10
    ;
    declare 
        v_counter
        int unsigned
        default 0
    ;
    repeat                              -- marks the start of the repeat loop
        set v_counter := v_counter + 1; -- just keep counting
    until                               -- marks the end of the loop body
        v_counter = v_max_counter       -- this boolean expression constitutes the loop control condition
    end repeat;                         -- marks the end of the loop
end;

For a WHILE..DO..END WHILE loop, the condition is checked before execution of the statements between the DO and END WHILE markers. If the condition does not evaluate to TRUE, that is, if it evaluates to FALSE or NULL, the loop terminates, and program execution resumes after the END WHILE marker. If the condition evaluates to TRUE, the loop is entered and the statements between the DO and END WHILE markers are executed. When program execution reaches de END WHILE marker, the condition between the WHILE and DO markers is re-evaluated, and it all starts all over again.


procedure p_count()
begin
    declare 
        v_max_counter
        int unsigned
        default 10
    ;
    declare 
        v_counter
        int unsigned
        default 0
    ;
    while                                -- marks the start of the loop
        v_counter <> v_max_counter       -- this boolean expression constitutes the loop control condition
    do                                   -- marks the start of the loop body
        set v_counter := v_counter + 1;  -- just keep counting
    end while;                           -- marks the end of the loop body
end;

Like with LOOP..END LOOP, you can still write LEAVE inside such a loop, but this is considered bad practice. In general, Good Code Writing Rules demand that code blocks, including loops, have a single point of entry, and a single exit. By adhering to this rule, you ensure that the loops are easier to maintain, and easier to debug. Ignoring this rule can lead to 'spaghetti' code, that's code that, instead of maintaining a clear sequential flow of program execution, jumps back and forth.

Why people favour structured loops

Generally, people favour structured loops over unstructured ones. The syntactic design of the structured loop constructs makes it easy to see where the repetition terminates and what condition controls termination. The only things that the programmer needs to do is make sure the condition is right. Also, the programmer must refrain from writing some additional LEAVE inside the loop. So, that's one thing programmers must do, and one thing the must not do.

For an unstructured loop, the programmer must write some kind of conditional statement that checks the condition. Apart from checking the condition, the programmer must also actually write the code that terminates the loop. In an unstructured loop, the condition check could appear anywhere inside the loop, so one really must read the entire loop code to see where the exit point is and what condition controls it. As for structured loops, programmers should refrain from writing multiple condition checks and exit points. This holds true for both unstructured and structured loops.

Loops and cursors

In my previous post, I illustrated CURSOR traversal in both Oracle and MySQL. I used an unstructured LOOP..END LOOP construct there in both examples. In fact, I usually handle cursors in MySPL using the unstructured loop construct. I'll hope I can explain why.

Take a look at the following snippet. I didn't make it up myself, I took it from the Cursors section in the MySQL Reference manual.

UPDATE: David Beroff kindly pointed out to me that at some point during the past decade that this post has been online, the example in the MySQL manual changed and now uses an LOOP...END LOOP construct. I don't know if this post had anything to do with that, but I'm happy it was changed anyway. Yay!

I did simplify it a little to keep a focus on the problem at hand, using a loop and a cursor:


CREATE PROCEDURE curdemo()
BEGIN
    DECLARE done INT DEFAULT 0;
    DECLARE a CHAR(16);
    DECLARE b INT;
    DECLARE cur1
        CURSOR FOR
        SELECT id,data
        FROM test.t1;
    DECLARE
        CONTINUE HANDLER FOR
        SQLSTATE '02000'
            SET done = 1;

    OPEN cur1;

    REPEAT
        FETCH cur1 INTO a, b;
        IF NOT done THEN
            INSERT
            INTO   test.t3
            VALUES (a,b);
        END IF;
    UNTIL
        done
    END REPEAT;

    CLOSE cur1;
END;

Basically, what's happening here's that we have a cursor, cur1 that selects data from the columns id and data of the table t1. The cursor is traversed using an REPEAT..UNTIL..END REPEAT loop, and for each FETCH, data that is taken from the cursor is used to perform an INSERT into table t3. Of course, we could've achieved this in a single INSERT INTO..SELECT..FROM statement, but that's beside the point. I want to focus on the REPEAT..UNTIL..END REPEAT loop and the CURSOR traversal.

So, how is this loop controlled? Well, being a REPEAT..UNTIL..END REPEAT loop, the loop body is entered at least once, and repeated until the condition becomes TRUE. Here, the condition itself is very simple, it consists of a simple evaluation of the INT type variable done. By default, done is zero, wich is equivalent to FALSE when evaluated as a boolean value. The entire control of the loop depends on the value of the done variable being set to an equivalent of TRUE when the cursor is exhausted. When that is the case, the very next time that the UNTIL condition is evaluated, the loop will terminate.

How is cursor exhaustion detected in MySPL? Well, I described in my previous post that you need a CONTINUE HANDLER for the builtin NOT FOUND condition to do this. In this snippet, this very device is used, with a slight modification: instead of the builtin NOT FOUND condition, a SQLSTATE '02000' is used. This covers the same situation as the builtin NOT FOUND condition.

The handler statement itself is very simple. It just assigns a value of 1 to the done variable. Because 1 is an equivalent of TRUE, the loop will terminate the next time that the UNTIL condition is evaluated.

Superficially all this seems to make sense, but I'll try and convince you that this is actually all wrong. Maybe that's a bit too strong. Let's say that I think that this is not as good as it could be.

What is really happening here?

Take a look at the body of the loop; I mean the statements inbetween the REPEAT and the UNTIL markers. First, a record is fetched from the cursor. Then, an IF..THEN statement checks if the done variable is still equivalent to FALSE, and if so, the INSERT is performed.

Now, why would it be neccessary to use an IF statement after the FETCH? Well, suppose the resultset retrieved by the cursor comprises exactly one record. When the cursor is opened, the cursor's internal record pointer will be positioned right before that first and only record. Then, the loop is entered, and the fetch is performed for the first time. The record's field values are transferred to the variables, and the cursor's record pointer will be moved to the next position, right after the record. Execution continues with the IF. The tested condition, NOT done, will evaluate to TRUE, because the done variable is still zero. So, in the first loop cycle, the INSERT is performed. We have arrived at the end of the loop body, reaching the UNTIL. Because the done variable is still equivalent to FALSE, the loop is re-entered, and it starts all over again right after the REPEAT marker defining the beginning of the loop.

We are now about to FETCH for the second time. However, once we attempt that, an error occurs because the cursor is exhausted. Remember, the query retrieved only one record, which was fetched in the first loop cycle. But wait, didn't we declare a HANDLER to deal with this situation? Yes, we did indeed! So, procedure execution is resumed at the handler statement, which assigns a one to the done variable. Once the assignment is done, the procedure execution resumes right after the FETCH that raised the error condition. This is because we are dealing with a CONTINUE handler here. After procedure execution resumes, the IF is executed for the second time. This time NOT done will evaluate to FALSE, so the INSERT is not performed this time. This makes sense, after all, we already exhausted our cursor during the first loop cycle.

Now we know why the IF's there. When we would just write an unconditionally executed INSERT right after the fetch, we would be inserting a record anyway, even though the cursor's already exhausted. The values that we would be inserting are in fact the same values that were inserted during the first loop cycle, because the FETCH is the only statement that assigns any values to the a and b variables.

After the IF..END IF, we encounter the UNTIL for the second time. This time, the done variable will evaluate to TRUE wich will terminate the loop.

Sorry, but why is this so bad?

So, what's not good about this? There are two things that I'm thinking of. One is a matter of style, and not that important, altough I think most programmers can appreciate it. The second has to do with efficiency, and is as far as I can see not really open to any debate. (But hey, feel free to do post a comment; as some advertising bots seem to have discovered, comments on this blog can be made anonymously).

It's 'just' a matter of style, but I feel that REPEAT is a construct you typically use to do some work that has to be done at least one time. After that, the condition may require the loop to be re-entered until all of the work has been done. REPEAT is so suitable for this type of problem because it checks the loop control condition after the loop body, so you are guaranteed to have at least one loop cycle. In my opinion, cursor loops usually do not match this pattern. Most of the times, you can't tell on beforehand if the cursor will retrieve any records at all, and in those cases, I do not want to enter the loop at all. Of course, the snippet is resilient to the case, so it's really a matter of style, not of correctness. If the cursor query would not return any rows at all, the snippet would still work, but to me, a REPEAT is like a marker saying: "Ho there, here's some work that needs to be done, and maybe we'll have to repeat it a couple of times". Not surprisingly, I really almost never have the need to write any loop like this.

I feel quite a bit more strong about another thing. I think that it is really not very good to have to different checks that test the same condition. I mean, we have both the IF and the UNTIL checking essentially an inverse condition. And it's not just something that has to do with this particular snippet; it really has to do with this type of cursor loop in general. Because the work you do in the loop relies on a successful FETCH, you realy need the IF. To me, it seems as if all the structural benefits of the REPEAT loop are superceded by essentially being forced to duplicate the loop control check. What makes it a bit worse, is that we really have to check inverse conditions, because the UNTIL wants the condition to be FALSE in order to repeat the loop and keep going, whereas the IF wants the condition to be TRUE to keep going.

Can't you solve this with a `WHILE` loop?

Well, in part, but I still feel we can do even better. Let's give it I go, I'll show you what I mean. Here's an equivalent snippet, with the REPEAT rewritten using WHILE:


CREATE PROCEDURE curdemo()
BEGIN
    DECLARE done INT DEFAULT 0;
    DECLARE a CHAR(16);
    DECLARE b INT;
    DECLARE cur1
        CURSOR FOR
        SELECT id,data
        FROM test.t1;
    DECLARE
        CONTINUE HANDLER FOR
        SQLSTATE '02000'
            SET done = 1;

    OPEN cur1;

    FETCH cur1 INTO a, b;        
    WHILE NOT done DO

        INSERT
        INTO   test.t3
        VALUES (a,b);

        FETCH cur1 INTO a, b;
    END WHILE;

    CLOSE cur1;
END;

So, how is this loop controlled in this case? Being a WHILE..DO..END WHILE loop, the loop body is entered only when the condition evaluates to TRUE. Repetition will likewise occur only if the condition is still TRUE when re-entering the loop. So this solves that "style thingie" I just mentioned. What's also good about this solution is this: Because we know that the cursor is not exhausted once we're inside the loop, we can do away with the IF alltogether. That's one inverse check less.

So what's the objection now then?

Although this is an improvement over REPEAT, there's still two things bothering me here.

One is really really minor, and has more to do with my translation of REPEAT into a WHILE than with anything else. For simplicity, I retained the original done variable. However, this makes me test NOT done in the WHILE. This is not bad, but we could do a lot better by reversing the semantics here. If we would alter it's declaration to, say

DECLARE hasMoreRows BOOL DEFAULT TRUE;

and likewise redeclare the handler statements to:

SET hasMoreRows = FALSE;

we could rewrite the loop start as

WHILE hasMoreRows DO

So that's quite nicely solved I think:


CREATE PROCEDURE curdemo()
BEGIN
    DECLARE hasMoreRows BOOL DEFAULT TRUE;
    DECLARE a CHAR(16);
    DECLARE b INT;
    DECLARE cur1
        CURSOR FOR
        SELECT id,data
        FROM test.t1;
    DECLARE
        CONTINUE HANDLER FOR
        SQLSTATE '02000'
            SET hasMoreRows = FALSE;

    OPEN cur1;

    FETCH cur1 INTO a, b;        
    WHILE hasMoreRows DO

        INSERT
        INTO   test.t3
        VALUES (a,b);

        FETCH cur1 INTO a, b;
    END WHILE;

    CLOSE cur1;
END;

What's still bothering me though is the handling of the FETCH statement. The snippet uses two FETCH statements. One's inside the loop body, wich is good, because we set up the loop in the first place to browse through the records in the cursor. The other one's outside of the loop, just before the we enter it. Why do we need two FETCH statements then? Let's concentrate on the one inside the loop body, and forget about the other one for a while. We already established we really need one inside the loop in order to traverse the cursor, and you will see why we need one outside the loop in a few moments.

First thing that we can observe about the 2nd FETCH is it's placement. It's immediately at the bottom of the loop body, right before the END WHILE marker. This is different from the situation in the REPEAT..UNTIL..END REPEAT snippet. There, the FETCH statement was at the top of the loop body, right behind the REPEAT marker. Does it make sense to put it at the bottom? When we reason from the point of view of the work that has to be done inside the loop, it certainly doesn't. Remember, we are browsing through all the records in the cursor, and INSERTing data from the cursor record into another table. So how does it make sense to FETCH the data after the INSERT?

Imagine the FETCH would be at the top of the loop, right where it was when we had the REPEAT loop, and pretend that the FETCH immediately preceding the loop isn't there. Now, suppose that query we used to define the cursor yields no records at all. The cursor would be opened, and because the hasMoreRows variable defaults to TRUE, the loop would be entered. Now, we are pretending the FETCH is right on top of the loop body, so we do the FETCH, and this results in an error, because the cursor's exhausted. Procedure execution would then be transferred to the HANDLER statement, wich SETs the hasMoreRows variable to FALSE, after wich procedure execution is resumed...right after the FETCH. Because we are already inside the loop, there's nothing that prevents the INSERT from being executed, wich is clearly not what we intended to do at all. Repetition is terminated only after the first loop cycle completes, wich simply is too late by one loop cycle.

So, now we know why the FETCH must be placed at the bottom of the loop. If it is placed there, cursor exhaustion is immediately handled by not entering the next loop cycle, and without executing any intervening statements. This also explains why we must have a FETCH immediately preceding the entire loop. If we place the FETCH that we need inside the loop at the bottom, we still need to have FETCHed at least once before entering the first loop cycle (or alternatively, not entering the loop at all) to initialize all the variables properly.

The downside is of course that we need to duplicate the fetch. This would not be so bad if it were'nt for all those columns. Suppose you have 20 columns, and you accidentally swap a few columns, fetching the wrong values into your variables...

How an unstructured loop remedies all this

Generally, I do not like to write unstructured loops, but in this case, I keep coming to the conclusion that a simple LOOP..END LOOP is just The Best Solution. Check it out:


CREATE PROCEDURE curdemo()
BEGIN
    DECLARE done BOOL DEFAULT FALSE;
    DECLARE a CHAR(16);
    DECLARE b INT;
    DECLARE cur1
        CURSOR FOR
        SELECT id,data
        FROM test.t1;
    DECLARE
        CONTINUE HANDLER FOR
        SQLSTATE '02000'
            SET done = TRUE;

    OPEN cur1;

    myLoop: LOOP
        FETCH cur1 INTO a, b;

        IF done THEN
            CLOSE cur1;
            LEAVE myLoop;
        END IF;

        INSERT
        INTO   test.t3
        VALUES (a,b);

    END LOOP;
    
END;

It's not really difficult to see how this loop is controlled. As in the REPEAT example, the loop is always entered. Right after it is entered, the FETCH is executed, and should the cursor be exhausted, the HANDLER is executed, setting the done variable to TRUE. Until now, this is similar to the REPEAT example. Then, the IF is executed. Instead of conditionally doing the actual work when NOT done as we saw in the REPEAT example, we use it to CLOSE the cursor, and LEAVE the loop only in case we're done. This pattern ensures that we keep all the statements that have to do with cursor handling together. After the END IF, we can do all the real work we need to perform inside the loop.

Summary

So, to summarize it all, we don't need the double check like we saw with the REPEAT; also, we don't need the extra FETCH we saw with the WHILE. The price we have to pay is that the loop is unstructured: your fellow programmers will sue you, and call you names behind your back.

Seriously, as long as you keep the OPENstatment, the start of the LOOP and the IF..END IF together, and make sure that the IF..END IF contains only those statements that have to do with cleaning up after the cursor traversal, you should be allright. One other concern is the loop label. When handling multiple or nesting loops, you should ensure not to mismatch the labels, but that's about it.

Wednesday, September 28, 2005

Want to write a CURSOR LOOP with MySQL?

Like all procedural database languages I know, the MySQL stored procedure language supports explicit cursors. I just wrote "explicit cursor" but for I'll refer to those by just saying "cursor".

A cursor allows a programmer to traverse a set of records retrieved by a relational database query in a sequential, one-by-one fashion. As such, a cursor forms an interface between the relationally oriented, declarative SQL language and the procedurally oriented calling context. A database cursor is like a pointer positioned somewhere in a resultset which can be moved programmatically to browse through all the records in the resultset. So, that's quite analogous to the wordprocessor cursor you use to browse through the characters and words on your computer screen.

(For those that are wondering what an implicit cursor is - well, that's just a piece of memory used by the database server to work with resulsets internally. Implicit cursors are not accessible via an externally exposed API, whereas explicit cursors are.)

Lots of php programmers are probably familiar with the cursor concept:


<?php
    //connect to the database server
    $connection = mysql_connect(
        $_SESSION['DB_SERVER']
    ,   $_SESSION['DB_USER']
    ,   $_SESSION['DB_PASSWORD']
    );
    //get the 'cursor'
    $cursor = $mysql_query(
        'SELECT * FROM table1'
    ,   $connection
    );
    //browse through the 'cursor'
    while(
        $row = mysql_fetch_row(
            $cursor
        )
    ){
        //process records, one by one.
        echo('<div>');
        foreach(
            $row 
        as  $field
        ) {
            echo('<span>'); 
            echo($field); 
            echo('</span>'); 
        }
        echo('</div>');
    }
?>

The cursor interface is really convenient when you need to do complex resultset processing. Processing a resultset is quite common in a client program, especially when the resultset needs to be displayed or printed, but somwtimes, it can be quite useful to do resultset processing on the database server side.

In most cases, purely data-oriented problems can usually be solved without the explicit resulset processing provided by cursors. It is often much better to use implicit resulset processing using 'ordinary' SQL statements. In fact, ordinary SQL statements are very powerful, and should be used whenever it's possible.
However, in some cases, the cursor interface is easier to write and maintain, and in some cases, there's no ordinary SQL substitute for the cursor interface.

Let's think of an example that is screaming for explicit resultset processing.

Suppose we have a table of employees, emp and a stored procedure p_calculate_salary. Now, assume that this stored procedure contains all the logic and business rules we need to calculate an employee's salary for a given month, including benefits, bonuses, and witheld fines for taking part in health or educational plans. What's more, the procedure does not only calculate all that, but also makes the necessary changes elsewhere in the system, trigger autonomous processes such as sending each employee it's monthly and (when applicable) annual salary reports.

For those that are still not convinced that all this logic should be embedded in a procedure, think of the following scenario. Suppose that the user that needs to calculate the salaries is not authorized to directly access the data that is needed to perform the business logic implemented by the procedure, and that the procedure is created using SQL SECURITY DEFINER. In that case, there really is no way to solve this using ordinary SQL. Resultset processing using a cursor is really the only way in this case.

To conveniently do the salary run for a particular month, we could use a cursor to traverse the list of employees, getting the relevant employee data and plugging that into the procedure. This type of use case is typical for a cursor: We do not need any direct output, so an ordinary SELECT is not necessary, spurious even.

Also, we cannot call a procedure in an ordinary SELECT, and it is not always possible to wrap all the business rules into a function (which could be used inside a SELECT).

This salary run is all back-end stuff. We do not want some external script doing this. We want to keep all those business rules and logic as closely tied to the database as we can, especially when the internals of the procedure need to refer to the database. So, that's why the database procedure is really the solution we want to use to calculate the salaries. Once we admit to that, we really cannot avoid cursor-based resultset processing anymore. So let's dig into some resultset processing...

The pseudocode for this operation would look like this:


    for all records in the employee table do:
        get the current employee record
        calculate salary for the current employee record
    until all employee records are processed.

In Oracle, you could write it like this:


    declare       
        v_col1 ....;                      --declare 1..N variables to access the record
        v_col2 ....;                      --I left out the datatypes intentionally
        ...
        v_colN ....; 
        cursor csr_emp                    --define the cursor
        as
        select *
        from   emp
        ;
    begin
        open csr_emp;                     --open the cursor
        loop                              --traverse the cursor 
            fetch csr_emp into            --fetch current record
                v_col1
            ,   v_col2
            ..
            ,   v_colN
            ;
            exit when csr_emp%notfound;   --stop traversal when no more records

            p_calculate_salary(           --process a record 
                v_col1
            ,   v_col2
            ...
            ,   v_colN
            );
        end loop;
        close csr_emp;                    --release resources 
    exception when others then            --capture all errors 
        if csr_emp%isopen then            --release resources 
            close csr_emp;
        end if;
    end;

This snippet illustrates the general pattern for working with a cursor:

You define or declare the cursor.

You open the cursor

You fetch a record from the cursor, storing field values into variables

You check if there are more records, and if so, repeat step 3. If not, proceed

You close the cursor when you've fetched the last record.

Now, compare this to the equivalent MySQL syntax:


    begin  
        declare v_col1 ....;              -- define variables to store record fields
        declare v_col2 ....;              -- datatypes intentionally ommitted
        declare v_colN ....;
        declare v_notfound                -- define the loop control variable
            BOOL default FALSE;
        declare csr_emp                   -- define the cursor
            cusor for 
            select *
            from   emp;
        declare continue handler          -- handle cursor exhaustion
            for not found 
            set v_notfound := TRUE;       -- mark our loop control variable
        declare exit handler              -- handle other errors
            for sqlexception
            close csr_emp;                -- free resources before exit
        open csr_emp;                     -- open cursor 
        cursor_loop: loop
            fetch csr_emp into            -- fetch record values
               v_col1
            ,  v_col2
            ...
            ,  v_colN
            ;
            if v_not_found then           -- exit the loop when the cursor is exhausted
                leave cursor_loop;
            end if;
            call p_calculate_salary(      -- process a record
                v_col1
            ,   v_col2
            ...
            ,   v_colN
            );
        end loop;
        close csr_emp;                    -- free resources
    end;

The ingredients are about the same for both examples. Both Oracle and MySQL open the cursor, fetch from it, and close it when it runs out of records.

The difference between MySQL and Oracle boils down to the difference in the way the loop control is handled.

In Oracle you could use a so-called cursor attribute. Cursor attributes are essentially builtin, readonly variables associated with the cursor. That's actually quite like member variables encountered in object oriented languages. To refer to a particular cursor attribute, you use this syntax: cursor_name%attribute_name. Oracle defines the following cursor attributes: FOUND, NOTFOUND, ROWCOUNT and ISOPEN.

In our snippet, we used the NOTFOUND attribute to check wheter the cursor was exhausted in order to exit the loop. We also used the ISOPEN attribute in the general catch-all WHEN OTHERS exception handler wich allows us to explicitly close the cursor when it was left open when an exception occurred in the inner block.

Although MS SQL Server's Transact SQL dialect does not define cursor attributes, there's a device there used to write cursor loops that in practice resembles the Oracle FOUND and NOTFOUND cursor attributes: the @@FETCH_STATUS (session wide) global variable. Theres a similar global resembling the Oracle ROWCOUNT attribute: @@CURSOR_ROWS.

The difference with Oracle cursor attributes is that in MS SQL, there's only one instance of the globals, reflecting the status of the last cursor statement. In practice, this is of no consequence because you usually only reference the cursor attributes in the immediate vincinity of the statements that could change the cursor status. T-SQL has one other device that provides information about the status of an arbitrary cursor (not just the current one), the CURSOR_STATUS(), but usually, the globals are sufficient for most purposes.

MySQL does not define any cursor attributes, not any globals that provide cursor status information. In fact, cursor functionality is limited to opening a cursor, fetching values from it's current record, and closing it. Like in other rdbms products, fetching from an exhausted cursor will raise an error situation in MySQL. Because there's no general way to know on beforehand if the cursor is exhausted, there's nothing left to do but fetching from the cursor and handling the error situation. Always.

So, in MySQL, we must declare a HANDLER for the builtin NOT FOUND condition. Usually, we want to continue our procedure when the cursor is exhausted, so this will usually be a CONTINUE handler. Inside the handler statement, we must place the statements that will update our loop control variable. Like in the Oracle case, we still have to check our loop control variable to actually exit the loop to stop the cursor traversal process.

When writing these loops for MySQL it might seems a bit awkward at first to use these contructs, especially when you are used to cursor attributes or cursor status globals. In my opinion, it's actually quite daft to have to rely on general error handling mechanisms for handling cursor exhaustion situations. After all, there's nothing exceptional or erroneous about the cursor becoming exhausted. But for now, it's no use sulking. This is just how MySQL does these things right now.

Friday, September 16, 2005

Redirecting Oracle Mod PLSQL Gateway Pages directly to files

I have some webapps that are based upon the Oracle Mod PLSQL gateway. Nowadays, a lot of developers prefer to build Oracle Webapps using some java component to generate the pages, but for a lot of purposes, I still think the Mod PLSQL gateway is not a bad solution at all. I'm the first to agree that mod plsql is not very suitable for generating complex html pages. But when you have to generate XML data that's readily derivable from your relational data stored in the database, i think it's superior to whatever solution running outside the database.

Anyway, once you do have to build or maintain a webapp based on the mod plsql gateway, you are going to want to view the pages you're generating without having to GET them via a HTTP request. It could be for debugging purposes or in case you'd want to make a part of your application accessible offline.

For example, an application I'm building for a client generates a lot of pages that are rendered as diagrams in the client browser. In this setup, I've written an oracle package that grabs all the data I need to draw diagrams of a particular type. So, I have one package to get me the data for business process model diagrams, one for diagrams describing information, one for organization charts, and so on.

The diagrams are usually interrelated, and you can click through these diagrams, zooming on and out. Together, these diagrams provide an integral description of an organization: its hierarchical structure, its business processes, and the associated data flows. The diagrams are used mainly for quality management, process reengineering, re-organization, and auditing purposes.

Anyway, these packages output the data as xml to the mod plsql gateway. I'm using nothing fancy, just htp.p. I do have one package that outputs xml elements that I need across diagram types, and that has a little cache builtin but that's about it.

Normally, users would GET xml documents using their internet browser. In my setup, the generated xml contains a instruction that can often be manipulated with a parameter in the uri. I use this to have the client browser initiate a XSLT transformation of the xml data. Usually, the transformation target is some form of DHTML, that is, HTML4.0/CSS/javascript.

So, in my apps, the actual rendering is left to the client, another thing wich will be frowned upon by most developers. Actually, I don't think its so bad, because now the network between the client and the webserver is loaded about 10 times less that it would have been had the XSLT transformation been done on the serverside.

Now, some groups of users do not have access to the intranet all of the time. They still need to access the diagrams though. Also, a presentation for an audience is usally a situation where you would want to use an offline version of the system rather than an online one. So, to handle these situations, I've build some utilities to generate an offline snapsot of a part of the data. This involves traversing the graph of interrelated diagrams, generating the xml and then, storing the xml data in a file that can be accessed offline.

The latter task is very general purpose, so I decided to post the code here:


    /********************************************
    *
    *   procedure p_file: p_take_snapshot
    *   purpose: store mod_plsql generated content in a file
    *   author: Roland Bouman (R underscore P underscore Bouman at hotmail dot com)
    *   parameters:
    *
    *       p_dir: the directory to create the file in
    *
    *           This must be a valid directory according
    *           to the builtin utl_file package.
    *           For Oracle 8 and below, the path must be 
    *           set in the utl_file_dir database parameter.
    *           For Oracle 9 this is the name of an existing
    *           DIRECTORy schema object.
    *
    *       p_file: the name of the file
    *
    *   example usage:
    *   
    *       my_web_app.p_generate_web_page('page1');
    *       p_take_snapshot(
    *           'offline_pages'
    *       ,   'page1.html'
    *       );
    *
    *    my_web_app.p_generate_web_page('page1') represents
    *    the generation of a webpage by the procedure 
    *    p_generate_web_page in the package my_web_app.
    *    Of course, this is totally hypothetical.
    *
    *   remarks: 
    *       The procedure stores whatever happens  
    *       to reside in the buffer maintained by 
    *       the builtin htp package in the file. 
    *       The file is created if it does not already 
    *       exist. If it does exist, it's contents are 
    *       overwritten. 
    *
    *********************************************/
    create or replace
    procedure p_take_snapshot(
        p_dir           varchar2              -- Path (<=v8) or name of DIRECTORY object(>=v9).        
    ,   p_file          varchar2              -- Filename of the target. 
    )
    is
        v_file          utl_file.file_type;   -- The handle to our file.        
        v_idx           integer;              -- current index of multi-line buffer (TABLE OF VARCHAR2(256s) in htp package        
        v_line          varchar2(256);        -- A local buffer we use to read a single line from the htp buffer        
        v_len           integer;              -- The actual length of the data in our local line buffer        
        v_buffsize      integer := 0;         -- total aount of data in the utl_file write buffer
        v_newlen        integer;              -- helper var to compute the new  size of the utl_file write buffer  
        v_max_buffsize  integer := 32767;   -- The maximum size of the utl_file write buffer
        procedure p_open_file                 -- Utility to open our file
        is
        begin
            v_file := utl_file.fopen(         -- get file handle
                p_dir                         -- for file in this dir 
            ,   p_file                        -- file has this name
            ,   'w'                           -- open for write 
            ,   v_max_buffsize                -- use this maximum size for the write buffer 
            );
        end;        
        procedure p_close_file                -- Utility to close our file
        is
        begin
            if utl_file.is_open(v_file) then  -- check if the file is open       
                utl_file.fclose(v_file);      -- close the open file       
            end if;
        end;
    begin        
        p_open_file;                          -- open file for write        
        loop                                  -- loop through lines in htp buffer
            v_line := htp.get_line(v_idx);    -- get a line from the htp buffer            
            v_len := length(v_line);          -- get amount of data in the line            
            v_newlen := v_buffsize + v_len;   -- get total amount of data in the utl_file buffer
            if v_newlen > v_max_buffsize then -- if total amount of data would overflow the utl_file buffer                
                utl_file.fflush(v_file);      -- flush the utl_file buffer                
                v_buffsize := v_len;          -- update cumulative buffersize            
            else                              -- we don't need to flush the utl_file buffer right now                
                v_buffsize := v_newlen;       -- update cumulative buffersize 
            end if;
            utl_file.put(v_file, v_line);     -- (buffered) write to the file 
            exit when v_idx = 0;              -- leave the loop when we did all lines 
        end loop;
        utl_file.fflush(v_file);              -- flush what's still in the utl_file buffer 
        p_close_file;                         -- close the file
    exception                                 -- handle runtime errors
        when others then                      -- catch all. TODO: handle exceptions declared in utl_file
        dbms_output.put_line(                 -- print som error info
            substr(
                'p_take_snapshot: '
            ||  sqlerrm
            ,   1
            ,   255
            )
        );
        p_close_file;                         -- close the file
        raise;                                -- propagate the exception
    end;

Interstingly, oracle does provide some functionality to do this already: htp.showpage outputs the content of the multi line buffer mainained by the htp package....but does it with dbms_output.put_line. Why they used dbms_output.put_line is a complete mystery to me. Why not simply use dbms_output.put ? Anyway, for my purpose, it just doesn't cut it. Because they used put_line, the output contains lots of newlines which mess up my XML data bigtime. My procedure really what they do in htp.showpage, I just used utl_file.put instead of dmbs_output.put_line.

Anyway, if you need this procedure, knock yourself out. You can have it, copy it, distribute it...Whatever. I'd like it very much if you send me your comments and/or improvements. Also, I'd like it if you'd credit me. Other ideas concerning mod plsql are welcome to.

Quest Pipelines Newsletter (re)publishes "An Introduction to the MySQL Information Schema"

About one and a half months ago, I wrote two papers on the MySQL information schema for Andrew Gifrins site, www.mysqldevelopment.com. After writing those, he tipped my that the Quest Piplines Newsletter team might be interested in republishing it. Andrew is quite an active participant of the Quest Pipelines Portal, and one of the few people that push MySQL related tips and articles, for example, on stored procedures.

Now, the they (re)published the first article in the september issue. I made a few adjustments to the original text and code listings and results.
FOr the next issue of the newsletter, I'll be revising my second paper quite a bit. When I wrote the articles originally, i was running the latest beta release, i think it was 5.0.7. Now, we're at 5.0.12, pushing 5.0.13, and MySQL really improved. Important features such as referencing tables directly from within triggers, and being able use CALL from within a trigger are just screaming to be incorporated in the code generator i present in the article.

I gueass I got my work cut out for me.

Wednesday, September 14, 2005

Connect by problem: Yep, it's solved alright

Those that have read the past few of my posts on this blog will probably think by now that I'm obsessed with hierarchies. Well, that's right! So, without much ado, here's another related post.

Do you remember that 'slight' CONNECT BY problem I wrote about in my post of 18 august? I've got some wonderful news: it really is solved. Some revision of the Oracle SQL Reference (9.2 - Release 2) tells me that Oracle has a paricular construct that deals with it.

A short recapitulation. Oracle has a special SQL construct that allows you to write a recursive query. This is used to query hierarchies that are modelled using a adjacency list model. Implemented in a relational database, you'd need a table that stores each node in the hierarchy as a row. The table would be set up to have a self-referencing relationship, so that each row would be able to have a 'pointer' to the row that represents the parent node in the hierarchy:


    create table employee(
        id                number       not null 
    ,   person_id         number       not null 
    ,   formationplace_id number       not null 
    ,   boss_employee_id  number       
    ,   constraint pk_employee 
            primary key(id)
    ,   constraint fk_employee_boss 
            foreign key(boss_employee_id)
            references employee(id)
    )

Here, for a particular employee, the record that represents that employee would have the boss_employee_id hold the value of the id column of the employee record that represents it's boss. So, boss_employee_id points to id, identifying the parent record in the hierarchy.

Now, with the usual SQL constructs, you can't write a query that selects an entire tree; at least, you cannot solve it for an arbitrary number of levels. This is because this structure is recursive. You'd need a recursive query construct to solve it.

Of course, you could use PL/SQL, or an external language with either recursive or iterative possibilities to traverse an entire tree, but these are certainly not 'usual SQL constructs'. (Anyway, you could only use such a tool to traverse the tree, and not to define the resultset for use in other queries)

Now, Oracle has a special language construct to deal with this type of structure, CONNECT BY...PRIOR...START WITH. To select a node and all of it's descendants, we'd do:


    select     *
    from       employee e
    connect by 
    prior      e.id = e.boss_employee_id 
    start with e.id = :id_of_interest

So, this tells oracle to get the employee with id equal to :id_of interest first. This represent the root of the tree we are retrieving. then, the records are fetched that have the boss_employee_id equal to the id of the first node. This yields the direct children of our first node. Now, for each of these, we could repeat the process to fetch their direct children until no more rows are found.

The direction -top to bottom- can be reversed by reversing the operands in the CONNECT BY PRIOR clause:


    select     *
    from       employee e
    connect by 
    prior      e.boss_employee_id = e.id
    start with e.id = :id_of_interest

The initial node is the same, but the query will now retrieve it's ancestors rather than it's descendants.

Now, let's stick with the query that get's us all the descendants. What we'd like to be able to do is to print out the hierarchy in some sort of tree view. To be sure, what we mean is a depth-first ordering of the nodes. So, first our initial node, than after that it's first child, and then the children of that node...
When you give it a couple of cracks, you will this discover that you can't. At least not without another 'special' language construct. The construct I'm referring to is the ORDER SIBLINGS BY clause. This just keeps the siblings together, order by some expression:


    select     *
    from       employee e
    connect by 
    prior      e.id = e.boss_employee_id 
    start with e.id = :id_of_interest
    order siblings by e.id

This is of course much nicer than the makeshift solution that i came up with. It's easier to read, and it does not suffer from the major flaw in my solution, which was not suitable for an arbitrary depth.

So, I'm glad that one's solved. I guess I should've read that SQL Reference better.