Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Saturday, May 20, 2006

Checking out the Pentaho Open Source BI Platform

Recently, I've started checking out the Pentaho open source BI (Business Intelligence) platform. .

So far I'm really impressed about their demo environment.

Clearly, it reveals the ambition of being a top notch BI platform - not just the umpteenth tool. It integrates virtually all of the functionalities associated with BI Projects (ETL, Reporting, Graphing, OLAP, Business Rules Integration, Bursting, Output Formatting for Excel/PDF/Web, Dashboarding) with a workflow engine (Shark) and this really offers an edge.

Pentaho claims that traditional BI usually fails because the perspective of what entails a solution to a business process is just too narrow. A typical BI solution is a product like a report or a datawarehouse - period, end of story. The people behind Pentaho take a more organic view:

  1. a Business Problem is a Problem that has to do with the Business Processes.

  2. The solution to that problem is an alteration/modification of Business Processes

  3. Business intelligence forms just a part of that solution - NOT the solution in itself.

  4. BI can play it's part more succesfull is it integrates with the business process

  5. Pentaho has all the facilities to implement that integration



Literally everything inside Pentaho should be coupled to a business process. To support this, it's built entirely as a Service Oriented Architecture, making it easy to integrate.

I think that taking a sophisticated view like pentaho does, really candidates this platform as a true competitor to the larger Vendors, and I'm thinking of Microsoft and Cognos. I don't think Pentaho is yet as mature, but it's got a lot going for it.

I showed it off to the managers at the office, and guess what! I can present my findings, and tell the rest of our BI unit about this. I hope I can make some waves there...

Thursday, May 18, 2006

Got to get Organazized

Hola,

I really got to get organized. There's just too much little things that seem to remain unfinished for...well, like forever. So, I'm pushing some of the nearly finished or at least usable stuff to the web. Maybe someone else might benefit, and maybe I'll get the kind of feedback that will take it to the next level.


Organization Charts


Since a couple of days, I've been developing a javascript library to render Organization Charts on web pages. Dude, it's even documented... The good news is that it renders decent on Mozilla/Firefox and Internet Explorer. The bad news is that Opera seems to have a little bug or quirk that messes up the connecting lines. (Details: see my post at the opera forum)

MySQL information schema XML dump utilities


A little while ago, I blogged about some utilities I made to work with XML generated by the MySQL command line client, and I announced that I was working on an XSLT stylesheet to transform the information_schema XML to HTML documentation. Well, it's not finished, but you can get a feel of what I have in mind. And...there's notes too on how to use it. Check out this sample of generated documentation for the sakila (08) database. (BTW, check out the wiki on MySQLForge on that; Excellent, thanks Sheeri!)

Some MySQL UDF's for generating XML


Although the MySQL command line client can generate XML, it's not very flexible. The format is also non-standard (according to ISO 9075:2003, book 14). Well, I've started work on this, but it's all experimental. I'll post back on that when I can put it out in the open. What I've got working so far, is this happy little xml_attributes() function. It accepts a variable number of expressions, and automagically creates an XML attribute for them. Just as an example:

SELECT xml_attributes(
film_id
, title
, rating
)
FROM sakila.film
WHERE title LIKE 'AC%'
;
+--------------------------------------------------+
| xml_attributes(film_id,title,rating) |
+--------------------------------------------------+
| film_id="1" title="ACADEMY DINOSAUR" rating="PG" |
| film_id="2" title="ACE GOLDFINGER" rating="G" |
+--------------------------------------------------+

The funny thing is of course that you don't have to pass a name and a value; it figures it out by itself. You can pass as many expressions as you like, and it will do the job. Of course, I automatically generate proper XML names for the attributes, and alsoe the values are escaped properly. However the drawback of UDF's remains the fact that you *cannot* do anyting with character sets, wich pretty much makes them automatically unsuitable to deliver a true standards compliant XML generator.

A XML - database import facility based on annotated schemas

I've been thinking about this for some time. It should be possible to use XSLT to transform an XML Schema that contains relational mappings (as annotations) to a SAX-parser that actually reads an XML stream and imports the data in the process.

Wednesday, May 10, 2006

MySQL View Metadata: original source, algorithm, and status

Views are a great feature which is available as of MySQL 5. A view can be seen as a named and persistently stored SELECT expression. After defining the view it can be referenced just like a table in other queries.

Metadata for MySQL Views can be queried using the TABLES and VIEWS tables from the information_schema. This can be very helpful when maintaining or developing views.

However, this metadata would be even more valueable if it would provide a few more items:

Status

A view becomes broken when the database objects on which it depends (tables, functions, other views) are modified or dropped. It would be nice to be able to run a query to identify exactly those views.

The original sourcecode

Right now, the view source is parsed and becomes rewritten, stripping out any comments and indentation. The information_schema only provides access to this rewritten statement in the VIEW_DEFINITION column of the information_schema.VIEWS table; an unpleasant suprise if you spent a lot of time commenting your view code.

Algorithm

In the CREATE VIEW statement, one can specify the Algorithm that should be used to implement the view. Because the algorithm can have a big impact on the performance it would be nice to be able to query that.


The View Status


Although it's not explicitly supported, it's actually possible to create a conventional query that returns the status for a given view (or set of views). The solution relies on the value of the TABLE_COMMENT column of the information_schema.TABLES table.

For base tables, this column contains the text specified for the COMMENT table option specified in the CREATE TABLE or ALTER TABLE statement. For views, this normally contains the text 'VIEW':

mysql> USE test;
Database changed
mysql> CREATE TABLE t(
-> id int
-> )
-> ENGINE = MyISAM
-> COMMENT = 'Comments on this table'
-> ;
Query OK, 0 rows affected (0.10 sec)

mysql> CREATE VIEW v
-> AS
-> SELECT *
-> FROM t;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT table_name
-> , table_type
-> , table_comment
-> FROM information_schema.tables
-> WHERE table_schema = schema()
-> AND table_name IN ('t','v')
-> ;
+------------+------------+------------------------+
| table_name | table_type | table_comment |
+------------+------------+------------------------+
| t | BASE TABLE | Comments on this table |
| v | VIEW | VIEW |
+------------+------------+------------------------+
2 rows in set (0.00 sec)

Now, we will deliberately break the view by dropping the table upon which it depends:

mysql> DROP TABLE t;
Query OK, 0 rows affected (0.04 sec)

When we run our select again, this is what we get:

+------------+------------+----------------------------------------------------------------------------------+
| table_name | table_type | table_comment |
+------------+------------+----------------------------------------------------------------------------------+
| v | VIEW | View 'test.v' references invalid table(s) or column(s) or function(s) or defin |
+------------+------------+----------------------------------------------------------------------------------+

So, we're witnessing the fact that MySQL is using this column in it's very own way. Because you cannot specify a COMMENT table option when creating or altering a view, MySQL can pretty much decide for itself how to use this column, and it turns out we can take advantage of this.

(Another case where MySQL uses it for it's own means is seen in InnoDB tables. In those cases, some of the tail of the comment is used to log some innodb specific info, possibly truncating the actually entered comment text.)

Now that we know this, it won't be too hard to write a query that identifies broken views:

SELECT *
FROM information_schema.tables
WHERE table_type = 'VIEW'
AND table_type != table_comment

(Later on, we'll modify the query in order to have a VIEW_STATUS column report a value of either 'VALID' or 'INVALID', indicating wheter the view is broken or not. Of course, that modification is trivial.)

Under the Hood: Algorithm and Sourcecode


MySQL stores the view definition in a .frm file in the database directory beneath the data directory, just like it does for tables. Let's take a look at the v.frm file: :

TYPE=VIEW
query=select `test`.`t`.`id` AS `id` from `test`.`t`
md5=25ace41c7f6b0927b2c4ed5227e64ad0
updatable=1
algorithm=0
definer_user=root
definer_host=localhost
suid=1
with_check_option=0
revision=1
timestamp=2006-05-09 22:34:46
create-version=1
source=select *\nfrom t

I've highlighted some items of interest:

  • the query is the code after MySQL's done rewriting the query.

  • Here, the algorithm item is associated with the number 0. By creating views with explicit ALGORITHM specifications, we can quickly learn that this can have either on of the values 0,1 or 2, corresponding to the algorithms UNDEFINED, TEMPTABLE and MERGE.

  • the source item is almost exactly the code as it was entered by the user. The whitespace is still there, only the newlines are escaped with an escape sequence (\n)


Now that we know that it's stored over there, we can try to use it in a query. To do that, we need to be able to somehow access and parse (or actually scan rather than parse) the file contents...And we can, with the LOAD_FILE() function!

Using LOAD_FILE()


The LOAD_FILE() function accepts a filename as a string argument. It tries to read the specified file, and returns the contents as a string. You need the FILE privilege to do this, and the file must not be larger than max_allowed_packet bytes.

On my Windows XP system, the data directory is located at D:\MySQL\MySQL Server 5.1\data. The view v resides in the test database, so I can access the contents of the corresponding .frm file using this query:


SELECT load_file(
'D:\\MySQL\\MySQL Server 5.1\\data\\test\\v.frm'
)
;


In a string literal, a backslash introduces an escape sequence for special characters. So, in order to get a literal backslash, the backslash itself needs to be escaped by typing two backslashes. Tip for windows users: LOAD_FILE will also accept forward slashes as directory separator, so this works equally well (better, actually):


SELECT load_file(
'D:/MySQL/MySQL Server 5.1/data/test/v.frm'
);


Now, we want to be able to drive this using data from the information_schema.VIEWS table, so the following modification is quite straightforward:


SELECT load_file(
concat(
'D:/MySQL/MySQL Server 5.1/data/'
, table_schema
, '/'
, table_name
, '.frm'
)
) file_contents
FROM information_schema.VIEWS
;


Of course, we would also like to be able to parameterize the value for the data directory. At first, it looks as if you can:


mysql> show variables like '%dir%';
+----------------------------+-------------------------------------------+
| Variable_name | Value |
+----------------------------+-------------------------------------------+
| basedir | D:\MySQL\MySQL Server 5.1\ |
| bdb_logdir | |
| bdb_tmpdir | |
| character_sets_dir | D:\MySQL\MySQL Server 5.1\share\charsets\ |
| datadir | D:\MySQL\MySQL Server 5.1\Data\ |
| innodb_data_home_dir | |
| innodb_log_arch_dir | |
| innodb_log_group_home_dir | .\ |
| innodb_max_dirty_pages_pct | 90 |
| plugin_dir | D:\MySQL\MySQL Server 5.1\lib/ |
| slave_load_tmpdir | D:\MySQL\Datatemp |
| tmpdir | D:/MySQL/Datatemp |
+----------------------------+-------------------------------------------+
12 rows in set (0.00 sec)

But alas, it doesn't work that way:

mysql> select @@datadir;
ERROR 1193 (HY000): Unknown system variable 'datadir'

I'm not really sure if this behaviour constitutes a bug, but it doesn't make much sense to me. A quick search in bugs.mysql.com did find me a report about this (bug #1039). As I cannot change this behaviour, I'm settling for the literal right now. If anyone's got a tip to improve on this, I'll be most glad to hear about it.

(Actually, the minor improvement I came up with was using a user-defined variable but this got me into trouble again because of an entirely different bug alltogether)

'Parsing' the file contents


Now that we can load the view's .frm file, we can try to pick out the items of interest.

(I wouldn't call it parsing exactly, it's more of a scanning/recognition process)

For example, suppose we want to extract the algorithm item from the file contents. First, we'll create a few different views to be able to test the effect:

create algorithm=undefined view v_undefined as select * from t;
create algorithm=merge view v_merge as select * from t;
create algorithm=temptable view v_temptable as select * from t;


Now we can write the actual query. First, we want to look for the string algorithm=; then, our value is sitting there right after that string as the remainder of the line. In MySQL, we can solve this 'excision' SUBSTRING() and INSTR() and the like, but I find that a combination SUBSTRING_INDEX() is really much, much more elegant, as well as less error-prone. That's because you don't have to repeat or duplicate any of the strings that determine the excision: I need only one occurrence of the input string, one occurrence of the starting delimiter and one occurrence of the ending delimiter. (A variant of this theme let's you select the n-th item in a separated list by modifying one integer only). Anyway, this is what I'm doing now:


select v.table_schema as schema_name
, v.table_name as view_name
, substring_index(
substring_index(
load_file(
concat(
'D:/MySQL/MySQL Server 5.1/data/'
, v.table_schema
, '/'
, v.table_name
, '.frm'
)
)
, 'algorithm='
, -1
)
, '\n'
, 1
) as algorithm
from information_schema.views v
where v.table_schema = schema()
and v.table_name like 'v\_%'


and this gives us the following result:


+-------------+---------------+-----------+
| schema_name | view_name | algorithm |
+-------------+---------------+-----------+
| test | v_merge | 2 |
| test | v_temptable | 1 |
| test | v_undefined | 0 |
+-------------+---------------+-----------+


Ok, we can make the output pretty later on, so people can actually read what algorithm was used. We'll take care of that after we're done extracting the original source code.

By itself, excising the souce code is even easier than what we already saw for the algorithm item, because the source code happens to be the last item in the .frm file. There's just one complicating factor, and that's that some special characters in the orignal statement appear as plain, escaped text there. So, in order to get to the actual source code, you need to apply the REPLACE() function. This worked for me:

replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
substring_index(
load_file(
concat(
'D:/MySQL/MySQL Server 5.1/data/'
, v.table_schema
, '/'
, v.table_name
, '.frm'
)
)
, '\nsource='
, -1
)
, '\\_','\_')
, '\\%','\%')
, '\\\\','\\')
, '\\Z','\Z')
, '\\t','\t')
, '\\r','\r')
, '\\n','\n')
, '\\b','\b')
, '\\\"','\"')
, '\\\'','\'')
, '\\0','\0')

Putting it together


Now, let's finish what we've started. Suppose we want to create a query that shows both the algorithm, as well as the original source code. Well, obviously, we can take the expressions for the algorithm and source, just as described before, and SELECT both of them. This means we will be loading the file 2 times (That is, unless the optimizer is clever enough to exclude it).

I don't know if it makes a difference performance-wise, but I use a little trick to prevent duplicate loads of the file. I assign the result of LOAD_FILE() to a (session) user variable, and reference that wherever I need the file contents. Here's the final query:


select v.table_schema as schema_name
, v.table_name as view_name
, case
substring_index(
substring_index(
@frm:=load_file(
concat(
'D:/MySQL/MySQL Server 5.1/data/'
, v.table_schema
, '/'
, v.table_name
, '.frm'
)
)
, 'algorithm='
, -1
)
, '\n'
, 1
)
when 0 then 'UNDEFINED'
when 1 then 'TEMPTABLE'
when 2 then 'MERGE'
else 'UNRECOGNIZED'
end as algorithm
, replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
substring_index(
@frm
, '\nsource='
, -1
)
, '\\_','\_')
, '\\%','\%')
, '\\\\','\\')
, '\\Z','\Z')
, '\\t','\t')
, '\\r','\r')
, '\\n','\n')
, '\\b','\b')
, '\\\"','\"')
, '\\\'','\'')
, '\\0','\0') as view_source
, if(t.table_type=table_comment,'YES'
, 'NO'
) as view_status
from information_schema.views v
inner join information_schema.tables t
on v.table_schema = t.table_schema
and v.table_name = t.table_name


This includes the extra modifications to create the view_status column and the algorithm column.

Saturday, May 06, 2006

Just let SCO convince you by comparing some stacks

LOL!

Some aftermath from the MySQL Users conference I attended just a week ago: I just got some 'informative' mail from SCO, promoting their SCAMP Stack (Google it if you like).

The mail contains a link to a promotional article, advertising the SCAMP stack. Although I recall that SCO has bad standing due to them instantiating a lawsuit against IBM and RedHat (among others), claiming that these companies have violated a non-disclosure agreement with SCO by donating UNIX code to the Linux community, I read the paper anyway.

At the bottom, their article contains a graph comparing TCO for SCAMP, LAMP and WIMP stacks for over a 5 year period. The funny thing is that they totally mess up on the WIMP stuff. I thought that "WIMP" (in this context that is) usually means: Windows, IIS, Mysql and Perl or PHP (or poth :). Now, check out their graph:


  • The label underneath the WIMP bar inside the graph sais: "XP + SQL", indicating a Windows XP and Microsoft SQL Server combination

  • The caption on the right top of the graph explains WIMP as "...Window, IIS, MySQL and Perl and PHP..."

  • The legend beneath the caption defines WIMP as "...Microsoft Windows 2003 Server Standard Edition + SQL Server 2005 Standard edition + Essential support"



Now, that's at least three mistakes too much:

  1. What are we talking about, Windows XP or Windows 2003 Server? That really matters, considering that one of them is a Server product and not a homepc os

  2. Again, what are we talking about, MS SQL 2005 or MySQL?

  3. If we really are talking about MS SQL 2005 - that product has not been around for 5 years - more like 7 to 8 months if we're talking GA


Anyone to enlighten me on this subject?

Tuesday, May 02, 2006

The MySQL Forums

I feel most honoured that I have been awarded a MySQL Community award 2006: it seems I'm this year's MySQL Forum "Leader".

(Well, you should not take the "Leader" bit very literally I guess - It just means I got a big mouth that's eager to tell someone else what I think about their problem :)

I'd like to congratulate the other Community Award Winners: Giuseppe Maxia for his excellent code contributions and technical articles; Markus Popp for his terrific articles, tips, and the db4free initiative; and of course: Rasmus Lerdorf for...well, so many things! I mean - I can't write this list: it's way too long and words fail here anyway.

Here's the award we all got:



Concerning the forums, I want to use this occasion to express my regards to Arjen Lentz and all those other people at MySQL that maintain the forum site. I really think that it is amazing that a public forum that is visited so often and by so many people, together displaying such an enormous diversity in background, nationality and skill level, is such a friendly, sociable place, that manages to maintain such a high level of quality concerning subject matter while being so accessible to less experienced users at the same time. So, thanks guys for providing this service.

In a previous blog entry, I already mentioned a couple of people whose forum posts were most helpful to me: Andrew Gilfrin, Jay Pipes and Felix Geerinckx. I'd like to mention a few other people that cut my eye while I was there:

Maurizio Nardò

Maurizio has displayed some very interesting and original ideas on UDF's and dynamic SQL

Mike Kruckenberg

Mike is a really knowledgable MySQL professional and co-author of Pro MySQL (Apress books). Once he digs in to a thread, he's not leaving until the poster is satified by an answer

Bob Field

Bob Field's been around for a while, helping people out with various issues

Peter Brawley

Peter Brawley is also one of these people that try to think together with the poster, instead of thinking for them



I've learned a lot by browsing through all of your posts - I still do. Thanks for that, and see you there next time.