Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Wednesday, March 15, 2006

Beware! XML injection vulnerability

LOL!

I noticed that in one of my previous blog entries (Importing XML data into MySQL using ExtractData()
) some of the literal XML snippets did not render at the Planet MySQL blog aggregate.

Then, the trouble was limited to my own entry: the escaped tags in the string arguments passed to the XML functions were not rendered.
Yesterday I did another little entry concerning the MySQL XML functions, but this time, my entry seems to wreck the remainder of the PlanetMySQL.org page! I guess this would be called 'XML injection vulnerability'.

Ironically, I entered a little warning in my blog entry for those that would be interested in reading the source snippets. For those in doubt - It was never my intention to manipulate or otherwise influence the planet mysql, or any other aggregate for that matter.

MySQL 5.1.7: XPath asymmetry between ExtractData() and UpdateXML() ?

The new MySQL 5.1.7 XML functions let you manipulate XML-formatted text strings using XPath query expressions.

You've got string ExtractValue(XML, XPath) - this let's you query a (sub)fragment of the XML fragment and returns the text content of the node that was found (if multiple nodes apply, the text contents of all these nodes are concatenated and separated by a space character, which is then returned). And, you've got XML UpdateXML(XML, XPath, XML) - this let's you search XML fragments and allows you to replace them with another XML fragment.

The funny thing is, it looks as though the XPath query is handled very differently in either case. Consider this little snippet (you might wan't to look at it on the original blog - some blog aggregates don't render the escaped XML fragments well):


mysql> select UpdateXML('<b>ccc<b>hhh</b></b>', '//b', '<e>fff</e>') c1
-> , ExtractValue('<b>ccc<b>hhh</b></b>','//b') c2
-> ;
+----------------------+---------+
| c1 | c2 |
+----------------------+---------+
| <b>ccc<b>hhh</b></b> | ccc hhh |
+----------------------+---------+
1 row in set (0.00 sec)

As you can see in the second column, the XPath correctly identifies all the <b> elements at all levels. The content of all the <b> elements is returned, literally.
However, the same XML, and the same XPath are used as an argument to UpdateXML()...NOTHING happens....I would've expected to see something like:


<b>fff<b>fff</b></b>

or

<b>fff</b>

Anyway, I filed a bug report.

Saturday, March 11, 2006

MySQL 5.1. Information schema now clickable!

I just uploaded the first clickable version of the diagram of the MySQL 5.1. information_schema database. Its right here: http://www.xcdsql.org/MySQL/information_schema/5.1/MySQL_5_1_INFORMATION_SCHEMA.html.

Apart from adding the new information schema tables (PARTITIONS, FILES, PLUGINS, PROCESSLIST, ENGINES) I also touched up the layout, modified the pagesize (in the visio file; you can now print it on A3 without much hassle), and added a relationship between SCHEMATA and TRIGGERS.

I will be maintaining both the 5.0 and the 5.1 version of the diagram from now on.

&lt; != <

Darn, I just noticed that PlanetMySQL renders the literal equivalents of the &lt; and &gt; entity references I use to display XML in my previous blog entry "Importing XML data into MySQL using ExtractData()".

This means the xml fragments, but also the calls to MySQL XML functions won't show at all, or only partially (only the text content seem to be rendered).

I wonder how other people've solved that...

Friday, March 10, 2006

Importing XML data into MySQL using ExtractData()

As of version 5.1.5, MySQL supports some functions for handling XML Data. One can debate the subject wheter a rdbms should ship with XML capabilities. I happen to like this feature.

Working with XML tends to be very data-oriented. In alot of cases, only a trivial mapping is required between the relational data and the xml representation. In these cases, it is certainly quite convenient to have all the tools that implement the mapping close at hand.

MySQL XML capabilities up to 5.0



In prior versions, MySQL XML support was limited to exporting data in XML format using the MySQL command line client. All you need to do is start the mysql client utilitiy with the --xml or -X option.

(I'm using the sakila sample database throughout this article - be sure to install it if you want to execute the code yourself.)


D:\Data\MySQLXML>mysql -P 3307 -u root -pmysqlroot -D sakila -X
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 3 to server version: 5.1.7-beta-nt-max

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> select * from store;
<?xml version="1.0"?>

<resultset statement="select * from store">
<row>
<field name="store_id">1</field>
<field name="manager_staff_id">1</field>
<field name="address_id">1</field>
<field name="last_update">2006-02-14 21:57:12</field>
</row>

<row>
<field name="store_id">2</field>
<field name="manager_staff_id">2</field>
<field name="address_id">2</field>
<field name="last_update">2006-02-14 21:57:12</field>
</row>
</resultset>
2 rows in set (0.00 sec)

mysql>

This is quite a convenient method for quickly formatting query results in an XML format. It is especially suitable to satisfy ad-hoc query needs, but with some operating system level scripting and batch scheduling you can set it up to emit standard XML formatted reports periodically.

Another device you can use to generate XML data out of MySQL relational data is MyXML. This is a library of UDF's that enables one to generate XML from queries directly.

(BTW - lot's of you will already know the way to this site already, but for those that don't: http://tangent.org/ is quite a large list of contributions to the open source community. Most of them are Brian Aker's, and you can spot some contributions that don't include Brian...as first author. Anyway, it's got apache modules, MySQL UDF's, perl scripts, Feed utilities. Anyone wondering how to run a little java or perl form inside MySQL - go check out this site first.)

A possible disadvantage is that the mapping from relational data to XML is fixed - there is no way you can change the tagnames, control wheter data should be formatted as element content or as an attribute, set the namespace, etcetera.

MySQL XML capabilities as of 5.1.5

XML support in MySQL 5.1 has been extended with:

  • ExtractValue(): Query and Retrieve data from XML formatted text strings

  • UpdateXML(): Update data in XML formatted text strings


(These functions are described in detail here. There's also a little article that illustrates the usage of these functions.)

Both functions accept a string that represents fragment of XML, and an XPath expression. The XPath expression acts as a query operator on the xml, identifying the nodes (that is, the elements, attributes, text contents etc.) that the function needs to handle.
In the case of ExtractValue(), the text nodes specified by, or ocurring within the nodes identified by the XPath expression is returned:

mysql> select extractValue('<tag>text</tag>','tag') as `tag`
-> , extractValue('<tag>text</tag>','tag/text()') as `text()`
-> ;
+------+--------+
| tag | text() |
+------+--------+
| text | text |
+------+--------+
1 row in set (0.00 sec)

In first call to ExtractValue() the XPath expression, tag selects the <tag>text</tag> bit, but only the text content is returned. The XPath expression in second call to ExtractValue(), tag/text() explicitly specifies that text content should be selected. The result is the same in both cases.
If multiple nodes are identified, the text contents are concatenated, separated by a single space character:

mysql> select extractValue('<tag>text1</tag><tag>text2</tag>','tag') as `tag`;
+-------------+
| tag |
+-------------+
| text1 text2 |
+-------------+
1 row in set (0.00 sec)

In the case of UpdateXML(), the selected node is replaced by the third supplied argument.

(The documentation is not very clear what should happen in case the XPath expression matches multiple nodes. As far as I can see now, UpdateXML() does not do anything. Seems like a bug to me, or at least, a gap in the documentation.)

Importing XML using ExtractValue


Now that we know how to selectively extract data from an xml fragment, we can try and put it to good use. For instance, we can try to write a simple utility to import XML data back into the database.
Well, if we know the format of XML documents that we want to import, this should be an easy task. Let's start simple and try to import an xml document like the one we generated in the introduction of this article:

(Of course, I'm not pretending that this is the preferred way to load data coming from mysql back into the database. You can achieve much more performance and data quality by using the usual backup and restore utilities or the mysqlimport utility. This is all just an illustration of an approach towards loading xml data using the builtin features. We're using this example because it will be easy to reproduce)


mysql -P 3307 -u root -pmysqlroot -D sakila -X -e "SELECT * FROM store" > store.xml

Now, we know we can write a single XPath expression for each field in both of the store records, and once we have the values, it's of course a piece of cake to write an insert around that:

mysql> use test;
Database changed
mysql> create table store like sakila.store;
Query OK, 0 rows affected (0.14 sec)

mysql> set @xml := load_file('D:/Data/MySQLXML/store.xml');
Query OK, 0 rows affected (0.00 sec)

mysql> insert into store values (
-> extractValue(@xml,'/resultset/row[1]/field[1]/text()')
-> , extractValue(@xml,'/resultset/row[1]/field[2]/text()')
-> , extractValue(@xml,'/resultset/row[1]/field[3]/text()')
-> , extractValue(@xml,'/resultset/row[1]/field[4]/text()')
-> ),(
-> extractValue(@xml,'/resultset/row[2]/field[1]/text()')
-> , extractValue(@xml,'/resultset/row[2]/field[2]/text()')
-> , extractValue(@xml,'/resultset/row[2]/field[3]/text()')
-> , extractValue(@xml,'/resultset/row[2]/field[4]/text()')
-> );
Query OK, 2 rows affected (0.06 sec)
Records: 2 Duplicates: 0 Warnings: 0


However, this is an unacceptable approach because it is limited to importing only the first two records. When a new sakila videostore is opened someday, we will not be able to import that third record from an xml file. Clearly we need to be able to import all of the records. So, somehow, we need to loop through all the <row> childelements of the <resultset> document element.
Once we've established that we need a loop, we know that we can't keep on typing all our commands interactively. We need either a function or a stored procedure to hold one of the loop constructs like REPEAT or WHILE. Having said that, we realize immediately that if we are to use a loop, we must also think of a particular situation that tells us when we should exit the loop. Somehow, we need to detect if there are still more records to import.
Well, we can do just that. XPath is not limited to just extracting values from an XML fragment. XPath can perform calculations too, yielding values that were never literally present in the source XML. XPath even knows some aggregate functions; this is what we'll be using here to. Consider the following procedure definition:

delimiter //

drop procedure if exists p_import_stores
//

create procedure p_import_stores(
p_xml text
)
begin
declare v_row_index int unsigned default 0;
declare v_row_count int unsigned;
declare v_xpath_row varchar(255);

-- calculate the number of row elements.
set v_row_count := extractValue(p_xml,'count(/resultset/row)');

-- loop through all the row elements
while v_row_index < v_row_count do
set v_row_index := v_row_index + 1;
set v_xpath_row := concat(
'/resultset/row['
, v_row_index
, ']'
);
insert into store values (
extractValue(p_xml,concat(v_xpath_row,'/field[1]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/field[2]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/field[3]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/field[4]/text()'))
);
end while;
end;
//

First, we calculate the number of rows using the XPath function count(). We use that to drive a WHILE loop. Inside the while loop, we increment v_row_index to keep track of the current <row> element. We then use that to construct a new XPath
expression, v_xpath_row that refers to the corresponding element in the xml document. Then, we perform the insert. We can conveniently append the partial XPath expression that retrieves the text content of the <field> elements relative to the parent <row> element to v_xpath_row XPath expression to extract the column values for our insert
statement.
When we call the procedure, we can convince ourselves that it does indeed import all of the records:

mysql> delete from store;
-> //
Query OK, 2 rows affected (0.03 sec)

mysql> call p_import_stores(@xml)
-> //
Query OK, 1 row affected (0.07 sec)

mysql> select * from store;
-> //
+----------+------------------+------------+---------------------+
| store_id | manager_staff_id | address_id | last_update |
+----------+------------------+------------+---------------------+
| 1 | 1 | 1 | 2006-02-14 21:57:12 |
| 2 | 2 | 2 | 2006-02-14 21:57:12 |
+----------+------------------+------------+---------------------+
2 rows in set (0.00 sec)

This is certainly a major improvement over our initial attempt. We've just made the import algorithm independant of the number of records present in the xml. We can improve this procedure a little bit more: for example, we can create a couple of 'constants' to hold element names or even entire XPath expressions, or we can extend our approach toward the number of rows also to the fields. When we introduce the PREPARE syntax, we can even have the procedure perform one or multiple insert statements for multiple <row> elements, depending upon the total number of rows (this could mean a significant performance booster). The PREPARE syntax would also allow one to choose a target table dynamically, leading to a generalized solution for importing XML into the MySQL database.
However, there's at least one problem with this procedure that cannot be remedied using any of these approaches. Te problem is that we really must import all of the <row> elements: we can't make a selection. This problem can be solved by allowing the XPath expression that selects the <row> elements to be parameterized:

delimiter //

drop procedure if exists p_import_stores
//

create procedure p_import_stores(
p_xml text
, p_xpath_row varchar(255)
)
begin
declare v_row_index int unsigned default 0;
declare v_row_count int unsigned;
declare v_xpath_row varchar(255);

-- calculate the number of row elements.
set v_row_count := extractValue(
p_xml
, concat(
'count('
, p_xpath_row
, ')'
)
);

-- loop through all the row elements
while v_row_index < v_row_count do
set v_row_index := v_row_index + 1;
set v_xpath_row := concat(
p_xpath_row
, '['
, v_row_index
, ']'
);
insert into store values (
extractValue(p_xml,concat(v_xpath_row,'/field[1]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/field[2]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/field[3]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/field[4]/text()'))
);
end while;
end;
//

Now we're in a position to use the XPath expression together with a predicate, that is, to add a (set of) criteria that selects only those rows that are of interst. In the example that follows, a predicate is used to select only the row where the store_id field is 1.

mysql> delete from store;
-> //
Query OK, 2 rows affected (0.04 sec)
mysql> call p_import_stores(@xml,'/resultset/row[field[@name="store_id"]=1]')
-> //
Query OK, 1 row affected (0.04 sec)

mysql> select * from store
-> //
+----------+------------------+------------+---------------------+
| store_id | manager_staff_id | address_id | last_update |
+----------+------------------+------------+---------------------+
| 1 | 1 | 1 | 2006-02-14 21:57:12 |
+----------+------------------+------------+---------------------+
1 row in set (0.00 sec)

Of course, this approach can be used equally well for the field XPath expressions. Up to a certain point this would make the import algorithm less dependant upon the particular element names. Suppose we would implement the same trick for the field XPath expression, our procedure signature would look something like this:

create procedure p_import_stores(
p_xml text
, p_xpath_row varchar(255)
, p_xpath_field varchar(255)
)
...

Now, suppose our xml is actually an xhtml document with a <table> element inside the <body> we could then write:

call p_import_stores(
p_xml
, '/html/body/table/tbody/tr'
, '/td'
);

Now, most people will probably agree that the import functionality is gaining usefullness with each of these improvements. However, a major weakness remains. We will always be dependant upon these list or table like structures. In a lot of cases this just isn't enough. What we need is to be able to pass the procedure some structure that contains all the information to define an arbitrary mapping between the xml source and the relational target data. This idea is not new - Microsoft uses it in MS SQL Server. They employ the so-called 'annnotated xml schemas' to achieve it. (You can google that, or check out this site.)
Anyway, I think that is should be possible to come up with something. Of course, the mapping itself would be passed to the procedure in some XML format too - at least, that's how I think it should be implemented. Anyway, I've got a feeling that I will post a sequal to this post at a quite short notice.

Tuesday, March 07, 2006

Sneak Preview: Diagram for the MySQL 5.1.7 Information Schema

Yesterday, I finally installed the MySQL 5.1.7 beta. It took some help from Markus to get it up and running, and this morning, I made a start to update the MySQL information schema diagram that I've been maintaing since june 2005.

To begin with, I just added the new information_schema tables to the original visio diagram. I also reworked the layout to put the tables that are in the same functional category near to each other. (In the previous version, the privilege tables were scattered throughout the diagram).

I put the EVENTS table in same category as TRIGGERS and the ROUTINES, because these all represent MySQL Programmability. I put the PARTITIONS, the ENGINES and the FILES table in a new category, because these all have to do with physical storage of data.

I still need more info on the PLUGINS table to decide where I should put it (it seems like it contains information similar to what I find in the ENGINES table.)

Finally, I adjusted the page dimension so the entire diagram can be printed landscape on A3 paper.

I will adjust the change history soon, and make an announcement in the information schema formum. Meanwhile, a sneak preview is available here: http://www.xcdsql.org/MySQL/information_schema/5.1/INFORMATION_SCHEMA With Rolenames.gif (visio: http://www.xcdsql.org/MySQL/information_schema/5.1/INFORMATION_SCHEMA With Rolenames.vsd). When that's done, I'll give it a go with MySQL Workbench too.

Saturday, March 04, 2006

Finally, MySQL WorkBench is in Beta!

Wow, I really have been waiting for this to happen...

I've been using DBDesigner 4 for some time now. DBDesigner let's you define a datamodel using familiar diagramming techniques. Although the user interface has some quirks, I find it a very handy tool, especially for defining simple, physical data models for MySQL. The has got some reverse engineering capabilities, and of course, you can have it generate the SQL DDL statements to realize your datamodel.

Now, for some time there's been talk about the successor to DBDesigner: MySQL Workbench. The alpha has been available since 14 september 2005. I did install it then, but at the time the functionality seemed so limited, it was clear the tool was in the middle of development.

Frankly, it became quite silent after that. The tool is also not (yet?) listed among the other client side mysql tools at the download page, so the forum is pretty much the only source of news and information. And now, through that, I have just learned that the MySQL Workbench Beta release is now available! Check out the announcement to find a link where you can download a version that's appropriate for you platform.

Obviously, I did not take a very thorough look, but some of the functionalities that are added as compared to DBDesigner 4 are:


  • support for views, triggers and stored procedures

  • multiple models and in a tabbed MDI interface

  • tool to synchronize the diagram with the realized model

  • import your old DB Designer 4 models

  • collapse/expand columns, indexes, triggers etc. section by clicking inside a table in the diagram (as opposed to a global setting in dbd 4)



Other interesting things are that like in DBDesigner 4, the diagrams are stored as xml files. This is very cool, because it gives you all kinds of possibilities to port from and through other applications. I have not seen the functionality to store the model in a database yet, maybe it has to built still.

Anyway, I will start testing this tool as of now. I think I will start to draw up the information_schema diagram with MySQL Workbench and see if I can find some bugs, or features I'd like. I already spotted one: I can't reverse engineer the information_schema database. I will probably use this opportunity to start using MySQL 5.1, as there have been some additions to the information_schema database.

Friday, March 03, 2006

Heck, I can tell it's been a while

LOL!

I found out only today that the Oracle unit of my employer (I'm working for the Microsoft Unit since 1.5 years or so), Inter Access, uses Eclipse to build their java application components, and Jasper Reports as a reporting solution.

This suprises me. I always assumed they used Oracle JDeveloper and Oracle Reports.

Wednesday, March 01, 2006

Improved MySQL Error management in 5.2

I just got word that improved error handling - the stuff I blogged about previously - is in fact being adressed by the MySQL development team. Very glad to head about that.

(We should not expect it sooner than 5.2 though. This is of course totally reasonable as the 5.1 version has been in beta for some time now. In fact, I gueass it won't take too long before that one will become generally available)