As of version 5.1.5, MySQL supports some
functions for handling XML Data. One can debate the subject wheter a rdbms should ship with XML capabilities. I happen to like this feature.
Working with XML tends to be very data-oriented. In alot of cases, only a trivial mapping is required between the relational data and the xml representation. In these cases, it is certainly quite convenient to have all the tools that implement the mapping close at hand.
MySQL XML capabilities up to 5.0
In prior versions, MySQL XML support was limited to exporting data in XML format using the
MySQL command line client. All you need to do is start the mysql client utilitiy with the
--xml or
-X option.
(I'm using the
sakila sample database throughout this article - be sure to install it if you want to execute the code yourself.)
D:\Data\MySQLXML>mysql -P 3307 -u root -pmysqlroot -D sakila -X
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 3 to server version: 5.1.7-beta-nt-max
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> select * from store;
<?xml version="1.0"?>
<resultset statement="select * from store">
<row>
<field name="store_id">1</field>
<field name="manager_staff_id">1</field>
<field name="address_id">1</field>
<field name="last_update">2006-02-14 21:57:12</field>
</row>
<row>
<field name="store_id">2</field>
<field name="manager_staff_id">2</field>
<field name="address_id">2</field>
<field name="last_update">2006-02-14 21:57:12</field>
</row>
</resultset>
2 rows in set (0.00 sec)
mysql>
This is quite a convenient method for quickly formatting query results in an XML format. It is especially suitable to satisfy ad-hoc query needs, but with some operating system level scripting and batch scheduling you can set it up to emit standard XML formatted reports periodically.
Another device you can use to generate XML data out of MySQL relational data is
MyXML. This is a library of UDF's that enables one to generate XML from queries directly.
(BTW - lot's of you will already know the way to this site already, but for those that don't:
http://tangent.org/ is quite a
large list of contributions to the open source community. Most of them are
Brian Aker's, and you can spot some contributions that don't include Brian...as first author. Anyway, it's got apache modules, MySQL UDF's, perl scripts, Feed utilities. Anyone wondering how to run a little
java or
perl form inside MySQL - go check out this site first.)
A possible disadvantage is that the mapping from relational data to XML is fixed - there is no way you can change the tagnames, control wheter data should be formatted as element content or as an attribute, set the namespace, etcetera.
MySQL XML capabilities as of 5.1.5
XML support in MySQL 5.1 has been extended with:
ExtractValue(): Query and Retrieve data from XML formatted text strings
UpdateXML(): Update data in XML formatted text strings
(These functions are described in detail
here. There's also a little
article that illustrates the usage of these functions.)
Both functions accept a string that represents fragment of XML, and an
XPath expression. The XPath expression acts as a query operator on the xml, identifying the nodes (that is, the elements, attributes, text contents etc.) that the function needs to handle.
In the case of
ExtractValue(), the text nodes specified by, or ocurring within the nodes identified by the XPath expression is returned:
mysql> select extractValue('<tag>text</tag>','tag') as `tag`
-> , extractValue('<tag>text</tag>','tag/text()') as `text()`
-> ;
+------+--------+
| tag | text() |
+------+--------+
| text | text |
+------+--------+
1 row in set (0.00 sec)
In first call to
ExtractValue() the XPath expression,
tag selects the
<tag>text</tag> bit, but only the text content is returned. The XPath expression in second call to
ExtractValue(),
tag/text() explicitly specifies that text content should be selected. The result is the same in both cases.
If multiple nodes are identified, the text contents are concatenated, separated by a single space character:
mysql> select extractValue('<tag>text1</tag><tag>text2</tag>','tag') as `tag`;
+-------------+
| tag |
+-------------+
| text1 text2 |
+-------------+
1 row in set (0.00 sec)
In the case of
UpdateXML(), the selected node is replaced by the third supplied argument.
(The documentation is not very clear what should happen in case the XPath expression matches multiple nodes. As far as I can see now,
UpdateXML() does not do
anything. Seems like a
bug to me, or at least, a gap in the documentation.)
Importing XML using ExtractValue
Now that we know how to selectively extract data from an xml fragment, we can try and put it to good use. For instance, we can try to write a simple utility to import XML data back into the database.
Well, if we know the format of XML documents that we want to import, this should be an easy task. Let's start simple and try to import an xml document like the one we generated in the introduction of this article:
(Of course, I'm not pretending that this is the preferred way to load data coming from mysql back into the database. You can achieve much more performance and data quality by using the usual backup and restore utilities or the mysqlimport utility. This is all just an illustration of an approach towards loading xml data using the builtin features. We're using this example because it will be easy to reproduce)
mysql -P 3307 -u root -pmysqlroot -D sakila -X -e "SELECT * FROM store" > store.xml
Now, we know we can write a single XPath expression for each field in both of the store records, and once we have the values, it's of course a piece of cake to write an insert around that:
mysql> use test;
Database changed
mysql> create table store like sakila.store;
Query OK, 0 rows affected (0.14 sec)
mysql> set @xml := load_file('D:/Data/MySQLXML/store.xml');
Query OK, 0 rows affected (0.00 sec)
mysql> insert into store values (
-> extractValue(@xml,'/resultset/row[1]/field[1]/text()')
-> , extractValue(@xml,'/resultset/row[1]/field[2]/text()')
-> , extractValue(@xml,'/resultset/row[1]/field[3]/text()')
-> , extractValue(@xml,'/resultset/row[1]/field[4]/text()')
-> ),(
-> extractValue(@xml,'/resultset/row[2]/field[1]/text()')
-> , extractValue(@xml,'/resultset/row[2]/field[2]/text()')
-> , extractValue(@xml,'/resultset/row[2]/field[3]/text()')
-> , extractValue(@xml,'/resultset/row[2]/field[4]/text()')
-> );
Query OK, 2 rows affected (0.06 sec)
Records: 2 Duplicates: 0 Warnings: 0
However, this is an unacceptable approach because it is limited to importing only the first two records. When a new sakila videostore is opened someday, we will not be able to import that third record from an xml file. Clearly we need to be able to import all of the records. So, somehow, we need to loop through all the
<row> childelements of the
<resultset> document element.
Once we've established that we need a loop, we know that we can't keep on typing all our commands interactively. We need either a function or a stored procedure to hold one of the loop constructs like
REPEAT or
WHILE. Having said that, we realize immediately that if we are to use a loop, we must also think of a particular situation that tells us when we should exit the loop. Somehow, we need to detect if there are still more records to import.
Well, we can do just that. XPath is not limited to just extracting values from an XML fragment. XPath can perform calculations too, yielding values that were never literally present in the source XML. XPath even knows some aggregate functions; this is what we'll be using here to. Consider the following procedure definition:
delimiter //
drop procedure if exists p_import_stores
//
create procedure p_import_stores(
p_xml text
)
begin
declare v_row_index int unsigned default 0;
declare v_row_count int unsigned;
declare v_xpath_row varchar(255);
-- calculate the number of row elements.
set v_row_count := extractValue(p_xml,'count(/resultset/row)');
-- loop through all the row elements
while v_row_index < v_row_count do
set v_row_index := v_row_index + 1;
set v_xpath_row := concat(
'/resultset/row['
, v_row_index
, ']'
);
insert into store values (
extractValue(p_xml,concat(v_xpath_row,'/field[1]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/field[2]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/field[3]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/field[4]/text()'))
);
end while;
end;
//
First, we calculate the number of rows using the XPath function
count(). We use that to drive a
WHILE loop. Inside the while loop, we increment
v_row_index to keep track of the current
<row> element. We then use that to construct a new XPath
expression,
v_xpath_row that refers to the corresponding element in the xml document. Then, we perform the insert. We can conveniently append the partial XPath expression that retrieves the text content of the
<field> elements relative to the parent
<row> element to
v_xpath_row XPath expression to extract the column values for our insert
statement.
When we call the procedure, we can convince ourselves that it does indeed import all of the records:
mysql> delete from store;
-> //
Query OK, 2 rows affected (0.03 sec)
mysql> call p_import_stores(@xml)
-> //
Query OK, 1 row affected (0.07 sec)
mysql> select * from store;
-> //
+----------+------------------+------------+---------------------+
| store_id | manager_staff_id | address_id | last_update |
+----------+------------------+------------+---------------------+
| 1 | 1 | 1 | 2006-02-14 21:57:12 |
| 2 | 2 | 2 | 2006-02-14 21:57:12 |
+----------+------------------+------------+---------------------+
2 rows in set (0.00 sec)
This is certainly a major improvement over our initial attempt. We've just made the import algorithm independant of the number of records present in the xml. We can improve this procedure a little bit more: for example, we can create a couple of 'constants' to hold element names or even entire XPath expressions, or we can extend our approach toward the number of rows also to the fields. When we introduce the
PREPARE syntax, we can even have the procedure perform one or multiple insert statements for multiple
<row> elements, depending upon the total number of rows (this could mean a significant performance booster). The
PREPARE syntax would also allow one to choose a target table dynamically, leading to a generalized solution for importing XML into the MySQL database.
However, there's at least one problem with this procedure that cannot be remedied using any of these approaches. Te problem is that we really must import all of the
<row> elements: we can't make a selection. This problem can be solved by allowing the XPath expression that selects the
<row> elements to be parameterized:
delimiter //
drop procedure if exists p_import_stores
//
create procedure p_import_stores(
p_xml text
, p_xpath_row varchar(255)
)
begin
declare v_row_index int unsigned default 0;
declare v_row_count int unsigned;
declare v_xpath_row varchar(255);
-- calculate the number of row elements.
set v_row_count := extractValue(
p_xml
, concat(
'count('
, p_xpath_row
, ')'
)
);
-- loop through all the row elements
while v_row_index < v_row_count do
set v_row_index := v_row_index + 1;
set v_xpath_row := concat(
p_xpath_row
, '['
, v_row_index
, ']'
);
insert into store values (
extractValue(p_xml,concat(v_xpath_row,'/field[1]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/field[2]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/field[3]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/field[4]/text()'))
);
end while;
end;
//
Now we're in a position to use the XPath expression together with a
predicate, that is, to add a (set of) criteria that selects only those rows that are of interst. In the example that follows, a predicate is used to select only the row where the store_id field is 1.
mysql> delete from store;
-> //
Query OK, 2 rows affected (0.04 sec)
mysql> call p_import_stores(@xml,'/resultset/row[field[@name="store_id"]=1]')
-> //
Query OK, 1 row affected (0.04 sec)
mysql> select * from store
-> //
+----------+------------------+------------+---------------------+
| store_id | manager_staff_id | address_id | last_update |
+----------+------------------+------------+---------------------+
| 1 | 1 | 1 | 2006-02-14 21:57:12 |
+----------+------------------+------------+---------------------+
1 row in set (0.00 sec)
Of course, this approach can be used equally well for the field XPath expressions. Up to a certain point this would make the import algorithm less dependant upon the particular element names. Suppose we would implement the same trick for the field XPath expression, our procedure signature would look something like this:
create procedure p_import_stores(
p_xml text
, p_xpath_row varchar(255)
, p_xpath_field varchar(255)
)
...
Now, suppose our xml is actually an xhtml document with a
<table> element inside the
<body> we could then write:
call p_import_stores(
p_xml
, '/html/body/table/tbody/tr'
, '/td'
);
Now, most people will probably agree that the import functionality is gaining usefullness with each of these improvements. However, a major weakness remains. We will always be dependant upon these list or table like structures. In a lot of cases this just isn't enough. What we need is to be able to pass the procedure some structure that contains all the information to define an arbitrary mapping between the xml source and the relational target data. This idea is not new - Microsoft uses it in MS SQL Server. They employ the so-called 'annnotated xml schemas' to achieve it. (You can google that, or check out
this site.)
Anyway, I think that is should be possible to come up with something. Of course, the mapping itself would be passed to the procedure in some XML format too - at least, that's how I think it should be implemented. Anyway, I've got a feeling that I will post a sequal to this post at a quite short notice.