Tuesday, April 20, 2010

Restoring XML-formatted MySQL dumps

EAVB_VFZUHIARHI To whom it may concern -

The mysqldump program can be used to make logical database backups. Although the vast majority of people use it to create SQL dumps, it is possible to dump both schema structure and data in XML format. There are a few bugs (#52792, #52793) in this feature, but these are not the topic of this post.

XML output from mysqldump

Dumping in XML format is done with the --xml or -X option. In addition, you should use the --hex-blob option otherwise the BLOB data will be dumped as raw binary data, which usually results in characters that are not valid, either according to the XML spec or according to the UTF-8 encoding. (Arguably, this is also a bug. I haven't filed it though.)

For example, a line like:

mysqldump -uroot -pmysql -X --hex-blob --databases sakila
dumps the sakila database to the following XML format:

<?xml version="1.0"?>
<mysqldump xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<database name="sakila">
<table_structure name="actor">
<field Field="actor_id" Type="smallint(5) unsigned" Null="NO" Key="PRI" Extra="auto_increment" />
<field Field="first_name" Type="varchar(45)" Null="NO" Key="" Extra="" />
<field Field="last_name" Type="varchar(45)" Null="NO" Key="MUL" Extra="" />
<field Field="last_update" Type="timestamp" Null="NO" Key="" Default="CURRENT_TIMESTAMP" Extra="on update CURRENT_TIMESTAMP" />
<key Table="actor" Non_unique="0" Key_name="PRIMARY" Seq_in_index="1" Column_name="actor_id" Collation="A" Cardinality="200" Null="" Index_type="BTREE" Comment="" />
<key Table="actor" Non_unique="1" Key_name="idx_actor_last_name" Seq_in_index="1" Column_name="last_name" Collation="A" Cardinality="200" Null="" Index_type="BTREE" Comment="" />
<options Name="actor" Engine="InnoDB" Version="10" Row_format="Compact" Rows="200" Avg_row_length="81" Data_length="16384" Max_data_length="0" Index_length="16384" Data_free="233832448" Auto_increment="201" Create_time="2009-10-10 10:04:56" Collation="utf8_general_ci" Create_options="" Comment="" />
</table_structure>
<table_data name="actor">
<row>
<field name="actor_id">1</field>
<field name="first_name">PENELOPE</field>
<field name="last_name">GUINESS</field>
<field name="last_update">2006-02-15 03:34:33</field>
</row>

...many more rows and table structures...

</database>
</mysqldump>
I don't want to spend too much time discussing why it would be useful to make backups in this way. There are definitely a few drawbacks - for example, for sakila, the plain SQL dump, even with --hex-blob is 3.26 MB (3.429.358 bytes), whereas the XML output is 13.7 MB (14,415,665 bytes). Even after zip compression, the XML formatted dump is still one third larger than the plain SQL dump: 936 kB versus 695 kB.

Restoring XML output from mysqldump

A more serious problem is that MySQL doesn't seem to offer any tool to restore XML formatted dumps. The LOAD XML feature, kindly contributed by Erik Wetterberg could be used to some extent for this purpose. However, this feature is not yet available (it will be available in the upcoming version MySQL 5.5), and from what I can tell, it can only load data - not restore tables or databases. I also believe that this feature does not (yet) provide any way to properly restore hex-dumped BLOB data, but I really should test it to know for sure.

Anyway.

In between sessions of the past MySQL users conference I cobbled up an XSLT stylesheet that can convert mysqldump's XML output back to SQL script output. It is available under the LGPL license, and it is hosted on google code as the mysqldump-x-restore project. To get started, you need to download the mysqldump-xml-to-sql.xslt XSLT stylesheet. You also need a command line XSLT processor, like xsltproc. This utility is part of the Gnome libxslt project, and is included in packages for most linux distributions. There is a windows port available for which you can download the binaries.

Assuming that xsltproc is in your path, and the XML dump and the mysqldump-xml-to-sql.xslt are in the current working directory, you can use this command to convert the XML dump to SQL:

xsltproc mysqldump-xml-to-sql.xslt sakila.xml > sakila.sql
On Unix-based systems you should be able to directly pipline the SQL into mysql using

mysql -uroot -pmysql < `xsltproc mysqldump-xml-to-sql.xslt sakila.xml`
The stylesheet comes with a number of options, which can be set through xsltproc's --stringparam option. For example, setting the schema parameter to N will result in an SQL script that only contains DML statements:

xsltproc --stringparam schema N mysqldump-xml-to-sql.xslt sakila.xml > sakila.sql
Setting the data option to N will result in an SQL script that only contains DDL statements:

xsltproc --stringparam data N mysqldump-xml-to-sql.xslt sakila.xml > sakila.sql
. There are additional options to control how often a COMMIT should be issued, whether to add DROP statements, whether to generate single row INSERT statements, and to set the max_allowed_packet size.

What's next?

Nothing much really. I don't really recommend people to use mysqldump's XML output. I wrote mysqldump-x-restore for those people that inherited a bunch of XML formatted dumps, and don't know what to do with them. I haven't thouroughly tested it - please file a bug if you find one. If you actually think it's useful and you want more features, please let me know, and I'll look into it. I don't have much use for this myself, so if you have great ideas to move this forward, I'll let you have commit access.

That is all.

25 comments:

Erik said...

I'm running your script and all i get is a bunch of "Object is a string : " messages in the output.

Any ideas what might be wrong?

I'll do some more testing.

rpbouman said...

Erik, is that output from xsltproc? Please mail me the command line, a small test XML dump to reproduce and the output at roland.bouman@gmail.com

TIA, Roland.

Anonymous said...

Hi, I am getting following error

Unexpected data type .

Unexpected data type .

Unexpected data type .

rpbouman said...

@Anonymous, thanks! Strange, the name of the type should be in the message.

Can you send me the dump? Only the schema, no data please. send it to roland.bouman@gmail.com with some reasonable subject line and body text so you won't get round up by gmail spamfilter.

Anonymous said...

Hi Roland,
very nice, very informative, very competent site (not only this article). In particular this XML-Dump_back_to_SQL-Dump-XSLT hint is right what I was looking for (could have found it using a search engine but sometimes you errm... "don't find a single tree within the woods" as we use to say in Germany). I fully agree with your advice not to use XML-dumps for 1:1 backups (should be obvious... but XML sounds "good" at first, and it surely is, but not for this kind of purpose).
Also those other articles dealing with native implemented MySQL/XPath constructs are just gold! The point is that any of this may be found within the official MySQL docs - but it's nice to find it "rehashed" like this on a silver tablet.
So thanks for sharing all this (it happened I bookmarked this site twice already and I do not regret this :) ) !
Wolfram

rpbouman said...

Hi Wolfram!

thank you very, very much for your kind words! It's very rewarding to get this kind of feedback - much appreciated!

kind regards,

Roland.

Anonymous said...

The problem with using the XSLT file to transform the XML dump file to SQL is that it doesn't handle large files very well. I have an 877M XML dump file that I would like to convert. I ran out of memory on a linux box with 4G of physical memory + 6G of VM.

I was hoping the XSLT file would be better than using simplexml_load_file() in PHP (which has the same problem of opening the entire file in memory).

rpbouman said...

Hi Anonymous!

yes, most XSLT implementations first need to parse the XML and have some in-memory representation of the document to apply the stylesheet to. There is something called streaming XSLT, and I believe the latest release of Saxon implements it. However, I imagine not all types of documents and stylesheets lend themselves to this technique.

Personally, in your case, I would probably try to parse the XML using a SAX parser. To keep it simple I would probably try to use the SAX handlers to generate lines in CSV (or similar) format. You can then load the CSV with for example LOAD DATA INFILE.

If you don't want to create a custom program for this kind of thing, you could give Kettle (pentaho data integration) a go. It has a Load data from XML step, which may help here. Kettle doesn't like large XML files too (or actually, it handles them fine, but also requires the memory) however, it can use a file buffer to compensate (but of course, it costs performance). For more info on this approach, see:
http://wiki.pentaho.com/display/EAI/Get+Data+From+XML
extra info for handling large files:
http://wiki.pentaho.com/display/EAI/Get+Data+from+XML+-+Handling+Large+Files

steve said...

I have a xml that looks like this (this is what I have to work with) and all I get is a create statement for database "mediaDb". How can I get it to create tables and data rows?

http://www.4shared.com/document/B7FsPIwJ/mediaDb_Sample.html

rpbouman said...

Hi steve!

(next time, please use the issue tracker in the google code project to post issues. Thanks in advance)

I downloaded your file and I can confirm your result. But the structure of your xml is different from what I get when I use mysqldump with the -X option. What version of mysqldump did you create this dump with?

Mean while, I can cobble up something to make it work for your example, would you like that? But then it would be cool if you could test it and give me some feedback if it works. Ok?

rpbouman said...

Ok steve, I took a look and I think it can be done, but there's a huge challenge since the format you posted has zero information on data types.

Anyway, I just committed a new version (revision 4) which I hope will fix your problem. But please be aware that I don't see anyway to restore the table structures going by this format. I did a best guess to distinguish between numbers and strings, I hope this helps for now.

syamimi said...

Hi Roland,

can i know if there is any format for xml? I try to do the sqldump from my sql file but I got a syntax error..

rpbouman said...

syamimi,

this is not helpful. If you think you found a bug, please use the tracker on the google code project.

Please provide a description of what you did, and what happened, and - very important - any error messages you received. Would also be great if you could attach the input file, or even better, a small sample from it that still reproduces that error.

Thanks in advance, and kind regards,
Roland.

Anonymous said...

Hi Roland,

Your script literally saved the day for myself.

My task was to develop a script that would scramble confidential information such as personal identity numbers, names, and the like from a MySQL dump taken from production. When scrambled, this data would then be transferred to test environment for use.

The data contained both simple values in table columns such as 'personalIdentityNumber', and (for the most part) lots of xml in clob columns. The confidential fields that had to be mangled were identified from the dump file using regular expressions. This would have been very difficult against the traditional MySQL dump file. The xml, however, provided enough boiler plate to get reliable regexp matches. Indeed it works fine against simple columns such as

020372-XXXX

and fields containing xml such as

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<personalIdentityNumber>020372-XXXX</personalIdentityNumber>
...

But the problem was that we're running MySQL 5.1., which doesn't yet support xml import. And I couldn't figure out a schema-agnostic (or almost agnostic) way of doing this using MySQL's EXTRACTVALUE etc functions alone. So I was about to start writing the xslt, but luckily stumbled upon your blog first!

Thanks!

Toni

rpbouman said...

Hi Toni!

I'm glad the script works for you. But I didn't quite understand this statement:

"the problem was that we're running MySQL 5.1., which doesn't yet support xml import."

I believe 5.1 is the first version to introduce ExtractValue() and UpdateXML()

http://dev.mysql.com/doc/refman/5.1/en/xml-functions.html

Anonymous said...

Hi Roland,

Yes, EXTRACTVALUE is available in 5.1. Our database schema is still evolving rapidly, and a db import solution based on EXTRACTVALUE and stored procs would require updates to follow db schema evolution. This is much easier when operating on MySQL xml dump and a set of regular expression templates for the mangling. All I have to do is to provide a list of column names (which happen to be the same across several tables in my case) and xml tag names used, and a simple python script will create and process regular expressions (and invoke mapping functions for substitutions) out of templates such as:

"()([^<]+)()" (for simple values)

and

"(<{tagname}>)([^&]+)(</{tagname}>)" (for xml content)

BTW: I also came across this issue: http://code.google.com/p/mysqldump-x-restore/issues/detail?id=1 when running the script with larger datasets. In our case it was ok to simply substitute single quotes with double quotes, which seemed to work.

Third thing encountered was that MySQL's xml export is not able to deal with bit(1) datatypes properly. XML contained strange bytes in place of the bits: 0x0a01 for 1 and 0x0a00 for 0. Substituting these with 0 (0x0a30) and 1 (0x0a31) did the trick.

Toni

rpbouman said...

Hi Toni!

thanks for the feedback. Do you mean that MySQL generates XML that is not well-formed when outputting bit data type values? If that's the case, it's a bug and it should be reported.

Anonymous said...

Hi Roland!

In more detail, mysqldump seems to emit a byte 0x00 for "false" and 0x01 for "true" when column type is bit(1), at least when dumping in xml format. These are clearly nonprintable chars and the resulting file is no longer well-formed xml at least according to xmllint. Since MySQL allows use of 0 and 1 decimal digits in inserts to bit(1) columns, a simple tr replacement did the trick:

cat dump.xml | tr "\000" 0 | tr "\001" 1 > dump2.xml

bit(1) seems to be the default mapping for a JPA/Hibernate entities' boolean fields in MySQL.

I'll have a look at bugs.mysql.com and see if I can report the bug.

Toni

Anonymous said...

Roland,

Actually mysqldump flag --hex-blob also works for binary and varbinary datatypes (bit(1) for example), creating a hex-blob instead of breaking well-formedness of the xml. So I guess it's the same bug you mentioned in the post.

Toni

rpbouman said...

Hi Toni,

thanks for investigating and sharing the result. Yeah, I guess hex-blob really should be automatic and mandatory when dumping to xml. Good to know that it works for bit fields as well.

Lewis Thomas said...

Thanks so much Roland,

This was a great solution for me. I think the errors some received of "Unexpected data type" occur with errors parsing indexes and foreign keys. I received those errors which went away when my schema only included columns.

tekvax said...

I was upgrading mediawiki on a fedora box, and mistakenly saved the database as --xml with the mymysqldump program, and of course I noticed my mistake after I dropped all tables in the database! Too bad I was following their upgrade instructions without thinking what I was doing!

I tried your mysqldump-xml-to-sql.xslt file with the mediawiki database, and obtained the following output.

xsltproc mysqldump-xml-to-sql.xslt file.xml > file.sql
file.xml:22681: parser error : Char 0x0 out of allowed range
<field Field="ar_timestamp" Type="binary(14)" Null="NO" Key="" Default="
^
file.xml:22681: parser error : AttValue: ' expected
<field Field="ar_timestamp" Type="binary(14)" Null="NO" Key="" Default="
^
file.xml:22681: parser error : attributes construct error
<field Field="ar_timestamp" Type="binary(14)" Null="NO" Key="" Default="
^
file.xml:22681: parser error : Couldn't find end of Start Tag field line 22681
<field Field="ar_timestamp" Type="binary(14)" Null="NO" Key="" Default="
^
file.xml:22681: parser error : Premature end of data in tag table_structure line 22674
<field Field="ar_timestamp" Type="binary(14)" Null="NO" Key="" Default="
^
file.xml:22681: parser error : Premature end of data in tag database line 22673
<field Field="ar_timestamp" Type="binary(14)" Null="NO" Key="" Default="
^
file.xml:22681: parser error : Premature end of data in tag mysqldump line 2
<field Field="ar_timestamp" Type="binary(14)" Null="NO" Key="" Default="
^
unable to parse file.xml

I know this likely isn't very helpful, but given the fact that it happened to me while upgrading the mediawiki, and following their instructions, I'm sure it has happened to many others as well!

Thank you for you help. Just for the record, I was lucky enough to have a tarball backup of the entire system that was only a few days old, and was able to recover the database from the /var/lib/mysql directory in the tar backup, and only lost a *FEW* days of updates!

Live and learn!

Cheers,
Dan

rpbouman said...

Hi Dan!

"file.xml:22681: parser error : Char 0x0 out of allowed range" < looks like the dump is not valid xml? Can you check with a hex editor if the error message is true? If it is, a bug should be filed for the dump program.

tekvax said...

dumb question is that line 22681?
or character position 22681?

tekvax said...

Oddly enough, it appears that the 22681st line does in fact contain NULL characters...
My notepad++ would let me copy and paste them here, so I changed them from a to the word "NUL" just for clarity and example...
The database came from a working wikimedia installation.



Strange indeed!

Cheers,
Dan

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...