Friday, March 10, 2006

Importing XML data into MySQL using ExtractValue()

As of version 5.1.5, MySQL supports some functions for handling XML Data. One can debate the subject wheter a rdbms should ship with XML capabilities. I happen to like this feature.

Working with XML tends to be very data-oriented. In alot of cases, only a trivial mapping is required between the relational data and the xml representation. In these cases, it is certainly quite convenient to have all the tools that implement the mapping close at hand.

MySQL XML capabilities up to 5.0



In prior versions, MySQL XML support was limited to exporting data in XML format using the MySQL command line client. All you need to do is start the mysql client utilitiy with the --xml or -X option.

(I'm using the sakila sample database throughout this article - be sure to install it if you want to execute the code yourself.)


D:\Data\MySQLXML>mysql -P 3307 -u root -pmysqlroot -D sakila -X
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 3 to server version: 5.1.7-beta-nt-max

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> select * from store;
<?xml version="1.0"?>

<resultset statement="select * from store">
<row>
<field name="store_id">1</field>
<field name="manager_staff_id">1</field>
<field name="address_id">1</field>
<field name="last_update">2006-02-14 21:57:12</field>
</row>

<row>
<field name="store_id">2</field>
<field name="manager_staff_id">2</field>
<field name="address_id">2</field>
<field name="last_update">2006-02-14 21:57:12</field>
</row>
</resultset>
2 rows in set (0.00 sec)

mysql>

This is quite a convenient method for quickly formatting query results in an XML format. It is especially suitable to satisfy ad-hoc query needs, but with some operating system level scripting and batch scheduling you can set it up to emit standard XML formatted reports periodically.

Another device you can use to generate XML data out of MySQL relational data is MyXML. This is a library of UDF's that enables one to generate XML from queries directly.

(BTW - lot's of you will already know the way to this site already, but for those that don't: http://tangent.org/ is quite a large list of contributions to the open source community. Most of them are Brian Aker's, and you can spot some contributions that don't include Brian...as first author. Anyway, it's got apache modules, MySQL UDF's, perl scripts, Feed utilities. Anyone wondering how to run a little java or perl form inside MySQL - go check out this site first.)

A possible disadvantage is that the mapping from relational data to XML is fixed - there is no way you can change the tagnames, control wheter data should be formatted as element content or as an attribute, set the namespace, etcetera.

MySQL XML capabilities as of 5.1.5

XML support in MySQL 5.1 has been extended with:

  • ExtractValue(): Query and Retrieve data from XML formatted text strings

  • UpdateXML(): Update data in XML formatted text strings


(These functions are described in detail here. There's also a little article that illustrates the usage of these functions.)

Both functions accept a string that represents fragment of XML, and an XPath expression. The XPath expression acts as a query operator on the xml, identifying the nodes (that is, the elements, attributes, text contents etc.) that the function needs to handle.
In the case of ExtractValue(), the text nodes specified by, or ocurring within the nodes identified by the XPath expression is returned:

mysql> select extractValue('<tag>text</tag>','tag') as `tag`
-> , extractValue('<tag>text</tag>','tag/text()') as `text()`
-> ;
+------+--------+
| tag | text() |
+------+--------+
| text | text |
+------+--------+
1 row in set (0.00 sec)

In first call to ExtractValue() the XPath expression, tag selects the <tag>text</tag> bit, but only the text content is returned. The XPath expression in second call to ExtractValue(), tag/text() explicitly specifies that text content should be selected. The result is the same in both cases.
If multiple nodes are identified, the text contents are concatenated, separated by a single space character:

mysql> select extractValue('<tag>text1</tag><tag>text2</tag>','tag') as `tag`;
+-------------+
| tag |
+-------------+
| text1 text2 |
+-------------+
1 row in set (0.00 sec)

In the case of UpdateXML(), the selected node is replaced by the third supplied argument.

(The documentation is not very clear what should happen in case the XPath expression matches multiple nodes. As far as I can see now, UpdateXML() does not do anything. Seems like a bug to me, or at least, a gap in the documentation.)

Importing XML using ExtractValue


Now that we know how to selectively extract data from an xml fragment, we can try and put it to good use. For instance, we can try to write a simple utility to import XML data back into the database.
Well, if we know the format of XML documents that we want to import, this should be an easy task. Let's start simple and try to import an xml document like the one we generated in the introduction of this article:

(Of course, I'm not pretending that this is the preferred way to load data coming from mysql back into the database. You can achieve much more performance and data quality by using the usual backup and restore utilities or the mysqlimport utility. This is all just an illustration of an approach towards loading xml data using the builtin features. We're using this example because it will be easy to reproduce)


mysql -P 3307 -u root -pmysqlroot -D sakila -X -e "SELECT * FROM store" > store.xml

Now, we know we can write a single XPath expression for each field in both of the store records, and once we have the values, it's of course a piece of cake to write an insert around that:

mysql> use test;
Database changed
mysql> create table store like sakila.store;
Query OK, 0 rows affected (0.14 sec)

mysql> set @xml := load_file('D:/Data/MySQLXML/store.xml');
Query OK, 0 rows affected (0.00 sec)

mysql> insert into store values (
-> extractValue(@xml,'/resultset/row[1]/field[1]/text()')
-> , extractValue(@xml,'/resultset/row[1]/field[2]/text()')
-> , extractValue(@xml,'/resultset/row[1]/field[3]/text()')
-> , extractValue(@xml,'/resultset/row[1]/field[4]/text()')
-> ),(
-> extractValue(@xml,'/resultset/row[2]/field[1]/text()')
-> , extractValue(@xml,'/resultset/row[2]/field[2]/text()')
-> , extractValue(@xml,'/resultset/row[2]/field[3]/text()')
-> , extractValue(@xml,'/resultset/row[2]/field[4]/text()')
-> );
Query OK, 2 rows affected (0.06 sec)
Records: 2 Duplicates: 0 Warnings: 0


However, this is an unacceptable approach because it is limited to importing only the first two records. When a new sakila videostore is opened someday, we will not be able to import that third record from an xml file. Clearly we need to be able to import all of the records. So, somehow, we need to loop through all the <row> childelements of the <resultset> document element.
Once we've established that we need a loop, we know that we can't keep on typing all our commands interactively. We need either a function or a stored procedure to hold one of the loop constructs like REPEAT or WHILE. Having said that, we realize immediately that if we are to use a loop, we must also think of a particular situation that tells us when we should exit the loop. Somehow, we need to detect if there are still more records to import.
Well, we can do just that. XPath is not limited to just extracting values from an XML fragment. XPath can perform calculations too, yielding values that were never literally present in the source XML. XPath even knows some aggregate functions; this is what we'll be using here to. Consider the following procedure definition:

delimiter //

drop procedure if exists p_import_stores
//

create procedure p_import_stores(
p_xml text
)
begin
declare v_row_index int unsigned default 0;
declare v_row_count int unsigned;
declare v_xpath_row varchar(255);

-- calculate the number of row elements.
set v_row_count := extractValue(p_xml,'count(/resultset/row)');

-- loop through all the row elements
while v_row_index < v_row_count do
set v_row_index := v_row_index + 1;
set v_xpath_row := concat(
'/resultset/row['
, v_row_index
, ']'
);
insert into store values (
extractValue(p_xml,concat(v_xpath_row,'/field[1]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/field[2]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/field[3]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/field[4]/text()'))
);
end while;
end;
//

First, we calculate the number of rows using the XPath function count(). We use that to drive a WHILE loop. Inside the while loop, we increment v_row_index to keep track of the current <row> element. We then use that to construct a new XPath
expression, v_xpath_row that refers to the corresponding element in the xml document. Then, we perform the insert. We can conveniently append the partial XPath expression that retrieves the text content of the <field> elements relative to the parent <row> element to v_xpath_row XPath expression to extract the column values for our insert
statement.
When we call the procedure, we can convince ourselves that it does indeed import all of the records:

mysql> delete from store;
-> //
Query OK, 2 rows affected (0.03 sec)

mysql> call p_import_stores(@xml)
-> //
Query OK, 1 row affected (0.07 sec)

mysql> select * from store;
-> //
+----------+------------------+------------+---------------------+
| store_id | manager_staff_id | address_id | last_update |
+----------+------------------+------------+---------------------+
| 1 | 1 | 1 | 2006-02-14 21:57:12 |
| 2 | 2 | 2 | 2006-02-14 21:57:12 |
+----------+------------------+------------+---------------------+
2 rows in set (0.00 sec)

This is certainly a major improvement over our initial attempt. We've just made the import algorithm independant of the number of records present in the xml. We can improve this procedure a little bit more: for example, we can create a couple of 'constants' to hold element names or even entire XPath expressions, or we can extend our approach toward the number of rows also to the fields. When we introduce the PREPARE syntax, we can even have the procedure perform one or multiple insert statements for multiple <row> elements, depending upon the total number of rows (this could mean a significant performance booster). The PREPARE syntax would also allow one to choose a target table dynamically, leading to a generalized solution for importing XML into the MySQL database.
However, there's at least one problem with this procedure that cannot be remedied using any of these approaches. Te problem is that we really must import all of the <row> elements: we can't make a selection. This problem can be solved by allowing the XPath expression that selects the <row> elements to be parameterized:

delimiter //

drop procedure if exists p_import_stores
//

create procedure p_import_stores(
p_xml text
, p_xpath_row varchar(255)
)
begin
declare v_row_index int unsigned default 0;
declare v_row_count int unsigned;
declare v_xpath_row varchar(255);

-- calculate the number of row elements.
set v_row_count := extractValue(
p_xml
, concat(
'count('
, p_xpath_row
, ')'
)
);

-- loop through all the row elements
while v_row_index < v_row_count do
set v_row_index := v_row_index + 1;
set v_xpath_row := concat(
p_xpath_row
, '['
, v_row_index
, ']'
);
insert into store values (
extractValue(p_xml,concat(v_xpath_row,'/field[1]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/field[2]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/field[3]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/field[4]/text()'))
);
end while;
end;
//

Now we're in a position to use the XPath expression together with a predicate, that is, to add a (set of) criteria that selects only those rows that are of interst. In the example that follows, a predicate is used to select only the row where the store_id field is 1.

mysql> delete from store;
-> //
Query OK, 2 rows affected (0.04 sec)
mysql> call p_import_stores(@xml,'/resultset/row[field[@name="store_id"]=1]')
-> //
Query OK, 1 row affected (0.04 sec)

mysql> select * from store
-> //
+----------+------------------+------------+---------------------+
| store_id | manager_staff_id | address_id | last_update |
+----------+------------------+------------+---------------------+
| 1 | 1 | 1 | 2006-02-14 21:57:12 |
+----------+------------------+------------+---------------------+
1 row in set (0.00 sec)

Of course, this approach can be used equally well for the field XPath expressions. Up to a certain point this would make the import algorithm less dependant upon the particular element names. Suppose we would implement the same trick for the field XPath expression, our procedure signature would look something like this:

create procedure p_import_stores(
p_xml text
, p_xpath_row varchar(255)
, p_xpath_field varchar(255)
)
...

Now, suppose our xml is actually an xhtml document with a <table> element inside the <body> we could then write:

call p_import_stores(
p_xml
, '/html/body/table/tbody/tr'
, '/td'
);

Now, most people will probably agree that the import functionality is gaining usefullness with each of these improvements. However, a major weakness remains. We will always be dependant upon these list or table like structures. In a lot of cases this just isn't enough. What we need is to be able to pass the procedure some structure that contains all the information to define an arbitrary mapping between the xml source and the relational target data. This idea is not new - Microsoft uses it in MS SQL Server. They employ the so-called 'annnotated xml schemas' to achieve it. (You can google that, or check out this site.)
Anyway, I think that is should be possible to come up with something. Of course, the mapping itself would be passed to the procedure in some XML format too - at least, that's how I think it should be implemented. Anyway, I've got a feeling that I will post a sequal to this post at a quite short notice.

47 comments:

Anonymous said...

thanks for this information. I have modified what you have done to import a different xml file. the only problem that i am experiencing is that under one heading that is to be imported the text is surrounded by paragraph markers (this blog doesnt allow p markers) and it does not import this. any idea how to get around this? using your example it would be like
field xxxx text /field -- works
field xxxx p text /p /field -- doesnot work

thanks dennis

rpbouman said...

Hi Dennis,

(are you from Holland?) Can you please mail me your XML? You can find my email address in the profile (R_P_Bouman@hotmail.com)

Anonymous said...

Thank You Roland!
I am moving from SQL Server to MySQL and your post here has saved me hours of experimentation and frustration. Your sample worked virtually straight out of the box. Great stuff!

Alex said...

that's all nice, but does not solve more complicated problem:
loading (adding) new data to MySQL from outer XML sources, if needed to add data to many tables with relations.
what I mean is:
we have a MySQL database with a group of related tables. every day we have a new XML file, which needs to be loaded into the database. xml nodes and attributes need to be parsed and foreign keys need to be retrieved, so can't just use this simple approach as described here. would be nice to have an xml importer, which could be configured for any tables (and relations) structure. I don't want to write an xml importer mysel - I think it's a pretty common task.

rpbouman said...

Alex, you are right that this solution will only be suitable for the simplest of problems.

Regarding your specific problem: you ought to take a look at Kettle. Kettle a.k.a. Pentaho Data Integration services (see www.pentaho.org) has pretty much all the functionality you need to do this. (and much, much more). Kettle is free software, but requires java.

Download kettle here: http://www.javaforge.com/displayDocument/Kettle-2.3.1.zip?doc_id=15911

BTW, a really generic XML to database importer is not trivial at all in my opinion. You can of course use XSLT for small files, but you really want a streaming XML solution - which is not at all easy to devise for the general case.

MS SQL has something like this using annotated XML schemas. However, my advise for now is to check out Kettle. I have used it for XML import. I still want to blog that but I havent had the time lately.

rpbouman said...

Ach, link does not display...oh well

Download kettle at:

http://www.javaforge.com/displayDocument/

Kettle-2.3.1.zip?doc_id=15911

or go to

http://www.pentaho.org/download/ga.php

Anonymous said...

I would like to know how to install myXML or myperl after downloading??

thank u.

rpbouman said...

Hi Aruni,

instead of MyXML, please try lib_mysqludf_xql. It is easier to use and has more functionality.

Download it here:

http://www.xcdsql.org/MySQL/UDF/

from there, it is in:
lib_mysqludf_xql/lib_mysqludf_xql-0.9.2.tar.gz

For the MyPerl stuff, please ask Brian Aker. Contact Brian via email:

brian @ tangent.org

For all this stuff you are going to have to compile it yourself. Unzip, and do:

./configure

to get you started.

Hans Stevens said...

Hello Roland,

The post is already quite dated ,
but it's nice anyway.

I'm having serious performance issues when trying to import an XML file of 3Mb using more or less the same script as in your post.
Is this a know issue??

Imho, 3Mb is not that big. So I was wondering if this approach is just for real small XML-files? Maybe my script is suffering a bottleneck I overlooked.

Currently I'm using the 5.1.22rc version of MySQL

Thanks in advance!
Regards
Hans

rpbouman said...

Hi Hans!

"I'm having serious performance issues when trying to import an XML file of 3Mb using more or less the same script as in your post.
Is this a know issue??"

I never tried it, but I would not be surprised. Maybe you ought to take a look at the community contributed LOAD XML INFILE patch:

http://dev.mysql.com/doc/refman/6.0/en/load-xml.html

Another viable way is to use an ETL tool like kettle (a.k.a. Pentaho data integration, http://www.pentaho.com/launch/pdi/), which is probably what I would do.

Let me know if you need more help with this.

Kind regards,

Roland

Travis Staloch said...

Another way that you can import xml data info a mysql database is to use SportsFeed Xml Processor which is a cross platform service application. You can use it to automate the process of fetching remote xml data and saving portions of the data to disk or database.

Kartikeya said...

Hi Roland,
I am using MySql version 5.0.51a.
Is it possible to import xml file to a table. I know there are many external tools for the same but, I was wondering if it is possible to use it directly from mySql.

Thanks,
Kartikeya

rpbouman said...

Hi Kartikeya,

In MySQL 5.0 and below, it is not exactly straightforward. I mean, I never tried but I suppose you could use FILE() to import the file proper, and then you could use a stored procedure to parse and load the data.

However, I really believe this is not a very good solution. I think that an external tool will get the job done faster, more reliable, and without the burden of developing your own parser.

Unknown said...

Hi Roland,
I am using MySQL version 5.1.
I am getting Error message Code : 1064 at extractvalue line

You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '(p_xml,concat(v_xpath_row,'/field[1]/text()'))
-- ' extract(p' at line 15
(0 ms taken)

Can you please suggest me.

Thanks in advacne

Jegadeesan Ramu

Unknown said...

Hi Roland,

I am using MySQL version 5.1 and getting error message code : 1064 at extractvalue line.

The error message are as below:

You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '(p_xml,concat(v_xpath_row,'/field[1]/text()'))
-- ' extract(p' at line 15
(0 ms taken)

Can you please suggest.

Thanks in advance
Jegadeesan ramu

rpbouman said...

Hi Jagadeesan,

what is your code? can't help if you don't post the code.

Anonymous said...

hye,i would like to know can the code above be used to import packet capture file which is in .xml?

Anonymous said...

Thanks man, This has helped me.

Anonymous said...

Two things here

1) really nice article. YOu helped me figure out how to do what I needed to do. THANKS!!!!

2) Is mySQL an actually database system? I state this as what type of idiots would give you a functionality like this and the only way you can return a result set is through a loop. Databases work best on set based operations. So why would it not return a result SET!?!?!?

rpbouman said...

Hi Anonymous,

re 1) thanks! I'm glad this post was useful for you.

re 2) In my opinion, MySQL is in fact a RDBMS, but that said, you aren't the first one to question that. However, more to the point, I like your suggestion of having some device (not necessarily a function) that would allow you to treat the result of an XPath query as a result set. But I think the ExtractValue function is modelled after the Oracle function of the same name (which works somewhat different, but also returns a scalar value, not a result set). So the short answer is, ExtractValue does exactly what it suggests, it extracts a -scalar- value, not a result set.

Perhaps you should file a feature request at bugs.mysql.com.

Kind regards,

Roland

Anonymous said...

From a performance standpoint of doing multiple inserts does it make more sense to drop the information into temporary tables and the do a single insert into the table, or does it really matter?

Note that I am looking at this from an order detail standpoint, so it would get called repeatedly.

rpbouman said...

Hi Anonymous,

"From a performance standpoint of doing multiple inserts does it make more sense to drop the information into temporary tables and the do a single insert into the table, or does it really matter?"

This is a good question! (Generally that means I only have bad answers :)

My hunch is that yes, you could gain some performance if you could somehow bunch up all those inserts. But there are a few things to consider.

- Disk: if you are using MyISAM tables or InnoDB with autocommit enabled, every INSERT statement will hit the disk which is slow. If you are using InnoDB I would advise to disable autocommit, and do an explicit COMMIT after processing the XML. Using temporary tables like you suggest could gain you some performance if you use MEMORY temporary tables.

- Even if you do use MEMORY temporary tables, you still get some overhead because you're doing it a repeated number of times. For example, each INSERT will cause a table lock, which takes some time. What might be faster is not storing the data in tables, but in a string as a SQL INSERT script. So what you'd do is generate one big INSERT INTO table VALUES (...), ..., (...) statement, and execute that once dynamically.

Anyway these are just a few thoughts - I haven't tested it but I might give it a quick go because I'm rather curious myself.

Thanks,

Roland.

Anonymous said...

If you test it, please pass along the results as I would be interested in knowing as well being as I am rather new to MySQL.

Anonymous said...

Roland,

If you did the insert dynamically as stated, would that not leave you open to SQL Injection Attacks?

Thanks

rpbouman said...

Anonymous, yes - SQL injection should always be a concern. And, if you desire to parameterize the XPath expression, there is the additional concern of XPath injection.

However, this article is simply showing the basics of working with ExtractValue function. Adding parameters or escaping would not be conducive to getting the point across.

I probably should have added a line to mention that these are concerns are valid but unaddressed here. Thanks for pointing it out.

Anonymous said...

How to implement this with Bugzilla?

rpbouman said...

Anonymous, I dont understand the question. How exactly does Bugzilla relate to this at all?

Anonymous said...

Roland,

I am currently using your example for testing a new method for doing insert into a table. However, I am running into an issue with collations. I realize it is probably something on my server, but I cannot seem to figure it out.

You have the two statements below.

set v_row_count := extractValue(p_xml,'count(/resultset/row)');
extractValue(p_xml,concat(v_xpath_row,'/field[1]/text()'))

the top statement is returning as a utf8_general_ci while the bottom statement is returning as a utf8_unicode_ci.

Did you ever run across this during you testing and if so, how did you correct it?

Thanks,

rpbouman said...

Hi Anonymous,

no, I don't recall having any collation issues.

From what can tell, the return value of extractValue() has the same collation as that of the first argument. So if there is any problem later on comparing the values, I should think that that would be caused by the (implicitly or explicitly) declared collation of the variables that you use to store the values in.

I probably would try to explicitly declare my variables as having a specific unicode collation. For info on choosing the right unicode collation, read http://dev.mysql.com/doc/refman/5.1/en/charset-unicode-sets.html


good luck,

Roland.

Anonymous said...

Hi Roland,

I am using pentaho data integration tool for reporting purpose.I am facing a problem when Transforming a dynamically generated xml into table output.



Scenario is : Transformation of 3 files into DB. If the node is 'a' its value should be inserted into 'id' field if node name is !='a' it should be stored into table as 'Name' and 'value' fields.

1.xml

Content

Quote:

1
b content


2.xml

Content

Quote:

1
c content



3.xml

Content

Quote:

1
d content



I need to write a transformation that parses through these 3 files 1.xml , 2.xml and 3.xml


and produces a result of table containing


DB TABLE

ID | Name | Value
_______________________________
1 | b | b content
1 | c | c content
1 | d | d content


Any Suggestions ?

Thanks,
Arthy

rpbouman said...

Hi Arthy!

yes, you can use pentaho data integration to import xml, in fact, I would recommend that method over using the MySQL functions I described in this article.

With PDI, you can the XML input step to parse XML using an XPath expression. with the row denormalizer you can group key value pairs into a record. Whatever you can't solve with those steps, you can probably solve with a Javascript step and/or regexes. I have a practical example of XML import and export in My 2009 OSCON presentation (for details, slides, and code samples, go to http://rpbouman.blogspot.com/2009/07/oscon-2009-presentation-taming-your.html and follow the links to the OSCON site)

Now, I read you comment, and I suspect some info got lost by the blogger commenting system. I'd be happy to take a look if you could send your files to me by email (Roland.Bouman@gmail.com).

regards, Roland Bouman

Unknown said...

You've got to be kidding.
I have sitting here megabytes of xml files dumped from a client that I need to import into a database. These xml file contain the ENTIRE database schema. You mean there is mysql 5.1+ DOESNT have a function that can simply import the xml files and create the db/tables/columns/data?

rpbouman said...

James, I am not kidding. The feature is available in MySQL 5.1. There is a LOAD XML feature available in he latest beta, MySQL 5.5. though:
http://dev.mysql.com/doc/refman/5.5/en/load-xml.html

If you have megabytes of XML to import, I advise either one of the following approaches:

1) Write an XSLT stylesheet to tansform your XML to SQL.
2) Use kettle, a.k.a. Pentaho data integration. It comes with examples that show how to load XML, and you can dump that data directly into the database
3) write a custom script in php/perl/python etc.

If you like I can help you and cobble up something to do approach 1.

rpbouman said...

ugh...I meant to write "loading XML is *not* available in MySQL 5.1 (or lower for that matter).

rpbouman said...

James, I just completed some stuff that should help you use your XML formatted MySQL dumps. You can find background information, as well as a link to the downloads and instructions here:

http://rpbouman.blogspot.com/2010/04/restoring-xml-formatted-mysql-dumps.html

I hope this helps.

kind regards,

Roland

Anonymous said...

Fantastic this has just saved me
Thanks

andry said...

xml is a very interesting language to be used and contain the data object model or abbreviated DOM.tutorial very good and hopefully can help me in building a web application thanks

Raptorak said...

This is a pretty old post but hopefully someone will read this... what method would one use to go about importing a very large XML file (such as 500-600 MB's) into a database?

rpbouman said...
This comment has been removed by the author.
rpbouman said...

@Raptorak,

You should probably try the Streaming XML Input step in kettle (aka Pentaho Data Integration, and open source ETL tool, download at http://sourceforge.net/projects/pentaho/files/Data%20Integration/). See: http://wiki.pentaho.com/display/EAI/XML+Input+Stream+%28StAX%29

Alternatively, you could write a dedicated import program based on a SAX parser. (SAX is also the technique used by Kettle's streaming xml input step).

Greg Tompkins said...

Hello,

I am trying to use this example in what I am trying to do. I am kind of a newbie with stored procedures in mySQL. I have this XML file that looks like this I would like to import into mySQL:


001/22/2012 21:50:581562181189
001/22/2012 21:50:571562181189


I want the script to create a table called temp in the database ted5000

But I can't even get past the first part of my script:

mysql> create procedure p_temp(p_xml text);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1

Here is what I have as my sp so far:

delimiter //

drop procedure if exists p_temp
//

create procedure p_temp(p_xml text)
begin
declare v_row_index int unsigned default 0;
declare v_row_count int unsigned;
declare v_xpath_row varchar(255);

--calculate the number of row elements.
set v_row_count := extractValue(p_xml,'count(/History/SECOND)');

-- loop through all the row elements
while v_row_index < v_row_count do
set v_row_index := v_row_index + 1;
set v_xpath_row := concat(
'/History/SECOND['
, v_row_index
, ']'
);
insert into temp values (
extractValue(p_xml,concat(v_xpath_row,'/SECOND[1]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/SECOND[2]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/SECOND[3]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/SECOND[4]/text()'))
, extractValue(p_xml,concat(v_xpath_row,'/SECOND[5]/text()'))
);
end while;
end;
//

If you could offer me a hand, I am kind of struggling here. Thank You!

Serge said...

Hi,

Can someone help me with this. I am trying to get the results from an xml string into mysql table.

Here is a sample of the xml :





I am trying to retrieve the CountryCodeID, Canada, countryCode_2, but i am currently stuck.

Thanks

rpbouman said...

Serge, I know it's retarded but blogger comments doesn't escape XML itself. Please post again but escape the XML yourself so that

<tagname> becomes &lt;tagname&gt;

Anonymous said...

Thanks very much for posting this. 8 years later its helped me import my first XML file ;)

Unknown said...

I have a form at http://www.sledor.talk4fun.net/results/results.html which shows results if a proper roll number is entered... the results are fetched from http://results.vtu.ac.in site and only the table is shown... now i want to add the data of the table to my mysql/database.

Is there a way to do it..?

for example if the roll number entered is 1db13cscs006

the page will show the results as
Semester: 8 Result: FIRST CLASS WITH DISTINCTION

Subject External Internal Total Result
Software Architectures (10IS81) 67 25 92 P
System Modeling and Simulation (10CS82) 73 24 97 P
Software Testing (10CS842) 48 24 72 P
Information and Network Security (10CS835) 64 25 89 P
Project Work (10CS85) 97 95 192 P
Seminar (10CS86) 0 47 47 P


Total Marks: 589


now i want the data from this table to be stored to my database so that if the same roll number is searched again... the result should be shown from my database and not from http://results.vtu.ac.in... help me out pls..Smile | Smile | :)

rpbouman said...

Punith, this sounds like a really specific request. Please go to the mysql forums, or stackoverflow and post there.

thank you.

simplyme said...

hi, do i need to change the original xml file to have the "resultset" tag? thanks!

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Rolling up each game's lines into a single game row (6/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is the sixth installment of a s...