ExtractValue()
. The reason I'm revisiting the subject now is that I recently received an email request in relation to my old blog post:I came across one of your blogs on importing XML data to MySQL using ExtractData() and I am trying to do the same using MySQL (5.5) database. However, I am new to this kind of method and would like to seek your expertise on this. I have an XML file that looks like this:Now, it is worth pointing out that it is totally feasible to get this task done, relying only on MySQL built-ins. This is in fact described in my original blog post. The steps are:<?xml version="1.0" encoding="UTF-8"?> <wovoml xmlns="http://www.wovodat.org" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.1.0" xsi:schemaLocation="http://www.wovodat.org phread2.xsd"> <Data> <Seismic> <SingleStationEventDataset> <SingleStationEvent code="VTAG_20160405000000" owner1="169" pubDate="2018-04-05 00:00:00" station="101"> <startTime>2016-04-05 00:00:00</startTime> <startTimeCsec>0</startTimeCsec> <startTimeUnc>0000-00-00 00:00:00</startTimeUnc> <startTimeCsecUnc>0</startTimeCsecUnc> <picksDetermination>H</picksDetermination> <SPInterval>12.5</SPInterval> <duration>50</duration> <durationUnc>0</durationUnc> <distActiveVent>0</distActiveVent> <maxAmplitude>1532.3</maxAmplitude> <sampleRate>0.01</sampleRate> <earthquakeType>TQ</earthquakeType> </SingleStationEvent> <SingleStationEvent code="VTAG_20160406000000" owner1="169" pubDate="2018-04-06 00:00:00" station="101"> <startTime>2016-04-06 00:00:00</startTime> <startTimeCsec>0</startTimeCsec> <startTimeUnc>0000-00-00 00:00:01</startTimeUnc> <startTimeCsecUnc>0</startTimeCsecUnc> <picksDetermination>H</picksDetermination> <SPInterval>5.2</SPInterval> <duration>36</duration> <durationUnc>0</durationUnc> <distActiveVent>0</distActiveVent> <maxAmplitude>9435.1</maxAmplitude> <sampleRate>0.01</sampleRate> <earthquakeType>HFVQ(LT)</earthquakeType> </SingleStationEvent> <SingleStationEvent code="VTAG_20160407000000" owner1="169" pubDate="2018-04-07 00:00:00" station="101"> <startTime>2016-04-07 00:00:00</startTime> <startTimeCsec>0</startTimeCsec> <startTimeUnc>0000-00-00 00:00:02</startTimeUnc> <startTimeCsecUnc>0</startTimeCsecUnc> <picksDetermination>H</picksDetermination> <SPInterval>2.3</SPInterval> <duration>19</duration> <durationUnc>0</durationUnc> <distActiveVent>0</distActiveVent> <maxAmplitude>549.3</maxAmplitude> <sampleRate>0.01</sampleRate> <earthquakeType>HFVQ(S)</earthquakeType> </SingleStationEvent> </SingleStationEventDataset> </Seismic> </Data> </wovoml>And my table is:CREATE TABLE IF NOT EXISTS `sd_evs` ( `sd_evs_id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT COMMENT 'ID', `sd_evs_code` varchar(30) NOT NULL COMMENT 'Code', `ss_id` mediumint(8) unsigned DEFAULT NULL COMMENT 'Seismic station ID', `sd_evs_time` datetime DEFAULT NULL COMMENT 'Start time', `sd_evs_time_ms` decimal(2,2) DEFAULT NULL COMMENT 'Centisecond precision for start time', `sd_evs_time_unc` datetime DEFAULT NULL COMMENT 'Start time uncertainty', `sd_evs_time_unc_ms` decimal(2,2) DEFAULT NULL COMMENT 'Centisecond precision for uncertainty in start time', `sd_evs_picks` enum('A','R','H','U') DEFAULT NULL COMMENT 'Determination of picks: A=Automatic picker, R=Ruler, H=Human using a computer-based picker, U=Unknown', `sd_evs_spint` float DEFAULT NULL COMMENT 'S-P interval', `sd_evs_dur` float DEFAULT NULL COMMENT 'Duration', `sd_evs_dur_unc` float DEFAULT NULL COMMENT 'Duration uncertainty', `sd_evs_dist_actven` float DEFAULT NULL COMMENT 'Distance from active vent', `sd_evs_maxamptrac` float DEFAULT NULL COMMENT 'Maximum amplitude of trace', `sd_evs_samp` float DEFAULT NULL COMMENT 'Sampling rate', `sd_evs_eqtype` enum('TQ','HFVQ(LT)','HFVQ(S)','LFVQ(SX)','SDH(HF)','SDH(LF)','H','E','tele','LFVQ(X)','HFVQ') DEFAULT NULL COMMENT 'The WOVOdat terminology for the earthquake type', `cc_id` smallint(5) unsigned DEFAULT NULL COMMENT 'Collector ID', `cc_id2` smallint(5) unsigned DEFAULT NULL COMMENT 'Owner 2 ID', `cc_id3` smallint(5) unsigned DEFAULT NULL COMMENT 'Owner 3 ID', `sd_evs_loaddate` datetime DEFAULT NULL COMMENT 'Load date', `sd_evs_pubdate` datetime DEFAULT NULL COMMENT 'Publish date', `cc_id_load` smallint(5) unsigned DEFAULT NULL COMMENT 'Loader ID', `cb_ids` varchar(255) DEFAULT NULL COMMENT 'List of cb_ids (linking to cb.cb_id) separated by a comma', PRIMARY KEY (`sd_evs_id`), UNIQUE KEY `sd_evs_code` (`sd_evs_code`), KEY `OWNER 1` (`cc_id`), KEY `OWNER 2` (`cc_id2`), KEY `OWNER 3` (`cc_id3`), KEY `STATION` (`ss_id`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1 COMMENT='Seismic event data from a single station' AUTO_INCREMENT=359143 ;I am using Linux command line to perform such tasks as I need to do an automation of extracting data and importing to MySQL database. How do I extract attributes and elements values from XML file in such a way that the following fields contain the following values in the table?
Attribute Field/Column code sd_evs_code owner1 cc_id pubDate sd_evs_pubdate station ss_id Element Field/Column startTime sd_evs_time startTimeCsec sd_evs_time_ms startTimeUnc sd_evs_time_unc startTimeCsecUnc sd_evs_time_unc_ms picksDetermination sd_evs_picks SPInterval sd_evs_spint duration sd_evs_dur durationUnc sd_evs_dur_unc distActiveVent sd_evs_dist_actven maxAmplitude sd_evs_maxamptrac sampleRate sd_evs_samp earthquakeType sd_evs_eqtype
- Identify the xpath expression that identifies those elements that correspond to database rows. In this example, that xpath expression is
/wovoml/Data/Seismic/SingleStationEventDataset/SingleStationEvent
. This is an xpath path expression, which in this case is simply the tag names of the elements in the XML document, separated by a forward slash character (/
). - Use MySQL's
ExtractValue()
function to count the number of elements that are to be loaded as rows. This looks something likeCAST(EXTRACTVALUE(v_xml, CONCAT('count(', v_xpath, ')') AS UNSIGNED)
, wherev_xml
is a string that contains our XML document, andv_xpath
a string that contains the xpath expression we just identified. The xpathcount()
function accepts a path expression and returns the number of elements identified by that path as an integer. MySQL'sEXTRACTVALUE
however is unaware of the type of the xpath result, so therefore we explicitly convert it into an integer value using MySQL'sCAST()
function. - Once we have the number of elements, we can setup a loop (using either
LOOP
,WHILE
, orREPEAT
) - Inside the loop, we can construct a new xpath expression based on our original xpath expression and the loop index, like so
CONCAT(v_xpath, '[', v_row_index, ']')
. The bit between the square brackets is called a xpath predicate. Using only an integer value as predicate will retrieve the element located at that index position. - Also inside the loop, we identify the xpath expression relative to the current row element for each field item we'd like to extract. So, for example, the
code
attribute of the current<SingleStationEvent>
element is to be used to load thesd_evs_code
column, and to extract that item we can write a call likeCAST(EXTRACTVALUE(v_xml, CONCAT(v_xpath, '[', v_row_index, ']/@code')) AS UNSIGNED)
. The@
is an xpath shorthand to indicate an attribute at whatever element is identified by the preceding path expression. Instead of extracting an attribute, we could also append a tagname; for exampleCAST(EXTRACTVALUE(v_xml, CONCAT(v_xpath, '[', v_row_index, ']/startTime')) AS DATETIME)
could be used to extract the value of the<startTime>
child element as a MySQL datetime value. - Still inside the loop, we can now assign the extracted values to our columns - either by inserting them directly for each run of the loop, or by writing either the expressions themselves, or the resulting values of these extractions to a SQL statement text, which could then be executed later as a whole.
The Stored Procedure p_load_xml()
You can find the procedure on github. Its signature is:PROCEDURE p_load_xml(IN p_spec text, IN p_xml text)
p_spec
is an XML document that specifies the mapping from XML document to the table and its columns.p_xml
is the actual XML document containing the data that is to be loaded.
The Mapping document
The format of the mapping document that is to be passed to thep_spec
parameter is best explained using an example. To achieve the load expressed in the request I received, I'm using this:<x show="true" exec="false" table="sd_evs" xpath="/wovoml/Data/Seismic/SingleStationEventDataset/SingleStationEvent"> <x column="sd_evs_code" xpath="@code" expression=""/> <x column="ss_id" xpath="@station" expression="cast(ss_id as unsigned)"/> <x column="sd_evs_time" xpath="startTime" expression=""/> <x column="sd_evs_time_ms" xpath="startTimeCsec" expression=""/> <x column="sd_evs_time_unc" xpath="startTimeUnc" expression=""/> <x column="sd_evs_time_unc_ms" xpath="startTimeCsecUnc" expression=""/> <x column="sd_evs_picks" xpath="picksDetermination" expression=""/> <x column="sd_evs_spint" xpath="SPInterval" expression=""/> <x column="sd_evs_dur" xpath="duration" expression=""/> <x column="sd_evs_dur_unc" xpath="durationUnc" expression=""/> <x column="sd_evs_dist_actven" xpath="distActiveVent" expression=""/> <x column="sd_evs_maxamptrac" xpath="maxAmplitude" expression=""/> <x column="sd_evs_samp" xpath="sampleRate" expression=""/> <x column="sd_evs_eqtype" xpath="earthquakeType" expression=""/> <x column="cc_id" xpath="@owner1" expression=""/> <x column="sd_evs_pubdate" xpath="@pubDate" expression=""/> </x>
The document element
- document element
- The top level element (the document element) of the spec defines which table is to be loaded, as well as what expression is to be used to identify those elements that are loaded as rows. The tagname of the element can be any valid XML element name. In my example above I use
x
but it really doesn't matter. table
-attribute- This mandatory attribute holds the name of the target table
schema
-attribute- This optional attribute can be used to specify the schema wherein the table resides. If present it will be used to qualify the table name. If not specified, an unqualified table name will be used.
xpath
-attribute- The
xpath
-attribute on the document element should specify the xpath path expression that identifies all elements that are to be loaded as a row into the table.
show
-attribute- This optional attribute can be used to show the SQL that is generated to load the data into the table. If its value is
"true"
the procedure will return the generated SQL as a resultset. If not present, or anything other than"true"
, no resultset is returned by the procedure. exec
-attribute- This optional attribute can be used to suppress whether the generated SQL to load the table will be executed. If its value is
"false"
, execution of the generated SQL will be surpressed. If omitted, or if it has any other value, the SQL will be executed automatically, thus actually loading the table.
Child elements
- child element
- The document element can have any number of child elements, and each specifies how to extract a value from the current row element, and which column to load it into. Again, the tagname is not significant as long as it is a valid XML element name - the attributes of the child elements specify the behavior.
xpath
-attribute- Here you must specify a (mandatory) xpath-expression relative to the current row element, and this is what will be used as argument to
EXTRACTVALUE()
to extract a value from the XML data document. column
-attribute- Here you must specify a name for the extract. Normally, this will be the name of a column of the table defined in the document element, and the value extracted by the xpath expression will be loaded into that column. The name given here can also be used to refer to the value of the extract in case a child element has an
expression
-attribute. For more details, read up on theexpression
-attribute. expression
-attribute- Sometimes, the raw extract is not suitable for loading directly into a column. In those cases, you can specify a SQL column expression in the
expression
-attribute. In these SQL expressions, you can refer to the extracts defined by any other child element simply by using the name defined in the correspondingcolumn
-attribute. exclude
-attribute- Sometimes, a particular extract does not need to be loaded into the table at all, although it may be required by an expression defined by some other child-element. In this case, you can add a
exclude
-attribute with a value of"true"
. In this case, the extract will be available and can be referenced by the name in thecolumn
-attribute of that child element, but the extract itself will not be loaded into a column.
SQL generation
The following SQL statement is generated for the example XML document and spec:INSERT INTO `sd_evs` ( `sd_evs_code` , `ss_id` , `sd_evs_time` , `sd_evs_time_ms` , `sd_evs_time_unc` , `sd_evs_time_unc_ms` , `sd_evs_picks` , `sd_evs_spint` , `sd_evs_dur` , `sd_evs_dur_unc` , `sd_evs_dist_actven` , `sd_evs_maxamptrac` , `sd_evs_samp` , `sd_evs_eqtype` , `cc_id` , `sd_evs_pubdate` ) SELECT `sd_evs_code` , cast(ss_id as unsigned) , `sd_evs_time` , `sd_evs_time_ms` , `sd_evs_time_unc` , `sd_evs_time_unc_ms` , `sd_evs_picks` , `sd_evs_spint` , `sd_evs_dur` , `sd_evs_dur_unc` , `sd_evs_dist_actven` , `sd_evs_maxamptrac` , `sd_evs_samp` , `sd_evs_eqtype` , `cc_id` , `sd_evs_pubdate` FROM ( SELECT 'VTAG_20160405000000' AS `sd_evs_code` , '101' AS `ss_id` , '2016-04-05 00:00:00' AS `sd_evs_time` , '0' AS `sd_evs_time_ms` , '0000-00-00 00:00:00' AS `sd_evs_time_unc` , '0' AS `sd_evs_time_unc_ms` , 'H' AS `sd_evs_picks` , '12.5' AS `sd_evs_spint` , '50' AS `sd_evs_dur` , '0' AS `sd_evs_dur_unc` , '0' AS `sd_evs_dist_actven` , '1532.3' AS `sd_evs_maxamptrac` , '0.01' AS `sd_evs_samp` , 'TQ' AS `sd_evs_eqtype` , '169' AS `cc_id` , '2018-04-05 00:00:00' AS `sd_evs_pubdate` UNION ALL SELECT 'VTAG_20160406000000' , '101' , '2016-04-06 00:00:00' , '0' , '0000-00-00 00:00:01' , '0' , 'H' , '5.2' , '36' , '0' , '0' , '9435.1' , '0.01' , 'HFVQ(LT)' , '169' , '2018-04-06 00:00:00' UNION ALL SELECT 'VTAG_20160407000000' , '101' , '2016-04-07 00:00:00' , '0' , '0000-00-00 00:00:02' , '0' , 'H' , '2.3' , '19' , '0' , '0' , '549.3' , '0.01' , 'HFVQ(S)' , '169' , '2018-04-07 00:00:00' ) AS derivedNote how the raw extracted values end up in the inner most
SELECT
statement: for each element identified by the xpath
-element on the document element, there is one SELECT
statement, which are united by the UNION ALL
operators. The column-expressions of these individual SELECT
-expressions that form the legs of the union are the results from applying EXTRACTVALUE()
on the xml document, using the combination of the xpath
-attributes of the document element and its child elements. These extracts are assigned the name specified in the column
-attribute using a SQL alias.
Note how the outer
SELECT
-statement selects the non-excluded columns. For the ss_id
target column, you can see how an SQL-expression specified in the corresponding child-element in the spec is applied to the values selected by the inner SELECT
-statement.
Possible Improvements
The current version ofp_load_xml()
is a very crude and straightforward implementation. It should do the job, but not much else. I do have a list of things that could be improved in the future: - Sensible error messages in case there are errors parsing the specification document
- Automatic type checking and conversion. The idea is that if the target table exists we could use the information schema to find out about the column data types, and wrap the value in a
CAST()
orCONVERT()
expression and explicitly catch any type errors, rather than postponing this until the actual data load. - Paging / Chunking. It would perhaps be nice if one could control how many statements are generated, rather than generating just one big
UNION ALL
. - Stored Routine generation. It might be nice to be able to generate a stored routine based only on the spec document, which can then be used and re-used afterwards to load xml documents that conform to that spec.
- Currently, only
INSERT
is supported. It would be nice to be able to generate SQL forINSERT ... ON DUPLICATE KEY UPDATE
,UPDATE
,DELETE
. Maybe even adjust the spec to allow formulating a condition that determines what action to perform.
Finally
You can freely use and distribute this code. If you find a bug, or have a feature request, please use github issues and pull requests to contribute. Your feedback is much appreciated!Also, now that MySQL 5.7 has all these JSON-functions, maybe something similar should be built for JSON documents? Let me know.