Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Thursday, October 22, 2015

Loading Arbitary XML documents into MySQL tables with p_load_xml

Many years ago, I wrote about importing XML data into MySQL using ExtractValue(). The reason I'm revisiting the subject now is that I recently received an email request in relation to my old blog post:
I came across one of your blogs on importing XML data to MySQL using ExtractData() and I am trying to do the same using MySQL (5.5) database. However, I am new to this kind of method and would like to seek your expertise on this. I have an XML file that looks like this:
<?xml version="1.0" encoding="UTF-8"?>
 <wovoml xmlns="http://www.wovodat.org"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 version="1.1.0" xsi:schemaLocation="http://www.wovodat.org phread2.xsd">
 <Data>
 <Seismic>
 <SingleStationEventDataset>
  <SingleStationEvent code="VTAG_20160405000000" owner1="169" pubDate="2018-04-05 00:00:00" station="101">
   <startTime>2016-04-05 00:00:00</startTime>
   <startTimeCsec>0</startTimeCsec>
   <startTimeUnc>0000-00-00 00:00:00</startTimeUnc>
   <startTimeCsecUnc>0</startTimeCsecUnc>
   <picksDetermination>H</picksDetermination>
   <SPInterval>12.5</SPInterval>
   <duration>50</duration>
   <durationUnc>0</durationUnc>
   <distActiveVent>0</distActiveVent>
   <maxAmplitude>1532.3</maxAmplitude>
   <sampleRate>0.01</sampleRate>
   <earthquakeType>TQ</earthquakeType>
  </SingleStationEvent>
  <SingleStationEvent code="VTAG_20160406000000" owner1="169" pubDate="2018-04-06 00:00:00" station="101">
   <startTime>2016-04-06 00:00:00</startTime>
   <startTimeCsec>0</startTimeCsec>
   <startTimeUnc>0000-00-00 00:00:01</startTimeUnc>
   <startTimeCsecUnc>0</startTimeCsecUnc>
   <picksDetermination>H</picksDetermination>
   <SPInterval>5.2</SPInterval>
   <duration>36</duration>
   <durationUnc>0</durationUnc>
   <distActiveVent>0</distActiveVent>
   <maxAmplitude>9435.1</maxAmplitude>
   <sampleRate>0.01</sampleRate>
   <earthquakeType>HFVQ(LT)</earthquakeType>
  </SingleStationEvent>
  <SingleStationEvent code="VTAG_20160407000000" owner1="169" pubDate="2018-04-07 00:00:00" station="101">
   <startTime>2016-04-07 00:00:00</startTime>
   <startTimeCsec>0</startTimeCsec>
   <startTimeUnc>0000-00-00 00:00:02</startTimeUnc>
   <startTimeCsecUnc>0</startTimeCsecUnc>
   <picksDetermination>H</picksDetermination>
   <SPInterval>2.3</SPInterval>
   <duration>19</duration>
   <durationUnc>0</durationUnc>
   <distActiveVent>0</distActiveVent>
   <maxAmplitude>549.3</maxAmplitude>
   <sampleRate>0.01</sampleRate>
   <earthquakeType>HFVQ(S)</earthquakeType>
  </SingleStationEvent>
 </SingleStationEventDataset>
 </Seismic>
 </Data>
</wovoml>
And my table is:
CREATE TABLE IF NOT EXISTS `sd_evs` (
  `sd_evs_id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT COMMENT 'ID',
  `sd_evs_code` varchar(30) NOT NULL COMMENT 'Code',
  `ss_id` mediumint(8) unsigned DEFAULT NULL COMMENT 'Seismic station ID',
  `sd_evs_time` datetime DEFAULT NULL COMMENT 'Start time',
  `sd_evs_time_ms` decimal(2,2) DEFAULT NULL COMMENT 'Centisecond precision for start time',
  `sd_evs_time_unc` datetime DEFAULT NULL COMMENT 'Start time uncertainty',
  `sd_evs_time_unc_ms` decimal(2,2) DEFAULT NULL COMMENT 'Centisecond precision for uncertainty in start time',
  `sd_evs_picks` enum('A','R','H','U') DEFAULT NULL COMMENT 'Determination of picks: A=Automatic picker, R=Ruler, H=Human using a computer-based picker, U=Unknown',
  `sd_evs_spint` float DEFAULT NULL COMMENT 'S-P interval',
  `sd_evs_dur` float DEFAULT NULL COMMENT 'Duration',
  `sd_evs_dur_unc` float DEFAULT NULL COMMENT 'Duration uncertainty',
  `sd_evs_dist_actven` float DEFAULT NULL COMMENT 'Distance from active vent',
  `sd_evs_maxamptrac` float DEFAULT NULL COMMENT 'Maximum amplitude of trace',
  `sd_evs_samp` float DEFAULT NULL COMMENT 'Sampling rate',
  `sd_evs_eqtype` enum('TQ','HFVQ(LT)','HFVQ(S)','LFVQ(SX)','SDH(HF)','SDH(LF)','H','E','tele','LFVQ(X)','HFVQ') DEFAULT NULL COMMENT 'The WOVOdat terminology for the earthquake type',
  `cc_id` smallint(5) unsigned DEFAULT NULL COMMENT 'Collector ID',
  `cc_id2` smallint(5) unsigned DEFAULT NULL COMMENT 'Owner 2 ID',
  `cc_id3` smallint(5) unsigned DEFAULT NULL COMMENT 'Owner 3 ID',
  `sd_evs_loaddate` datetime DEFAULT NULL COMMENT 'Load date',
  `sd_evs_pubdate` datetime DEFAULT NULL COMMENT 'Publish date',
  `cc_id_load` smallint(5) unsigned DEFAULT NULL COMMENT 'Loader ID',
  `cb_ids` varchar(255) DEFAULT NULL COMMENT 'List of cb_ids (linking to cb.cb_id) separated by a comma',
  PRIMARY KEY (`sd_evs_id`),
  UNIQUE KEY `sd_evs_code` (`sd_evs_code`),
  KEY `OWNER 1` (`cc_id`),
  KEY `OWNER 2` (`cc_id2`),
  KEY `OWNER 3` (`cc_id3`),
  KEY `STATION` (`ss_id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 COMMENT='Seismic event data from a single station' AUTO_INCREMENT=359143 ;
I am using Linux command line to perform such tasks as I need to do an automation of extracting data and importing to MySQL database. How do I extract attributes and elements values from XML file in such a way that the following fields contain the following values in the table?
AttributeField/Column
codesd_evs_code
owner1cc_id
pubDatesd_evs_pubdate
stationss_id
ElementField/Column
startTimesd_evs_time
startTimeCsecsd_evs_time_ms
startTimeUncsd_evs_time_unc
startTimeCsecUncsd_evs_time_unc_ms
picksDeterminationsd_evs_picks
SPIntervalsd_evs_spint
durationsd_evs_dur
durationUncsd_evs_dur_unc
distActiveVentsd_evs_dist_actven
maxAmplitudesd_evs_maxamptrac
sampleRatesd_evs_samp
earthquakeTypesd_evs_eqtype
Now, it is worth pointing out that it is totally feasible to get this task done, relying only on MySQL built-ins. This is in fact described in my original blog post. The steps are:
  1. Identify the xpath expression that identifies those elements that correspond to database rows. In this example, that xpath expression is /wovoml/Data/Seismic/SingleStationEventDataset/SingleStationEvent. This is an xpath path expression, which in this case is simply the tag names of the elements in the XML document, separated by a forward slash character (/).
  2. Use MySQL's ExtractValue() function to count the number of elements that are to be loaded as rows. This looks something like CAST(EXTRACTVALUE(v_xml, CONCAT('count(', v_xpath, ')') AS UNSIGNED), where v_xml is a string that contains our XML document, and v_xpath a string that contains the xpath expression we just identified. The xpath count() function accepts a path expression and returns the number of elements identified by that path as an integer. MySQL's EXTRACTVALUE however is unaware of the type of the xpath result, so therefore we explicitly convert it into an integer value using MySQL's CAST() function.
  3. Once we have the number of elements, we can setup a loop (using either LOOP, WHILE, or REPEAT)
  4. Inside the loop, we can construct a new xpath expression based on our original xpath expression and the loop index, like so CONCAT(v_xpath, '[', v_row_index, ']'). The bit between the square brackets is called a xpath predicate. Using only an integer value as predicate will retrieve the element located at that index position.
  5. Also inside the loop, we identify the xpath expression relative to the current row element for each field item we'd like to extract. So, for example, the code attribute of the current <SingleStationEvent> element is to be used to load the sd_evs_code column, and to extract that item we can write a call like CAST(EXTRACTVALUE(v_xml, CONCAT(v_xpath, '[', v_row_index, ']/@code')) AS UNSIGNED). The @ is an xpath shorthand to indicate an attribute at whatever element is identified by the preceding path expression. Instead of extracting an attribute, we could also append a tagname; for example CAST(EXTRACTVALUE(v_xml, CONCAT(v_xpath, '[', v_row_index, ']/startTime')) AS DATETIME) could be used to extract the value of the <startTime> child element as a MySQL datetime value.
  6. Still inside the loop, we can now assign the extracted values to our columns - either by inserting them directly for each run of the loop, or by writing either the expressions themselves, or the resulting values of these extractions to a SQL statement text, which could then be executed later as a whole.
While it is entirely possible to go through all these steps, it would be rather tedious to do so manually; Especially if you have to ingest multiple, differently formatted XML documents and write a specific routine for each type of input document. What would be really neat is if we could somehow only specify how items in the XML document are mapped to tables and columns, and then automate the work of actually extracting the data and loading it into the table. So, I wrote a stored procedure that does just that.

The Stored Procedure p_load_xml()

You can find the procedure on github. Its signature is:
PROCEDURE p_load_xml(IN p_spec text, IN p_xml text)
  • p_spec is an XML document that specifies the mapping from XML document to the table and its columns.
  • p_xml is the actual XML document containing the data that is to be loaded.

The Mapping document

The format of the mapping document that is to be passed to the p_spec parameter is best explained using an example. To achieve the load expressed in the request I received, I'm using this:
<x show="true" exec="false" table="sd_evs" xpath="/wovoml/Data/Seismic/SingleStationEventDataset/SingleStationEvent">
  <x column="sd_evs_code"         xpath="@code"               expression=""/>
  <x column="ss_id"               xpath="@station"            expression="cast(ss_id as unsigned)"/>
  <x column="sd_evs_time"         xpath="startTime"           expression=""/>
  <x column="sd_evs_time_ms"      xpath="startTimeCsec"       expression=""/>
  <x column="sd_evs_time_unc"     xpath="startTimeUnc"        expression=""/>
  <x column="sd_evs_time_unc_ms"  xpath="startTimeCsecUnc"    expression=""/>
  <x column="sd_evs_picks"        xpath="picksDetermination"  expression=""/>
  <x column="sd_evs_spint"        xpath="SPInterval"          expression=""/>
  <x column="sd_evs_dur"          xpath="duration"            expression=""/>
  <x column="sd_evs_dur_unc"      xpath="durationUnc"         expression=""/>
  <x column="sd_evs_dist_actven"  xpath="distActiveVent"      expression=""/>
  <x column="sd_evs_maxamptrac"   xpath="maxAmplitude"        expression=""/>
  <x column="sd_evs_samp"         xpath="sampleRate"          expression=""/>
  <x column="sd_evs_eqtype"       xpath="earthquakeType"      expression=""/>
  <x column="cc_id"               xpath="@owner1"             expression=""/>
  <x column="sd_evs_pubdate"      xpath="@pubDate"            expression=""/>
</x>

The document element

document element
The top level element (the document element) of the spec defines which table is to be loaded, as well as what expression is to be used to identify those elements that are loaded as rows. The tagname of the element can be any valid XML element name. In my example above I use x but it really doesn't matter.
table-attribute
This mandatory attribute holds the name of the target table
schema-attribute
This optional attribute can be used to specify the schema wherein the table resides. If present it will be used to qualify the table name. If not specified, an unqualified table name will be used.
xpath-attribute
The xpath-attribute on the document element should specify the xpath path expression that identifies all elements that are to be loaded as a row into the table.
Apart from these essential attributes on the document, a few extra are provided for convenience:
show-attribute
This optional attribute can be used to show the SQL that is generated to load the data into the table. If its value is "true" the procedure will return the generated SQL as a resultset. If not present, or anything other than "true", no resultset is returned by the procedure.
exec-attribute
This optional attribute can be used to suppress whether the generated SQL to load the table will be executed. If its value is "false", execution of the generated SQL will be surpressed. If omitted, or if it has any other value, the SQL will be executed automatically, thus actually loading the table.

Child elements

child element
The document element can have any number of child elements, and each specifies how to extract a value from the current row element, and which column to load it into. Again, the tagname is not significant as long as it is a valid XML element name - the attributes of the child elements specify the behavior.
xpath-attribute
Here you must specify a (mandatory) xpath-expression relative to the current row element, and this is what will be used as argument to EXTRACTVALUE() to extract a value from the XML data document.
column-attribute
Here you must specify a name for the extract. Normally, this will be the name of a column of the table defined in the document element, and the value extracted by the xpath expression will be loaded into that column. The name given here can also be used to refer to the value of the extract in case a child element has an expression-attribute. For more details, read up on the expression-attribute.
expression-attribute
Sometimes, the raw extract is not suitable for loading directly into a column. In those cases, you can specify a SQL column expression in the expression-attribute. In these SQL expressions, you can refer to the extracts defined by any other child element simply by using the name defined in the corresponding column-attribute.
exclude-attribute
Sometimes, a particular extract does not need to be loaded into the table at all, although it may be required by an expression defined by some other child-element. In this case, you can add a exclude-attribute with a value of "true". In this case, the extract will be available and can be referenced by the name in the column-attribute of that child element, but the extract itself will not be loaded into a column.

SQL generation

The following SQL statement is generated for the example XML document and spec:
INSERT INTO `sd_evs` (
 `sd_evs_code`
, `ss_id`
, `sd_evs_time`
, `sd_evs_time_ms`
, `sd_evs_time_unc`
, `sd_evs_time_unc_ms`
, `sd_evs_picks`
, `sd_evs_spint`
, `sd_evs_dur`
, `sd_evs_dur_unc`
, `sd_evs_dist_actven`
, `sd_evs_maxamptrac`
, `sd_evs_samp`
, `sd_evs_eqtype`
, `cc_id`
, `sd_evs_pubdate`
)
SELECT 
 `sd_evs_code`
, cast(ss_id as unsigned)
, `sd_evs_time`
, `sd_evs_time_ms`
, `sd_evs_time_unc`
, `sd_evs_time_unc_ms`
, `sd_evs_picks`
, `sd_evs_spint`
, `sd_evs_dur`
, `sd_evs_dur_unc`
, `sd_evs_dist_actven`
, `sd_evs_maxamptrac`
, `sd_evs_samp`
, `sd_evs_eqtype`
, `cc_id`
, `sd_evs_pubdate`
FROM (
SELECT 'VTAG_20160405000000' AS `sd_evs_code`
, '101' AS `ss_id`
, '2016-04-05 00:00:00' AS `sd_evs_time`
, '0' AS `sd_evs_time_ms`
, '0000-00-00 00:00:00' AS `sd_evs_time_unc`
, '0' AS `sd_evs_time_unc_ms`
, 'H' AS `sd_evs_picks`
, '12.5' AS `sd_evs_spint`
, '50' AS `sd_evs_dur`
, '0' AS `sd_evs_dur_unc`
, '0' AS `sd_evs_dist_actven`
, '1532.3' AS `sd_evs_maxamptrac`
, '0.01' AS `sd_evs_samp`
, 'TQ' AS `sd_evs_eqtype`
, '169' AS `cc_id`
, '2018-04-05 00:00:00' AS `sd_evs_pubdate`
UNION ALL
SELECT 'VTAG_20160406000000'
, '101'
, '2016-04-06 00:00:00'
, '0'
, '0000-00-00 00:00:01'
, '0'
, 'H'
, '5.2'
, '36'
, '0'
, '0'
, '9435.1'
, '0.01'
, 'HFVQ(LT)'
, '169'
, '2018-04-06 00:00:00'
UNION ALL
SELECT 'VTAG_20160407000000'
, '101'
, '2016-04-07 00:00:00'
, '0'
, '0000-00-00 00:00:02'
, '0'
, 'H'
, '2.3'
, '19'
, '0'
, '0'
, '549.3'
, '0.01'
, 'HFVQ(S)'
, '169'
, '2018-04-07 00:00:00'
) AS derived
Note how the raw extracted values end up in the inner most SELECT statement: for each element identified by the xpath-element on the document element, there is one SELECT statement, which are united by the UNION ALL operators. The column-expressions of these individual SELECT-expressions that form the legs of the union are the results from applying EXTRACTVALUE() on the xml document, using the combination of the xpath-attributes of the document element and its child elements. These extracts are assigned the name specified in the column-attribute using a SQL alias.

Note how the outer SELECT-statement selects the non-excluded columns. For the ss_id target column, you can see how an SQL-expression specified in the corresponding child-element in the spec is applied to the values selected by the inner SELECT-statement.

Possible Improvements

The current version of p_load_xml() is a very crude and straightforward implementation. It should do the job, but not much else. I do have a list of things that could be improved in the future:
  • Sensible error messages in case there are errors parsing the specification document
  • Automatic type checking and conversion. The idea is that if the target table exists we could use the information schema to find out about the column data types, and wrap the value in a CAST() or CONVERT() expression and explicitly catch any type errors, rather than postponing this until the actual data load.
  • Paging / Chunking. It would perhaps be nice if one could control how many statements are generated, rather than generating just one big UNION ALL.
  • Stored Routine generation. It might be nice to be able to generate a stored routine based only on the spec document, which can then be used and re-used afterwards to load xml documents that conform to that spec.
  • Currently, only INSERT is supported. It would be nice to be able to generate SQL for INSERT ... ON DUPLICATE KEY UPDATE, UPDATE, DELETE. Maybe even adjust the spec to allow formulating a condition that determines what action to perform.
I will probably not spend much time up front in actually creating these improvements, unless there are people that start to use this software and inform me that they would like to see those improvements. If that is the case then please use the github issue tracker to let me know your requests.

Finally

You can freely use and distribute this code. If you find a bug, or have a feature request, please use github issues and pull requests to contribute. Your feedback is much appreciated!

Also, now that MySQL 5.7 has all these JSON-functions, maybe something similar should be built for JSON documents? Let me know.

No comments: