Roland Bouman's blog: XML

Showing posts with label XML. Show all posts

Sunday, March 18, 2018

A Tale of a JavaScript Memory Leak

Abstract: Matching JavaScript regular expressions against large input strings with V8 can result in memory leaks. In this post, I explain how to troubleshoot the issue using Google Chrome heap snapshots. Finally, a fix proposed by my son David (age 14) is presented.

Background

At Just-BI we developed a browser-based application for one of our customers. One way the application gets its data is by loading and parsing Microsoft Excel files. The app is succesful and our customers are happy, a fact they express by attempting to load ever larger files.

On mobile Safari (iPad), our app starts crashing when the files reach a certain size. This happens around 5 to 6 MB. Argueably, that's not that large, but things are a bit more complicated: Microsft Excel files, at least those of the .xlsx variety, are actually zip-compressed folders, containing mostly OOXML spreadsheet documents.

It is a well known fact that XML is really verbose, and I suppose we should be grateful that our 5 - 6 MB excel files uncompress to only 40 MB of XML.

We could debug the issue somewhat, and we noticed that by the time Safari crashes, it does so reporting it is out of heap space. We can't be sure if that's the actual cause, but we think we might be able to overcome or at least postpone this issue by somehow cutting down on memory usage.

This brings us to the main topic of our tale.

Parsing xlsx files in the Browser

To parse xlsx files, we use a particular javascript library called js-xlsx. This is actually a pretty nice piece of work, and I do not hesitate to recommend it. We have used it for quite a while without major issues; It's just that the particular strategey that this library uses to parse the xlsx file will temporarily spike memory usage, and we believe this triggers some bug in Safari, which eventually leads to a crash.

So, we're currently investigating a less general, less standards-compliant way to parse xlsx files. In return, this allows us to parse xlsx faster, and with a much reduced peak-memory usage.

I don't want to talk too much in detail about the xlsx parser we're developing. All I can say it is not meant to be a general, fully featured xlsx parser; the only requirement it is designed to fulfill is to avoid or at least postpone the crash we observe in Safari, and to parse portions of xlsx workbooks having a well-known structure specific to our application requirements.

I am happy to report that, with just one day of work, we managed to get an initial version of our parser to work. It has a peak-memory usage that is half that of the previous solution. And it sounds even more spectacular when I write: 100% less memory!

As an added bonus, the new solution is just a little less than 10x as fast as the old solution. For some reason, it does not sound much cooler when I say the new solution is about an order of magnitude faster, so I won't.

JavaScript Regular Expressions

Our parser relies on JavaScript regular expressions, which are exposed through the global built-in RegExp object.

One might be aware that for at least 100 different programming languages there at least 10,000 stackexchange answers to wittily denounce any attempt to parse XML using regular expressions. Some bloggers' entire careers seems built entirely around their particular brand of scorn and disdain about this topic.

We have little to add to discussions like these, other than that in modern JavaScript runtimes, regular expressions are a very productive and powerful tool for quickly building tokenizers. These tokenizers can have amazing performance, and can serve admirably as a foundation to build parsers of many kinds, including but certainly not limited to XML-parsers.

A Memory Leak

Despite the initial success, not all is well with our new xlsx parser. We found that, notwithstanding lower peak-memory consumption, it did suffer from a memory leak. We noticed this by creating a simple sample application, loading only our parser, and then making heap snapshots using Chrome developer tools during various phases of the process. See the screenshot below:

This is what happens:

First heap snapshot was made directly after loading the application, and measures 6.5 MB. Whether you think this is a lot or not, this is our baseline, and there is not much we can do about it now.
Next, the user picks a xlsx workbook, and the application opens it. The snapshot is now 12.6 MB, which is an increase of 6.1 MB as compared to our baseline. The workbook file is a little less than 6 MB and accounts for most of the increase. At this point, our sample application has also extracted and parsed the list of worksheets contained within the workbook, as well as the shared string table. I haven't looked at that in detail, but for now I am satisfied to believe that this accounts for the remaining extra memory.
At this point, we extracted the worksheet of interest from the workbook and uncompressed it into a javascript string. This made our heap snapshot increase by almost 34 MB. That is certainly a lot! However, the filesize of the worksheet document itself is 34,445 kB, so it seems everything is accounted for.
The next heap snapshot was taken after parsing the worksheet and building a parse tree. The snapshot weighs 77.3 MB - an increase another 31 MB. Now, the sheet has 32,294 rows with 24 cells of data each, and most of the cells are unique decimal numbers, so it is a decent chunk of data. But even then, it still feels as if this is way too large.

That said, things probably look worse than they really are. Our new parser is event based: the parse method accepts a configuration object that contains only a callback, which is called every time a new row is extracted from the sheet. For our sample application, the callback is only a very naive proof of concept. I suspect there are plenty opportunities to make the parse tree builder smarter and the parse tree smaller.
The last heap snapshot was taken after the parse. At this point, the parse tree, the workbook object, and the XML string went out of scope. But we are still looking at a heap snapshot of more than 40 MB! This is bad news: we really should be back at something close to heap snapshot 1. So, there's about 34 MB unaccounted for.

In the screenshot, you can also see what's hogging the memory: in the top right pane, we find our XML document string, which indeed accounts for the retained 34 MB of memory. In the bottom right pane, we can see who's still referencing it: it's some property called parent of sliced string @15298471. And these are referenced twice in some array, which is referenced by something called regexp_last_match_info in the native context.

Memory Leak, Explained

Now, what I think we're looking at is the lastMatch property of the global built-in RegExp-object.

If you're not familiar with JavaScript regular expressions, it might be helpful to consider exactly how our parser uses them. We're using code like this:

function parse(){
  //regular expression to match a <row> start-tag.
  var regexp = /<row\s([^>]+)\s*>/g;

  var match, data, xml = "...this is our huge XML document, loaded into a JavaScript String...";

  match = regexp.exec(xml);
  if (match) {

    //...do something with matched substrings contained as array elements in match...
    data = match[1];
    ...etc...

  }
}

(Note that this is just an example of the concept - not literally the actual code)

The parse() function first assigns a literal regular expression to the regexp variable. Under the covers, this literal regular expression results in calling the built-in global RegExp constructor, instantiating a new RegExp instance. Then, the exec-method of the RegExp instance is called, passing the -huge- XML document string. The exec-method returns a object representing the result: if there was no match, null is returned; if there is a match an object is returned that contains information about the match(es).

If there was a match, the match object will look a lot like an overloaded array object, having the matched parts of the string argument as elements. The element at index 0 of the match object (match[0]) is the substring matching the entire regular expression, the element at index 1 is the substring that matches the first parenthesized capturing group, and so on.

Now, since the match variable is a local variable in our parse function, everything should be garbage collectible after the function ends, right?

Yes. But No.

About `RegExp.lastMatch`

As it turns out, when an instance of the RegExp object finds a match, then the corresponding match info object containing all the matching substrings is stored in the lastMatch property of the global built-in RegExp-object. So, even if our parse-method is out of scope, the last match made by some regular expression inside it is still dangling around, attached to the global RegExp object in its lastMatch property.

Substrings in V8

Now, if the the lastMatch-object is still around, then the substrings representing the matches are also still around. As it turns out, V8 implements these substrings as "slice" objects. From within the JavaScript environment, they act and behave like String objects, but internally, the V8 javascript engine implements them as objects that have a parent property to keep a reference to the original String object from which they are a substring, along with some indexes to indicate what part of the original string makes up the substring.

Now, if you think about it, this way of implementing substrings is actually pretty clever, since it allows V8 to do many string manipulations very efficiently, minimizing overhead and memory consumption due to copying parts of strings to and through. In my case, it just becomes an atrocious memory hog because of the RegExp object, that has decided to maintain a reference to the last match object (for whatever reason).

Other people have ran into issues due to V8's substring design as well, and a bug was filed here:

https://bugs.chromium.org/p/v8/issues/detail?id=2869

Solutions

My oldest son, David of 14 years old, came up with a pretty creative solution: what if we'd write our own substring implementation, overriding the native one? If this makes you cringe, just think of 30 MB memory leaks and crashing browsers: it puts things in perspective. If this still sounds crazy to you, you should realize that when he came up with this idea, we had been looking together at this issue for two hours already. And even though we felt we knew the substring issue was related, we still had no way to prove that this was actually the case. His idea was feasible and might possibly confirm our suscpicions, so we went ahead and did it, and attached our own implementations of substring and substr to the __proto__ object of our XML string to override only these methods of that string instance.

As was to be expected, our own substring implementations were way slower than the native ones, and the parse took about 25 times longer than before. However, it *did* solve the memory leak. This was a strong indication that we were on the right track.

Then, David suggested another solution: why don't we simply clear out the lastMatch property of the global built-in RegExp object? We tried to do this directly, simply by assignment:

RegExp.lastMatch = null;

Unfortunately, this does not work. Although it does not throw a runtime exception, the RegExp object is protected against this kind of assignment, and the property never gets overwritten. However, it is still possible to achieve what we want, simply by instantiating a new RegExp object, and then forcing a match against a known, short string. We can then wrap that in a utility function, so we can always call it after doing some serious regular expression matching on large strings:

function freeRegExp(){
  /\s*/g.exec("");
}

Here's the heap snapshot after applying this fix:

Summary

Globals are bad.
Side-effects are bad, in particular if the modifications are global.
V8's substring implementation may lead to unexpected memory leaks.
Chrome heap snapshots are a powerful tool to troubleshoot them.
After applying regular expressions to huge strings, always force a match against a small string to prevent memory leaks.
David Rocks! He truly impressed me with his troubleshooting skills and his knack for pragmatic, feasible solutions.

Thursday, October 22, 2015

Loading Arbitary XML documents into MySQL tables with p_load_xml

Many years ago, I wrote about importing XML data into MySQL using ExtractValue(). The reason I'm revisiting the subject now is that I recently received an email request in relation to my old blog post:

I came across one of your blogs on importing XML data to MySQL using ExtractData() and I am trying to do the same using MySQL (5.5) database. However, I am new to this kind of method and would like to seek your expertise on this. I have an XML file that looks like this:

<?xml version="1.0" encoding="UTF-8"?>
 <wovoml xmlns="http://www.wovodat.org"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 version="1.1.0" xsi:schemaLocation="http://www.wovodat.org phread2.xsd">
 <Data>
 <Seismic>
 <SingleStationEventDataset>
  <SingleStationEvent code="VTAG_20160405000000" owner1="169" pubDate="2018-04-05 00:00:00" station="101">
   <startTime>2016-04-05 00:00:00</startTime>
   <startTimeCsec>0</startTimeCsec>
   <startTimeUnc>0000-00-00 00:00:00</startTimeUnc>
   <startTimeCsecUnc>0</startTimeCsecUnc>
   <picksDetermination>H</picksDetermination>
   <SPInterval>12.5</SPInterval>
   <duration>50</duration>
   <durationUnc>0</durationUnc>
   <distActiveVent>0</distActiveVent>
   <maxAmplitude>1532.3</maxAmplitude>
   <sampleRate>0.01</sampleRate>
   <earthquakeType>TQ</earthquakeType>
  </SingleStationEvent>
  <SingleStationEvent code="VTAG_20160406000000" owner1="169" pubDate="2018-04-06 00:00:00" station="101">
   <startTime>2016-04-06 00:00:00</startTime>
   <startTimeCsec>0</startTimeCsec>
   <startTimeUnc>0000-00-00 00:00:01</startTimeUnc>
   <startTimeCsecUnc>0</startTimeCsecUnc>
   <picksDetermination>H</picksDetermination>
   <SPInterval>5.2</SPInterval>
   <duration>36</duration>
   <durationUnc>0</durationUnc>
   <distActiveVent>0</distActiveVent>
   <maxAmplitude>9435.1</maxAmplitude>
   <sampleRate>0.01</sampleRate>
   <earthquakeType>HFVQ(LT)</earthquakeType>
  </SingleStationEvent>
  <SingleStationEvent code="VTAG_20160407000000" owner1="169" pubDate="2018-04-07 00:00:00" station="101">
   <startTime>2016-04-07 00:00:00</startTime>
   <startTimeCsec>0</startTimeCsec>
   <startTimeUnc>0000-00-00 00:00:02</startTimeUnc>
   <startTimeCsecUnc>0</startTimeCsecUnc>
   <picksDetermination>H</picksDetermination>
   <SPInterval>2.3</SPInterval>
   <duration>19</duration>
   <durationUnc>0</durationUnc>
   <distActiveVent>0</distActiveVent>
   <maxAmplitude>549.3</maxAmplitude>
   <sampleRate>0.01</sampleRate>
   <earthquakeType>HFVQ(S)</earthquakeType>
  </SingleStationEvent>
 </SingleStationEventDataset>
 </Seismic>
 </Data>
</wovoml>

And my table is:

CREATE TABLE IF NOT EXISTS `sd_evs` (
  `sd_evs_id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT COMMENT 'ID',
  `sd_evs_code` varchar(30) NOT NULL COMMENT 'Code',
  `ss_id` mediumint(8) unsigned DEFAULT NULL COMMENT 'Seismic station ID',
  `sd_evs_time` datetime DEFAULT NULL COMMENT 'Start time',
  `sd_evs_time_ms` decimal(2,2) DEFAULT NULL COMMENT 'Centisecond precision for start time',
  `sd_evs_time_unc` datetime DEFAULT NULL COMMENT 'Start time uncertainty',
  `sd_evs_time_unc_ms` decimal(2,2) DEFAULT NULL COMMENT 'Centisecond precision for uncertainty in start time',
  `sd_evs_picks` enum('A','R','H','U') DEFAULT NULL COMMENT 'Determination of picks: A=Automatic picker, R=Ruler, H=Human using a computer-based picker, U=Unknown',
  `sd_evs_spint` float DEFAULT NULL COMMENT 'S-P interval',
  `sd_evs_dur` float DEFAULT NULL COMMENT 'Duration',
  `sd_evs_dur_unc` float DEFAULT NULL COMMENT 'Duration uncertainty',
  `sd_evs_dist_actven` float DEFAULT NULL COMMENT 'Distance from active vent',
  `sd_evs_maxamptrac` float DEFAULT NULL COMMENT 'Maximum amplitude of trace',
  `sd_evs_samp` float DEFAULT NULL COMMENT 'Sampling rate',
  `sd_evs_eqtype` enum('TQ','HFVQ(LT)','HFVQ(S)','LFVQ(SX)','SDH(HF)','SDH(LF)','H','E','tele','LFVQ(X)','HFVQ') DEFAULT NULL COMMENT 'The WOVOdat terminology for the earthquake type',
  `cc_id` smallint(5) unsigned DEFAULT NULL COMMENT 'Collector ID',
  `cc_id2` smallint(5) unsigned DEFAULT NULL COMMENT 'Owner 2 ID',
  `cc_id3` smallint(5) unsigned DEFAULT NULL COMMENT 'Owner 3 ID',
  `sd_evs_loaddate` datetime DEFAULT NULL COMMENT 'Load date',
  `sd_evs_pubdate` datetime DEFAULT NULL COMMENT 'Publish date',
  `cc_id_load` smallint(5) unsigned DEFAULT NULL COMMENT 'Loader ID',
  `cb_ids` varchar(255) DEFAULT NULL COMMENT 'List of cb_ids (linking to cb.cb_id) separated by a comma',
  PRIMARY KEY (`sd_evs_id`),
  UNIQUE KEY `sd_evs_code` (`sd_evs_code`),
  KEY `OWNER 1` (`cc_id`),
  KEY `OWNER 2` (`cc_id2`),
  KEY `OWNER 3` (`cc_id3`),
  KEY `STATION` (`ss_id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 COMMENT='Seismic event data from a single station' AUTO_INCREMENT=359143 ;

I am using Linux command line to perform such tasks as I need to do an automation of extracting data and importing to MySQL database. How do I extract attributes and elements values from XML file in such a way that the following fields contain the following values in the table?

Attribute	Field/Column
code	sd_evs_code
owner1	cc_id
pubDate	sd_evs_pubdate
station	ss_id
Element	Field/Column
startTime	sd_evs_time
startTimeCsec	sd_evs_time_ms
startTimeUnc	sd_evs_time_unc
startTimeCsecUnc	sd_evs_time_unc_ms
picksDetermination	sd_evs_picks
SPInterval	sd_evs_spint
duration	sd_evs_dur
durationUnc	sd_evs_dur_unc
distActiveVent	sd_evs_dist_actven
maxAmplitude	sd_evs_maxamptrac
sampleRate	sd_evs_samp
earthquakeType	sd_evs_eqtype

Now, it is worth pointing out that it is totally feasible to get this task done, relying only on MySQL built-ins. This is in fact described in my original blog post. The steps are:

Identify the xpath expression that identifies those elements that correspond to database rows. In this example, that xpath expression is /wovoml/Data/Seismic/SingleStationEventDataset/SingleStationEvent. This is an xpath path expression, which in this case is simply the tag names of the elements in the XML document, separated by a forward slash character (/).
Use MySQL's ExtractValue() function to count the number of elements that are to be loaded as rows. This looks something like CAST(EXTRACTVALUE(v_xml, CONCAT('count(', v_xpath, ')') AS UNSIGNED), where v_xml is a string that contains our XML document, and v_xpath a string that contains the xpath expression we just identified. The xpath count() function accepts a path expression and returns the number of elements identified by that path as an integer. MySQL's EXTRACTVALUE however is unaware of the type of the xpath result, so therefore we explicitly convert it into an integer value using MySQL's CAST() function.
Once we have the number of elements, we can setup a loop (using either LOOP, WHILE, or REPEAT)
Inside the loop, we can construct a new xpath expression based on our original xpath expression and the loop index, like so CONCAT(v_xpath, '[', v_row_index, ']'). The bit between the square brackets is called a xpath predicate. Using only an integer value as predicate will retrieve the element located at that index position.
Also inside the loop, we identify the xpath expression relative to the current row element for each field item we'd like to extract. So, for example, the code attribute of the current <SingleStationEvent> element is to be used to load the sd_evs_code column, and to extract that item we can write a call like CAST(EXTRACTVALUE(v_xml, CONCAT(v_xpath, '[', v_row_index, ']/@code')) AS UNSIGNED). The @ is an xpath shorthand to indicate an attribute at whatever element is identified by the preceding path expression. Instead of extracting an attribute, we could also append a tagname; for example CAST(EXTRACTVALUE(v_xml, CONCAT(v_xpath, '[', v_row_index, ']/startTime')) AS DATETIME) could be used to extract the value of the <startTime> child element as a MySQL datetime value.
Still inside the loop, we can now assign the extracted values to our columns - either by inserting them directly for each run of the loop, or by writing either the expressions themselves, or the resulting values of these extractions to a SQL statement text, which could then be executed later as a whole.

While it is entirely possible to go through all these steps, it would be rather tedious to do so manually; Especially if you have to ingest multiple, differently formatted XML documents and write a specific routine for each type of input document. What would be really neat is if we could somehow only specify how items in the XML document are mapped to tables and columns, and then automate the work of actually extracting the data and loading it into the table. So, I wrote a stored procedure that does just that.

The Stored Procedure `p_load_xml()`

You can find the procedure on github. Its signature is:

PROCEDURE p_load_xml(IN p_spec text, IN p_xml text)

p_spec is an XML document that specifies the mapping from XML document to the table and its columns.
p_xml is the actual XML document containing the data that is to be loaded.

The Mapping document

The format of the mapping document that is to be passed to the p_spec parameter is best explained using an example. To achieve the load expressed in the request I received, I'm using this:

<x show="true" exec="false" table="sd_evs" xpath="/wovoml/Data/Seismic/SingleStationEventDataset/SingleStationEvent">
  <x column="sd_evs_code"         xpath="@code"               expression=""/>
  <x column="ss_id"               xpath="@station"            expression="cast(ss_id as unsigned)"/>
  <x column="sd_evs_time"         xpath="startTime"           expression=""/>
  <x column="sd_evs_time_ms"      xpath="startTimeCsec"       expression=""/>
  <x column="sd_evs_time_unc"     xpath="startTimeUnc"        expression=""/>
  <x column="sd_evs_time_unc_ms"  xpath="startTimeCsecUnc"    expression=""/>
  <x column="sd_evs_picks"        xpath="picksDetermination"  expression=""/>
  <x column="sd_evs_spint"        xpath="SPInterval"          expression=""/>
  <x column="sd_evs_dur"          xpath="duration"            expression=""/>
  <x column="sd_evs_dur_unc"      xpath="durationUnc"         expression=""/>
  <x column="sd_evs_dist_actven"  xpath="distActiveVent"      expression=""/>
  <x column="sd_evs_maxamptrac"   xpath="maxAmplitude"        expression=""/>
  <x column="sd_evs_samp"         xpath="sampleRate"          expression=""/>
  <x column="sd_evs_eqtype"       xpath="earthquakeType"      expression=""/>
  <x column="cc_id"               xpath="@owner1"             expression=""/>
  <x column="sd_evs_pubdate"      xpath="@pubDate"            expression=""/>
</x>

The document element

document element: The top level element (the document element) of the spec defines which table is to be loaded, as well as what expression is to be used to identify those elements that are loaded as rows. The tagname of the element can be any valid XML element name. In my example above I use x but it really doesn't matter.
table-attribute: This mandatory attribute holds the name of the target table
schema-attribute: This optional attribute can be used to specify the schema wherein the table resides. If present it will be used to qualify the table name. If not specified, an unqualified table name will be used.
xpath-attribute: The xpath-attribute on the document element should specify the xpath path expression that identifies all elements that are to be loaded as a row into the table.

Apart from these essential attributes on the document, a few extra are provided for convenience:

show-attribute: This optional attribute can be used to show the SQL that is generated to load the data into the table. If its value is "true" the procedure will return the generated SQL as a resultset. If not present, or anything other than "true", no resultset is returned by the procedure.
exec-attribute: This optional attribute can be used to suppress whether the generated SQL to load the table will be executed. If its value is "false", execution of the generated SQL will be surpressed. If omitted, or if it has any other value, the SQL will be executed automatically, thus actually loading the table.

Child elements

child element: The document element can have any number of child elements, and each specifies how to extract a value from the current row element, and which column to load it into. Again, the tagname is not significant as long as it is a valid XML element name - the attributes of the child elements specify the behavior.
xpath-attribute: Here you must specify a (mandatory) xpath-expression relative to the current row element, and this is what will be used as argument to EXTRACTVALUE() to extract a value from the XML data document.
column-attribute: Here you must specify a name for the extract. Normally, this will be the name of a column of the table defined in the document element, and the value extracted by the xpath expression will be loaded into that column. The name given here can also be used to refer to the value of the extract in case a child element has an expression-attribute. For more details, read up on the expression-attribute.
expression-attribute: Sometimes, the raw extract is not suitable for loading directly into a column. In those cases, you can specify a SQL column expression in the expression-attribute. In these SQL expressions, you can refer to the extracts defined by any other child element simply by using the name defined in the corresponding column-attribute.
exclude-attribute: Sometimes, a particular extract does not need to be loaded into the table at all, although it may be required by an expression defined by some other child-element. In this case, you can add a exclude-attribute with a value of "true". In this case, the extract will be available and can be referenced by the name in the column-attribute of that child element, but the extract itself will not be loaded into a column.

SQL generation

The following SQL statement is generated for the example XML document and spec:

INSERT INTO `sd_evs` (
 `sd_evs_code`
, `ss_id`
, `sd_evs_time`
, `sd_evs_time_ms`
, `sd_evs_time_unc`
, `sd_evs_time_unc_ms`
, `sd_evs_picks`
, `sd_evs_spint`
, `sd_evs_dur`
, `sd_evs_dur_unc`
, `sd_evs_dist_actven`
, `sd_evs_maxamptrac`
, `sd_evs_samp`
, `sd_evs_eqtype`
, `cc_id`
, `sd_evs_pubdate`
)
SELECT 
 `sd_evs_code`
, cast(ss_id as unsigned)
, `sd_evs_time`
, `sd_evs_time_ms`
, `sd_evs_time_unc`
, `sd_evs_time_unc_ms`
, `sd_evs_picks`
, `sd_evs_spint`
, `sd_evs_dur`
, `sd_evs_dur_unc`
, `sd_evs_dist_actven`
, `sd_evs_maxamptrac`
, `sd_evs_samp`
, `sd_evs_eqtype`
, `cc_id`
, `sd_evs_pubdate`
FROM (
SELECT 'VTAG_20160405000000' AS `sd_evs_code`
, '101' AS `ss_id`
, '2016-04-05 00:00:00' AS `sd_evs_time`
, '0' AS `sd_evs_time_ms`
, '0000-00-00 00:00:00' AS `sd_evs_time_unc`
, '0' AS `sd_evs_time_unc_ms`
, 'H' AS `sd_evs_picks`
, '12.5' AS `sd_evs_spint`
, '50' AS `sd_evs_dur`
, '0' AS `sd_evs_dur_unc`
, '0' AS `sd_evs_dist_actven`
, '1532.3' AS `sd_evs_maxamptrac`
, '0.01' AS `sd_evs_samp`
, 'TQ' AS `sd_evs_eqtype`
, '169' AS `cc_id`
, '2018-04-05 00:00:00' AS `sd_evs_pubdate`
UNION ALL
SELECT 'VTAG_20160406000000'
, '101'
, '2016-04-06 00:00:00'
, '0'
, '0000-00-00 00:00:01'
, '0'
, 'H'
, '5.2'
, '36'
, '0'
, '0'
, '9435.1'
, '0.01'
, 'HFVQ(LT)'
, '169'
, '2018-04-06 00:00:00'
UNION ALL
SELECT 'VTAG_20160407000000'
, '101'
, '2016-04-07 00:00:00'
, '0'
, '0000-00-00 00:00:02'
, '0'
, 'H'
, '2.3'
, '19'
, '0'
, '0'
, '549.3'
, '0.01'
, 'HFVQ(S)'
, '169'
, '2018-04-07 00:00:00'
) AS derived

Note how the raw extracted values end up in the inner most SELECT statement: for each element identified by the xpath-element on the document element, there is one SELECT statement, which are united by the UNION ALL operators. The column-expressions of these individual SELECT-expressions that form the legs of the union are the results from applying EXTRACTVALUE() on the xml document, using the combination of the xpath-attributes of the document element and its child elements. These extracts are assigned the name specified in the column-attribute using a SQL alias.

Note how the outer SELECT-statement selects the non-excluded columns. For the ss_id target column, you can see how an SQL-expression specified in the corresponding child-element in the spec is applied to the values selected by the inner SELECT-statement.

Possible Improvements

The current version of p_load_xml() is a very crude and straightforward implementation. It should do the job, but not much else. I do have a list of things that could be improved in the future:

Sensible error messages in case there are errors parsing the specification document
Automatic type checking and conversion. The idea is that if the target table exists we could use the information schema to find out about the column data types, and wrap the value in a CAST() or CONVERT() expression and explicitly catch any type errors, rather than postponing this until the actual data load.
Paging / Chunking. It would perhaps be nice if one could control how many statements are generated, rather than generating just one big UNION ALL.
Stored Routine generation. It might be nice to be able to generate a stored routine based only on the spec document, which can then be used and re-used afterwards to load xml documents that conform to that spec.
Currently, only INSERT is supported. It would be nice to be able to generate SQL for INSERT ... ON DUPLICATE KEY UPDATE, UPDATE, DELETE. Maybe even adjust the spec to allow formulating a condition that determines what action to perform.

I will probably not spend much time up front in actually creating these improvements, unless there are people that start to use this software and inform me that they would like to see those improvements. If that is the case then please use the github issue tracker to let me know your requests.

Finally

You can freely use and distribute this code. If you find a bug, or have a feature request, please use github issues and pull requests to contribute. Your feedback is much appreciated!

Also, now that MySQL 5.7 has all these JSON-functions, maybe something similar should be built for JSON documents? Let me know.

Wednesday, June 22, 2011

Working with namespaces and namespace prefixes in XSLT 1.0

To whom it may concern -

I'm currently developing a Xslt 1.0 stylesheet for analysis of XML Schema documents. As part of this work, I developed a couple of templates for working with namespace names and prefixes, and I like to share them via this post. The code is not incredibly hard or advanced, but it gets the job done and it may save you some time if you need something similar.

get-local-name: Return the local name part of a given QName. function in XPath 2.0
get-prefix: Return the prefix part of a given QName.
get-ns-name: Return the namespace name associated to the given prefix.
get-ns-prefix: Return a prefix that can be used to denote the given namespace name.
resolve-ns-identifier: Return the namespace name for a given QName prefix

Before I discuss the code, I want to make a few remarks:

This s all about Xslt 1.0 and its query langue XPath 1.0. All these things can be solved much more conveniently in XPath 2.0, and hence in Xslt 2.0 because that builds on Xpath 2.0 (and not XPath 1.0 like Xslt 1.0 does)
If you're planning to use this in a web-browser, and you want to target Firefox, your're out of luck. Sorry. Firefox is a greatt browser, but unlike Chrome, Opera and even Internet Explorer, it doesn't care enough about Xslt to fix bug #94270, which has been living in their bug tracker as long as August 2001 (nope - I didn't mistype 2011, that's 2001 as in almost a decade ago)

get-local-name

Return the local name part of a given QName. This is functionally equivalent to the fn:local-name-from-QName

<!-- get the last substring after the *last* colon (or he argument if no colon) -->
<xsl:template name="get-local-name">
    <xsl:param name="qname"/>
    <xsl:choose>
        <xsl:when test="contains($qname, ':')">
            <xsl:call-template name="get-local-name">
                <xsl:with-param name="qname" select="substring-after($qname, ':')"/>
            </xsl:call-template>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$qname"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

get-prefix

Return the prefix part of a given QName. This is functionally equivalent to the fn:prefix-from-QName function in XPath 2.0

<!-- get the substring before the *last* colon (or empty string if no colon) -->
<xsl:template name="get-prefix">
    <xsl:param name="qname"/>
    <xsl:param name="prefix" select="''"/>
    <xsl:choose>
        <xsl:when test="contains($qname, ':')">
            <xsl:call-template name="get-prefix">
                <xsl:with-param name="qname" select="substring-after($qname, ':')"/>
                <xsl:with-param name="prefix" select="concat($prefix, substring-before($qname, ':'))"/>
            </xsl:call-template>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$prefix"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

get-ns-name

Return the namespace name associated to the given prefix. This is functionally equivalent to the fn:namespace-uri-for-prefix function in XPath 2.0. The main difference is that this template does the lookup against the namespace definitions that are in effect in the current context, whereas the XPath 2.0 function allows the element which is used as context to be passed in as argument.

<!-- get the namespace uri for the namespace identified by the prefix in the parameter -->
<xsl:template name="get-ns-name">
    <xsl:param name="ns-prefix"/>
    <xsl:variable name="ns-node" select="namespace::node()[local-name()=$ns-prefix]"/>
    <xsl:value-of select="$ns-node"/>
</xsl:template>

get-ns-prefix

Return a prefix that can be used to denote the given namespace name. This template is complementary to the get-ns-name template. This template assumes only one prefix will be defined for each namespace. The namspace is resolved against the current context.

<!-- get the namespace prefix for the namespace name parameter -->
<xsl:template name="get-ns-prefix">
    <xsl:param name="ns-name"/>
    <xsl:variable name="ns-node" select="namespace::node()[.=$ns-name]"/>
    <xsl:value-of select="local-name($ns-node)"/>
</xsl:template>

resolve-ns-identifier

Return the namespace name for a given QName prefix (be it a namespace prefix or a namspace name). This template is useful to generically obtain a namespace name when feeding it the prefix part of a QName. If the prefix happens to be a namespace name, then that is returned, but if it happens to be a namespace prefix, then a lookup is performed to return the namspace name. This template also looks at the namspaces in effect in the current context.

<!-- return the namespace name -->
<xsl:template name="resolve-ns-identifier">
    <xsl:param name="ns-identifier"/>
    <xsl:choose>
        <xsl:when test="namespace::node()[.=$ns-identifier]">
            <xsl:value-of select="$ns-identifier"/>
        </xsl:when>
        <xsl:when test="namespace::node()[local-name()=$ns-identifier]">
            <xsl:value-of select="namespace::node()[local-name()=$ns-identifier]"/>
        </xsl:when>
        <xsl:otherwise>
            <xsl:message terminate="yes">
                Error: "<xsl:value-of select="$ns-identifier"/>" is neither a valid namespace prefix nor a valid namespace name.
            </xsl:message>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

Monday, May 30, 2011

Cleaning webpages with Pentaho Data Integration and JTidy

Here's an issue I've come across multiple times: I need to scrape HTML websites to extract data. Pentaho Data Integration (kettle) has lots of functionality on-board to make this an easy process, except one: it does not support reading data directly from HTML.

In this short post, I provide a simple tip to clean HTML pages and convert them to XML so you can extract its data using the conventional "Get data from XML" step. The solution hinges on two ingredients:

JTtidy, the Java port of the popuar HTML Tidy program, which is an open source HTML cleaning program originally created by Dave Ragget

The user-defined Java Class step, which lets us write a small snippet of Java that we need to interface with JTidy

Standard Kettle tools for Webservices

Kettle is really good at fetching data from the web and extracting data from webservices, be they in a SOAP/XML, REST/JSON or RSS flavor. (There is an extensive chapter on this subject in Pentaho Kettle Solutions). But when you're dealing with plain old HTML, things can get pretty hairy.

If you're lucky, the page may be in XHTML, and in that case it's worth trying the Get Data from XML step. However, quite often a webpage that claims to be XHTML is not well-formed XML, and even if it is, Kettle does not understand things like   entities, which are valid in XHTML, but not in plain XML. And of course, more often than not, you're not lucky, and XHTML represents only a minor fraction of all the web pages out there.

Workaround: JavaScript string manipulation

In the past, I usually worked around these issues. In practice, some quick and dirty string manipulation using the Modified Javascript Value step and some built-in indexOf(), substring and replace() functions go a long way.

In most cases I don't really need the entire web page, but only a <table>, <ul> or <ol> element in the <body>. Excising only the interesting sections out of the page using plain string manipulation will often get rid of most of the cruft that prevents the data from being treated as XML. For example, if we only need to get the rows from a table with a particular id attribute, we can use a JavaScript snippet like this:


//table we're looking for
var startHandle = "<table class=\"lvw\" cellpadding=0 cellspacing=0>";
var startPosition= html.indexOf(startHandle);
//look beyond the start tag to lose the invalid unquoted attributes
startPosition += startHandle.length;

//find where this table ends (lucky us, no nested table elements :)
var endHandle = "</table>";
var endPosition = html.indexOf(endHandle, startPosition);

//make a complete table fragment out of it again
var table = "<table>" + html.substring(startPosition, endPosition + endHandle.length);

//replace nbsp entities, empty unclosed img elements, and value-less nowrap attributes
table = table.replace(/&nbsp;|<img[^>]>|nowrap/ig, "");

There are of course no guarantees that the sections you cut out like that are in fact well-formed XML, but in my experience it's often worth a try.

A better way: using JTidy

While the JavaScript workaround may just work for your particular case, it certainly has disadvantages. Sometimes it may just be not so simple to clean the HTML with plain string manipulation. And of course there is a performance issue too - the JavaScript step can be quite slow.

Fortunately, there is a better way.

Using a user-defined Java Class step we can have JTidy do the dirty work of cleaning the HTML and converting it to XML, which we can then process in a sane way with Kettle's Get Data from XML step.

We need to do two things to make this work: first, you have to download JTidy, unzip it, and place the jtidy-r938.jar in the libext directory, which resides immediately in your kettle installation directory. (note that if you were running spoon, you need to restart it before it will be picked up). Second, you need a little bit of glue code for the User-defined Java class step so Kettle can use the Tidy class inside the jar. With some help from the pentaho wiki and the JTidy JavaDoc documentation, I came up with the following Java snippet to make it work:


import org.w3c.tidy.Tidy;
import java.io.StringReader;
import java.io.StringWriter;

protected Tidy tidy;

public boolean init(StepMetaInterface stepMetaInterface, StepDataInterface stepDataInterface)
{
    //create and configure a Tidy instance
    tidy = new Tidy();
    tidy.setXmlOut(true);
    return parent.initImpl(stepMetaInterface, stepDataInterface);
}

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException {
    Object[] r;
    //Get row from incoming stream.
    //Bail out if its not there.
    if ((r = getRow()) == null) {
       setOutputDone();
       return false;
    }

    //read the value of the html input field
    //the html field happens to be the 5th field in the stream, 
    //because java arrays start at 0, we use index 4 to reference it
    StringReader html = new StringReader((String)r[4]);

    //use tidy to parse html to xml
    StringWriter xml = new StringWriter();
    tidy.parse(html, xml);

    //assign the xml to the output row
    //note we simply overwrite the original html field from the input row.
    r[4] = xml.toString();

    //push the output row to the outgoing stream.
    putRow(data.outputRowMeta, r);
    return true;
}

(Tip: for more examples and background information on the user-defined java class step, check out the excellent blog posts by Slawomir Chodnicki, Matt Casters and the video walk-through by Dein Einspanjer)

The big advantage of using Tidy is that you can be sure that the result is well-formed XML. In addition, you can have JTidy report on any errors or warnings, which makes it much more robust than any ad-hoc string manipulation you can come up with.

Tuesday, April 20, 2010

Restoring XML-formatted MySQL dumps

To whom it may concern -

The mysqldump program can be used to make logical database backups. Although the vast majority of people use it to create SQL dumps, it is possible to dump both schema structure and data in XML format. There are a few bugs (#52792, #52793) in this feature, but these are not the topic of this post.

XML output from mysqldump

Dumping in XML format is done with the --xml or -X option. In addition, you should use the --hex-blob option otherwise the BLOB data will be dumped as raw binary data, which usually results in characters that are not valid, either according to the XML spec or according to the UTF-8 encoding. (Arguably, this is also a bug. I haven't filed it though.)

For example, a line like:


mysqldump -uroot -pmysql -X --hex-blob --databases sakila

dumps the sakila database to the following XML format:


<?xml version="1.0"?>
<mysqldump xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <database name="sakila">
    <table_structure name="actor">
      <field Field="actor_id" Type="smallint(5) unsigned" Null="NO" Key="PRI" Extra="auto_increment" />
      <field Field="first_name" Type="varchar(45)" Null="NO" Key="" Extra="" />
      <field Field="last_name" Type="varchar(45)" Null="NO" Key="MUL" Extra="" />
      <field Field="last_update" Type="timestamp" Null="NO" Key="" Default="CURRENT_TIMESTAMP" Extra="on update CURRENT_TIMESTAMP" />
      <key Table="actor" Non_unique="0" Key_name="PRIMARY" Seq_in_index="1" Column_name="actor_id" Collation="A" Cardinality="200" Null="" Index_type="BTREE" Comment="" />
      <key Table="actor" Non_unique="1" Key_name="idx_actor_last_name" Seq_in_index="1" Column_name="last_name" Collation="A" Cardinality="200" Null="" Index_type="BTREE" Comment="" />
      <options Name="actor" Engine="InnoDB" Version="10" Row_format="Compact" Rows="200" Avg_row_length="81" Data_length="16384" Max_data_length="0" Index_length="16384" Data_free="233832448" Auto_increment="201" Create_time="2009-10-10 10:04:56" Collation="utf8_general_ci" Create_options="" Comment="" />
    </table_structure>
    <table_data name="actor">
      <row>
        <field name="actor_id">1</field>
        <field name="first_name">PENELOPE</field>
        <field name="last_name">GUINESS</field>
        <field name="last_update">2006-02-15 03:34:33</field>
      </row>

...many more rows and table structures...

  </database>
</mysqldump>

I don't want to spend too much time discussing why it would be useful to make backups in this way. There are definitely a few drawbacks - for example, for sakila, the plain SQL dump, even with --hex-blob is 3.26 MB (3.429.358 bytes), whereas the XML output is 13.7 MB (14,415,665 bytes). Even after zip compression, the XML formatted dump is still one third larger than the plain SQL dump: 936 kB versus 695 kB.

Restoring XML output from mysqldump

A more serious problem is that MySQL doesn't seem to offer any tool to restore XML formatted dumps. The LOAD XML feature, kindly contributed by Erik Wetterberg could be used to some extent for this purpose. However, this feature is not yet available (it will be available in the upcoming version MySQL 5.5), and from what I can tell, it can only load data - not restore tables or databases. I also believe that this feature does not (yet) provide any way to properly restore hex-dumped BLOB data, but I really should test it to know for sure.

Anyway.

In between sessions of the past MySQL users conference I cobbled up an XSLT stylesheet that can convert mysqldump's XML output back to SQL script output. It is available under the LGPL license, and it is hosted on google code as the mysqldump-x-restore project. To get started, you need to download the mysqldump-xml-to-sql.xslt XSLT stylesheet. You also need a command line XSLT processor, like xsltproc. This utility is part of the Gnome libxslt project, and is included in packages for most linux distributions. There is a windows port available for which you can download the binaries.

Assuming that xsltproc is in your path, and the XML dump and the mysqldump-xml-to-sql.xslt are in the current working directory, you can use this command to convert the XML dump to SQL:


xsltproc mysqldump-xml-to-sql.xslt sakila.xml > sakila.sql

On Unix-based systems you should be able to directly pipline the SQL into mysql using


mysql -uroot -pmysql < `xsltproc mysqldump-xml-to-sql.xslt sakila.xml`

The stylesheet comes with a number of options, which can be set through xsltproc's --stringparam option. For example, setting the schema parameter to N will result in an SQL script that only contains DML statements:


xsltproc --stringparam schema N mysqldump-xml-to-sql.xslt sakila.xml > sakila.sql

Setting the data option to N will result in an SQL script that only contains DDL statements:


xsltproc --stringparam data N mysqldump-xml-to-sql.xslt sakila.xml > sakila.sql

. There are additional options to control how often a COMMIT should be issued, whether to add DROP statements, whether to generate single row INSERT statements, and to set the max_allowed_packet size.

What's next?

Nothing much really. I don't really recommend people to use mysqldump's XML output. I wrote mysqldump-x-restore for those people that inherited a bunch of XML formatted dumps, and don't know what to do with them. I haven't thouroughly tested it - please file a bug if you find one. If you actually think it's useful and you want more features, please let me know, and I'll look into it. I don't have much use for this myself, so if you have great ideas to move this forward, I'll let you have commit access.

That is all.

Thursday, December 13, 2007

Arnold Daniels' lib_mysqludf_xql featured on MySQL devzone

Wohoo! Jon Stephens just published an excellent article on the possibilities for working with XML and MySQL on the MySQL devzone!

I'm very proud that he chose to highlight the lib_mysqludf_xql library of MySQL User Defined functions authored by Arnold Daniels. Some time ago, Arnold joined the MySQL UDF Repository and so far he's been the most active member of the group.

Arnold also took the initiative to set up a new website for the MySQL UDF Repository, and we will soon move all content and libraries from the old location to the new site. (The main reason it hasn't happened yet is all due to me - i need to make time to move the old content to the new site).

If you want to discuss lib_mysqludf_xql or if you want to joint the UDF repository, sponsor our work or contribute in any other way, just go to our google group and post your questions.

Sunday, August 12, 2007

Transforming Japanese and Chinese text from utf8 encoded XML to ASCII Rich text (rtf)

This is one of those things that seemed extremely difficult and cumbersome, but turned out to be trivially simple, and relatively painless. Here is the case:

Using PHP to generate XML documents from a MySQL Database

I am working with a MySQL database storing lots human readable text. The text takes the form of semi-structured, well-formed XML. This particular XML-format is based on a very small set of elements borrowed from HTML and is used mostly for semantic markup of the text.

The XML fragments are fairly small, and not represent an end product. They are part of a larger unit: ultimately, they are a part of some document. The final document is not stored as a whole, because it is conceived and maintained on the level of its parts: all these individual XML fragments logically form distinct units.

Rather, complete documents are generated from the MySQL database using a PHP script. The PHP script sends SQL queries to the database to collect all XML fragments that belong in one document. This results in a well-formed XML document consisting of many XML fragments that belong together.

(The details of the actual technique used for the actual generation of the XML documents are not discussed any further in this article. Is suffices to say that this can be done either in PHP, or in SQL or a combination of both.)

Natural language text translation and Unicode

There is a variable number of documents, and one particular document can appear in many translations. So, the actual character data comes in a variety of natural languages, such as English, German, Chinese and Japanese. For this reason, the database stores the XML in unicode.

The MySQL column(s) that store this data are defined like this:


    xml MEDIUMTEXT
        CHARACTER SET utf8
        COLLATE utf8_unicode_ci

XSLT converts XML to various formats

The XML documents that are generated from the database do not represent an end product. Rather, the XML is further processed to yield a variety of output documents. Most of these output documents contain the majority of the data in the originally generated XML document.

However, each particular type of output document serves a different purpose, such as delivery to a customer, internal text review, online web presentation etc. In most cases, a particular output document requires a distinct format that best serves its purpose. So, the same information is required in a variety of presentations, such as plain text, HTML and rtf.

XSLT is used to transform the XML documents generated from the database into the desired target format(s). XSLT is a powerful special purpose language that is especially designed to transform XML input to plain text, HTML or XML output.

XSLT transformations are themselves denoted using XML. The transformation can be defined using a wide variety of constructs. XSLT uses XPath to define patterns that are matched in the input document. These patterns are associated with templates to generate output.

XSLT Templates can emit any structural XML construct, and in addition cause particular pattern matching rules to be applied, or perform traditional procedural processing on the input document (such as iteration, or explicitly call another template to be applied.)

Why an intermediate step?

Although it is possible to directly generate the desired target documents from the database, an intermediate step through XML has a number of advantages.

Single interface to extract data

With the intermediate XML step, there is just one piece of software that interacts with the database to extract data. This makes the solution more resilient to changes in the database schema. In many cases, a change in schema only requires the XML export functionality to be adjusted - the XSLT transformations can remain unchanged as long as equivalent input XML is generated. Instead of writing (and maintaining!) the-same-but-slightly-different queries, JOINs, ORDER BY's, GROUP BY's over and over, only the XML export script needs to be adjusted.

Translations of human readable text

There is another reason that justifies the intermediate XML format. The text stored in the database is conceived in one language -English- first, and then translated by human translators. Obviously, the translators need the original text first so they can translate it into the equivalent in their respective language. When the translators are done, it must be imported back again into the database.

This process is currently implemented by sending the XML document directly to the translators. The XML format is quite simple and intuitive, and translators are asked to replace the stretches of English text with the corresponding translated text. This is done in a straightforward manner, by directly editing the XML.

Importing a translation is, again, a matter of applying an XSLT transformation, this time from XML to SQL.

(The process is actually a little more complicated because there is also some checking involved to ensure that the structure received from the translators matches the expected structure. In this case, every structural element in the untranslated document must also be present in the translated document. Also, the translated document must not introduce any new structural elements.)

Rich Text Format

For a number of purposes, the application needs to generate output documents in the Rich Text Format format.

Rich Text Format or RTF is a proprietary data format for representing text and (simple) markup. It is owned by Microsoft

RTF is ASCII encoded: all characters in RTF documents are 7-bit ASCII characters. 8-bit characters and also unicode characters are supported in modern versions using special escape sequences. A RTF unicode character is denoted like this:


    \u<decimal character code><ASCII character>

So, \u initiates a unicode escape sequence telling the RTF agent to expect an integer that specifies a character in the unicode character set. Immediately after that, there must be a normal ASCII character that can be used by RTF agents that do not implement the substitution of the unicode escape sequence with the respective unicode character.

For example, the text


日本語

can be written like this


\u26085?\u26412?\u35486?

in RTF. Unicode aware RTF agents should print the Japanese text 日本語. RTF clients that do not understand or implement the unicode escape will use the question mark instead, because that is placed immediately after each escape sequence:

???

Generating RTF from a unicode source with XSLT

As described earlier, XSLT is used to generate RTF documents (among others). So how do we ensure that the unicode characters in the input documents are converted to ASCII characters? How do we make sure the unicode characters that do not fit in ASCII are properly escaped using the RTF unicode escape syntax?

This is something that looked like a very hard problem to solve quickly and cleanly. Luckily, after only little experimentation, it turned out to be very, very easy, and if required almost no extra code. The solution consists of two steps. One step involves a very minor but extremely important modification at the XSLT level. The other step involves some processing of the transformation result.

Choosing an ASCII output encoding with <xsl:output>

XSLT transformations can optionally contain the <xsl:output> element to control various aspects of the generated output. I used <xsl:output> many times before to control whether the output of the transformation should be done using xml, plain text or HTML. The <xsl:output> element can also be used to specify the character encoding of the tranformation result.

Since RTF is essentially an ASCII format, this is the encoding we need to specify:


<xsl:output
    method="text"
    encoding="ascii"
/>

This ensures that only ASCII characters result from the transformation process, even if the input document is in, say an utf8 encoded text (as is the case here). The first 127 unicode characters map to the equivalent ASCII character. Characters with a character code that exceeds 127 are output using XML numeric character references. So for example,


日本語

becomes:


&#26085;&#26412;&#35486;

Looks familiar? Thought so ;)

Replacing XML entity references with RTF unicode escapes using PHP regular expressions

So after letting XSLT do the dirty work of performing the character code conversions and escaping them to entity references, we still need one final step to make RTF unicode escape sequences of these entity references. As we are in a PHP environment, this is a perfect job for the Pearl compatible regular expression extension:


$rtf = preg_replace(
    '/&#([0-9]+);/'
    '\u${1}?'
,   $xslt_result
);

And that is all there is to it. Maybe a tad ugly, but it works ;)