Sunday, August 12, 2007

Transforming Japanese and Chinese text from utf8 encoded XML to ASCII Rich text (rtf)

This is one of those things that seemed extremely difficult and cumbersome, but turned out to be trivially simple, and relatively painless. Here is the case:

Using PHP to generate XML documents from a MySQL Database


I am working with a MySQL database storing lots human readable text. The text takes the form of semi-structured, well-formed XML. This particular XML-format is based on a very small set of elements borrowed from HTML and is used mostly for semantic markup of the text.

The XML fragments are fairly small, and not represent an end product. They are part of a larger unit: ultimately, they are a part of some document. The final document is not stored as a whole, because it is conceived and maintained on the level of its parts: all these individual XML fragments logically form distinct units.

Rather, complete documents are generated from the MySQL database using a PHP script. The PHP script sends SQL queries to the database to collect all XML fragments that belong in one document. This results in a well-formed XML document consisting of many XML fragments that belong together.
mysql-php-xml
(The details of the actual technique used for the actual generation of the XML documents are not discussed any further in this article. Is suffices to say that this can be done either in PHP, or in SQL or a combination of both.)

Natural language text translation and Unicode


There is a variable number of documents, and one particular document can appear in many translations. So, the actual character data comes in a variety of natural languages, such as English, German, Chinese and Japanese. For this reason, the database stores the XML in unicode.

The MySQL column(s) that store this data are defined like this:

xml MEDIUMTEXT
CHARACTER SET utf8
COLLATE utf8_unicode_ci

XSLT converts XML to various formats


The XML documents that are generated from the database do not represent an end product. Rather, the XML is further processed to yield a variety of output documents. Most of these output documents contain the majority of the data in the originally generated XML document.

However, each particular type of output document serves a different purpose, such as delivery to a customer, internal text review, online web presentation etc. In most cases, a particular output document requires a distinct format that best serves its purpose. So, the same information is required in a variety of presentations, such as plain text, HTML and rtf.
xslt-to-several-ouput-formats
XSLT is used to transform the XML documents generated from the database into the desired target format(s). XSLT is a powerful special purpose language that is especially designed to transform XML input to plain text, HTML or XML output.

XSLT transformations are themselves denoted using XML. The transformation can be defined using a wide variety of constructs. XSLT uses XPath to define patterns that are matched in the input document. These patterns are associated with templates to generate output.

XSLT Templates can emit any structural XML construct, and in addition cause particular pattern matching rules to be applied, or perform traditional procedural processing on the input document (such as iteration, or explicitly call another template to be applied.)

Why an intermediate step?


Although it is possible to directly generate the desired target documents from the database, an intermediate step through XML has a number of advantages.

Single interface to extract data


With the intermediate XML step, there is just one piece of software that interacts with the database to extract data. This makes the solution more resilient to changes in the database schema. In many cases, a change in schema only requires the XML export functionality to be adjusted - the XSLT transformations can remain unchanged as long as equivalent input XML is generated. Instead of writing (and maintaining!) the-same-but-slightly-different queries, JOINs, ORDER BY's, GROUP BY's over and over, only the XML export script needs to be adjusted.

Translations of human readable text


There is another reason that justifies the intermediate XML format. The text stored in the database is conceived in one language -English- first, and then translated by human translators. Obviously, the translators need the original text first so they can translate it into the equivalent in their respective language. When the translators are done, it must be imported back again into the database.

This process is currently implemented by sending the XML document directly to the translators. The XML format is quite simple and intuitive, and translators are asked to replace the stretches of English text with the corresponding translated text. This is done in a straightforward manner, by directly editing the XML.
translations
Importing a translation is, again, a matter of applying an XSLT transformation, this time from XML to SQL.
db-load-translation
(The process is actually a little more complicated because there is also some checking involved to ensure that the structure received from the translators matches the expected structure. In this case, every structural element in the untranslated document must also be present in the translated document. Also, the translated document must not introduce any new structural elements.)

Rich Text Format


For a number of purposes, the application needs to generate output documents in the Rich Text Format format.

Rich Text Format or RTF is a proprietary data format for representing text and (simple) markup. It is owned by Microsoft

RTF is ASCII encoded: all characters in RTF documents are 7-bit ASCII characters. 8-bit characters and also unicode characters are supported in modern versions using special escape sequences. A RTF unicode character is denoted like this:

\u<decimal character code><ASCII character>

So, \u initiates a unicode escape sequence telling the RTF agent to expect an integer that specifies a character in the unicode character set. Immediately after that, there must be a normal ASCII character that can be used by RTF agents that do not implement the substitution of the unicode escape sequence with the respective unicode character.

For example, the text

日本語

can be written like this

\u26085?\u26412?\u35486?

in RTF. Unicode aware RTF agents should print the Japanese text 日本語. RTF clients that do not understand or implement the unicode escape will use the question mark instead, because that is placed immediately after each escape sequence:

???

Generating RTF from a unicode source with XSLT


As described earlier, XSLT is used to generate RTF documents (among others). So how do we ensure that the unicode characters in the input documents are converted to ASCII characters? How do we make sure the unicode characters that do not fit in ASCII are properly escaped using the RTF unicode escape syntax?

This is something that looked like a very hard problem to solve quickly and cleanly. Luckily, after only little experimentation, it turned out to be very, very easy, and if required almost no extra code. The solution consists of two steps. One step involves a very minor but extremely important modification at the XSLT level. The other step involves some processing of the transformation result.

Choosing an ASCII output encoding with <xsl:output>


XSLT transformations can optionally contain the <xsl:output> element to control various aspects of the generated output. I used <xsl:output> many times before to control whether the output of the transformation should be done using xml, plain text or HTML. The <xsl:output> element can also be used to specify the character encoding of the tranformation result.

Since RTF is essentially an ASCII format, this is the encoding we need to specify:

<xsl:output
method="text"
encoding="ascii"
/>

This ensures that only ASCII characters result from the transformation process, even if the input document is in, say an utf8 encoded text (as is the case here). The first 127 unicode characters map to the equivalent ASCII character. Characters with a character code that exceeds 127 are output using XML numeric character references. So for example,

日本語

becomes:

&#26085;&#26412;&#35486;

Looks familiar? Thought so ;)

Replacing XML entity references with RTF unicode escapes using PHP regular expressions


So after letting XSLT do the dirty work of performing the character code conversions and escaping them to entity references, we still need one final step to make RTF unicode escape sequences of these entity references. As we are in a PHP environment, this is a perfect job for the Pearl compatible regular expression extension:

$rtf = preg_replace(
'/&#([0-9]+);/'
'\u${1}?'
, $xslt_result
);

And that is all there is to it. Maybe a tad ugly, but it works ;)

3 comments:

Stevie said...

Hi Roland,

extremely bright aproach - great. It's axactly the problem I have.

One thing I don't know: how do do the conversion from utf-8 to ascii &#xxxx using xslt? I have never used xslt, allthough I consider myself to be an experienced php developer.
Can you give a code example?

Best regards

Stefan

rpbouman said...

Hi Stefan,

"how do do the conversion from utf-8 to ascii &#xxxx using xslt?"

As I described, this turned out to be the easy part.

You have the source XML document in utf8. The XSLT processor applies a XSLT stylesheet to the XML source, finding XML structures (using XPath expressions) and replacing them with RTF text content (from templates)

Now that utf8 to ascii encoding problem is solved by using an <xsl:output> element in the XSLT stylesheet that specifies that the output encoding should be ASCII. Then it automagically outputs the UTF8 characters that do not fit into ASCII using standard XML character entities in the &#xxxx; notation.

For an XSLT tutorial, look here:

http://www.w3schools.com/xsl/xsl_transformation.asp

Good luck!

Masanja said...

Thank you, Roland.
I can advice you to visit this site: php tutorials that to know about php and XML more.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...