Sunday, August 12, 2007

Transforming Japanese and Chinese text from utf8 encoded XML to ASCII Rich text (rtf)

This is one of those things that seemed extremely difficult and cumbersome, but turned out to be trivially simple, and relatively painless. Here is the case:

Using PHP to generate XML documents from a MySQL Database


I am working with a MySQL database storing lots human readable text. The text takes the form of semi-structured, well-formed XML. This particular XML-format is based on a very small set of elements borrowed from HTML and is used mostly for semantic markup of the text.

The XML fragments are fairly small, and not represent an end product. They are part of a larger unit: ultimately, they are a part of some document. The final document is not stored as a whole, because it is conceived and maintained on the level of its parts: all these individual XML fragments logically form distinct units.

Rather, complete documents are generated from the MySQL database using a PHP script. The PHP script sends SQL queries to the database to collect all XML fragments that belong in one document. This results in a well-formed XML document consisting of many XML fragments that belong together.
mysql-php-xml
(The details of the actual technique used for the actual generation of the XML documents are not discussed any further in this article. Is suffices to say that this can be done either in PHP, or in SQL or a combination of both.)

Natural language text translation and Unicode


There is a variable number of documents, and one particular document can appear in many translations. So, the actual character data comes in a variety of natural languages, such as English, German, Chinese and Japanese. For this reason, the database stores the XML in unicode.

The MySQL column(s) that store this data are defined like this:

xml MEDIUMTEXT
CHARACTER SET utf8
COLLATE utf8_unicode_ci

XSLT converts XML to various formats


The XML documents that are generated from the database do not represent an end product. Rather, the XML is further processed to yield a variety of output documents. Most of these output documents contain the majority of the data in the originally generated XML document.

However, each particular type of output document serves a different purpose, such as delivery to a customer, internal text review, online web presentation etc. In most cases, a particular output document requires a distinct format that best serves its purpose. So, the same information is required in a variety of presentations, such as plain text, HTML and rtf.
xslt-to-several-ouput-formats
XSLT is used to transform the XML documents generated from the database into the desired target format(s). XSLT is a powerful special purpose language that is especially designed to transform XML input to plain text, HTML or XML output.

XSLT transformations are themselves denoted using XML. The transformation can be defined using a wide variety of constructs. XSLT uses XPath to define patterns that are matched in the input document. These patterns are associated with templates to generate output.

XSLT Templates can emit any structural XML construct, and in addition cause particular pattern matching rules to be applied, or perform traditional procedural processing on the input document (such as iteration, or explicitly call another template to be applied.)

Why an intermediate step?


Although it is possible to directly generate the desired target documents from the database, an intermediate step through XML has a number of advantages.

Single interface to extract data


With the intermediate XML step, there is just one piece of software that interacts with the database to extract data. This makes the solution more resilient to changes in the database schema. In many cases, a change in schema only requires the XML export functionality to be adjusted - the XSLT transformations can remain unchanged as long as equivalent input XML is generated. Instead of writing (and maintaining!) the-same-but-slightly-different queries, JOINs, ORDER BY's, GROUP BY's over and over, only the XML export script needs to be adjusted.

Translations of human readable text


There is another reason that justifies the intermediate XML format. The text stored in the database is conceived in one language -English- first, and then translated by human translators. Obviously, the translators need the original text first so they can translate it into the equivalent in their respective language. When the translators are done, it must be imported back again into the database.

This process is currently implemented by sending the XML document directly to the translators. The XML format is quite simple and intuitive, and translators are asked to replace the stretches of English text with the corresponding translated text. This is done in a straightforward manner, by directly editing the XML.
translations
Importing a translation is, again, a matter of applying an XSLT transformation, this time from XML to SQL.
db-load-translation
(The process is actually a little more complicated because there is also some checking involved to ensure that the structure received from the translators matches the expected structure. In this case, every structural element in the untranslated document must also be present in the translated document. Also, the translated document must not introduce any new structural elements.)

Rich Text Format


For a number of purposes, the application needs to generate output documents in the Rich Text Format format.

Rich Text Format or RTF is a proprietary data format for representing text and (simple) markup. It is owned by Microsoft

RTF is ASCII encoded: all characters in RTF documents are 7-bit ASCII characters. 8-bit characters and also unicode characters are supported in modern versions using special escape sequences. A RTF unicode character is denoted like this:

\u<decimal character code><ASCII character>

So, \u initiates a unicode escape sequence telling the RTF agent to expect an integer that specifies a character in the unicode character set. Immediately after that, there must be a normal ASCII character that can be used by RTF agents that do not implement the substitution of the unicode escape sequence with the respective unicode character.

For example, the text

日本語

can be written like this

\u26085?\u26412?\u35486?

in RTF. Unicode aware RTF agents should print the Japanese text 日本語. RTF clients that do not understand or implement the unicode escape will use the question mark instead, because that is placed immediately after each escape sequence:

???

Generating RTF from a unicode source with XSLT


As described earlier, XSLT is used to generate RTF documents (among others). So how do we ensure that the unicode characters in the input documents are converted to ASCII characters? How do we make sure the unicode characters that do not fit in ASCII are properly escaped using the RTF unicode escape syntax?

This is something that looked like a very hard problem to solve quickly and cleanly. Luckily, after only little experimentation, it turned out to be very, very easy, and if required almost no extra code. The solution consists of two steps. One step involves a very minor but extremely important modification at the XSLT level. The other step involves some processing of the transformation result.

Choosing an ASCII output encoding with <xsl:output>


XSLT transformations can optionally contain the <xsl:output> element to control various aspects of the generated output. I used <xsl:output> many times before to control whether the output of the transformation should be done using xml, plain text or HTML. The <xsl:output> element can also be used to specify the character encoding of the tranformation result.

Since RTF is essentially an ASCII format, this is the encoding we need to specify:

<xsl:output
method="text"
encoding="ascii"
/>

This ensures that only ASCII characters result from the transformation process, even if the input document is in, say an utf8 encoded text (as is the case here). The first 127 unicode characters map to the equivalent ASCII character. Characters with a character code that exceeds 127 are output using XML numeric character references. So for example,

日本語

becomes:

&#26085;&#26412;&#35486;

Looks familiar? Thought so ;)

Replacing XML entity references with RTF unicode escapes using PHP regular expressions


So after letting XSLT do the dirty work of performing the character code conversions and escaping them to entity references, we still need one final step to make RTF unicode escape sequences of these entity references. As we are in a PHP environment, this is a perfect job for the Pearl compatible regular expression extension:

$rtf = preg_replace(
'/&#([0-9]+);/'
'\u${1}?'
, $xslt_result
);

And that is all there is to it. Maybe a tad ugly, but it works ;)

Tuesday, August 07, 2007

Windows Vista Testdrive: "feedback is not defined" and "open is not a function"

I was just scavenging the web for a decent picture of the Microsoft Windows logo when my eye caught the site www.windowsvistatestdrive.com. I am currently a happy (K)Ubuntu GNU/Linux user, but I decided to give it a spin anyway.

(I know some people have some strong opinions on Linux vs Windows, especially Windows Vista, but to me, the computer and its operating system are just tools to get my work done. I am not a religous person, and god forbid I become religious over something as trivial as the right flavour of computing environment)

Well, the result of my test drive is shown below:

MSWinTestdrive

As you can see, I got what looked like a web rendering of Windows Vista...except that opening the page gave me one error immediately:

feedback is not defined

which is kind of a bummer for a site that is designed to sollicit feedback.

I clicked the "start" button anyway, and this gave me the following pearl:

open is not a function


I doubt irony was ever so appropriate ;)

Why webdevelopers should use Linux

Recently, I revamped my old Organization Chart javascript widget to a new version. Since I released this code some time ago I received quite a good deal of email requests from people that are actually using it for their company's intranet site.

On the one hand, it makes me very happy to see people are actually using it for professional purposes. On the other hand, at the time I slapped it together in a few hours total, building on some old experience. As it turned out, users were/are having quite some issues to properly customize their organization charts. I even received some code contributions to improve the functionality.

Now, a few days ago, I finally found some time to address the major issues. But something grave changed in the mean while. When I first created and released the widget, I was a windows user, and used to create stuff that would work on Microsoft Internet Explorer, and then modify it to work on Mozilla/Firefox. Now, it has been about a year since I replaced Microsoft Windows for (K)Ubuntu Linux. Meaning, I am without Internet Explorer....or am I ;)

Enter IEs4Linux:



The IE4sLinux project offers the real, genuine Microsoft Windows Internet Explorer as a Wine emulation:



One of the compelling things about the IE4Linux project is that it allows you to run several versions of MS Internet Explorer on the same machine - something I never got to work on my native windows machine.

Obviously, web developers that need to test their pages on several browsers and browser versions are better off on Linux + Wine than native Windows, as Linux + Wine + IEs4Linux allows them to test their pages on several versions of Internet Explorer.

Thanks to the people from IEs4Linux and Wine!

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...