Using PHP to generate XML documents from a MySQL Database
I am working with a MySQL database storing lots human readable text. The text takes the form of semi-structured, well-formed XML. This particular XML-format is based on a very small set of elements borrowed from HTML and is used mostly for semantic markup of the text.
The XML fragments are fairly small, and not represent an end product. They are part of a larger unit: ultimately, they are a part of some document. The final document is not stored as a whole, because it is conceived and maintained on the level of its parts: all these individual XML fragments logically form distinct units.
Rather, complete documents are generated from the MySQL database using a PHP script. The PHP script sends SQL queries to the database to collect all XML fragments that belong in one document. This results in a well-formed XML document consisting of many XML fragments that belong together.
(The details of the actual technique used for the actual generation of the XML documents are not discussed any further in this article. Is suffices to say that this can be done either in PHP, or in SQL or a combination of both.)
Natural language text translation and Unicode
There is a variable number of documents, and one particular document can appear in many translations. So, the actual character data comes in a variety of natural languages, such as English, German, Chinese and Japanese. For this reason, the database stores the XML in unicode.
The MySQL column(s) that store this data are defined like this:
xml MEDIUMTEXT
CHARACTER SET utf8
COLLATE utf8_unicode_ci
XSLT converts XML to various formats
The XML documents that are generated from the database do not represent an end product. Rather, the XML is further processed to yield a variety of output documents. Most of these output documents contain the majority of the data in the originally generated XML document.
However, each particular type of output document serves a different purpose, such as delivery to a customer, internal text review, online web presentation etc. In most cases, a particular output document requires a distinct format that best serves its purpose. So, the same information is required in a variety of presentations, such as plain text, HTML and rtf.
XSLT is used to transform the XML documents generated from the database into the desired target format(s). XSLT is a powerful special purpose language that is especially designed to transform XML input to plain text, HTML or XML output.
XSLT transformations are themselves denoted using XML. The transformation can be defined using a wide variety of constructs. XSLT uses XPath to define patterns that are matched in the input document. These patterns are associated with templates to generate output.
XSLT Templates can emit any structural XML construct, and in addition cause particular pattern matching rules to be applied, or perform traditional procedural processing on the input document (such as iteration, or explicitly call another template to be applied.)
Why an intermediate step?
Although it is possible to directly generate the desired target documents from the database, an intermediate step through XML has a number of advantages.
Single interface to extract data
With the intermediate XML step, there is just one piece of software that interacts with the database to extract data. This makes the solution more resilient to changes in the database schema. In many cases, a change in schema only requires the XML export functionality to be adjusted - the XSLT transformations can remain unchanged as long as equivalent input XML is generated. Instead of writing (and maintaining!) the-same-but-slightly-different queries,
JOIN
s, ORDER BY
's, GROUP BY
's over and over, only the XML export script needs to be adjusted.Translations of human readable text
There is another reason that justifies the intermediate XML format. The text stored in the database is conceived in one language -English- first, and then translated by human translators. Obviously, the translators need the original text first so they can translate it into the equivalent in their respective language. When the translators are done, it must be imported back again into the database.
This process is currently implemented by sending the XML document directly to the translators. The XML format is quite simple and intuitive, and translators are asked to replace the stretches of English text with the corresponding translated text. This is done in a straightforward manner, by directly editing the XML.
Importing a translation is, again, a matter of applying an XSLT transformation, this time from XML to SQL.
(The process is actually a little more complicated because there is also some checking involved to ensure that the structure received from the translators matches the expected structure. In this case, every structural element in the untranslated document must also be present in the translated document. Also, the translated document must not introduce any new structural elements.)
Rich Text Format
For a number of purposes, the application needs to generate output documents in the Rich Text Format format.
Rich Text Format or RTF is a proprietary data format for representing text and (simple) markup. It is owned by Microsoft
RTF is ASCII encoded: all characters in RTF documents are 7-bit ASCII characters. 8-bit characters and also unicode characters are supported in modern versions using special escape sequences. A RTF unicode character is denoted like this:
\u<decimal character code><ASCII character>
So,
\u
initiates a unicode escape sequence telling the RTF agent to expect an integer that specifies a character in the unicode character set. Immediately after that, there must be a normal ASCII character that can be used by RTF agents that do not implement the substitution of the unicode escape sequence with the respective unicode character.For example, the text
日本語
can be written like this
\u26085?\u26412?\u35486?
in RTF. Unicode aware RTF agents should print the Japanese text 日本語. RTF clients that do not understand or implement the unicode escape will use the question mark instead, because that is placed immediately after each escape sequence:
???
Generating RTF from a unicode source with XSLT
As described earlier, XSLT is used to generate RTF documents (among others). So how do we ensure that the unicode characters in the input documents are converted to ASCII characters? How do we make sure the unicode characters that do not fit in ASCII are properly escaped using the RTF unicode escape syntax?
This is something that looked like a very hard problem to solve quickly and cleanly. Luckily, after only little experimentation, it turned out to be very, very easy, and if required almost no extra code. The solution consists of two steps. One step involves a very minor but extremely important modification at the XSLT level. The other step involves some processing of the transformation result.
Choosing an ASCII output encoding with <xsl:output>
XSLT transformations can optionally contain the <xsl:output> element to control various aspects of the generated output. I used <xsl:output> many times before to control whether the output of the transformation should be done using xml, plain text or HTML. The <xsl:output> element can also be used to specify the character encoding of the tranformation result.
Since RTF is essentially an ASCII format, this is the encoding we need to specify:
<xsl:output
method="text"
encoding="ascii"
/>
This ensures that only ASCII characters result from the transformation process, even if the input document is in, say an utf8 encoded text (as is the case here). The first 127 unicode characters map to the equivalent ASCII character. Characters with a character code that exceeds 127 are output using XML numeric character references. So for example,
日本語
becomes:
日本語
Looks familiar? Thought so ;)
Replacing XML entity references with RTF unicode escapes using PHP regular expressions
So after letting XSLT do the dirty work of performing the character code conversions and escaping them to entity references, we still need one final step to make RTF unicode escape sequences of these entity references. As we are in a PHP environment, this is a perfect job for the Pearl compatible regular expression extension:
$rtf = preg_replace(
'/([0-9]+);/'
'\u${1}?'
, $xslt_result
);
And that is all there is to it. Maybe a tad ugly, but it works ;)