Showing posts with label XSLT. Show all posts

Wednesday, June 22, 2011

Working with namespaces and namespace prefixes in XSLT 1.0

To whom it may concern -

I'm currently developing a Xslt 1.0 stylesheet for analysis of XML Schema documents. As part of this work, I developed a couple of templates for working with namespace names and prefixes, and I like to share them via this post. The code is not incredibly hard or advanced, but it gets the job done and it may save you some time if you need something similar.

get-local-name: Return the local name part of a given QName. function in XPath 2.0
get-prefix: Return the prefix part of a given QName.
get-ns-name: Return the namespace name associated to the given prefix.
get-ns-prefix: Return a prefix that can be used to denote the given namespace name.
resolve-ns-identifier: Return the namespace name for a given QName prefix

Before I discuss the code, I want to make a few remarks:

This s all about Xslt 1.0 and its query langue XPath 1.0. All these things can be solved much more conveniently in XPath 2.0, and hence in Xslt 2.0 because that builds on Xpath 2.0 (and not XPath 1.0 like Xslt 1.0 does)
If you're planning to use this in a web-browser, and you want to target Firefox, your're out of luck. Sorry. Firefox is a greatt browser, but unlike Chrome, Opera and even Internet Explorer, it doesn't care enough about Xslt to fix bug #94270, which has been living in their bug tracker as long as August 2001 (nope - I didn't mistype 2011, that's 2001 as in almost a decade ago)

get-local-name

Return the local name part of a given QName. This is functionally equivalent to the fn:local-name-from-QName

<!-- get the last substring after the *last* colon (or he argument if no colon) -->
<xsl:template name="get-local-name">
    <xsl:param name="qname"/>
    <xsl:choose>
        <xsl:when test="contains($qname, ':')">
            <xsl:call-template name="get-local-name">
                <xsl:with-param name="qname" select="substring-after($qname, ':')"/>
            </xsl:call-template>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$qname"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

get-prefix

Return the prefix part of a given QName. This is functionally equivalent to the fn:prefix-from-QName function in XPath 2.0

<!-- get the substring before the *last* colon (or empty string if no colon) -->
<xsl:template name="get-prefix">
    <xsl:param name="qname"/>
    <xsl:param name="prefix" select="''"/>
    <xsl:choose>
        <xsl:when test="contains($qname, ':')">
            <xsl:call-template name="get-prefix">
                <xsl:with-param name="qname" select="substring-after($qname, ':')"/>
                <xsl:with-param name="prefix" select="concat($prefix, substring-before($qname, ':'))"/>
            </xsl:call-template>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$prefix"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

get-ns-name

Return the namespace name associated to the given prefix. This is functionally equivalent to the fn:namespace-uri-for-prefix function in XPath 2.0. The main difference is that this template does the lookup against the namespace definitions that are in effect in the current context, whereas the XPath 2.0 function allows the element which is used as context to be passed in as argument.

<!-- get the namespace uri for the namespace identified by the prefix in the parameter -->
<xsl:template name="get-ns-name">
    <xsl:param name="ns-prefix"/>
    <xsl:variable name="ns-node" select="namespace::node()[local-name()=$ns-prefix]"/>
    <xsl:value-of select="$ns-node"/>
</xsl:template>

get-ns-prefix

Return a prefix that can be used to denote the given namespace name. This template is complementary to the get-ns-name template. This template assumes only one prefix will be defined for each namespace. The namspace is resolved against the current context.

<!-- get the namespace prefix for the namespace name parameter -->
<xsl:template name="get-ns-prefix">
    <xsl:param name="ns-name"/>
    <xsl:variable name="ns-node" select="namespace::node()[.=$ns-name]"/>
    <xsl:value-of select="local-name($ns-node)"/>
</xsl:template>

resolve-ns-identifier

Return the namespace name for a given QName prefix (be it a namespace prefix or a namspace name). This template is useful to generically obtain a namespace name when feeding it the prefix part of a QName. If the prefix happens to be a namespace name, then that is returned, but if it happens to be a namespace prefix, then a lookup is performed to return the namspace name. This template also looks at the namspaces in effect in the current context.

<!-- return the namespace name -->
<xsl:template name="resolve-ns-identifier">
    <xsl:param name="ns-identifier"/>
    <xsl:choose>
        <xsl:when test="namespace::node()[.=$ns-identifier]">
            <xsl:value-of select="$ns-identifier"/>
        </xsl:when>
        <xsl:when test="namespace::node()[local-name()=$ns-identifier]">
            <xsl:value-of select="namespace::node()[local-name()=$ns-identifier]"/>
        </xsl:when>
        <xsl:otherwise>
            <xsl:message terminate="yes">
                Error: "<xsl:value-of select="$ns-identifier"/>" is neither a valid namespace prefix nor a valid namespace name.
            </xsl:message>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

Tuesday, April 20, 2010

Restoring XML-formatted MySQL dumps

To whom it may concern -

The mysqldump program can be used to make logical database backups. Although the vast majority of people use it to create SQL dumps, it is possible to dump both schema structure and data in XML format. There are a few bugs (#52792, #52793) in this feature, but these are not the topic of this post.

XML output from mysqldump

Dumping in XML format is done with the --xml or -X option. In addition, you should use the --hex-blob option otherwise the BLOB data will be dumped as raw binary data, which usually results in characters that are not valid, either according to the XML spec or according to the UTF-8 encoding. (Arguably, this is also a bug. I haven't filed it though.)

For example, a line like:


mysqldump -uroot -pmysql -X --hex-blob --databases sakila

dumps the sakila database to the following XML format:


<?xml version="1.0"?>
<mysqldump xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <database name="sakila">
    <table_structure name="actor">
      <field Field="actor_id" Type="smallint(5) unsigned" Null="NO" Key="PRI" Extra="auto_increment" />
      <field Field="first_name" Type="varchar(45)" Null="NO" Key="" Extra="" />
      <field Field="last_name" Type="varchar(45)" Null="NO" Key="MUL" Extra="" />
      <field Field="last_update" Type="timestamp" Null="NO" Key="" Default="CURRENT_TIMESTAMP" Extra="on update CURRENT_TIMESTAMP" />
      <key Table="actor" Non_unique="0" Key_name="PRIMARY" Seq_in_index="1" Column_name="actor_id" Collation="A" Cardinality="200" Null="" Index_type="BTREE" Comment="" />
      <key Table="actor" Non_unique="1" Key_name="idx_actor_last_name" Seq_in_index="1" Column_name="last_name" Collation="A" Cardinality="200" Null="" Index_type="BTREE" Comment="" />
      <options Name="actor" Engine="InnoDB" Version="10" Row_format="Compact" Rows="200" Avg_row_length="81" Data_length="16384" Max_data_length="0" Index_length="16384" Data_free="233832448" Auto_increment="201" Create_time="2009-10-10 10:04:56" Collation="utf8_general_ci" Create_options="" Comment="" />
    </table_structure>
    <table_data name="actor">
      <row>
        <field name="actor_id">1</field>
        <field name="first_name">PENELOPE</field>
        <field name="last_name">GUINESS</field>
        <field name="last_update">2006-02-15 03:34:33</field>
      </row>

...many more rows and table structures...

  </database>
</mysqldump>

I don't want to spend too much time discussing why it would be useful to make backups in this way. There are definitely a few drawbacks - for example, for sakila, the plain SQL dump, even with --hex-blob is 3.26 MB (3.429.358 bytes), whereas the XML output is 13.7 MB (14,415,665 bytes). Even after zip compression, the XML formatted dump is still one third larger than the plain SQL dump: 936 kB versus 695 kB.

Restoring XML output from mysqldump

A more serious problem is that MySQL doesn't seem to offer any tool to restore XML formatted dumps. The LOAD XML feature, kindly contributed by Erik Wetterberg could be used to some extent for this purpose. However, this feature is not yet available (it will be available in the upcoming version MySQL 5.5), and from what I can tell, it can only load data - not restore tables or databases. I also believe that this feature does not (yet) provide any way to properly restore hex-dumped BLOB data, but I really should test it to know for sure.

Anyway.

In between sessions of the past MySQL users conference I cobbled up an XSLT stylesheet that can convert mysqldump's XML output back to SQL script output. It is available under the LGPL license, and it is hosted on google code as the mysqldump-x-restore project. To get started, you need to download the mysqldump-xml-to-sql.xslt XSLT stylesheet. You also need a command line XSLT processor, like xsltproc. This utility is part of the Gnome libxslt project, and is included in packages for most linux distributions. There is a windows port available for which you can download the binaries.

Assuming that xsltproc is in your path, and the XML dump and the mysqldump-xml-to-sql.xslt are in the current working directory, you can use this command to convert the XML dump to SQL:


xsltproc mysqldump-xml-to-sql.xslt sakila.xml > sakila.sql

On Unix-based systems you should be able to directly pipline the SQL into mysql using


mysql -uroot -pmysql < `xsltproc mysqldump-xml-to-sql.xslt sakila.xml`

The stylesheet comes with a number of options, which can be set through xsltproc's --stringparam option. For example, setting the schema parameter to N will result in an SQL script that only contains DML statements:


xsltproc --stringparam schema N mysqldump-xml-to-sql.xslt sakila.xml > sakila.sql

Setting the data option to N will result in an SQL script that only contains DDL statements:


xsltproc --stringparam data N mysqldump-xml-to-sql.xslt sakila.xml > sakila.sql

. There are additional options to control how often a COMMIT should be issued, whether to add DROP statements, whether to generate single row INSERT statements, and to set the max_allowed_packet size.

What's next?

Nothing much really. I don't really recommend people to use mysqldump's XML output. I wrote mysqldump-x-restore for those people that inherited a bunch of XML formatted dumps, and don't know what to do with them. I haven't thouroughly tested it - please file a bug if you find one. If you actually think it's useful and you want more features, please let me know, and I'll look into it. I don't have much use for this myself, so if you have great ideas to move this forward, I'll let you have commit access.

That is all.

Tuesday, July 14, 2009

open-msp-viewer: Free XSLT utilities to render MS Project files as HTML web pages

IMPORTANT NOTICE

The old xslt-based open-msp-viewer project is no longer updated. Please find a new and much improved pure HTML version here at github: https://github.com/rpbouman/open-msp-viewer. Thank you for your interest.

Original post below

For my day job, I've been working on a few things that allow you to render Microsoft Project 2003 projects on a web page.

The code I wrote for my work is proprietary, and probably not directly useful for most people. But I figured that at least some of the work might be useful for others, so I wrote an open source version from scratch and I published that as the open-msp-viewer project on google code. If you like, check out the code and give it a spin.

It works by first saving the project in the MS Project XML format using standard MS Project functionality (Menu \ Save As..., then pick .XML) and then applying an XSLT transformation to generate HTML.

Currently, the project includes an xslt stylesheet that renders MS Project XML files as a Gantt chart. To give you a quick idea, Take a look at these screenshots:

and

The web gantt chart is rendered in a HTML 4.01 variant, CSS 2.1 and uses javascript to allow the user to collapse and/or expand individual tasks in the work breakdown structure. Currently, the HTML does not validate due to a few custom attributes I introduced to support dynamic collapsing/expanding the chart with javascript. In addition, the xslt transform process introduces the msp namespace into the result document, which results in a validation error

You can either associate the xslt stylesheet directly with the MS Project XML file, or you can use an external tool like xsltproc.

In the trunk/xml subdirectory, you can find a couple of sample projects in xml format that already have the stylesheet association. I have tested these in IE8, Chrome 2, Safari 4 and Firefox 3.5, and it works well in all these browsers. In the trunk/html directory, you'll find HTML output as created by xsltproc.

In the future, more xslt stylesheets may be added to support alternative views. Things that I think I will add soon are a resources list and a calendar view.

Enjoy, and let me know if you find a bug or would like to contribute.

Sunday, August 12, 2007

Transforming Japanese and Chinese text from utf8 encoded XML to ASCII Rich text (rtf)

This is one of those things that seemed extremely difficult and cumbersome, but turned out to be trivially simple, and relatively painless. Here is the case:

Using PHP to generate XML documents from a MySQL Database

I am working with a MySQL database storing lots human readable text. The text takes the form of semi-structured, well-formed XML. This particular XML-format is based on a very small set of elements borrowed from HTML and is used mostly for semantic markup of the text.

The XML fragments are fairly small, and not represent an end product. They are part of a larger unit: ultimately, they are a part of some document. The final document is not stored as a whole, because it is conceived and maintained on the level of its parts: all these individual XML fragments logically form distinct units.

Rather, complete documents are generated from the MySQL database using a PHP script. The PHP script sends SQL queries to the database to collect all XML fragments that belong in one document. This results in a well-formed XML document consisting of many XML fragments that belong together.

(The details of the actual technique used for the actual generation of the XML documents are not discussed any further in this article. Is suffices to say that this can be done either in PHP, or in SQL or a combination of both.)

Natural language text translation and Unicode

There is a variable number of documents, and one particular document can appear in many translations. So, the actual character data comes in a variety of natural languages, such as English, German, Chinese and Japanese. For this reason, the database stores the XML in unicode.

The MySQL column(s) that store this data are defined like this:


    xml MEDIUMTEXT
        CHARACTER SET utf8
        COLLATE utf8_unicode_ci

XSLT converts XML to various formats

The XML documents that are generated from the database do not represent an end product. Rather, the XML is further processed to yield a variety of output documents. Most of these output documents contain the majority of the data in the originally generated XML document.

However, each particular type of output document serves a different purpose, such as delivery to a customer, internal text review, online web presentation etc. In most cases, a particular output document requires a distinct format that best serves its purpose. So, the same information is required in a variety of presentations, such as plain text, HTML and rtf.

XSLT is used to transform the XML documents generated from the database into the desired target format(s). XSLT is a powerful special purpose language that is especially designed to transform XML input to plain text, HTML or XML output.

XSLT transformations are themselves denoted using XML. The transformation can be defined using a wide variety of constructs. XSLT uses XPath to define patterns that are matched in the input document. These patterns are associated with templates to generate output.

XSLT Templates can emit any structural XML construct, and in addition cause particular pattern matching rules to be applied, or perform traditional procedural processing on the input document (such as iteration, or explicitly call another template to be applied.)

Why an intermediate step?

Although it is possible to directly generate the desired target documents from the database, an intermediate step through XML has a number of advantages.

Single interface to extract data

With the intermediate XML step, there is just one piece of software that interacts with the database to extract data. This makes the solution more resilient to changes in the database schema. In many cases, a change in schema only requires the XML export functionality to be adjusted - the XSLT transformations can remain unchanged as long as equivalent input XML is generated. Instead of writing (and maintaining!) the-same-but-slightly-different queries, JOINs, ORDER BY's, GROUP BY's over and over, only the XML export script needs to be adjusted.

Translations of human readable text

There is another reason that justifies the intermediate XML format. The text stored in the database is conceived in one language -English- first, and then translated by human translators. Obviously, the translators need the original text first so they can translate it into the equivalent in their respective language. When the translators are done, it must be imported back again into the database.

This process is currently implemented by sending the XML document directly to the translators. The XML format is quite simple and intuitive, and translators are asked to replace the stretches of English text with the corresponding translated text. This is done in a straightforward manner, by directly editing the XML.

Importing a translation is, again, a matter of applying an XSLT transformation, this time from XML to SQL.

(The process is actually a little more complicated because there is also some checking involved to ensure that the structure received from the translators matches the expected structure. In this case, every structural element in the untranslated document must also be present in the translated document. Also, the translated document must not introduce any new structural elements.)

Rich Text Format

For a number of purposes, the application needs to generate output documents in the Rich Text Format format.

Rich Text Format or RTF is a proprietary data format for representing text and (simple) markup. It is owned by Microsoft

RTF is ASCII encoded: all characters in RTF documents are 7-bit ASCII characters. 8-bit characters and also unicode characters are supported in modern versions using special escape sequences. A RTF unicode character is denoted like this:


    \u<decimal character code><ASCII character>

So, \u initiates a unicode escape sequence telling the RTF agent to expect an integer that specifies a character in the unicode character set. Immediately after that, there must be a normal ASCII character that can be used by RTF agents that do not implement the substitution of the unicode escape sequence with the respective unicode character.

For example, the text


日本語

can be written like this


\u26085?\u26412?\u35486?

in RTF. Unicode aware RTF agents should print the Japanese text 日本語. RTF clients that do not understand or implement the unicode escape will use the question mark instead, because that is placed immediately after each escape sequence:

???

Generating RTF from a unicode source with XSLT

As described earlier, XSLT is used to generate RTF documents (among others). So how do we ensure that the unicode characters in the input documents are converted to ASCII characters? How do we make sure the unicode characters that do not fit in ASCII are properly escaped using the RTF unicode escape syntax?

This is something that looked like a very hard problem to solve quickly and cleanly. Luckily, after only little experimentation, it turned out to be very, very easy, and if required almost no extra code. The solution consists of two steps. One step involves a very minor but extremely important modification at the XSLT level. The other step involves some processing of the transformation result.

Choosing an ASCII output encoding with <xsl:output>

XSLT transformations can optionally contain the <xsl:output> element to control various aspects of the generated output. I used <xsl:output> many times before to control whether the output of the transformation should be done using xml, plain text or HTML. The <xsl:output> element can also be used to specify the character encoding of the tranformation result.

Since RTF is essentially an ASCII format, this is the encoding we need to specify:


<xsl:output
    method="text"
    encoding="ascii"
/>

This ensures that only ASCII characters result from the transformation process, even if the input document is in, say an utf8 encoded text (as is the case here). The first 127 unicode characters map to the equivalent ASCII character. Characters with a character code that exceeds 127 are output using XML numeric character references. So for example,


日本語

becomes:


&#26085;&#26412;&#35486;

Looks familiar? Thought so ;)

Replacing XML entity references with RTF unicode escapes using PHP regular expressions

So after letting XSLT do the dirty work of performing the character code conversions and escaping them to entity references, we still need one final step to make RTF unicode escape sequences of these entity references. As we are in a PHP environment, this is a perfect job for the Pearl compatible regular expression extension:


$rtf = preg_replace(
    '/&#([0-9]+);/'
    '\u${1}?'
,   $xslt_result
);

And that is all there is to it. Maybe a tad ugly, but it works ;)

Roland Bouman's blog