Wednesday, June 22, 2011

Working with namespaces and namespace prefixes in XSLT 1.0

To whom it may concern -

I'm currently developing a Xslt 1.0 stylesheet for analysis of XML Schema documents. As part of this work, I developed a couple of templates for working with namespace names and prefixes, and I like to share them via this post. The code is not incredibly hard or advanced, but it gets the job done and it may save you some time if you need something similar.
  • get-local-name: Return the local name part of a given QName. function in XPath 2.0
  • get-prefix: Return the prefix part of a given QName.
  • get-ns-name: Return the namespace name associated to the given prefix.
  • get-ns-prefix: Return a prefix that can be used to denote the given namespace name.
  • resolve-ns-identifier: Return the namespace name for a given QName prefix
Before I discuss the code, I want to make a few remarks:
  1. This s all about Xslt 1.0 and its query langue XPath 1.0. All these things can be solved much more conveniently in XPath 2.0, and hence in Xslt 2.0 because that builds on Xpath 2.0 (and not XPath 1.0 like Xslt 1.0 does)
  2. If you're planning to use this in a web-browser, and you want to target Firefox, your're out of luck. Sorry. Firefox is a greatt browser, but unlike Chrome, Opera and even Internet Explorer, it doesn't care enough about Xslt to fix bug #94270, which has been living in their bug tracker as long as August 2001 (nope - I didn't mistype 2011, that's 2001 as in almost a decade ago)

get-local-name

Return the local name part of a given QName. This is functionally equivalent to the fn:local-name-from-QName
<!-- get the last substring after the *last* colon (or he argument if no colon) -->
<xsl:template name="get-local-name">
<xsl:param name="qname"/>
<xsl:choose>
<xsl:when test="contains($qname, ':')">
<xsl:call-template name="get-local-name">
<xsl:with-param name="qname" select="substring-after($qname, ':')"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$qname"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>

get-prefix

Return the prefix part of a given QName. This is functionally equivalent to the fn:prefix-from-QName function in XPath 2.0
<!-- get the substring before the *last* colon (or empty string if no colon) -->
<xsl:template name="get-prefix">
<xsl:param name="qname"/>
<xsl:param name="prefix" select="''"/>
<xsl:choose>
<xsl:when test="contains($qname, ':')">
<xsl:call-template name="get-prefix">
<xsl:with-param name="qname" select="substring-after($qname, ':')"/>
<xsl:with-param name="prefix" select="concat($prefix, substring-before($qname, ':'))"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$prefix"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>

get-ns-name

Return the namespace name associated to the given prefix. This is functionally equivalent to the fn:namespace-uri-for-prefix function in XPath 2.0. The main difference is that this template does the lookup against the namespace definitions that are in effect in the current context, whereas the XPath 2.0 function allows the element which is used as context to be passed in as argument.
<!-- get the namespace uri for the namespace identified by the prefix in the parameter -->
<xsl:template name="get-ns-name">
<xsl:param name="ns-prefix"/>
<xsl:variable name="ns-node" select="namespace::node()[local-name()=$ns-prefix]"/>
<xsl:value-of select="$ns-node"/>
</xsl:template>

get-ns-prefix

Return a prefix that can be used to denote the given namespace name. This template is complementary to the get-ns-name template. This template assumes only one prefix will be defined for each namespace. The namspace is resolved against the current context.
<!-- get the namespace prefix for the namespace name parameter -->
<xsl:template name="get-ns-prefix">
<xsl:param name="ns-name"/>
<xsl:variable name="ns-node" select="namespace::node()[.=$ns-name]"/>
<xsl:value-of select="local-name($ns-node)"/>
</xsl:template>

resolve-ns-identifier

Return the namespace name for a given QName prefix (be it a namespace prefix or a namspace name). This template is useful to generically obtain a namespace name when feeding it the prefix part of a QName. If the prefix happens to be a namespace name, then that is returned, but if it happens to be a namespace prefix, then a lookup is performed to return the namspace name. This template also looks at the namspaces in effect in the current context.
<!-- return the namespace name -->
<xsl:template name="resolve-ns-identifier">
<xsl:param name="ns-identifier"/>
<xsl:choose>
<xsl:when test="namespace::node()[.=$ns-identifier]">
<xsl:value-of select="$ns-identifier"/>
</xsl:when>
<xsl:when test="namespace::node()[local-name()=$ns-identifier]">
<xsl:value-of select="namespace::node()[local-name()=$ns-identifier]"/>
</xsl:when>
<xsl:otherwise>
<xsl:message terminate="yes">
Error: "<xsl:value-of select="$ns-identifier"/>" is neither a valid namespace prefix nor a valid namespace name.
</xsl:message>
</xsl:otherwise>
</xsl:choose>
</xsl:template>

Saturday, June 18, 2011

HPCC vs Hadoop at a glance

Update

Since this article was written, HPCC has undergone a number of significant changes and updates. This addresses some of the critique voiced in this blog post, such as the license (updated from AGPL to Apache 2.0) and integration with other tools. For more information, refer to the comments placed by Flavio Villanustre and Azana Baksh.

The original article can be read unaltered below:

Yesterday I noticed this tweet by Andrei Savu: . This prompted me to read the related GigaOM article and then check out the HPCC Systems website.

If you're too lazy to read the article or visit that website:
HPCC (High Performance Computing Cluster) is a massive parallel-processing computing platform that solves Big Data problems. The platform is now Open Source!


HPCC Systems compares itself to Hadoop, which I think is completely justified in terms of functionality. Its product originated as a homegrown solution of LexisNexis Risk Solutions allowing its customers (banks, insurance companies, law enforcment and federal government) to quickly analyze billions of records, and as such it has been in use for a decade or so. It is now open sourced, and I already heard an announcement that Pentaho is its major Business Intelligence Partner.

Based on the limited information a made a quick analysis, which I emailed to the HPCC Systems CTO, Armando Escalante. My friend Jos van Dongen said it was a good analysis and told me I should post it. Now, I don't really have time to make a nice blog post out of it, but I figured it can't hurt to just repeat what I said in my emails. So here goes:

Just going by the documentation, I see a two real unique selling points in HPCC Systems as compared to Hadoop:

  • Real-time query performance (as opposed to only analytic jobs). HPCC offers two difference setups, labelled Thor and Roxie. Functionalitywise, Thor should be compared to a Map/Reduce cluster like Hadoop: it's good for doing fairly long running analyses on large volumes of data. Roxie is a different beast, and designed to offer fast data access, supporting ad-hoc real-time queries
  • Integrated toolset (as opposed to hodgepodge of third party tools). We're talking about an IDE, job monitoring, code repository, scheduler, configuration manager, and whatnot. This really looks like like big productivity boosters, which may make Big Data processing a lot more accessible to companies that don't have the kind of development teams required to work with Hadoop.

(there may be many more benefits, but these are just the ones I could clearly distill from the press release and the website)

Especially for Business Intelligence, Roxie maybe a big thing. If real-time Big Data queries could be integrated with Business Intelligence OLAP and reporting tools, then this is certainly a big thing. I can't disclose the details but I have trustworthy information that integration with Pentaho's Analysis Engine, the Mondrian ROLAP engine is underway and will be available as an Enterprise feature.

A few things that look different but which may not matter too much when looking at HPCC and Hadoop from a distance:
  • ECL, the "Enterprise Control Language", which is a declarative query language (as opposed to just Map/Reduce). This initially seems like a big difference but Hadoop has tools like pig and sqoop and hive. Now, it could be that ECL is vastly superior to these hadoop tools, but my hunch is you'd have to be careful in how you position that. If you choose a head-on strategy in promoting ECL as opposed to pig, then the chances are that people will just spend their energy in discovering the things that pig can do and ECL cannot (not sure if those features actually exist, but that is what hadoop fanboys will look for), and in addition, the pig developers might simply clone the unique ECL features and the leveling of that playing field will just be a matter of time. This does not mean you shouldn't promote ECL - on the contrary, if you feel it is a more productive language than pig or any other hadoop tool, then by all means let your customers and prospects know. Just be careful and avoid downplaying the hadoop equivalents because that strategy could backfire.

  • Windows support. It's really nice that HPCC Systems is available for Microsoft Windows, it makes that a lot easier for Microsoft shops (and there are a lot of them). That said, customers that really have a big-data problem will solve it no matter what their internal software policies are. So they'd happily start running hadoop on linux if that solves their problems.
  • Maturity. On paper HPCC looks more mature than hadoop. It's hard to tell how much that matters though because hadoop has all the momentum. People might choose for hadoop because they anticipate that the maturity will come thanks to the sheer number of developers committing to that platform.


The only thing I can think of where HPCC looks like it has a disadvantage as compared to Hadoop is adoption rate and licensing. I hope these will prove not to be significant hurdles for HPCC, but I think that these might be bigger problems then they seem. Especially the AGPL licensing seems problematic to me.

The AGPL is not well regarded by anyone I know - not in the open source world. The general idea seems to be that even more than plain GPL3 it restricts how the software may be used. If the goal of open sourcing HPCC is to gain mindshare and a developer community (something that hadoop has done and is doing extremely well) then a more permissive license is really the way to go.

If you look at products like MySQL but also Pentaho - they are both very strongly corporately led products. The have a good number of users, but few contributions from outside the company, and this is probably due to a combination of GPL licensing and the additional requirement for handing over the copyright of any contributions to the company. Hence these products don't really benefit from an open source development model (or at least not as much as they could). For these companies, Open source may help initially to gain a lot of users, but those are in majority the users that just want a free ride: conversion rates to enterprise edition customers are quite low. It might be enough to make a decent buck, but eventually you'll hit a cap on how far you can grow. I'm not saying this is bad - you only need to grow as much as you have to, but it is something to be aware of.

Contrast this to Hadoop. The have a Apache 2.0 permissive license, and this results in many individuals but also companies contributing to the project. And there are still companies like Cloudera that manage to make a good living off of the services around their distribution of Hadoop. You don't lose the ability to develop add-ons either with this model - apache 2.0 allows all that. The difference with GPL (and AGPL) of course is that it allows this also to other users and companies. So the trick to stay on top in this model is to simply offer the best product (as opposed to being the sole holder of the copyright to he code).

Anyway - that is it for now - I hope this is helpful.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...