I just ran into a problem with Pentaho Data Integration, and I figured it may save others some time if I document it here.
The case is very straightforward: read a fairly small XML document directly from a URL, and parse out interesting data using the Get data from XML step.
Typically, this steps works quite well for me. But I just ran into a case where it doesn't work quite as expected. I ran into an error when I tried it on this URL:
<?xml version="1.0" encoding="utf-8"?> <wb:countries page="1" pages="6" per_page="50" total="260" xmlns:wb="http://www.worldbank.org"> <wb:country id="ABW"> <wb:iso2Code>AW</wb:iso2Code> <wb:name>Aruba</wb:name> <wb:region id="LCN">Latin America & Caribbean (all income levels)</wb:region> <wb:adminregion id="" /> <wb:incomeLevel id="NOC">High income: nonOECD</wb:incomeLevel> <wb:lendingType id="LNX">Not classified</wb:lendingType> <wb:capitalCity>Oranjestad</wb:capitalCity> <wb:longitude>-70.0167</wb:longitude> <wb:latitude>12.5167</wb:latitude> </wb:country> ... </wb:countries>
The error: Content is not allowed in prologThe error I got was:
Content is not allowed in prolog.You can encounter this error in any context where the step tries to retrieve the document from the URL, for example when you hit the "Get XPAth nodes" or "Preview" buttons, as well as when you're actually running the step.
Using the w3c XML validatorThe error message indicates that the XML document is in some way not valid. So I ran the URL through the w3c validator:
http://validator.w3.org/check?uri=http%3A%2F%2Fapi.worldbank.org%2Fen%2Fcountries%3Fpage%3D1&charset=%28detect+automatically%29&doctype=Inline&group=0Interestingly, this indicated that the document is valid XML.
A rather dismal workaroundThen I tried a few things in kettle in an attempt to work around it. I won't bother you with everything I tried. Eventually, I did find a viable work-around: By retrieving the document with the HTTP Client step, and then saving that to file using a simple Text file output step (omitting the header, separators, and quotes), I could then successfully open and parse that file with the "Get data from XML" step (from within a second transformation). This was of course a bit annoying since it involved a second transformation, which complicates things considerably. However all attempts to skip the "Text file output" step all brought me back to where I was and gave me the dreaded
Content is not allowed in prolog.error. So something was happening to the document between saving and loading from disk that somehow fixed it.
Mind w3c validator Warnings!I decided to investigate a bit more. What I didn't notice at first when I validated the XML document is that, despite passing validation, it does yield 2 warnings:
- No DOCTYPE found! Checking XML syntax only.
- Byte-Order Mark found in UTF-8 File.
UTF-8, the BOM, and java don't play nice togetherI knew what a BOM was, but I didn't quite understand it's implications in particular for java and java-based applications. Here's a quick list of things you need to know to understand the problem:
- The byte-order mark (BOM) is a special unicode character that indicates several details of the encoding of an inputstream.
- The BOM is optional, and for UTF-8 it is actually disrecommended. But, apparently, this does not mean it's never there, or even non-standard!
- The particular combination of a BOM occurring in a UTF-8 stream is not supported by java. There are bug reports about it here and here.
A better workaroundWe can first retrieve the document with the "Http Client" step, and then remove the BOM if it is present, and then process the document using the Get data from XML step. The transformation below illustrates that setup: So, the "HTTP client" step retrieves the XML text in the
documentfield, and the User-defined Java Expression step simply finds the first occurrence of the less than character (<), which demarcates either the start of the xml declaration, or the document element. The code for that expression is rather straightforward:
document.substring(document.indexOf("<"))All in all, not very pretty, but it does the job. I hope this was of some use to you.
UPDATE1: I created PDI-12410 pertaining to this case.
UPDATE2: Apart from the BOM, there seems to be a separate, independent problem when the XML is acquired from a URL and the server uses gzip compression.
UPDATE3: I have a commit here that solves both the BOM and the gzip issues: https://github.com/rpbouman/pentaho-kettle/commit/6cf28b5e4e88022dbf356ccad01c3b949bed4731.