In this short post, I provide a simple tip to clean HTML pages and convert them to XML so you can extract its data using the conventional "Get data from XML" step. The solution hinges on two ingredients:
- JTtidy, the Java port of the popuar HTML Tidy program, which is an open source HTML cleaning program originally created by Dave Ragget
- The user-defined Java Class step, which lets us write a small snippet of Java that we need to interface with JTidy
Standard Kettle tools for Webservices
Kettle is really good at fetching data from the web and extracting data from webservices, be they in a SOAP/XML, REST/JSON or RSS flavor. (There is an extensive chapter on this subject in Pentaho Kettle Solutions). But when you're dealing with plain old HTML, things can get pretty hairy.If you're lucky, the page may be in XHTML, and in that case it's worth trying the Get Data from XML step. However, quite often a webpage that claims to be XHTML is not well-formed XML, and even if it is, Kettle does not understand things like
entities, which are valid in XHTML, but not in plain XML. And of course, more often than not, you're not lucky, and XHTML represents only a minor fraction of all the web pages out there.Workaround: JavaScript string manipulation
In the past, I usually worked around these issues. In practice, some quick and dirty string manipulation using the Modified Javascript Value step and some built-inindexOf()
, substring
and replace()
functions go a long way. In most cases I don't really need the entire web page, but only a
<table>
, <ul>
or <ol>
element in the <body>
. Excising only the interesting sections out of the page using plain string manipulation will often get rid of most of the cruft that prevents the data from being treated as XML. For example, if we only need to get the rows from a table with a particular id
attribute, we can use a JavaScript snippet like this:
//table we're looking for
var startHandle = "<table class=\"lvw\" cellpadding=0 cellspacing=0>";
var startPosition= html.indexOf(startHandle);
//look beyond the start tag to lose the invalid unquoted attributes
startPosition += startHandle.length;
//find where this table ends (lucky us, no nested table elements :)
var endHandle = "</table>";
var endPosition = html.indexOf(endHandle, startPosition);
//make a complete table fragment out of it again
var table = "<table>" + html.substring(startPosition, endPosition + endHandle.length);
//replace nbsp entities, empty unclosed img elements, and value-less nowrap attributes
table = table.replace(/ |<img[^>]>|nowrap/ig, "");
There are of course no guarantees that the sections you cut out like that are in fact well-formed XML, but in my experience it's often worth a try.
A better way: using JTidy
While the JavaScript workaround may just work for your particular case, it certainly has disadvantages. Sometimes it may just be not so simple to clean the HTML with plain string manipulation. And of course there is a performance issue too - the JavaScript step can be quite slow.Fortunately, there is a better way.
Using a user-defined Java Class step we can have JTidy do the dirty work of cleaning the HTML and converting it to XML, which we can then process in a sane way with Kettle's Get Data from XML step.
We need to do two things to make this work: first, you have to download JTidy, unzip it, and place the
jtidy-r938.jar
in the libext
directory, which resides immediately in your kettle installation directory. (note that if you were running spoon, you need to restart it before it will be picked up). Second, you need a little bit of glue code for the User-defined Java class step so Kettle can use the Tidy
class inside the jar. With some help from the pentaho wiki and the JTidy JavaDoc documentation, I came up with the following Java snippet to make it work:
import org.w3c.tidy.Tidy;
import java.io.StringReader;
import java.io.StringWriter;
protected Tidy tidy;
public boolean init(StepMetaInterface stepMetaInterface, StepDataInterface stepDataInterface)
{
//create and configure a Tidy instance
tidy = new Tidy();
tidy.setXmlOut(true);
return parent.initImpl(stepMetaInterface, stepDataInterface);
}
public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException {
Object[] r;
//Get row from incoming stream.
//Bail out if its not there.
if ((r = getRow()) == null) {
setOutputDone();
return false;
}
//read the value of the html input field
//the html field happens to be the 5th field in the stream,
//because java arrays start at 0, we use index 4 to reference it
StringReader html = new StringReader((String)r[4]);
//use tidy to parse html to xml
StringWriter xml = new StringWriter();
tidy.parse(html, xml);
//assign the xml to the output row
//note we simply overwrite the original html field from the input row.
r[4] = xml.toString();
//push the output row to the outgoing stream.
putRow(data.outputRowMeta, r);
return true;
}
(Tip: for more examples and background information on the user-defined java class step, check out the excellent blog posts by Slawomir Chodnicki, Matt Casters and the video walk-through by Dein Einspanjer)
The big advantage of using Tidy is that you can be sure that the result is well-formed XML. In addition, you can have JTidy report on any errors or warnings, which makes it much more robust than any ad-hoc string manipulation you can come up with.