Monday, May 30, 2011

Cleaning webpages with Pentaho Data Integration and JTidy

Here's an issue I've come across multiple times: I need to scrape HTML websites to extract data. Pentaho Data Integration (kettle) has lots of functionality on-board to make this an easy process, except one: it does not support reading data directly from HTML.

In this short post, I provide a simple tip to clean HTML pages and convert them to XML so you can extract its data using the conventional "Get data from XML" step. The solution hinges on two ingredients:

Standard Kettle tools for Webservices

Kettle is really good at fetching data from the web and extracting data from webservices, be they in a SOAP/XML, REST/JSON or RSS flavor. (There is an extensive chapter on this subject in Pentaho Kettle Solutions). But when you're dealing with plain old HTML, things can get pretty hairy.

If you're lucky, the page may be in XHTML, and in that case it's worth trying the Get Data from XML step. However, quite often a webpage that claims to be XHTML is not well-formed XML, and even if it is, Kettle does not understand things like   entities, which are valid in XHTML, but not in plain XML. And of course, more often than not, you're not lucky, and XHTML represents only a minor fraction of all the web pages out there.

Workaround: JavaScript string manipulation

In the past, I usually worked around these issues. In practice, some quick and dirty string manipulation using the Modified Javascript Value step and some built-in indexOf(), substring and replace() functions go a long way.

In most cases I don't really need the entire web page, but only a <table>, <ul> or <ol> element in the <body>. Excising only the interesting sections out of the page using plain string manipulation will often get rid of most of the cruft that prevents the data from being treated as XML. For example, if we only need to get the rows from a table with a particular id attribute, we can use a JavaScript snippet like this:

//table we're looking for
var startHandle = "<table class=\"lvw\" cellpadding=0 cellspacing=0>";
var startPosition= html.indexOf(startHandle);
//look beyond the start tag to lose the invalid unquoted attributes
startPosition += startHandle.length;

//find where this table ends (lucky us, no nested table elements :)
var endHandle = "</table>";
var endPosition = html.indexOf(endHandle, startPosition);

//make a complete table fragment out of it again
var table = "<table>" + html.substring(startPosition, endPosition + endHandle.length);

//replace nbsp entities, empty unclosed img elements, and value-less nowrap attributes
table = table.replace(/&nbsp;|<img[^>]>|nowrap/ig, "");


There are of course no guarantees that the sections you cut out like that are in fact well-formed XML, but in my experience it's often worth a try.

A better way: using JTidy

While the JavaScript workaround may just work for your particular case, it certainly has disadvantages. Sometimes it may just be not so simple to clean the HTML with plain string manipulation. And of course there is a performance issue too - the JavaScript step can be quite slow.

Fortunately, there is a better way.

Using a user-defined Java Class step we can have JTidy do the dirty work of cleaning the HTML and converting it to XML, which we can then process in a sane way with Kettle's Get Data from XML step.

We need to do two things to make this work: first, you have to download JTidy, unzip it, and place the jtidy-r938.jar in the libext directory, which resides immediately in your kettle installation directory. (note that if you were running spoon, you need to restart it before it will be picked up). Second, you need a little bit of glue code for the User-defined Java class step so Kettle can use the Tidy class inside the jar. With some help from the pentaho wiki and the JTidy JavaDoc documentation, I came up with the following Java snippet to make it work:

import org.w3c.tidy.Tidy;
import java.io.StringReader;
import java.io.StringWriter;

protected Tidy tidy;

public boolean init(StepMetaInterface stepMetaInterface, StepDataInterface stepDataInterface)
{
//create and configure a Tidy instance
tidy = new Tidy();
tidy.setXmlOut(true);
return parent.initImpl(stepMetaInterface, stepDataInterface);
}

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException {
Object[] r;
//Get row from incoming stream.
//Bail out if its not there.
if ((r = getRow()) == null) {
setOutputDone();
return false;
}

//read the value of the html input field
//the html field happens to be the 5th field in the stream,
//because java arrays start at 0, we use index 4 to reference it
StringReader html = new StringReader((String)r[4]);

//use tidy to parse html to xml
StringWriter xml = new StringWriter();
tidy.parse(html, xml);

//assign the xml to the output row
//note we simply overwrite the original html field from the input row.
r[4] = xml.toString();

//push the output row to the outgoing stream.
putRow(data.outputRowMeta, r);
return true;
}

(Tip: for more examples and background information on the user-defined java class step, check out the excellent blog posts by Slawomir Chodnicki, Matt Casters and the video walk-through by Dein Einspanjer)

The big advantage of using Tidy is that you can be sure that the result is well-formed XML. In addition, you can have JTidy report on any errors or warnings, which makes it much more robust than any ad-hoc string manipulation you can come up with.

15 comments:

Sean said...

Hi Roland,
This is very useful blog entry (well, the most useful was the last one that asked me to check out the other mySQL blog ;-)

I tried to use these instructions but the "http client" step does not return anything into the "result" field. So I'm stuck. I have your book but in hard copy only. So I can only look it up later.

Would you please post a sample "ktr" that we can download and try?

Regards,
Sean

rpbouman said...

Hi Sean!

I can take some time to look into your issue later, but right now i'm a bit busy. Can you please check out the sample HTTP client transformation that ships with kettle? You can find that at:

${KETTLE_HOME}/samples/transformations/HTTP Client - simple retrieval example.ktr

(if that one doesn't work either, then please post again so we can take a look)

Sean said...

Roland,
Thank you. That example worked (I always forget about the samples directory). And I was able to run with jtidy library as well.

Thank you for responding.

Sean

Felipe Torres said...

Very usefull! Thanks!

Anonymous said...

Hi Roland,

I first used JTidy to analyze web page inside Kettle. But now I prefer to use JSoup, which let's me use a css selector to get the part of the page I need.

Maybe in the future I would develop a small plugin for jsoup inside Kettle.

Fabien Carrion

rpbouman said...

Hi @Fabien,

great tip! I haven't used JSoup before, but it sounds like a powerful lib. If you get round to creating the plugin, please let the community know - I think many people will welcome it.

kind regards, and keep up the good work -

Roland.

Anonymous said...

Roland,

The jsoup plugin is done:
https://github.com/gkfabs/Kettle-jsoup

Thanks

britebyte said...

Thanks a lot for this! I was tearing up my hair fiddling with Regular expressions on invalid html documents. Jtidy prepped the content for the XML step perfectly. I am going to try that jsoup plugin soon too. Great idea for a plugin!

Anonymous said...

I am using jtidy with the snippet but I get this error all the time in pentaho 5.0.1 stable version. Any advice on how to overcome this would be truly appreciated!!

2014/03/14 23:57:47 - Version checker - OK
2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - ERROR (version 5.0.1-stable, build 1 from 2013-11-15_16-08-58 by buildguy) : Unable to get fields from previous steps because of an error
2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - ERROR (version 5.0.1-stable, build 1 from 2013-11-15_16-08-58 by buildguy) : org.pentaho.di.core.exception.KettleStepException:
2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - Error initializing UserDefinedJavaClass to get fields:
2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - Line 3, Column 1: Non-abstract class "Processor" must implement method "boolean org.pentaho.di.trans.steps.userdefinedjavaclass.TransformClassBase.processRow(org.pentaho.di.trans.step.StepMetaInterface, org.pentaho.di.trans.step.StepDataInterface) throws org.pentaho.di.core.exception.KettleException"
2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a -
2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - at org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta.getFields(UserDefinedJavaClassMeta.java:431)
2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - at org.pentaho.di.trans.TransMeta.getThisStepFields(TransMeta.java:2200)
2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - at org.pentaho.di.trans.TransMeta.getThisStepFields(TransMeta.java:2157)
2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - at org.pentaho.di.ui.trans.steps.userdefinedjavaclass.UserDefinedJavaClassDialog$9.run(UserDefinedJavaClassDialog.java:530)
2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - at java.lang.Thread.run(Unknown Source)
2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - Caused by: org.codehaus.janino.CompileException: Line 3, Column 1: Non-abstract class "Processor" must implement method "boolean org.pentaho.di.trans.steps.userdefinedjavaclass.TransformClassBase.processRow(org.pentaho.di.trans.step.StepMetaInterface, org.pentaho.di.trans.step.StepDataInterface) throws org.pentaho.di.core.exception.KettleException"
2014/03/15 00:06:59 - Spoon - Transformation opened.
2014/03/15 00:06:59 - Spoon - Launching transformation [Player_Performance_Index]...
2014/03/15 00:06:59 - Spoon - Started the transformation execution.
2014/03/15 00:07:06 - Spoon - The transformation has finished!!
2014/03/15 00:14:10 - User Defined Java Class - PREVIEW - Dispatching started for transformation [User Defined Java Class - PREVIEW]
2014/03/15 00:14:11 - User Defined Java Class - PREVIEW - User Defined Java Class - PREVIEW
2014/03/15 00:14:11 - User Defined Java Class - PREVIEW - User Defined Java Class - PREVIEW

rpbouman said...

Probably some problem with the fields you specified in the step.

Not to worry - you can get a htmlt-xml plugin nowadays which offers exactly this functionality. It's available through the marketplace, or here: https://github.com/mattyb149/pdi-html-to-xml-plugin

Danilo said...

Hello Roland,
i'm using the input "get file content in memory" to load the html and i connect it with "user defined java class" where i pasted your code. If i run the transformation i get nullpointerexception, if i modify r[4] with r[0] i get a blank page. I think get file content in memory get all the file in a single row. How did you load html in memory?

rpbouman said...

Hi Danilo,

yes I think I used the load file content into memory input step. But for the html tidying, you can now install a plugin from the marketplace. I would recommend you try that.

Kike said...

Hello Roland,

I'm having the same problem as Danilo.

I either get a error or get blank.

I am using the HTTP client step to read an html. Then I send the result to the step user defined java class which contains your code. When I execute it, the result is blank with any value for R different than 4. With 4 I get the error NullPointerException (just like Danilo).

I read your reply for Danilo. The problem is that the pdi-html-to-xml-plugin doesn't work either (at least for me). The pdi-html-to-xml-plugin just puts an XML header which is useless for me since I am looking for a well-structured XML.

Any help will be greatly appreciated.

rpbouman said...

Danilo, Kike,

sorry guys I don't have time to look into this in detail.

Please debug my code for the judc step, or even better: please report your problem with the html to xml step in its respective github repository. That way you have a chance to have that fixed not just for you but for the rest of the world as well.

Thank you in advance.

Geelong cleaners said...

Your blog is absolutely fantastic. Good work.Regards.

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Rolling up each game's lines into a single game row (6/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is the sixth installment of a s...