tag:blogger.com,1999:blog-15319370.post669675250590460288..comments2024-03-05T11:16:00.846+01:00Comments on Roland Bouman's blog: Cleaning webpages with Pentaho Data Integration and JTidyrpboumanhttp://www.blogger.com/profile/13365137747952711328noreply@blogger.comBlogger15125tag:blogger.com,1999:blog-15319370.post-31966450082078044132015-11-05T09:49:54.265+01:002015-11-05T09:49:54.265+01:00Your blog is absolutely fantastic. Good work.Regar...Your blog is absolutely fantastic. Good work.Regards.Geelong cleanershttp://www.geelongcommercialcleaners.com.au/noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-43436008388824762132014-06-25T20:28:26.228+02:002014-06-25T20:28:26.228+02:00Danilo, Kike,
sorry guys I don't have time t...Danilo, Kike, <br /><br />sorry guys I don't have time to look into this in detail. <br /><br />Please debug my code for the judc step, or even better: please report your problem with the html to xml step in its respective github repository. That way you have a chance to have that fixed not just for you but for the rest of the world as well. <br /><br />Thank you in advance. rpboumanhttps://www.blogger.com/profile/13365137747952711328noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-31632697109853186912014-06-25T20:25:05.991+02:002014-06-25T20:25:05.991+02:00Hello Roland,
I'm having the same problem as ...Hello Roland,<br /><br />I'm having the same problem as Danilo.<br /><br />I either get a error or get blank.<br /><br />I am using the HTTP client step to read an html. Then I send the result to the step user defined java class which contains your code. When I execute it, the result is blank with any value for R different than 4. With 4 I get the error NullPointerException (just like Danilo).<br /><br />I read your reply for Danilo. The problem is that the pdi-html-to-xml-plugin doesn't work either (at least for me). The pdi-html-to-xml-plugin just puts an XML header which is useless for me since I am looking for a well-structured XML.<br /><br />Any help will be greatly appreciated.Kikehttps://www.blogger.com/profile/09580411216535415950noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-11212944768523031012014-04-04T23:10:50.047+02:002014-04-04T23:10:50.047+02:00Hi Danilo,
yes I think I used the load file cont...Hi Danilo, <br /><br />yes I think I used the load file content into memory input step. But for the html tidying, you can now install a plugin from the marketplace. I would recommend you try that.rpboumanhttps://www.blogger.com/profile/13365137747952711328noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-81546219237283760842014-04-04T10:45:09.862+02:002014-04-04T10:45:09.862+02:00Hello Roland,
i'm using the input "get fi...Hello Roland,<br />i'm using the input "get file content in memory" to load the html and i connect it with "user defined java class" where i pasted your code. If i run the transformation i get nullpointerexception, if i modify r[4] with r[0] i get a blank page. I think get file content in memory get all the file in a single row. How did you load html in memory?Danilohttps://www.blogger.com/profile/11032489816551000226noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-64869064132277508312014-03-15T06:10:42.661+01:002014-03-15T06:10:42.661+01:00Probably some problem with the fields you specifie...Probably some problem with the fields you specified in the step.<br /><br />Not to worry - you can get a htmlt-xml plugin nowadays which offers exactly this functionality. It's available through the marketplace, or here: https://github.com/mattyb149/pdi-html-to-xml-plugin rpboumanhttps://www.blogger.com/profile/13365137747952711328noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-1685228836852340442014-03-15T05:28:29.287+01:002014-03-15T05:28:29.287+01:00I am using jtidy with the snippet but I get this e...I am using jtidy with the snippet but I get this error all the time in pentaho 5.0.1 stable version. Any advice on how to overcome this would be truly appreciated!!<br /><br />2014/03/14 23:57:47 - Version checker - OK<br />2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - ERROR (version 5.0.1-stable, build 1 from 2013-11-15_16-08-58 by buildguy) : Unable to get fields from previous steps because of an error<br />2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - ERROR (version 5.0.1-stable, build 1 from 2013-11-15_16-08-58 by buildguy) : org.pentaho.di.core.exception.KettleStepException: <br />2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - Error initializing UserDefinedJavaClass to get fields: <br />2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - Line 3, Column 1: Non-abstract class "Processor" must implement method "boolean org.pentaho.di.trans.steps.userdefinedjavaclass.TransformClassBase.processRow(org.pentaho.di.trans.step.StepMetaInterface, org.pentaho.di.trans.step.StepDataInterface) throws org.pentaho.di.core.exception.KettleException"<br />2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - <br />2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - at org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta.getFields(UserDefinedJavaClassMeta.java:431)<br />2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - at org.pentaho.di.trans.TransMeta.getThisStepFields(TransMeta.java:2200)<br />2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - at org.pentaho.di.trans.TransMeta.getThisStepFields(TransMeta.java:2157)<br />2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - at org.pentaho.di.ui.trans.steps.userdefinedjavaclass.UserDefinedJavaClassDialog$9.run(UserDefinedJavaClassDialog.java:530)<br />2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - at java.lang.Thread.run(Unknown Source)<br />2014/03/15 00:02:52 - org.pentaho.di.trans.steps.userdefinedjavaclass.UserDefinedJavaClassMeta@3503fc0a - Caused by: org.codehaus.janino.CompileException: Line 3, Column 1: Non-abstract class "Processor" must implement method "boolean org.pentaho.di.trans.steps.userdefinedjavaclass.TransformClassBase.processRow(org.pentaho.di.trans.step.StepMetaInterface, org.pentaho.di.trans.step.StepDataInterface) throws org.pentaho.di.core.exception.KettleException"<br />2014/03/15 00:06:59 - Spoon - Transformation opened.<br />2014/03/15 00:06:59 - Spoon - Launching transformation [Player_Performance_Index]...<br />2014/03/15 00:06:59 - Spoon - Started the transformation execution.<br />2014/03/15 00:07:06 - Spoon - The transformation has finished!!<br />2014/03/15 00:14:10 - User Defined Java Class - PREVIEW - Dispatching started for transformation [User Defined Java Class - PREVIEW]<br />2014/03/15 00:14:11 - User Defined Java Class - PREVIEW - User Defined Java Class - PREVIEW<br />2014/03/15 00:14:11 - User Defined Java Class - PREVIEW - User Defined Java Class - PREVIEW<br />Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-15319370.post-90671878477881882622012-05-04T21:32:06.316+02:002012-05-04T21:32:06.316+02:00Thanks a lot for this! I was tearing up my hair fi...Thanks a lot for this! I was tearing up my hair fiddling with Regular expressions on invalid html documents. Jtidy prepped the content for the XML step perfectly. I am going to try that jsoup plugin soon too. Great idea for a plugin!britebytenoreply@blogger.comtag:blogger.com,1999:blog-15319370.post-66495276451132353302012-04-23T00:25:05.777+02:002012-04-23T00:25:05.777+02:00Roland,
The jsoup plugin is done:
https://github....Roland,<br /><br />The jsoup plugin is done:<br />https://github.com/gkfabs/Kettle-jsoup<br /><br />ThanksAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-15319370.post-18566498734463050002012-04-16T22:21:28.254+02:002012-04-16T22:21:28.254+02:00Hi @Fabien,
great tip! I haven't used JSoup ...Hi @Fabien, <br /><br />great tip! I haven't used JSoup before, but it sounds like a powerful lib. If you get round to creating the plugin, please let the community know - I think many people will welcome it. <br /><br />kind regards, and keep up the good work -<br /><br />Roland.rpboumanhttps://www.blogger.com/profile/13365137747952711328noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-77070600945782926842012-04-16T21:09:28.530+02:002012-04-16T21:09:28.530+02:00Hi Roland,
I first used JTidy to analyze web page...Hi Roland,<br /><br />I first used JTidy to analyze web page inside Kettle. But now I prefer to use JSoup, which let's me use a css selector to get the part of the page I need.<br /><br />Maybe in the future I would develop a small plugin for jsoup inside Kettle.<br /><br />Fabien CarrionAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-15319370.post-68517600438805693762011-06-06T04:28:19.193+02:002011-06-06T04:28:19.193+02:00Very usefull! Thanks!Very usefull! Thanks!Felipe Torresnoreply@blogger.comtag:blogger.com,1999:blog-15319370.post-52579018040955672782011-05-31T20:25:53.523+02:002011-05-31T20:25:53.523+02:00Roland,
Thank you. That example worked (I always ...Roland, <br />Thank you. That example worked (I always forget about the samples directory). And I was able to run with jtidy library as well. <br /><br />Thank you for responding. <br /><br />SeanSeanhttp://www.godlysearch.comnoreply@blogger.comtag:blogger.com,1999:blog-15319370.post-59677456984776509692011-05-31T18:28:21.114+02:002011-05-31T18:28:21.114+02:00Hi Sean!
I can take some time to look into your ...Hi Sean! <br /><br />I can take some time to look into your issue later, but right now i'm a bit busy. Can you please check out the sample HTTP client transformation that ships with kettle? You can find that at:<br /><br />${KETTLE_HOME}/samples/transformations/HTTP Client - simple retrieval example.ktr<br /><br />(if that one doesn't work either, then please post again so we can take a look)rpboumanhttps://www.blogger.com/profile/13365137747952711328noreply@blogger.comtag:blogger.com,1999:blog-15319370.post-59939191013217378092011-05-31T18:00:36.694+02:002011-05-31T18:00:36.694+02:00Hi Roland,
This is very useful blog entry (well, ...Hi Roland, <br />This is very useful blog entry (well, the most useful was the last one that asked me to check out the other mySQL blog ;-) <br /><br />I tried to use these instructions but the "http client" step does not return anything into the "result" field. So I'm stuck. I have your book but in hard copy only. So I can only look it up later. <br /><br />Would you please post a sample "ktr" that we can download and try?<br /><br />Regards,<br />SeanSeanhttp://www.godlysearch.comnoreply@blogger.com