Roland Bouman's blog: Pentaho Kettle Solutions

Showing posts with label Pentaho Kettle Solutions. Show all posts

Wednesday, March 25, 2015

A Generic Normalizer for Pentaho Data integration - Revisited

A while ago, I wrote about how to create a generic normalizer for Pentaho Data integration.

To freshen up your memory, the generic normalizer takes any input stream, and for each input row, it outputs one row for each field in the input stream. The output rows contain fields for input row number, input field number and input field value. As such it provides the same functionality as the built-in Row Normaliser step without requiring any configuration, thus allowing it to process arbitrary streams of data.

A reusable normalizer

Recently I received an comment asking for more information on how to make this normalizer more reusable:

I want to use this method to do field level auditing but I want to encapsulate it in a sub transformation to which I can pass the result rows from any step. In your image of "how all these steps work together", instead of a data grid, the input would need to be dynamic in terms of the number of fields/columns and the data types. Could you possibly provide a hint how to make the input to these steps (in your example, the datagrid) dynamic?

In the mean while, I learned a thing or two about kettle's internals and it seemed like a good idea to describe how to improve on the original example and make it suitable to be used in a so-called Mapping, a.k.a. a sub-transformation.

Design for use as Subtransformation

The design for the re-usable generic normalizer is shown below:

The User-defined Java class step in the middle actually implements the normalizer. The Mapping input and Mapping output specification steps allow the normalizer to be called from another transformation. They enable it to respectively receive input data from, and return output data to the calling transformation.

In the screenshot above, the configuration dialogs for both the Mapping input and output specification steps are shown. This is mainly to show that there is no configuration involved: the Mapping input specification step will faithfully pass all fields received from the incoming stream on to the normalizer, and the Mapping output specification will output all fields coming out of the normalizer to the outgoing stream.

Normalizer Improvements

The configuration of the user-defined Java class step differs in a number of aspects from what I used in the original normalizer example. In the original example the normalizer output consisted of three fields:

rownum: A sequential integer number identifying the position of the row to which the current output row applies.
fieldnum: A sequential integer number identifying the position of the field to which the current output row applies.
value: A string representation of the value to which the output applies

The original example used a Metadata Structure of Stream step to obtain metadata of the input stream, and this metadata was then tied to the output of the normalizer using a Stream Lookup step, "joining" the output of the Metadata Structure step with the output of the normalizer using the field number.
The improved generic normalizer adds two more output fields:

fieldname: The name of the field as it appears in the input stream
fieldtype: The name of the data type of the field as it appears in the input stream

Argueably, these two items are the most important pieces of metadata that were previously provided by the Metadata Structure of Stream in the original example, and I felt many people would probably prefer to have all of that work done by the normalizer itself rather than having to tie all the pieces together in the transformation itself using additional steps.

Code

The code for the user-defined Java class is shown below:

static long rownum = 0;
static RowMetaInterface inputRowMeta;
static long numFields;
static String[] fieldNames;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
  // get the current row
  Object[] r = getRow();

  // If the row object is null, we are done processing.
  if (r == null) {
    setOutputDone();
    return false;
  }

  // If this is the first row, cache some metadata.
  // We will reuse this metadata in processing the rest of the rows.
  if (first) {
    inputRowMeta = getInputRowMeta();
    numFields = inputRowMeta.size(); 
  }
    
  // Generate a new id number for the current row.
  rownum += 1;

  // Generate one output row for each field in the input stream.
  int fieldnum;
  ValueMetaInterface valueMetaInterface;
  for (fieldnum = 0; fieldnum < numFields; fieldnum++) {
    //get metadata for the current field
    valueMetaInterface = inputRowMeta.getValueMeta(fieldnum);
    Object[] outputRow = new Object[5];
    outputRow[0] = rownum;
    // Assign the field id. Note that we need to cast to long to match Kettle's type system.
    outputRow[1] = (long)fieldnum+1;
    //assign the data type name
    outputRow[2] = valueMetaInterface.getTypeDesc();
    //assign the field name
    outputRow[3] = valueMetaInterface.getName();
    //assign a string representation of the field value
    outputRow[4] = inputRowMeta.getString(r, fieldnum);
    //emit a row.
    putRow(data.outputRowMeta, outputRow);
  }
  
  return true;
}

The main difference with the original code is the addition of the two new output fields, fieldname and fieldtype. In order to obtain the values for these fields, the loop over the fields first obtains the ValueMetaInterface object for the current field. This is done by calling the getValueMeta() method of the RowMetaInterface object and passing the index of the desired field.
Using the ValueMetaInterface object, the field name is obtained using its getName() method. The data type name is obtained by calling its getTypeDesc() method.

Calling the normalizer as subtransformation

Using the improved normalizer is as simple as adding a Mapping-step to your transformation and pointing it to the transformation that contains the normalizer and Mapping input and output specifications:

Download samples

The transformations discussed in this post are available here:

generic-normalizer-mapping.ktr - the reusable generic normalizer.
reusable-generic-normalizer.ktr - example illustrating how to use the reusable generic normalizer.

These transformations are in the public domain: you can use, copy, redistribute and modify these transformations as you see fit. You are encouraged but not obliged to share any modifications that you make to these examples.

Monday, May 30, 2011

Cleaning webpages with Pentaho Data Integration and JTidy

Here's an issue I've come across multiple times: I need to scrape HTML websites to extract data. Pentaho Data Integration (kettle) has lots of functionality on-board to make this an easy process, except one: it does not support reading data directly from HTML.

In this short post, I provide a simple tip to clean HTML pages and convert them to XML so you can extract its data using the conventional "Get data from XML" step. The solution hinges on two ingredients:

JTtidy, the Java port of the popuar HTML Tidy program, which is an open source HTML cleaning program originally created by Dave Ragget

The user-defined Java Class step, which lets us write a small snippet of Java that we need to interface with JTidy

Standard Kettle tools for Webservices

Kettle is really good at fetching data from the web and extracting data from webservices, be they in a SOAP/XML, REST/JSON or RSS flavor. (There is an extensive chapter on this subject in Pentaho Kettle Solutions). But when you're dealing with plain old HTML, things can get pretty hairy.

If you're lucky, the page may be in XHTML, and in that case it's worth trying the Get Data from XML step. However, quite often a webpage that claims to be XHTML is not well-formed XML, and even if it is, Kettle does not understand things like   entities, which are valid in XHTML, but not in plain XML. And of course, more often than not, you're not lucky, and XHTML represents only a minor fraction of all the web pages out there.

Workaround: JavaScript string manipulation

In the past, I usually worked around these issues. In practice, some quick and dirty string manipulation using the Modified Javascript Value step and some built-in indexOf(), substring and replace() functions go a long way.

In most cases I don't really need the entire web page, but only a <table>, <ul> or <ol> element in the <body>. Excising only the interesting sections out of the page using plain string manipulation will often get rid of most of the cruft that prevents the data from being treated as XML. For example, if we only need to get the rows from a table with a particular id attribute, we can use a JavaScript snippet like this:


//table we're looking for
var startHandle = "<table class=\"lvw\" cellpadding=0 cellspacing=0>";
var startPosition= html.indexOf(startHandle);
//look beyond the start tag to lose the invalid unquoted attributes
startPosition += startHandle.length;

//find where this table ends (lucky us, no nested table elements :)
var endHandle = "</table>";
var endPosition = html.indexOf(endHandle, startPosition);

//make a complete table fragment out of it again
var table = "<table>" + html.substring(startPosition, endPosition + endHandle.length);

//replace nbsp entities, empty unclosed img elements, and value-less nowrap attributes
table = table.replace(/&nbsp;|<img[^>]>|nowrap/ig, "");

There are of course no guarantees that the sections you cut out like that are in fact well-formed XML, but in my experience it's often worth a try.

A better way: using JTidy

While the JavaScript workaround may just work for your particular case, it certainly has disadvantages. Sometimes it may just be not so simple to clean the HTML with plain string manipulation. And of course there is a performance issue too - the JavaScript step can be quite slow.

Fortunately, there is a better way.

Using a user-defined Java Class step we can have JTidy do the dirty work of cleaning the HTML and converting it to XML, which we can then process in a sane way with Kettle's Get Data from XML step.

We need to do two things to make this work: first, you have to download JTidy, unzip it, and place the jtidy-r938.jar in the libext directory, which resides immediately in your kettle installation directory. (note that if you were running spoon, you need to restart it before it will be picked up). Second, you need a little bit of glue code for the User-defined Java class step so Kettle can use the Tidy class inside the jar. With some help from the pentaho wiki and the JTidy JavaDoc documentation, I came up with the following Java snippet to make it work:


import org.w3c.tidy.Tidy;
import java.io.StringReader;
import java.io.StringWriter;

protected Tidy tidy;

public boolean init(StepMetaInterface stepMetaInterface, StepDataInterface stepDataInterface)
{
    //create and configure a Tidy instance
    tidy = new Tidy();
    tidy.setXmlOut(true);
    return parent.initImpl(stepMetaInterface, stepDataInterface);
}

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException {
    Object[] r;
    //Get row from incoming stream.
    //Bail out if its not there.
    if ((r = getRow()) == null) {
       setOutputDone();
       return false;
    }

    //read the value of the html input field
    //the html field happens to be the 5th field in the stream, 
    //because java arrays start at 0, we use index 4 to reference it
    StringReader html = new StringReader((String)r[4]);

    //use tidy to parse html to xml
    StringWriter xml = new StringWriter();
    tidy.parse(html, xml);

    //assign the xml to the output row
    //note we simply overwrite the original html field from the input row.
    r[4] = xml.toString();

    //push the output row to the outgoing stream.
    putRow(data.outputRowMeta, r);
    return true;
}

(Tip: for more examples and background information on the user-defined java class step, check out the excellent blog posts by Slawomir Chodnicki, Matt Casters and the video walk-through by Dein Einspanjer)

The big advantage of using Tidy is that you can be sure that the result is well-formed XML. In addition, you can have JTidy report on any errors or warnings, which makes it much more robust than any ad-hoc string manipulation you can come up with.

Saturday, December 18, 2010

Substituting variables in Kettle Parameter values

Kettle (a.k.a. Pentaho Data Integration) jobs and transformations offers support for named parameters (as of version 3.2.0). Named parameters form a special class of ordinary kettle variables and are intended to clearly and explicitly define for which variables the caller should supply a value.

One of my pet projects, the pentaho auto-documentation solution kettle-cookbook, uses two named parameters called INPUT_DIR and OUTPUT_DIR. These allow you to specify the directory that contains the BI content that is to be documented (such as kettle transformation and job files, action sequence files and mondrian schema files), and the directory to store the generated documentation.

Several kettle-cookbook users ran into problems attempting to use variable references in the values they supplied for the INPUT_DIR and OUTPUT_DIR variables. In this case, the variables referenced in the supplied parameter values would be set by adding entries in kettle.properties file. I just committed revision 64 of kettle-cookbook which should fix this problem. In this article I briefly discuss the solution, as I think it may be useful to other kettle users.

Substituting Kettle Variable References

Kettle doesn't automatically substitute variable references in parameter values (nor in ordinary variable values). So, if you need to support variable references inside parameter values, you have to substitute the variables yourself.

Variable substitution in Kettle 4.01 and up

As of Kettle version 4.01, the Calculator step supports a calculation type called "variable substitution in string A" that is intended exactly for that purpose. I have tested this but unfortunately in 4.01 it doesn't seem to work, at least not for the built-in variable ${Internal.Transformation.Filename.Directory} which I used in my test. In the latest stable version, Kettle 4.10 it does work as advertised, I would recommend using this method if you're a user of Kettle 4.10 (or later).

Variable substitution in earlier Kettle versions

I have committed myself to making kettle-cookbook work on kettle 3.2.0, as my sources tell me that this is still an often-used version in many production environments. I'm even prepared to make kettle-cookbook work on Kettle versions earlier than 3.20, should there be sufficient demand for that. Anyway, the bottom line is, these versions do not support the "variable substitution in string A" calculation in the Calculator step, so you have to resort to a little trick.

A Kettle 3.2.0 transformation to substitute variables in parameters

For kettle-cookbook, I added a single transformation called substitute-variables-in-parameters.ktr as the first transformation of the main job.

The substitute-variables-in-parameters.ktr transformation uses a "Get Variables" step to read the values of the INPUT_DIR and OUTPUT_DIR parameters. The values are then processed by a javascript function which substitutes all variable references with their values. Finally, a "Set Variables" step overwrites the original value of the variables with their replaced value.

The code for the JavaScript step is shown below:


function replace_variables(text){
    var re = /\$\{([^\}]+)\}|%%([^%]+)%%/g,
        match, from = 0,
        variable_name, variable_value, 
        replaced_text = ""
    ;

    while ((match = re.exec(text)) !== null) {
        variable_name  = match[1] ? match[1] : match[2];
        variable_value = getVariable(variable_name, "");
        replaced_text += text.substring(from, match.index);
        replaced_text += variable_value;
        from = match.index + match[0].length;
    }
    replaced_text += text.substring(from, text.length);
    return replaced_text;
}

var replaced_input_dir = replace_variable(input_dir);
var replaced_output_dir = replace_variable(output_dir);

The script first defines function replace_variables(text) which accepts the parameter value, and returns the substituted value. Then it calls the function, applying it to the input_dir and output_dir fields from the incoming stream. These fields originate in the preceding "Get variables" step which assigns them the value of the INPUT_DIR and OUTPUT_DIR variables. The output of the replace_variables() function is assigned to the replaced_input_dir and replaced_output_dir javascript variables, which leave the JavaScript step as fields of the outgoing stream. In the final "Set variables" step, the replaced_input_dir and replaced_output_dir fields are used to overwrite the original value of the INPUT_DIR and OUTPUT_DIR values.

The `replace_variables()` function

Let's take a closer look at the replace_variables() function.

The heart of the function is formed by a while loop that executes a javascript regular expression called re that matches variable references.

The regular expression itself is defined in the top of the function:


    var re = /\$\{([^\}]+)\}|%%([^%]+)%%/g,

It's intention is to recognize variable references of the form ${NAME} and %%NAME%%. The part of the pattern for the name is enclosed in parenthesis to form a capturing group. As we shall see later on, this allows us to extract the actual name of the referenced variable. The trailing g indicates that the pattern should be matched anywhere in the string. This is necessary because we want to replace all variable references in the input text, not just the first one.

The regular expression object is used to drive the while loop by calling its exec() method. In case of a match, the exec() returns an array that describes the text matched by the regular expression. If there's no match, exec() returns null


    while ((match = re.exec(text)) !== null) {
       ...
    }

If there is a match, we first extract the variable name:


variable_name  = match[1] ? match[1] : match[2];

The first entry of the match array (at index 0) is the text that was matched by the pattern as a whole. The array contains subsequent elements for each capturing group in the regular expression. Because our regular expression re has 2 capturing groups, the match array contains two more elements. If the variable is of the form ${NAME}, the element at index=1 contains the variable name. If it's of the form %%NAME%%, it will be contained in the element at index=2.

Once we have the variable name, we can use the getVariable() javascript function to obtain its value:


        variable_value = getVariable(variable_name, "");

The getVariable() is not a standard javascript function, but supplied by the kettle javascript step.

To perform the actual substitution, we take the substring of the original text up to the location where the variable reference was matched. This location is conveniently supplied by the match array:


        replaced_text += text.substring(from, match.index);

Right after that location, we need to put the variable value instead of its name:


        replaced_text += variable_value;

The last action in the loop is to remember to location right behind the last replaced variable reference, so we can pick up at the right location in the original value the next time we match a variable:


        from = match.index + match[0].length;

Right after the loop, we need to copy the final piece of original text occurring right behind the last variable reference to yield the complete replaced text:


    replaced_text += text.substring(from, text.length);

Odds and Ends

While the substitute-variables-in-parameters.ktr transformation works great for its intended purpose, substituting variables in the known parameters INPUT_DIR and OUTPUT_DIR, it is not really applicable beyond kettle cookbook. What you'd really want to have is a job that replaces variables in all parameters, not just those that are known in advance.

As it turns out this is actually almost trivial to achieve, however to solution is a bit too long-winded for this post. If anyone is interested in such a solution, please post a comment and let me know, and I'd be happy to provide it.

UPDATE: A solution that substitutes all variable references occurring in the parameter values of the containing job is now available at the kettle exchange area in the Pentaho wiki.

Thursday, December 09, 2010

Parameterizing SQL statements in the Kettle Table Input step: Variables vs Parameters

I get this question on a regular basis, so I figured I might as well blog it, in the hope it will be useful for others. Here goes:

Let's say that I want to delete all records that match an id from a set of tables. The table names come in as rows into the Execute SQL Script step (check execute for every row). Next I write:
DELETE FROM {table_name} WHERE id = {identifier}
as the SQL to execute. In the parameters grid at the bottom right, I have two fields: table_name and identifier. What is the syntax for substituting the table_name and identifier parameters in the sql script?

(Although this particular question focuses on the "Execute SQL Script" step, it also applies to the "Table Input" step, and probably a few more steps I can't recall right now.)

The parameters grid can be used for prepared statement value placeholders. In the SQL statement these placeholders are denoted as questionmarks (?). These are positional parameters: they get their value from those fields in the incoming stream that are entered in the parameters grid, in order.
Here's an example of the correct usage of these placeholders:

DELETE FROM myTable WHERE id = ?

Here, the ? in the WHERE clause will be bound to the value of the first field from the incoming stream entered in the parameters grid. Because there is only one such placeholder, there can be only one field in the parameters grid.

An important thing to realize is that these parameters can only be used to parameterize value expressions. So, this kind of parameter does not work for identifiers, nor do they work for structural elements of the SQL statement, such as keywords. So this kind of parameter cannot be used to parameterize the table name which seems to be the intention in the original example posed in the question.

There is a way to parameterize the structural elements of the SQL statement as well as the parameters. You can apply variable substitution to the SQL statetment.

Kettle Variables can be defined by a Set Variables step, or by specifying parameters at the transformation level. They get their value from "the environment": for example, parameters get their value initially when the transformation is started, and regular variables are typically set somewhere in the job that is calling your transformation.

In text fields, including the SQL textarea of the Table input step or the Execute SQL Script step, you denote those variables with this syntax: ${VARIABLE_NAME}. So to parameterize the table name we could use something like this:

DELETE FROM ${TABLE_NAME}

In order to force kettle to apply variable substitution to the SQL statement, you have to check the "variable substitution" checkbox. If this checkbox is checked, then all variables are simply substituted with their (string)value during transformation initialization. This is a lot like the way macro's are substituted by the pre-processor in C/C++ code.

When comparing variables with parameters, two important things should be mentioned here:

Unlike value placeholders, variables can be used to manipulate any aspect of the SQL statement, not just value expressions. The variable value will simply become the text that makes up the SQL statement, it is your responsibility it results in a syntactically valid and correct SQL statement.

Variables are evaluated once during transformation initalization. So if you want to vary the variable value, you'll have to call the transformation again for the change to take effect. For the same reasons, you cannot set the value of a variable and read it within the same transformation: setting the variable value occurs at runtime, but evaluating it occurs at initialization time.

Finally, here's a screenshot that summarizes these different ways to parameterize SQL statements in kettle:

If you want to read more about this topic, it's covered in both our books Pentaho Solutions and Pentaho Kettle Solutions. Another title you might be interested in is Maria Roldan's Pentaho 3.2 Data Integration: Beginner's Guide.