To freshen up your memory, the generic normalizer takes any input stream, and for each input row, it outputs one row for each field in the input stream. The output rows contain fields for input row number, input field number and input field value. As such it provides the same functionality as the built-in Row Normaliser step without requiring any configuration, thus allowing it to process arbitrary streams of data.
A reusable normalizer
Recently I received an comment asking for more information on how to make this normalizer more reusable:I want to use this method to do field level auditing but I want to encapsulate it in a sub transformation to which I can pass the result rows from any step. In your image of "how all these steps work together", instead of a data grid, the input would need to be dynamic in terms of the number of fields/columns and the data types. Could you possibly provide a hint how to make the input to these steps (in your example, the datagrid) dynamic?In the mean while, I learned a thing or two about kettle's internals and it seemed like a good idea to describe how to improve on the original example and make it suitable to be used in a so-called Mapping, a.k.a. a sub-transformation.
Design for use as Subtransformation
The design for the re-usable generic normalizer is shown below:The User-defined Java class step in the middle actually implements the normalizer. The Mapping input and Mapping output specification steps allow the normalizer to be called from another transformation. They enable it to respectively receive input data from, and return output data to the calling transformation.
In the screenshot above, the configuration dialogs for both the Mapping input and output specification steps are shown. This is mainly to show that there is no configuration involved: the Mapping input specification step will faithfully pass all fields received from the incoming stream on to the normalizer, and the Mapping output specification will output all fields coming out of the normalizer to the outgoing stream.
Normalizer Improvements
The configuration of the user-defined Java class step differs in a number of aspects from what I used in the original normalizer example. In the original example the normalizer output consisted of three fields:- rownum
- A sequential integer number identifying the position of the row to which the current output row applies.
- fieldnum
- A sequential integer number identifying the position of the field to which the current output row applies.
- value
- A string representation of the value to which the output applies
The improved generic normalizer adds two more output fields:
- fieldname
- The name of the field as it appears in the input stream
- fieldtype
- The name of the data type of the field as it appears in the input stream
Code
The code for the user-defined Java class is shown below:static long rownum = 0; static RowMetaInterface inputRowMeta; static long numFields; static String[] fieldNames; public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException { // get the current row Object[] r = getRow(); // If the row object is null, we are done processing. if (r == null) { setOutputDone(); return false; } // If this is the first row, cache some metadata. // We will reuse this metadata in processing the rest of the rows. if (first) { inputRowMeta = getInputRowMeta(); numFields = inputRowMeta.size(); } // Generate a new id number for the current row. rownum += 1; // Generate one output row for each field in the input stream. int fieldnum; ValueMetaInterface valueMetaInterface; for (fieldnum = 0; fieldnum < numFields; fieldnum++) { //get metadata for the current field valueMetaInterface = inputRowMeta.getValueMeta(fieldnum); Object[] outputRow = new Object[5]; outputRow[0] = rownum; // Assign the field id. Note that we need to cast to long to match Kettle's type system. outputRow[1] = (long)fieldnum+1; //assign the data type name outputRow[2] = valueMetaInterface.getTypeDesc(); //assign the field name outputRow[3] = valueMetaInterface.getName(); //assign a string representation of the field value outputRow[4] = inputRowMeta.getString(r, fieldnum); //emit a row. putRow(data.outputRowMeta, outputRow); } return true; }The main difference with the original code is the addition of the two new output fields,
fieldname
and fieldtype
. In order to obtain the values for these fields, the loop over the fields first obtains the ValueMetaInterface
object for the current field. This is done by calling the getValueMeta()
method of the RowMetaInterface
object and passing the index of the desired field.
Using the
ValueMetaInterface
object, the field name is obtained using its getName() method. The data type name is obtained by calling its getTypeDesc()
method.
Calling the normalizer as subtransformation
Using the improved normalizer is as simple as adding a Mapping-step to your transformation and pointing it to the transformation that contains the normalizer and Mapping input and output specifications:Download samples
The transformations discussed in this post are available here:- generic-normalizer-mapping.ktr - the reusable generic normalizer.
- reusable-generic-normalizer.ktr - example illustrating how to use the reusable generic normalizer.