Locations of visitors to this page Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Wednesday, April 15, 2015

MDX: retrieving the entire hierarchy path with Ancestors()

A couple of days ago I wrote about one of my forays into MDX land (Retrieving denormalized tabular results with MDX). The topic of that post was how to write MDX so as to retrieve the kind of flat, tabular results one gets from SQL queries. An essential point of that solution was the MDX Ancestor() function.

I stumbled upon the topic of my previous blogpost while I was researching something else entirely. Creating flat tables and looking up individual ancestors is actually a rather specific application of a much more general solution I found initially.

Pivot tables and the "Show Parents" functionality

GUI OLAP tools typically offer a pivot table query interface. They let you drag and drop measures and dimension items, like members and levels to create a pivot table. The cells of the pivot table are aggregated values of the measures, and the row and column headers of the pivot table are dimension members, which are typically derived from a level that was dragged into the pivot table.

Please recall the sales cube example I introduced in my previous post:



Now, suppose we would drag the Sales quantity measure unto the columns axis of our pivot table, and drag the Quarters level from the Time dimension unto the rows axis. The GUI tool might generate an MDX query quite like the one I introduced in my previous post:
SELECT  Measures.Quantity     ON COLUMNS
,       Time.Quarters.Members ON ROWS
FROM    SteelWheelsSales
Here's how this is rendered in Saiku Analytics:



And here's how it looks in Olap4J>:



Now, as I pointed out in my previous post, the problem with this result is that we don't see any context: we cannot see to which year the quarters belong. Both tools have a very useful feature called "Show parents". This is a toggle button that changes the view so that the headers show the values of the corresponding higher levels. For example, this is what the previous result looks like in Olap4J when "Show Parents" is toggled:

As you can see, the year level and even the "All level" is now visible.

In Saiku we can achieve a similar thing, but the other way around: you can add the year and the all level, at which point totals are shown for these higher levels:



And you can then choose "Hide Parents" to get rid of the rows for the higher level aggregats, leaving you with essentially the same view of the data as shown in the last Pivot4J screenshot.

Implementing Show/Hide Parents

In Saiku, the "Hide Parents" functionality is achieved by post-processing the resultset: when the result is iterated to render the table, rows for all but the lowest level are filtered away and discarded.

In Pivot4J, it works a little bit different. Here's how Xavier Cho describes it:
the information of the parents is obtained by the members present on the axes. Pivot4J accesses it through Olap4J API which exposes a member's parent and ancestors via Member.getParentMember() and Member.getAncestorMembers() respectively :

http://www.olap4j.org/api/org/olap4j/metadata/Member.html

References to the member instances in a given MDX can be obtained by its CellSet interface, which is equivalent to what is ResultSet for JDBC. In addition, Pivot4J exposes the member instance for each cells to the expression language context, so you can reference itself, or its parent or ancestors in a property expression too.

In summary, if you are trying to access the parent of a member included in MDX, you'll first need to execute the query using the Olap4J then get it from the resulting CellSet instance.

A pure MDX expression

I thought it would be fun to try and rewrite our original query in such a way that its result would give us this information.

The Ancestors() function

As it turns out, we can do this for one particular hierarchy in our query by creating a Calculated Member on the Measures hierarchy that applies the Ancestors() function to the current member of the hierarchy for which we want the path.

The Ancestors() function takes 2 arguments
  1. A member for which to find ancestor members (members at a higher level that contain the argument member)
  2. An argument that specifies how many levels to traverse up.
The function returns a set of members that are an ancestor of the member passed as first argument.

Specifying the first argument is easy: we simply want to find ancestors for whatever member, so we can specify it as <Hierarchy Name>.CurrentMember and it will just work.

The second argument can be specified in 2 ways:
  • As a level: the second argument specifies a level and all ancestors up to that level will be retrieved
  • As a integer representing a distance: the second argument specifies the number of levels that will be traversed upwards
The first form is useful if you want to retrieve ancestors up to a specific level. I want to retrieve all ancestors, so the number of levels I want the function to traverse is in fact equal to the level number of the first argument. We can conveniently specify this with the LEVEL_NUMBER property using an expression like:

<Hierarchy Name>.CurrentMember.Properties("LEVEL_NUMBER")

But this is not yet entirely right, since this form of the Properties() function always returns a string, even though the LEVEL_NUMBER property is actually of the integer type. The standard MDX Properties() function allows an optional second argument TYPED. When this is passed, the property will be returned as a value having its declared type.

Unfortunately, Mondrian, a.k.a. Pentaho Analysis Services does not support that form of the Properties() function (see: MONDRIAN-1795). So, in order to retrieve the level number as an integer value, we have to apply the CInt() function to convert the string representation of the level number to an integer.

So, our call to the Ancestors() function will look like this:

Ancestors(<Hierarchy Name>.CurrentMember, CInt(<Hierarchy Name>.CurrentMember.Properties("LEVEL_NUMBER")))

A simpler alternative: Ascendants()

If it is acceptable to also include the CurrentMember itself, then we can even simplify this quite a bit by using the Ascendants() function. The Ascendants() function takes a single member as argument, and returns the set of ancestor members as well as the argument member, all the way up to the member at the top level. With Ascendants(), our expression would simply be: Ascendants(<Hierarchy Name>.CurrentMember)

We will continue this post using Ancestors(), but the approach can be easily applied to Ascendants() instead.

Converting the set of Ancestor members to a scalar value

However we can't just use the bare Ancestors() expression in our query, nor can we use it as is to create a calculated member. That's because Ancestors() returns a set of members, while we want something that we can retrieve from the cells in the result.

As an initial attempt we can try and see if we can use the SetToStr() function, which takes a set as argument and returns a string representation of that set. So We can now finally write a query and it would look something like this:
WITH
MEMBER  Measures.[Time Ancestors]
AS      SetToStr(
          Ancestors(
            Time.CurrentMember, 
            CInt(
              Time.CurrentMember.Properties("LEVEL_NUMBER")
            )
          )
        )
SELECT  Measures.[Time Ancestors] ON COLUMNS
,       Time.Quarters.Members ON ROWS
FROM    SteelWheelsSales
The results might look something like this:
Time Time Ancestors
QTR1 {[Time].[2003], [Time].[All Years]}
QTR2 {[Time].[2003], [Time].[All Years]}
QTR3 {[Time].[2003], [Time].[All Years]}
QTR4 {[Time].[2003], [Time].[All Years]}
QTR1 {[Time].[2004], [Time].[All Years]}
QTR2 {[Time].[2004], [Time].[All Years]}
QTR3 {[Time].[2004], [Time].[All Years]}
QTR4 {[Time].[2004], [Time].[All Years]}
QTR1 {[Time].[2005], [Time].[All Years]}
QTR2 {[Time].[2005], [Time].[All Years]}
Well this certainly looks like we're on the right track! However, there are at least two things that are not quite right:
  • The string representation returned by SetToStr() looks very much like how one would write the set down as a MDX set literal (is that a thing? It should be :-). While entirely correct, it does not look very friendly and it is certainly quite a bit different from what our GUI tools present to end-users
  • The order of the members. It looks like Ancestors() returns the members in order of upward traversal, that is to say, from lower levels (=higher level numbers) to higher levels (=lower level numbers). The fancy way of saying that is that our result suggests that Ancestors() returns its members in post-natural order. We'd like the members to be in natural order, that is to say, in descending order of level (from high to low). Note that the specification of Ancestors() does not specify or require any particular order. So in the general case we should not rely on the results to be in any particular order.
First, let's see if we can fix the order of ancestor members. There's two different MDX functions that seem to apply here:
  • Order() is general purpose function that can be used to order the members of a set by an arbitrary numberic expression.
  • Hierarchize() is designed to order members into hierarchical order, that is to say, the members are ordered by their level number and by the level number of any of its ancestors.
While Order() is a nice and reasonable choice, Hierarchize() seems tailored exactly for our purpose so that's what we'll use:
WITH
MEMBER  Measures.[Time Ancestors]
AS      SetToStr(
          Hierarchize(
            Ancestors(
              Time.CurrentMember, 
              CInt(
                Time.CurrentMember.Properties("LEVEL_NUMBER")
              )
            )
          )
        )
SELECT  Measures.[Time Ancestors] ON COLUMNS
,       Time.Quarters.Members ON ROWS
FROM    SteelWheelsSales
And the result will now look like:
Time Time Ancestors
QTR1 {[Time].[All Years], [Time].[2003]}
QTR2 {[Time].[All Years], [Time].[2003]}
QTR3 {[Time].[All Years], [Time].[2003]}
QTR4 {[Time].[All Years], [Time].[2003]}
QTR1 {[Time].[All Years], [Time].[2004]}
QTR2 {[Time].[All Years], [Time].[2004]}
QTR3 {[Time].[All Years], [Time].[2004]}
QTR4 {[Time].[All Years], [Time].[2004]}
QTR1 {[Time].[All Years], [Time].[2005]}
QTR2 {[Time].[All Years], [Time].[2005]}
Now, as for obtaining a more friendly, human-readable string representation of the set, this is a considerably more open requirement. One the one hand there is the matter of how to represent each member in the ancestor set; on the other hand there is the matter of extracting this information from the resultset and using it in the GUI.

To represent members we have a handful of options: we could use the member name, or we could use its key value; however since we want to expose the information to the user, the only thing that seems really suitable is the member caption. Placing that data into the GUI is an implementation detail that need not concern us too much at this point. Let's say we aim to return the data as a comma-separated list, and assume our GUI tool is capable of extracting that data and then use it to render a result.

The function that seems to suit our need is called Generate(). There are actually 2 forms of Generate(), which frankly seem to suit completely different purposes. The form we're interested in is functionally quite similar to the MySQL-builtin aggregate function GROUP_CONCAT().

The arguments to this form of Generate() are:
  1. A set. This is where we'll feed the Ancestors() expression in
  2. A string expression. This expression will be evaluated for each member in the set passed as first argument. We'll use this to retrieve the caption of the current member of the hiearchy for which we're generating the ancestors list.
  3. A separator. Generate() concatenates the result values returned by the string expression passed as second argument, and this string will be used to separate those values. Since we want to obtain a comma-separated list, we'll use the literal string ", " for this argument.
The result is a single string value.

Putting it together, our query becomes:
WITH
MEMBER  Measures.[Time Ancestors]
AS      Generate(
          Hierarchize(
            Ancestors(
              Time.CurrentMember, 
              CInt(
                Time.CurrentMember.Properties("LEVEL_NUMBER")
              )
            )
          )
        , Time.CurrentMember.Properties("MEMBER_CAPTION")
        , ","
        )
SELECT  Measures.[Time Ancestors] ON COLUMNS
,       Time.Quarters.Members ON ROWS
FROM    SteelWheelsSales
And the result:
Time Time Ancestors
QTR1 All Years,2003
QTR2 All Years,2003
QTR3 All Years,2003
QTR4 All Years,2003
QTR1 All Years,2004
QTR2 All Years,2004
QTR3 All Years,2004
QTR4 All Years,2004
QTR1 All Years,2005
QTR2 All Years,2005
And we can repeat this process for every hierarchy on every axis, just like we did with the Ancestor() function in the previous post.

Sunday, April 12, 2015

Retrieving denormalized tabular results with MDX

I've been trying to learn the MultiDimensional Expression language (MDX) for quite a while. Unfortunately, I haven't been very successful at it. MDX has a few conceptual and some (superficial) syntactical similarities to SQL, but I have arrived at the conclusion that it is probably better to treat it as a completely different kind of language.

I do not intend to write a post entirely about the similarities and differences between MDX and SQL. There are tons of articles that do that already. For example, this Microsoft Technet article puts it like this:
The Multidimensional Expressions (MDX) syntax appears, at first glance, to be remarkably similar to the syntax of Structured Query Language (SQL). In many ways, the functionality supplied by MDX is also similar to that of SQL; with effort, you can even duplicate some of the functionality provided by MDX in SQL.

However, there are some striking differences between SQL and MDX, and you should be aware of these differences at a conceptual level. The following information is intended to provide a guide to these conceptual differences between SQL and MDX, from the point of view of an SQL developer.
A discussion regarding the similarities and differences between MDX and SQL is, to me, quite like a discussion on the similarities and differences between crabs and spiders. They're similar as they both have way too many legs, and because neither speak English, you'll have a hard time getting them to behave according to your instructions. But there are differences too, since one of them is by any measure way more hairy than could ever be called reasonable, and traps you in its web before poisoning you with its venomous fangs, whereas the other is only hard-headed and pinches you with its razor-sharp claws; at least, if its not walking away from you backwards.

The reason why I included the quote from the Microsoft article is because it illustrates an important point about MDX: it is often regarded as something that extends SQL - something that allows you to do things that are very hard or even impossible to do in SQL.

Frankly, that notion is mostly true. You don't even need to spend a lot of time playing with MDX to realize it. Unfortunately, it would take me considerable time and effort to describe with words why this is so. And I wouldn't be able to it without falling into the trap of describing similarities and mostly, lots of differences between MDX and SQL, and by extension, about the similarities and differences between the tables that SQL operates upon, and the multi-dimensional datasets called OLAP-Cubes, which is the type of data structure that MDX operates upon.

What might be less apparent from the quote from the Microsoft article is that sometimes, it may not be so straightforward to make MDX behave like SQL. I should point out right away that it is much easier to make MDX behave like SQL than the other way around. However, since most people do not desire to use MDX to do the kind of stuff that is normally done in SQL, it might be just a little bit obscure and hard to find examples that show you exactly how. That is what this blogpost aims to provide.

The inevitable example - a Sales cube

Let's assume we have a Sales Cube that describes the sales process of some sort of retail business. The cube might have quantitative measures like Sales Quantity and Sales price, and dimensions like Customer (the person that bought something), Product (the item that was sold), and Time (the date/time when the Sale took place). The image below illustrates this setup:



The image illustrates a (very basic) physical star schema, that is, a database model of a fact table surrounded by and linked to its dimension tables. In the top half of each box that represents a table we find the key column(s): in the fact table, the key is composite, and consists of foreign keys pointing to the dimension tables; in the dimension tables, the key is non-composite and referenced by the fact table via a foreign key. In the bottom half of each table box we find any non-key columns.

Although the image illustrates a database star schema, we can, with some effort, also read it as an OLAP-cube design. To do that, just pretend the key columns (those columns appear in the top section of each table) aren't there. For the fact table, think of the non-key columns as measures. For the dimensions, try to think of the non-key columns as the levels of a single hierarchy.

When working with a ROLAP engine like Mondrian (a.k.a. Pentaho Analysis) we can, in principle, literally maintain this 1:1 mapping between dimension tables and dimensions, dimension table columns and hierarchy levels, and fact table columns and measures.

In practical, real-world situations, the mapping between database schema and OLAP-cube doesn't necessarily have to be so straightforward. But I'm keeping things as simple as possible on purpose because it helps to illustrate the point I want to make.

MDX Queries against the Pentaho SteelWheels example Cube

For actual MDX query examples I will use the Pentaho SteelWheels sample database and Cube. While both that Cube and its underlying physical star schema are very simple, it is still a good deal more complex than my example. Just don't let that distract you.

To execute MDX queries, I like to use the Pentaho Analysis shell (Pash). Pentaho users can install Pash directly from the Pentaho marketplace. Instructions for installing and running Pash on other OLAP platforms, like icCube, Jasper Reports and others can be found in the Pash project README on github.

Sales Quantity per Quarter in MDX and SQL

Consider the following MDX-query against the SteelWheels Sales Cube:
SELECT  Measures.Quantity     ON COLUMNS
,       Time.Quarters.Members ON ROWS
FROM    SteelWheelsSales
This very basic query basically says: get us the Sales Quantity per quarter. The result might be represented like this:
Time Quantity
QTR1 4561
QTR2 5695
QTR3 6629
QTR4 19554
QTR1 8694
QTR2 8443
QTR3 11311
QTR4 20969
QTR1 10995
QTR2 8480
If you're not familiar with MDX, it is worth pointing out that although the result looks just like an ordinary table like you would get as a result from a SQL query, it is actually not quite so simple. I'll get back to that in a bit.

If one were to write a SQL query that provides a similar result, it would look something like this:
SELECT      DateDimension.QuarterLevel
,           SUM(SalesFact.Quantity)     AS SalesQuantityMeasure
FROM        SalesFact
INNER JOIN  DateDimension
ON          SalesFact.DateDimensionKey = DateDimension.DateDimensionKey
GROUP BY    DateDimension.QuarterLevel

Sales Quantity per Quarter with the year in SQL

Now there's one problem with this query, or rather with its result. While the results are correct, they are hard to interpret since we do not have any indication what year the quarters belong to. In SQL, we might add a expression that retrieves the year to the SELECT and GROUP BY clauses to fix this:
SELECT      DateDimension.YearLevel
,           DateDimension.QuarterLevel
,           SUM(SalesFact.Quantity)     AS SalesQuantityMeasure
FROM        SalesFact
INNER JOIN  DateDimension
ON          SalesFact.DateDimensionKey = DateDimension.DateDimensionKey
GROUP BY    DateDimension.YearLevel
,           DateDimension.QuarterLevel
This extension of our previous SQL query basically says: give me the sales quantity per quarter, and also show the year to which the quarter belongs. It this would give us a result like this:
YearLevel QuarterLevel SalesQuantityMeasure
2003 QTR1 4561
2003 QTR2 5695
2003 QTR3 6629
2003 QTR4 19554
2004 QTR1 8694
2004 QTR2 8443
2004 QTR3 11311
2004 QTR4 20969
2005 QTR1 10995
2005 QTR2 8480

Trying the same thing in MDX

Now, considering the superficial syntactical similarities between the MDX and the SQL statement, we might be tempted to add an expression for the Year level to our MDX query as well in order to get a similar result:
SELECT  Measures.Quantity ON COLUMNS
,       {Time.Years.Members, Time.Quarters.Members} ON ROWS
FROM    SteelWheelsSales
For now, don't worry too much about the curly braces. Instead, take a look at the result:
Time Quantity
2003 36439
2004 49417
2005 19475
QTR1 4561
QTR2 5695
QTR3 6629
QTR4 19554
QTR1 8694
QTR2 8443
QTR3 11311
QTR4 20969
QTR1 10995
QTR2 8480
Well, there result certainly contains years, but not quite in the way you would've expected. At least, not if MDX would behave like SQL, which it clearly does not.

While adding the year expression the SQL query caused a new Year column to be added to the result, doing the - superficially - similar thing in MDX did not change the number of "columns" of the result, but it did add a number of "rows" instead.

If you examine the value of the Quantity "column" for the first three rows, you might notice that the value there is quite a bit larger than for any of the quarters. And you even might notice that the value for 2003 is in fact QTR1: 4561 + QTR2: 5695 + QTR3: 6629 + QTR4: 19554 = 36439, and that the values for the other years are similarly the sum of the subsequent quarters.

This is of course no coincidence. The first 4 quarters belong to 2003, just like the second group of 4 quarters belong to 2004, and MDX "knows" this because of the way the cube is structured. But this is also the key to solving our problem: MDX offers functions that allow us to lookup related items of data. In this particular case, we can use the Ancestor() function to lookup the Year that corresponds to the the quarter.

Using the Ancestor() function in a Calculated Measure

The following query uses the Ancestor() function in a Calculated Member to lookup the value at the Year level for the current member on the Time hierarchy:
WITH
MEMBER  Measures.[Time.Year]
AS      Ancestor(
          Time.CurrentMember,
          Time.Years
        ).Properties("MEMBER_CAPTION")
SELECT  {Measures.[Time.Year]
        ,Measures.Quantity}   ON COLUMNS
,       Time.Quarters.Members ON ROWS
FROM    SteelWheelsSales
The result is shown below:
Time Time.Year Quantity
QTR1 2003 4561
QTR2 2003 5695
QTR3 2003 6629
QTR4 2003 19554
QTR1 2004 8694
QTR2 2004 8443
QTR3 2004 11311
QTR4 2004 20969
QTR1 2005 10995
QTR2 2005 8480
Since the Calculated Member was declared to be part of the Measures hierarchy, it will be used to generate values to fill cells. For each quarter on the ROWS axis, it is evaluated. The first argument to our Ancestor() function is Time.CurrentMember and this will be the item for which we are looking for an Ancestor; in other words, we are looking up an ancestor of a quarter. The second argument is a level expression Time.Years, which tells the Ancestor() function that we want whatever ancestor item exists at the Year level for the first argument.

The remainder of the expression for the Calculated member, .Properties("MEMBER_CAPTION"), serves to extract the human readable friendly label for the found Ancestor, and this is the value that will finally be used to populate the cell.

Note: Many MDX engines, including Mondrian, support a shortcut syntax to retrieve the caption: instead of writing .Properties("MEMBER_CAPTION"), you can also simply write .Caption. Unfortunately, this shortcut syntax is not universally supported, while .Properties("MEMBER_CAPTION") should always be supported.

A general recipe for extracting denormalized tables with MDX

By comparing our last 2 MDX queries we can distill a general workflow to construct denormalized result tables using MDX
  1. Make a selection of all levels across all hierarchies that you want to see in your result. Let's take for example Product Line, Product Vendor and Months and Years
  2. The levels you selected in #1 each belong to a particular hierarchy. In this case: Product and Time. For each hierarchy, determine the lowest level in your selection. In this case: Product Vendor from the Product hierarchy and Months from the Time hierarchy.
  3. You can now use the levels you found in #2 to write the expression for the ROWS-axis of your MDX query. Append .Members to each the level name, and combine these expressions with each other using the CrossJoin() function. In this case, that expression looks like CrossJoin(Product.Vendor.Members, Time.Months.Members)
  4. Take the remaining levels from your selection in #1 (that is, those levels that you didn't put on the ROWS axis in step #3). In our case, those levels are Product.Line and Time.Years. Write a Calculated member on the Measures hierarchy that uses the Ancestor() function. To keep things clear, derive the name of the calculated member from the name of the hierarchy and the name of the level. So for example, the Calculated Member for the Product Line level will be Measures.[Product.Line] or something like it. As first argument to Ancestor(), write the name of the hierarchy, followed by .CurrentMember. For the second argument, specify the level itself. To extract the Caption, append .Properties("MEMBER_CAPTION") to the call to Ancestor(). In our case we get: MEMBER Measures.[Product.Line] AS Ancestor(Product.CurrentMember, Product.Line).Properties("MEMBER_CAPTION") and MEMBER Measures.[Time.Years] AS Ancestor(Time.CurrentMember, Time.Years).Properties("MEMBER_CAPTION").
  5. Construct a set for the the COLUMNS axis of your query, consisting of a comma-separated list of the names of the calculated members. In our case it would be {Measures.[Product.Vendor], Measures.[Time.Years]} ON COLUMNS.
  6. Finally, if you want to also select any "real" measure values, include the appropriate measure values into the list on the COLUMNS axis. Remember, the measure will be aggregated on the level you chose in step #2. Suppose we would want to include Sales Quantity on our example, we'd have to change the COLUMNS code we constructed in 5 to {Measures.[Product.Line], Measures.[Time.Years], Measures.[Quantity]} ON COLUMNS
This is the actual complete MDX statement:
WITH
MEMBER  Measures.[Product.Line]
AS      Ancestor(
          Product.CurrentMember,
          Product.Line
        ).Properties("MEMBER_CAPTION")
MEMBER  Measures.[Time.Years]
AS      Ancestor(
          Time.CurrentMember,
          Time.Years
        ).Properties("MEMBER_CAPTION")
SELECT  {Measures.[Product.Line]
        ,Measures.[Time.Years]
        ,Measures.Quantity}   ON COLUMNS
,       CrossJoin(
         Product.Vendor.Members
        ,Time.Quarters.Members
        ) ON ROWS
FROM    SteelWheelsSales
And the result looks something like:
Product Time Product.Line Time.Years Quantity
Autoart Studio Design QTR1 Classic Cars 2003 33
Autoart Studio Design QTR2 Classic Cars 2003 42

...many more rows...

Welly Diecast Productions QTR1 Vintage Cars 2005 76
Welly Diecast Productions QTR2 Vintage Cars 2005 113

Wednesday, April 08, 2015

CSS tricks for (conditional) formatting of numbers and dates

Here's a bunch of CSS tricks that can help to format numbers and dates in HTML. You can even use it to achieve (basic) conditional formatting!

A Stackoverflow question: conditionally hiding zero values in a table

Today I stumbled upon this question on stackoverflow:
Is there a way to hide a data cell based on a specific value using just HTML/CSS? For example I have this code:
<table>

 <caption>Test</caption>

 <tr>
  <th>Values</th>
  <td>$100</td>
 </tr>

 <tr>
  <th>Initial value</th>
  <td>$0</td>
 </tr>

</table>
Is there a way to hide the cells that are equal to $0 using HTML/CSS only? Let's say instead of $0 I have a variable called fee that can be a variety of values: $0, $5, $20, etc. Is there a way to check what value it is and if it is found to be $0 can I then hide it?
As it turns out, this is actually possible with HTML5 data attributes, the CSS :before or :after pseudo-class, a CSS content property using a value of the type attr(), and attribute-value selector syntax to control conditional formatting:
<!doctype html>
<html>

 <head>

  <style type="text/css">
    
    /* make the cells output the value of their data-value attribute */
    td:after {
      content: attr(data-value);
    }
    
    /* hide the output if the data-value is equal to "$0" */
    td[data-value="$0"]:after {
      content: "";
    }

  </style>

 </head>

 <body>

  <table>

   <caption>Test</caption>

   <tr>
    <th>Values</th>
    <td data-value="$100"></td>
   </tr>

   <tr>
    <th>Initial value</th>
    <td data-value="$0"></td>
   </tr>

  </table>

 </body>

</html>
In summary, the ingredients of the solution are:
  • Encode the cell values as a custom data attribute, for example: data-value. The actual cells are empty.
  • Make the cell value show up using the :after pseudo-class of the td element. This is done by setting the CSS content property to the value attr(). Values of this type take an attribute name between the parenthesis, so in our example this becomes attr(data-value).
  • Use the attribute-value selector syntax for conditional formatting. In our example the requirement was to "hide" the value of cells with an amount equal to "$0". We can express this as td[data-value="$0"]. And since we display the value through the content property of the :after pseudo-class, we have to add :after to our td selector and specify a content property of "" to override the previous rule that outputs the value using attr().
The result looks something like this:

Values $100
Initial value


Browser compatibility

When I first tried to implement the idea I tested with latest chrome (41) and firefox (37) where it worked just fine. Much to my surprise and joy, it works without modification in IE8 as well! I'm so happy that I don't want to spoil it by testing other IE versions, but if anyone dares to try then I'd be grateful if you could post the result in the comments. Now personally, I'm not interested in hiding cell values. But this little trick does offer some possibilities for some basic conditional formatting.

Monetary amount formatting: red vs black

Consider a balance sheet of monetary amounts. Amounts should be right aligned, and we want the positive amounts to be displayed in black, and negative amounts in red:
<!doctype html>
<html>

 <head>

  <style type="text/css">
    
    /* right-align monetary amounts */
    td[data-monetary-amount] {
      text-align: right;
    }

    /* make the cells output their value */
    td[data-monetary-amount]:after {
      content: attr(data-monetary-amount);
    }

    /* make debit amounts show up in red */
    td[data-monetary-amount^="-"]:after {
      color: red;
    }
    
  </style>

 </head>

 <body>

  <table border="1">

   <tr>
    <th>Gain</th>
    <td data-monetary-amount="$100"></td>
   </tr>

   <tr>
    <th>Losst</th>
    <td data-monetary-amount="-$100"></td>
   </tr>

  </table>

 </body>

</html>
Note the data-monetary-amount^="-" syntax. This is a so-called substring matching attribute selector, which is specified in CSS 3. The comparison operator ^= tests whether the attribute value starts with a particular string, in this case the minus sign "-", which indicates we have a negative amount.

CSS 3 specifies similar comparison operators for a postfix match ($=) and an instring or "mid" match (*=).

The result looks something like this:

Gain $100
Loss -$100

Browser compatibility

As if I hadn't been blessed enough, this solution too works in IE8 (as well as Chrome and Firefox of course). Yay!

A slightly less nice solution that works in CSS 2.1

You can achieve the same effect with CSS 2.1 if you encode the value in 2 data attributes, one for the sign and one for the actual absolute amount:
<!doctype html>
<html>

 <head>

  <style type="text/css">

    /* right-align monetary amounts */
    td[data-monetary-amount] {
      text-align: right;
    }

    /* make the value show up */
    td[data-monetary-amount]:after {
      content: attr(data-monetary-amount);
    }

    /* make negative amounts show up in red, prefixed by the sign */
    td[data-sign="-"]:after {
      color: red;
      content: attr(data-sign) attr(data-monetary-amount);
    }

  </style>

 </head>

 <body>

  <table border="1">

   <tr>
    th>Debit</th>
    <td data-sign="+" data-monetary-amount="$100"></td>
   </tr>

   <tr>
    th>Credit</th>
    <td data-sign="-" data-monetary-amount="$100"></td>
   </tr>

  </table>

 </body>

</html>
An interesting bit of this last example is that it shows you can compose the value of the content property out of multiple pieces of content, in this case, two attr() values: attr(data-sign) to ensure that in case of negative values, we display the minus sign, and attr(data-value) to output the absolute value of the amount.

Locale dependent date formatting

The elements we saw in the previous example can be used for basic locale dependent date formatting. Let's keep it simple and format dates either in USA format, mon/d/yyyy, or in a format that is more easily understood outside the USA, d/mon/yyyy:
<!doctype html>
<html>

 <head>

  <style type="text/css">
      /* year comes last */
      time[datetime]:after {
        float: right;
        content: attr(datetime);
      }

      /* month and day come first */
      time[datetime*="-"] {
        float: left;
      }

      /* Months (non-USA) */
      time[datetime^="01-"]:after {
        content: "jan/";
      }

      ...rules for the other months go here...

      time[datetime^="12-"]:after {
        content: "dec/";
      }

      /* Days (non-USA) */
      time[datetime$="-01"]:before {
        content: "1/";
      }

      ...rules for the other days go here...

      time[datetime$="-31"]:before {
        content: "31/";
      }

      /* Months (USA) */
      *[lang="en-US"] time[datetime^="01-"]:before {
        content: "jan/";
      }

      ...rules for the other months go here...

      *[lang="en-US"] time[datetime^="12-"]:before {
        content: "dec/";
      }

      /* Days (USA) */
      *[lang="en-US"] time[datetime$="-01"]:after {
        content: "1/";
      }

      ...rules for the other days go here...

      *[lang="en-US"] time[datetime$="-31"]:after {
        content: "31/";
      }

  </style>

 </head>

 <body>

  <table border="1">

   <tr>
    <td lang="en-US">
     <time datetime="2015">
      <time datetime="04-08"/>
     </time>
    </td>
   </tr>

   <tr>
    <td lang="en-GB">
     <time datetime="2015">
      <time datetime="04-08"/>
     </time>
    </td>
   </tr>

  </table>

 </body>

</html>
This solution uses the HTML5 time-element. The time element can have a datetime attribute that contains the date/time in a machine readable format, and it may contain text content, which should be a human-readable representation of the date/time.

Now, personally, I do not think the HTML5 time element is an example of good or convenient design. At least, not from the perspective of the HTML5 author.

It is a great idea to require a machine-readable representation of the date. This potentially allows user agents to do useful things with the content. And allowing the user to manually specify the human-readable representation is also not a bad idea per se. But IMO, the time-element would have been much more useful if authors would be allowed to only specify the machine-readable representation of the date/time and, in absence of a manually entered human representation of the date/time, let the browser figure out how that date appears in the output in a human-readable representation. That way the browser could use information about the language of the document or document section to auto-format the date, or otherwise apply some user-preference. Another idea would be to allow the HTML author to control the output format using another attribute for a format string.

Anyway, this is not the case so we can try and see what we can do on our end. The solution above is as follows:
  • In the example above, a date is expressed using two time elements: one for the year-part and one for the month and day parts of the date. The year-part uses a non-negative integer for the datetime attribute, indicating a year. The mont/day-part uses a datetime attribute to represent a valid yearless date string. I nested the time element that represents the month and day part inside the one that represents the year. That said, it would have been much nicer if I could've just used a single time-element using a single datetime attribute containing all dateparts, but I couldn't figure out how to manipulate such a value with CSS. So I settled for a less optimal solution, which is certainly more verbose. At least, it does not duplicate any data, which seems a requirement that we never should let go off.
  • The first two CSS rules ensure that month and day appear first (using float:left) and the year appears last (using float: right). The first CSS rule specifies that all time elements having a datetime attribute should float right. The way we set it up, this matches the time elements that match the year part. The second CSS rule uses the substring-matching attribute selector *= to check if the datetime attribute of the time element contains a hyphen. Since the hyphen separates the day and month parts in the yearless date string format, this rule will match all time elements that represent a month/day part of a date.
  • The remaining rules are required for formatting the month and date parts as well as the separators. (Wich is a slash, /).
  • The prefix matching attribute selector ^= is used to test which month is identified by the prefix of the value of the datetime attribute. For each month, with prefixes 01 through 12, there is a rule, and its content property is used to output the month abbreviation like jan, feb, mar etc.
  • The postfix matching attribute selector $= is used to test which day is identified by the postfix of the value of the datetime attribute. For each day, with postfixes 01 through 31, there is a rule, and its content property is used to output the day number.
  • The upper set of rules matching the month-prefix and day-postfix are used to generate :after and :before pseudo-classes respectively to ensure that by default, the day part is displayed before the month part.
  • To accommodate the USA date format, the bottom set of rules was added. These are essentially a duplication of the prefix- and postfix matching rules for the month and day part respectively, but these rules have an initial selector part like this *[lang="en-US"] to ensure that these rules are active only if the time element is contained in a section that was marked as being localized for the USA. For these rules, the month parts are used to generate :before pseudo-classes, and the day parts are used to generate :after pseudo-classes, thus reversing the default order of displaying the month and day part.
The result looks something like this:

apr/8/2015
8/apr/2015

Browser compatibility

This solution works again fine in Chrome and Firefox, but does not render in IE8. Support for the time element was added in IE9, and the example works just fine there. Of course, if you really want it to work for IE8, you can, just don't use a time element but something generic such as span, and use a custom data- attribute for the datetime value, like data-datetime or similar.

Finally...

I hope this was useful information. Personally I think we still have a long way to go before we can use a pure css solution to solve all our data formatting problems, and I believe that esp. for web-applications, programmatic solutions (either on the server or on the client side) are still essential to deliver an acceptable result.

That said, every little bit of functionality in CSS can help you build a solution, even if such a solution is still controlled or generated by a programmatic backend.

Any comments and feedback are greatly appreciated.

Wednesday, March 25, 2015

A Generic Normalizer for Pentaho Data integration - Revisited

A while ago, I wrote about how to create a generic normalizer for Pentaho Data integration.

To freshen up your memory, the generic normalizer takes any input stream, and for each input row, it outputs one row for each field in the input stream. The output rows contain fields for input row number, input field number and input field value. As such it provides the same functionality as the built-in Row Normaliser step without requiring any configuration, thus allowing it to process arbitrary streams of data.

A reusable normalizer

Recently I received an comment asking for more information on how to make this normalizer more reusable:
I want to use this method to do field level auditing but I want to encapsulate it in a sub transformation to which I can pass the result rows from any step. In your image of "how all these steps work together", instead of a data grid, the input would need to be dynamic in terms of the number of fields/columns and the data types. Could you possibly provide a hint how to make the input to these steps (in your example, the datagrid) dynamic?
In the mean while, I learned a thing or two about kettle's internals and it seemed like a good idea to describe how to improve on the original example and make it suitable to be used in a so-called Mapping, a.k.a. a sub-transformation.

Design for use as Subtransformation

The design for the re-usable generic normalizer is shown below:
The User-defined Java class step in the middle actually implements the normalizer. The Mapping input and Mapping output specification steps allow the normalizer to be called from another transformation. They enable it to respectively receive input data from, and return output data to the calling transformation.

In the screenshot above, the configuration dialogs for both the Mapping input and output specification steps are shown. This is mainly to show that there is no configuration involved: the Mapping input specification step will faithfully pass all fields received from the incoming stream on to the normalizer, and the Mapping output specification will output all fields coming out of the normalizer to the outgoing stream.

Normalizer Improvements

The configuration of the user-defined Java class step differs in a number of aspects from what I used in the original normalizer example. In the original example the normalizer output consisted of three fields:
rownum
A sequential integer number identifying the position of the row to which the current output row applies.
fieldnum
A sequential integer number identifying the position of the field to which the current output row applies.
value
A string representation of the value to which the output applies
The original example used a Metadata Structure of Stream step to obtain metadata of the input stream, and this metadata was then tied to the output of the normalizer using a Stream Lookup step, "joining" the output of the Metadata Structure step with the output of the normalizer using the field number.
The improved generic normalizer adds two more output fields:
fieldname
The name of the field as it appears in the input stream
fieldtype
The name of the data type of the field as it appears in the input stream
Argueably, these two items are the most important pieces of metadata that were previously provided by the Metadata Structure of Stream in the original example, and I felt many people would probably prefer to have all of that work done by the normalizer itself rather than having to tie all the pieces together in the transformation itself using additional steps.

Code

The code for the user-defined Java class is shown below:
static long rownum = 0;
static RowMetaInterface inputRowMeta;
static long numFields;
static String[] fieldNames;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
  // get the current row
  Object[] r = getRow();

  // If the row object is null, we are done processing.
  if (r == null) {
    setOutputDone();
    return false;
  }

  // If this is the first row, cache some metadata.
  // We will reuse this metadata in processing the rest of the rows.
  if (first) {
    inputRowMeta = getInputRowMeta();
    numFields = inputRowMeta.size(); 
  }
    
  // Generate a new id number for the current row.
  rownum += 1;

  // Generate one output row for each field in the input stream.
  int fieldnum;
  ValueMetaInterface valueMetaInterface;
  for (fieldnum = 0; fieldnum < numFields; fieldnum++) {
    //get metadata for the current field
    valueMetaInterface = inputRowMeta.getValueMeta(fieldnum);
    Object[] outputRow = new Object[5];
    outputRow[0] = rownum;
    // Assign the field id. Note that we need to cast to long to match Kettle's type system.
    outputRow[1] = (long)fieldnum+1;
    //assign the data type name
    outputRow[2] = valueMetaInterface.getTypeDesc();
    //assign the field name
    outputRow[3] = valueMetaInterface.getName();
    //assign a string representation of the field value
    outputRow[4] = inputRowMeta.getString(r, fieldnum);
    //emit a row.
    putRow(data.outputRowMeta, outputRow);
  }
  
  return true;
}
The main difference with the original code is the addition of the two new output fields, fieldname and fieldtype. In order to obtain the values for these fields, the loop over the fields first obtains the ValueMetaInterface object for the current field. This is done by calling the getValueMeta() method of the RowMetaInterface object and passing the index of the desired field.
Using the ValueMetaInterface object, the field name is obtained using its getName() method. The data type name is obtained by calling its getTypeDesc() method.

Calling the normalizer as subtransformation

Using the improved normalizer is as simple as adding a Mapping-step to your transformation and pointing it to the transformation that contains the normalizer and Mapping input and output specifications:

Download samples

The transformations discussed in this post are available here: These transformations are in the public domain: you can use, copy, redistribute and modify these transformations as you see fit. You are encouraged but not obliged to share any modifications that you make to these examples.

Saturday, October 18, 2014

Performing administrative tasks on Pentaho 5.x Business Analytics Server using RESTful webservices and PHP/cURL

Yesterday, I noticed a discussion in the Pentaho Business Analytics group on linkedin: Using RESTful services to add users, add roles, add solutions, add datasources. In this discussion, Capital Markets Analyst/Consultant Rob Tholemeier writes:
We built some code in PHP that performs most if the 3.x admin console functions. Now with 5.x there appears to be RESTful services to do the same. Does anyone have code examples they are willing to share that uses RESTful services to add users, add roles, add solutions, add datasources? Change the same, assign roles, deletes the same?
I think it is an interesting question. I haven't seen many people integrating Pentaho in their PHP web applications so I decided to do a quick write up to demonstrate that this is not only possible but actually quite easy.

For this write up, I used the following software: Because everything is more fun with pictures, here's a high level overview of this setup:

Pentaho 5.x RESTful Webservices

Pentaho 5.x featured major refactoring to modernize its webservices to be more RESTful. Here's an overview of all the services.

All these webservices reside under the /api path beneath the path of the Pentaho web application, which is by default in the /pentaho path at the root of the server. Each service has a distinct path beneath the /api path. So assuming the pentaho server is running on the localhost (at the default port of 8080), you can access all calls offered by a particular service beneath http://localhost:8080/pentaho/api/service-specific-path.

The Pentaho 5.x webservices are, to some extent, self-documenting. You can get an overview of the available call for a specific service by doing a HTTP OPTIONS request to the root path of the service. For example, an OPTIONS request to http://localhost:8080/pentaho/api/session might return a document like this to describe the service:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<application xmlns="http://wadl.dev.java.net/2009/02">
  <doc xmlns:jersey="http://jersey.java.net/" jersey:generatedBy="Jersey: 1.16 11/28/2012 03:18 PM"/>
  <grammars>
    <include href="http://localhost:8080/pentaho/api/application.wadl/xsd0.xsd">
      <doc title="Generated" xml:lang="en"/>
    </include>
  </grammars>
  <resources base="http://localhost:8080/pentaho/api/">
    <resource path="session">
      <resource path="/setredirect">
        <method id="setredirect" name="GET">
          <response>
            <representation mediaType="text/plain"/>
          </response>
        </method>
      </resource>
      <resource path="/userWorkspaceDir">
        <method id="doGetCurrentUserDir" name="GET">
          <response>
            <representation mediaType="text/plain"/>
          </response>
        </method>
      </resource>
      <resource path="/workspaceDirForUser/{user}">
        <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="user" style="template" type="xs:string"/>
        <method id="doGetUserDir" name="GET">
          <response>
            <representation mediaType="text/plain"/>
          </response>
        </method>
      </resource>
    </resource>
  </resources>
</application>


The documentation provides a categorized overview of the service calls, as well as the necessary technical information, such as:
The HTTP method
In general, with very few exceptions, Pentho's web services use either GET (to retrieve information) or PUT (to modify or change data). There are a few calls that use POST (for example, the publish service). As far as I know, DELETE is not used.
Parameters
If a service accepts parameters, they are either passed through the query string or as a document in the request message body. This is colloquially known as the POST data, regardless of whether the POST method is actually used.

As per the design of the HTTP protocol, GET calls only accept parameters via the query string. The PUT calls to the Pentaho webservices sometimes accept parameters via the query string. This happens only if the information passed in the parameter has a relatively simple structure - think of a single key/value pair, or maybe a list of values. More complicated data is typically conveyed via the message body.
Data type information
In general Pentaho's web services support both XML and JSON. You can control the data format of the response by specifying the Accept request header. To specify the format of the request message body, you should use the Content-Type header in the request.
Unfortunately the documentation does not have any human readable descriptive information, so sometimes a little inspection and experimentation is required to figure out exactly how things work.

For this write-up, I decided to focus on the administrative interface around users, roles and privileges.

NOTE: The link above only lists the web services that are built into the Pentaho Business Analytics Platform. The platform can, and ususally does, have multiple plugins that offer extra functionality and some of these plugins ship by default with the server so that one might consider those also as builtins. Each plugin can, and often does, offer its own service calls, but these are not listed in the documentation referred to above. A discussion of these is also out of scope for this write-up but I will try and get back on this topic in a future blog post. If you don't feel like waiting that long, you can try and figure out the webservice calls offered by a particular plugin by doing an OPTIONS request to the root path corresponding to a particular plugin. The root path of a particular plugin is http://localhost:8080/pentaho/plugin/plugin-id/api. You can obtain a list of all installed plugins by doing a GET request the ids service of the PluginManagerResource services.

Suggested Tools

I found the following tools to be very useful in inspecting and experimenting with the Pentaho webservices api.
  • Administration perspective in the Pentaho user console. This web application ships with Pentaho and is normally used by Pentaho administrators to manage users, roles and privileges. Basically, we want to offer the functionality provided by this web application, but then with PHP.
  • The network tab in google chrome's developer tools. I used the network tab to monitor calls from Pentaho's administration perspective to the services provided by the Pentaho server.
  • Postman REST client. This is a handy extension from the chrome webstore. It allows you to do all kinds of REST calls directly from within your chrome browser. This was useful to test any assumptions on the details of how to exactly construct requests to the Pentaho web services.

The UserRoleDaoResource services

All functionality to work with users, roles and privileges is bundled in the UserRoleDaoResource service. The services specific path of this service is /userroledao. So, all calls that belong to the UserRoleDaoResource service can be accessed through the path http://localhost:8080/pentaho/api/userroledao.

The following categorized overview illustrates the different kinds of calls that belong to the UserRoleDaoResource service:
Users
Named accounts. An account is typically associated with a particular person that needs to work with the Pentaho server. The following calls are specific to working with users:
GET users
Retrieve the list of existing Pentaho user accounts.
PUT createUser
Create a new Pentaho user account.
PUT updatePassword
Modify the password of the specified user.
PUT deleteUsers
Remove one or more Pentaho user accounts.
Roles
Roles are basically a package of privileges (system roles), which can be assigned to one or more users. Any given user can be assigned multiple roles.
GET roles
Retrieve the list of existing roles.
PUT createRole
Create a new role. After the role is created, privileges can be assigned to it, and the role can then be assigned to one or more users, effectively granting the associated privileges to those users.
PUT deleteRoles
Remove one or more roles.
Assignments
Users can get assigned multiple roles and many users can be assigned a particular role. The following calls can be used to create or remove these assocations:
PUT assignAllRolesToUser
Assign all available roles to the specified user.
PUT assignAllUsersToRole
Assign the specified role to all available users.
PUT assignRoleToUser
Assign a specific role to a particular user.
PUT assignUserToRole
Assign a specific role to a particular user.
PUT removeAllRolesFromUser
Unassign whatever roles were assigned to a specific user.
PUT removeAllUsersFromRole
Take the specified role away from all users that were assigned the role.
PUT removeRoleFromUser
Unassign a specific role from a particular user.
PUT removeUserFromRole
Unassign a specific role from a particular user.
System Roles (also called Logical Roles)
These are essentially privileges: the ability to perform a particular action. Examples of such actions are Read BI content, Publish BI content, Schedule a job etc. In Pentaho, system roles cannot be assigned directly to users; instead, they have to be assigned to a role. Roles can then be associated to users to effectively grant them the privileges associated with the role.
GET logicalRoleMap
This returns a document containing two separate bits of information: the list of available system roles, as well as the association between regular roles and system roles.
PUT roleAssignments
Specify which system roles are associated with a particular regular role. Note that there is no separate call to add or remove individual associations between a role and a system role: rather, an entire set of system roles is assigned to a role at once, removing whatever set was assigned prior to that role.
Remember, you can always obtain the entire set of calls available for the UserRoleDaoResource service for your server by doing a OPTIONS request at the root of the /pentaho/api/userroledao path.

Webservice calls in PHP with cURL

Typically, calling out to HTTP (or FTP) servers from within PHP is done using the cURL library. A full discussion of cURL in PHP is well out of scope; you can refer to the - excellent - official PHP documentation instead. I will only discuss the basic pattern and only in as far as it applies to calling the Pentaho webservices.

Basic cURL calling sequence

The basic cURL calling sequence may be summarized as follows:
  1. Obtain a cURL handle by calling curl_init(). You should save the handle to a variable so you can use it in subsequent calls to the cURL library.
  2. Configure the cURL request by doing various calls to curl_setopt($handle, $option, $value). Each curl_setopt call basically sets a property ("option") on the cURL handle that is passed as first argument to curl_setopt(). The library defines a large number of property keys to control the various aspects of the HTTP request, such as the HTTP method, the request headers, message body etcetera.
  3. Call curl_exec() to send the request. This function will also return the response if a prior call to curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE) was made.
  4. Optionally, information about the response can be obtained by calling curl_getinfo()
  5. Finally, curl_close($handle) should be called to clean up the cURL handle and free any underlying resources used by the library.

Basic GET request to Pentaho with PHP/cURL

The following snippet shows how to do a GET request to Pentaho using PHP/cURL:
<?php
//obtain a cURL handle
$c = curl_init();

//specify the url and the HTTP method
curl_setopt($c, CURLOPT_URL, 'http://localhost:8080/pentaho/api/userroledao/users');
curl_setopt($c, CURLOPT_CUSTOMREQUEST, 'GET');

//supply credentials to authenticate against pentaho
curl_setopt($curl_handle, CURLOPT_USERPWD, 'admin:password');

//tell cURL to return the response as a string
curl_setopt($c, CURLOPT_RETURNTRANSFER, TRUE);

//obtain the response
$response = curl_exec($c);

//get the HTTP status code
$status = curl_getinfo($c, CURLINFO_HTTP_CODE);

//clean up the cURL handle
curl_close($c);
?>
As you can see the snippet follows the general cURL calling sequence. The options CURLOPT_URL and CURLOPT_CUSTOMREQUEST are used to specify the url and the HTTP method respectively, and CURLOPT_RETURNTRANSFER is set to TRUE to obtain the response as a string result when calling curl_exec.

The CURLOPT_USERPWD option is used to specify the credentials for basic HTTP authentication. The value is a string consisting of the username and password, separated by a colon, and the example uses the default built-in administrator's account called admin with the password password.

Note: The web service requests described in this blog post require authentication with the admin account, or at least an account that is privileged to perform administrative actions. Other webservices my work while being authenticated with less privileged accounts.

No specific request headers were set in this example. Because there is no specific Accept header to specify a format for the response, the default format will be used, which happens to be XML.

After executing this snippet, the variable $response will have a string value equivalent to the following document:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<userList>
    <users>suzy</users>
    <users>pat</users>
    <users>tiffany</users>
    <users>admin</users>
</userList>
We could have added an extra call to curl_setopt() to explicitly set the Accept header to specify that we want to receive the response in the JSON format:
<?php
//obtain a cURL handle
$c = curl_init();

...other calls to curl_setopt...

curl_setopt($c, CURLOPT_HTTPHEADER, array(
  'Accept: application/json'
));

$response = curl_exec($c);

curl_close($c); 
?>
Note that all request headers must be passed as an array of strings using a single call to curl_setopt($handle, CURLOPT_HTTP_HEADER, $array_of_header). Each element of the array of headers should be a single string, consisting of the header name followed by a colon followed by the header value.

After executing this snippet, the variable $response will contain a string equivalent to the following JSON document:
{
    "users": [
        "suzy",
        "pat",
        "tiffany",
        "admin"
    ]
}
While the format for the response defaults to XML, it is generally a good idea to always explicitly specify it. In order to explicitly request for XML, change the value for the Accept header to application/xml.

Processing the response

PHP has support for both XML as well as JSON. In this write-up I'll only use XML but it is good to realize that it would have worked just as well if we had used JSON. As PHP offers exactly one library for working with JSON, there many options for processing XML. (I'm inclined to say, way too many.)

Fortunately, for this particular task, the XML documents are always simple and never very large, and I have had good results working with the SimpleXML library. I believe this is included and enabled by default, which makes it a safe choice. Another reason why I like SimpleXML is that it offers exceptionally convenient access to the data in the XML document using property access operators and iterators.

It would be outside the scope of this write-up to discuss SimpleXML in detail but the following snippet may illustrate how easy it is to process an XML document like the <userList> response obtained from the GET to the /userroledao/users API call described above:
<?php
  $userlist = ...response from /userroledao/users...

  //parse xml string
  $doc = simplexml_load_string($userlist);

  //iterate elements
  foreach ($doc as $user) {
    //do something with the <user> element.
    //even though $user is an object we can easily extract its value by treating it as a string
    echo('<div>'.$user.'</div>');
  }  
?>
As you can see, it doesn't ever get much simpler than this: one call to simplexml_load_string to parse the xml document, and we can directly traverse the elements using foreach. Plus accessing the text content of the elements is also very easy: no need for a separate call to extract the text, just treat the element as a string. Note that if you do need more advanced ways to traverse the structure of the document and access the data, the SimpleXML library still goes a long way. You can even use XPath expressions, if you really need that.

Putting it together: a simple Pentaho Admin application in PHP

Using the techniques described above, and an argueably minimal amount of client side javascript, I put together a simple, yet fully functional administrative application for managing Pentaho roles, users and privileges in PHP. It is a single, self-contained script (php, html template, and javascript - no css) of just over 600 lines, including whitespace and comments. Here's what the application looks like:



User management features

The left had side of the page is dedicated to user management. From top to bottom we have:
User form
In the left top we have a form with fields for "Username" and "Password", and a "Create User" button. Hitting the button results in a PUT request to /userroledao/createUser to create a new user using the values in the "Username" and "Password" fields.
Existing Users list
Halfway the page below the user form there's a list showing the existing users. This is populated with the data from the response of a GET request to /userroledao/users.
Delete selected users button
Hitting this button fires a javascript function that collects the selection from the existing user list. This is used to do a PUT request to the /userroledao/deleteUsers service in order to delete those user accounts.
User roles list
When a single user is selected in the existing users list a GET request is made to the /userroledao/roles service to create a list of all available roles. Another GET request is made to the /userroledao/userRoles service and the data from the response is used to set the state of the checkboxes in front of the role names, indicating which roles are assigned to the user. If such a checkbox is checked by user interaction, a PUT request is made to the /userroledao/assignRoleToUser service, which will assign the corresponding role to the currently selected user. If the checkbox gets unchecked through user interaction, a PUT request is made to the /userroledao/removeRoleFromUser service, which will unassign the corresponding role from the currently selected user.

Role management features

The right had side of the page is dedicated to role management. From top to bottom we have:
Role form
In the right top we have a form with a "Rolename" field and a "Create Role" button. Hitting the button results in a PUT request to /userroledao/createRole to create a new role with the specified role name.
Existing Roles list
Halfway the page below the role form there's a list showing the existing roles. This is populated with the data from the response of a GET request to /userroledao/roles.
Delete selected roles button
Hitting this button fires a javascript function that collects the selection from the existing roles list. This is used to do a PUT request to the /userroledao/deleteRoles service in order to delete those roles.
Role members list
When a single role is selected in the existing roles list a GET request is made to the /userroledao/users service to create a list of all available users. Another GET request is made to the /userroledao/userRoles service and the data from the response is used to check the appropriate checkboxes in front of the users names to indicate which users got the current role assigned. If such a checkbox is checked through user interaction, a PUT request is made to the /userroledao/assignUserToRole service to assign the currently selected role to the checked user. If such a checkbox gets unchecked due to user interaction, a PUT request is made to the /userroledao/removeUserFromRole service to unassign the currently selected role from the unchecked user.
Privileges (logical roles) list
If a single role is selected in the existing roles list, a GET request is done to the /userroledao/logicalRoleMap service. The data from the response is used to create a list of all available privileges. From the same response, the list of logical roles assigned to the role selected in the existing role list is used to check the checkboxes in front of the logical role names in order to indicate which logical role names are assigned to the currently selected role. When such a checkbox is checked, or unchecked, a PUT request is done to the /userroledao/roleAssignments service to associate the appropriate set of logical roles with the currently selected role

With a few (arguably non-essential) exceptions, this application covers all services of the UserRoleDaoResource.

Implementation details

For reference I will now discuss the implementation details of this application.

User form

The user form can be used to create new users. Here's its corresponding HTML code:
<form method="POST">
  <table>
    <tr>
      <td>Username:</td>
      <td><input type="text" name="user" /></td>
    </tr>
    <tr>
      <td>Password:</td>
      <td><input type="password" name="password" /></td>
    </tr>
    <tr>
      <td colspan="2">
        <input type="submit" name="action" value="Create User"/>
      </td>
    </tr>
  </table>
</form>
Hitting the "Create User" button submits the form. But since the form element does not specify a specific action url, it will simply refresh the page, setting the form fields as POST data. In the top of the PHP script, this is handled with the following PHP code:
if (isset($_POST['action'])) {
  $action = strtolower($_POST['action']);
}
else {
  $action = NULL;
}
switch ($action) {
  case 'create user':
    $status = create_user($_POST['user'], $_POST['password']);
    break;
  case '...':
    ...
    break;
  
  ... many more case branches ...
}
In fact, all of the actions that the user can initiate result in a POST request that refreshes the page, setting a specific value for the action field to select the appropriate backend action. In case of the user form, this results in a call to the PHP function create_user(), passing the values of the POST data fields user and password, which originate from the HTML user form.

The PHP code of the create_user() function is shown below:
//create a user with specified name and password.
function create_user($user, $password){
  $c = curl_init();

  curl_setopt($c, CURLOPT_CUSTOMREQUEST, 'PUT');
  curl_setopt($c, CURLOPT_URL, 'http://localhost:8080/pentaho/api/userroledao/createUser');

  curl_setopt($curl_handle, CURLOPT_USERPWD, 'admin:password');

  curl_setopt($c, CURLOPT_POSTFIELDS, 
    '<user>'.
      '<userName>'.$user.'</userName>'.
      '<password>'.$password.'</password>'.
    '</user>'
  );
  curl_setopt($c, CURLOPT_HTTPHEADER, array(
    'Content-Type: application/xml'
  ));

  curl_exec($c);
  $status = curl_getinfo($c, CURLINFO_HTTP_CODE);

  curl_close($c);
  return $status;
}
The create_user function follows the basic pattern of the cURL calling sequence. The difference with the preceding generic example is that this is a PUT request, and thus the value for the CURLOPT_CUSTOMREQUEST option is 'PUT' rather than 'GET'.

The /userroledao/createUser is specified to take a user-element in the request message body, which is used to convey the user name and password. This element is constructed as a XML document and passed to the message body using a curl_setopt call using the CURLOPT_POSTFIELDS option.

Because we are passing a message body in the request, we also need to set the Content-Type header to application/xml to specify that the data passed in the message body is an XML document.

Finally, after the call to curl_exec, we use a call to curl_getinfo() using the CURLINFO_HTTP_CODE constant to obtain the HTTP-status of the request. This should be 200 if the PUT request succeeds. If there is some problem with the request we should receive a code in the 400 range (if the request itself has some problem) or the the 500 range (in case the server is experiencing some problem that is not related to this particular request). For example, if the user already exists, one gets a 403 (Forbidden) status instead of 200.

Note: The Pentaho REST services do not seem to have a robust way to convey the exact nature of the problem in case the request could not be met. At least, I have not noticed any useful information being conveyed in the response except for the HTTP status code. I checked the Administration perspective in the Pentaho user console to see what would happen in case an existing users is entered, and there the action just silently fails. It would be nice to get a recommendation about how to deal with error situations when using the Pentaho web service calls.

The Existing Users list

The existing users list is one of the simplest items in the interface. The HTML / php code is shown below:
<select multiple="true" id="Users" onchange="userSelectionChanged(this)">
>?php
  $users = get_users();
  foreach ($users as $user) {
?>
   <option><?php echo($user); ?></option>
<?php
  }
?>
</select>
The get_users() function is a simple GET request to /userroledao/users, followed by a parse of the XML response. Since both these aspects have been discussed already I will not repeat that code here.

The <select> element has an onchange event handler which calls to the javascript function userSelectionChanged(). Whenever the selection of the user list changes, that function will be called. This function will then determine if a single user is selected, and if that is the case, it will refresh the user role list by explicitly reloading the userRolesFrame:
function userSelectionChanged(list){

  //get the list selection as an array of user names 
  var selection = [];
  for (var i = 0, options = list.options, n = options.length; i < n; i++) {
    if (options[i].checked) {
      selection.push(options[i].value);
    }
  }

  //get the user roles frame
  var frame = document.getElementById("userRolesFrame");

  if (selection.length === 1) {
    //if there's exaclty one selected user, then load its assigned roles in the frame
    frame.src = "?view=userRoles&user=" + selection[0];
  }
  else {
    //blank the frame
    frame.src = "about://blank";
  }
}

The user roles list

We just disussed how selecting a single item in the existing user list refreshes the user role list by loading the userRolesFrame. The frame will be loading the current php script, passing values for the view and user parameters via the query string. The PHP script handles this by checking for the value of the view parameter in the query string. If now view parameter is present, the default interface will load as shown in the screenshot. But if specifying a value of userRoles for view will render only a list of roles, checking the roles that are assigned to the user specified by the user parameter, which is also passed via the query string:
<?php
  //see if a specific view was requested
  if (isset($_GET['view'])) {
    $view = $_GET['view'];
  }
  else {
    $view = NULL;
  }

  //select and render the requested view 
  switch ($view) {

    //render the user roles view
    case 'userRoles':

      //get the current user
      $user = $_GET['user'];

      //get the current user's assigned roles
      $roles = get_user_roles($user);

      //store the user's roles as rolenames in an array
      $assigned_roles = array();
      foreach ($roles as $role) {
        array_push($assigned_roles, ''.$role);
      }

      //get all roles
      $roles = get_roles();
      
      //render all roles as a list of divs with a checkbox
      foreach ($roles as $role) {

        //if the current role appears in the array of assigned roles, check the checkbox.
        $checked = in_array(''.$role, $assigned_roles);
?>
        <div>
          <input
            onchange="changeUserRoleAssignment(this)"
            name="<?php echo($role) ?>"
            type="checkbox"
            <?php echo ($checked ? 'checked="true"' : '')?>
          />
          <?php echo($role) ?>
        </div>
<?php
      }
      break;
    case '...':

      ...code to handle other views here...

    default:

      ...code for the regular interface (no specific view) goes here...
  }
?>
First, get_user_roles($user) is called to GET a response from the /userroledao/userRoles service, which is a list of roles for the specified user. From the php side of things, nothing new is really happening here. The only difference with regard to getting the list of users is the url, which is now /userroledao/userRoles rather than /userroledao/users and which includes a querystring parameter to specify the user:
  curl_setopt($c, CURLOPT_URL, 'http://localhost:8080/pentaho/api/userroledao/userRoles?userName='.$user);
The get_user_roles($user) function calls returns an XML document containing <role>-elements representing the roles assigned to the specified users. We use the foreach loop to iterate them and we store their string values (i.e., the actual role names) in the array $assigned_roles.

The remainder of the code is very similar to how the existing user list was rendered, except that we now use a call to get_roles() rather than get_users(). This does a GET request to /userroledao/roles and returns an XML document containing all available roles. We then iterate through that list to create an input-element of type checkbox along with the actual role name. The checkbox is checked according to whether the current role name is found in the previously populated $assigned_roles array.

Each checkbox is given an onchange handler which is implemented by the changeUserRoleAssignment() javascript function. This function sets a few variables in a form to indicate whether the corresponding role is to be assigned or unassigned, and the submits the form. The code for the form and the function are shown below:
<form 
  name="userRoleAssignment" method="POST" 
  action="?view=<?php echo($view)?>&user=<?php echo(urlencode($user))?>"
>
  <input type="hidden" name="action"/>
  <input type="hidden" name="role"/>
  <input type="hidden" name="user" value="<?php echo($user)?>"/>
</form>

<script type="text/javascript">
  function changeUserRoleAssignment(checkbox) {
    var form = document.forms["userRoleAssignment"];
    form.elements["action"].value = checkbox.checked ? "assign role to user" : "unassign role from user";
    form.elements["role"].value = checkbox.name;
    form.submit();
  }
</script>
The changeUserRoleAssignment() function writes its associated role name (stored in its name property) in the role field of the form, and it uses its checked state to set the value of the action field to assign role to user or unassign role from user. It then submits the form.

Since this code appears all in the user role view, it has the effect of refreshing only the frame wherein the view is contained. Because the form sets the action value, it triggers a PHP backend action before rendering the view (just like we saw in the implementation of the create user action):
switch ($action) {
  case 'create user':
    $status = create_user($_POST['user'], $_POST['password']);
    break;
  case 'assign role to user':
    assign_role_to_user($_POST['role'], $_POST['user']);
    break;
  case 'unassign role from user':
    unassign_role_from_user($_POST['role'], $_POST['user']);
    break;
  
  ... many more case branches ...
}
The PHP functions assign_role_to_user() and unassign_role_from_user() both perform a straightforward PUT request to the /userroledao/assignRoleToUser and /userroledao/removeRoleFromUser services respectively. For each these requests, the values of the user and role fields are passed to the service via the query string parameters userName and roleNames respectively.

Note that these two services support multiple role names; however only one is passed at any time by our application. Should you wish to pass multiple role names, then you should separate rolenames by a tab-character (ascii character 0x09). Note that since the names are passed in the query string, they must be url-encoded.

Finally

Although I haven't covered all implementation details, the rest is simply more of the same stuff. If you want to play with the code yourself, you can download the entire PHP script here.