Monday, August 31, 2015

MDX: "Show Parents" - Redux: A generic and systematic MDX query transformation to obtain the lineage of each member

A couple of months ago I wrote about how you can use the MDX functions Ancestors() and Ascendants to retrieve the full lineage of members. (See: "MDX: retrieving the entire hierarchy path with Ancestors()".)

As you might recall, the immediate reason to write about those functions was to find a pure MDX solution to implement the "Show Parents" / "Hide Parents" functionality offered by OLAP cube browsers. To recap, developers of MDX-based pivot tables face a challenge when rendering the result of a query like this:
SELECT    CrossJoin(
            [Product].[Product].Members,
            [Measures].[Sales]
          ) 
ON COLUMNS,
          [Time].[Months].Members
ON ROWS
FROM      [SteelWheelsSales]
In plain English: Sales of products across months.

The raw result might look something like this:
  

1968 Ford Mustang 1958 Chevy Corvette Limited Edition ...more Classic Cars... 1997 BMW R 1100 S 2002 Yamaha YZR M1 ...more products...
Time Sales Sales Sales Sales Sales Sales
Jan 1,742
Feb 2,846
...more months... ... ... ... ... ... ...
Jan 7,499 ... 2,254 3,222 ...
Feb 4,518 847 ... 2,921 3,865 ...
...more months... ... ... ... ... ... ...
The challenge is that if this result would be presented to an end-user, they might find it hard to interpret. For example, we see two rows labeled Jan, and two rows labeled Feb. Since we asked for time in months, these labels probably indicate the months January and February. But which "Jan" or "Feb" exactly? Since each year cycles through all the months we need to know the year as well.

The column labels are also challenging to interpret. What kind of thing is a "1968 Ford Mustang", and what kind of thing is a "1997 BMW R 1100 S"? A domain expert might know the former is a Classic car and the latter a motorcycle, but the labels themselves do not make this clear.

We could of course change the query and add members for the [Time].[Years] level and the [Product].[Line] level and wrap the sets in the Hierarchize() function:
SELECT    Hierarchize(
            CrossJoin(
              {[Product].[Line].Members
              ,[Product].[Product].Members},
              [Measures].[Sales]
            )
          ) 
ON COLUMNS,
          Hierarchize(
            {[Time].[Years].Members
            ,[Time].[Months].Members}
          )
ON ROWS
FROM      [SteelWheelsSales]
This would at least add extra rows and columns respectively, which would appear as a "break" value announcing a list of subsequent items at the lower level: (Please Note: I'm referring to [Time].[Years] level as a "higher" level than [Time].[Months], whereas the ordinal number of level, also known as the "level number" is actually lower.):
  

Classic Cars 1968 Ford Mustang 1958 Chevy Corvette Limited Edition ... Motorcycles 1997 BMW R 1100 S 2002 Yamaha YZR M1 ...
Time Sales Sales Sales ... Sales Sales Sales ...
2003 1,514,407 62,140 11,553 ... 397,220 37,016 27,730 ...
Jan 41,192 1,742 ...


...
Feb 20,464

... 25,784 2,846
...
... ... ... ... ... ... ... ... ...
2004 1,838,275 67,155 20,145 ... 590,580 42,138 42,483 ...
Jan 122,792 7,499 ... 41,201 2,254 3,222 ...
Feb 137,641 4,518 847 ... 49,067 2,921 3,865 ...
... ... ... ... ... ... ... ... ...
While that approach may be the right one in many cases, there may be situations where this solution is not so ideal. For instance, this solution doesn't just add the label for the member at the higher level, it also adds cells for the measure value belonging to those levels. The values for these cells need to be calculated, which is a waste if we're not really interested in the value at the higher level.

Simply adding members at higher levels could even add more confusion if our query was designed so as to not select all members at the [Time].[Months] and [Product].[Product] levels, but only a selection of particular members. In that case, the values presented for the higher levels will still report the aggregate measure for all children of that member, and not those that happen to be selected in the current query. In this case the numbers may appear not to add up to the total at a higher level.

So, we'd really like an altogether different solution that simply allows us to see the extra labels belonging to the higher level members, without actually seeing extra rows, columns and values corresponding to those higher level members.

In that prior post I also took a look at how open source OLAP cube browsers like Saiku and Pivot4j implement this. It turned out that Saiku actually does add higher level members, and simply filters those out of the final result, whereas Pivot4J uses the Olap4J API to retrieve ancestor members (and their captions).

I wanted to solve this problem with a solution that only requires MDX and that does not require adding higher level members (and forcing calculation of those values), and I explained how we can find the captions of higher levels using the Ascendants() function. But what I failed to do is to provide are clear, generally applicable recipe that one can apply to any arbitrary MDX query.

I think I have now found a way to do this, and so I'm writing this post to share what I found, in the hope of getting feedback, and maybe help others that face the same challenge.

How to add lineage information to arbitrary MDX COLUMNS vs ROWS queries

In this section I will describe a systematic transformation to turn an arbitrary MDX query and add the lineage information.

Before I explain the query transformation, please note that it applies to MDX queries that have both a COLUMNS and a ROW axis in the SELECT-list, and no other axes. The existence of a so-called "slicer"-axis (which appears in the WHERE-clause, not in the SELECT-list) does not affect the recipe and may or may not be present.
  1. For each hierarchy on each axis, create a calculated member to retrieve the lineage for whatever member belongs to that hierarchy. (For details on how to do this, please refer to my prior blogpost on Ancestors() and Ascendants().)

    If the hierarchy is on the COLUMNS-axis, then put this calculated member in whatever is the last hierarchy of the ROWS-axis; Vice versa, if the hierarchy is on the ROWS axis, then put this calculated member in whatever is the last hierarchy of the COLUMNS-axis. As member name, you can use some descriptive text like [Lineage of <hierarchy>].

    (Note that you only need to make a calculated member if there is a lineage at all. For example, the [Measures]-hierarchy only has one level, and hence there is no lineage. Therefore it is perfectly ok to omit the [Measures]-hierarchy, since a calculated member to retrieve the lineage of its members simply would not provide any extra information.)

    So, for our original input query we have the [Product]-hierarchy on the COLUMNS-axis, and the [Time-hierarchy on the ROWS-axis; the last hierarchy on the COLUMNS-axis is [Measures] and the last hierarchy on the ROWS axis is [Time]. So we get:
    WITH
    MEMBER  [Measures].[Ancestors of Time]
    AS      Generate(
              Order(
                Ascendants([Time].CurrentMember),
                [Time].CurrentMember.Ordinal,
                ASC
              ),
              [Time].CurrentMember.Properties("MEMBER_CAPTION"),
              ","
            )
    MEMBER  [Time].[Ancestors of Product]
    AS      Generate(
              Order(
                Ascendants([Product].CurrentMember),
                [Product].CurrentMember.Ordinal,
                ASC
              ),
              [Product].CurrentMember.Properties("MEMBER_CAPTION"),
              ","
            )
    
  2. For all but the last hierarchy on each axis, create a dummy calculated member of the form:
    MEMBER   [<hierarchy>].[Lineage]
    AS       1
    
    We need these dummy members to create tuples to hold the functional calculated members that report the lineage. We could have used an existing member of these hierarchies instead, but the advantage of creating a dummy calculated member is that we get to give it a name that clearly indicates its purpose. We don't actually need the value of these members at all, which is why assigned the constant 1.

    So, in our example, the COLUMNS-axis contains 2 hierarchies, of which [Product] is the first one, so we create its dummy calculated member:
    MEMBER  [Product].[Lineage]
    AS      1
    
    Since our ROWS-axis only contains one hierarchy we don't need to add any dummy calculated members for that.
  3. For each axis, wrap the original set expression in a Union() function. The second argument to that Union() will be the original axis expression. The first argument should be a newly constructed set of our calculated members to report the lineage.

    In the case of our example, we only have one hierarchy on each axis for which we want the lineage, so instead of a full blown set, the first argument to Union() can be simply a tuple consisting of the [Lineage] dummy calculated members (that is, if they exist) plus the [Lineage of <hierarchy>] calculated member that retrieves the lineage. It's rather more difficult to explain than to simply show, so here's the entire query:
    WITH
    MEMBER  [Measures].[Ancestors of Time]
    AS      Generate(
              Order(
                Ascendants([Time].CurrentMember),
                [Time].CurrentMember.Ordinal,
                ASC
              ),
              [Time].CurrentMember.Properties("MEMBER_CAPTION"),
              ","
            )
    MEMBER  [Time].[Ancestors of Product]
    AS      Generate(
              Order(
                Ascendants([Product].CurrentMember),
                [Product].CurrentMember.Ordinal,
                ASC
              ),
              [Product].CurrentMember.Properties("MEMBER_CAPTION"),
              ","
            )
    MEMBER  [Product].[Lineage]
    AS      1
    SELECT  Union(
              ([Product].[Lineage], [Measures].[Ancestors of Time]),
              CrossJoin(
                [Product].[Product].Members,
                [Measures].[Sales]
              )
            )
    ON COLUMNS,
            Union(
              [Time].[Ancestors of Product],
              [Time].[Months].Members
            )
    ON ROWS
    FROM  [SteelWheelsSales]
    
This is it really. If we run this query, we get the following result:
  

Lineage 1968 Ford Mustang 1958 Chevy Corvette Limited Edition ...
Time Ancestors of Time Sales Sales ...
Ancestors of Product Ancestors of Product All Products,Classic Cars,Autoart Studio Design,1968 Ford Mustang All Products,Classic Cars,Carousel DieCast Legends,1958 Chevy Corvette Limited Edition ...
Jan All Years,2003,QTR1,Jan 1,742 ...
Feb All Years,2003,QTR1,Feb ...
... ... ... ... ...
Jan All Years,2004,QTR1,Jan 7,499
...
Feb All Years,2004,QTR1,Feb 4,518 847 ...
... ... ... ... ...
Basically, this is our original result, except that it has one extra Ancestors of Time column glued at the front in between the original row labels and the cellset, and one extra Ancestors of Product row added on top, in between the original column labels and the cellset. Beyond that extra row and column, we find our original cellset, exactly like it was in the original.

Inside the extra column, we find values like All Years,2003,QTR1,Jan for the first original Jan row - in other words, the entire lineage up to the root level of the [Time]-hierarchy. Note that now the second Jan row is clearly distinguishable from the first one, since the value in the lineage column there is All Years,2004,QTR1,Jan

Similarly, the extra row contains values like All Products,Classic Cars,Autoart Studio Design,1968 Ford Mustang. So for each tuple along the COLUMNS axis we get the full hierarchy of the [Product]-hierarchy here in this extra row's values.

Of course, if a GUI tool were to render this in a way that makes sense to the end-user, some post-processing is required to extract the information from the extra column and row. But all that extra information is in one place, and it didn't require any extra but unused calculations or aggregations.

Dealing with multiple hierarchies

Our example query is rather simple, having only one hierarchy on each axis for which we need the lineage. But the approach is generic and works just as well if you have multiple hierarchies on the axes. While the actual query transformation remains essentially the same, I found it useful to add a few intermediate steps to the recipe in case of multiple hierarchies. This makes it easier (I think) to recognize the original query and the different steps of the query transformation transformation.

So, let's do the transformation again, but now with this more complex query:
SELECT    CrossJoin(
            CrossJoin(
              [Product].[Product].Members,
              [Order Status].Members
            ),
            [Measures].[Sales]
          )
ON COLUMNS,
          CrossJoin(
            [Time].[Months].Members,
            [Markets].[Country].Members
          )
ON ROWS
FROM      [SteelWheelsSales]
  1. Create a named set for the expression on each axis, and give them some descriptive name like [Original <axis>]:
    WITH
    SET     [Original Columns]
    AS      CrossJoin(
              CrossJoin(
                [Product].[Product].Members,
                [Order Status].Members
              ),
              [Measures].[Sales]
            )
    SET     [Original Rows]
    AS      CrossJoin(
              [Time].[Months].Members,
              [Markets].[Country].Members
            )
    
  2. Create the calculated members for the lineage as well as the dummy calculated members exactly like I explained earlier for the simple original query. If you like, you can keep the calculated members that are to appear on the COLUMNS-axis separate from those for the ROWS-axis, and together with the named sets that are to appear on that axis.

    The calculated members that are to appear on the COLUMNS-axis (and which will thus report the lineage information for the hierarchies on the ROWS-axis) are:
    MEMBER    [Product].[Lineage] AS 1
    MEMBER    [Order Status].[Lineage] AS 1
    MEMBER    [Measures].[Lineage of Time]
    AS        Generate(
                Order(
                  Ascendants([Time].CurrentMember),
                  [Time].CurrentMember.Level.Ordinal,
                  ASC
                ),
                [Time].CurrentMember.Properties("MEMBER_CAPTION"),
                ", "
              )
    MEMBER    [Measures].[Lineage of Markets]
    AS        Generate(
                Order(
                  Ascendants([Markets].CurrentMember),
                  [Markets].CurrentMember.Level.Ordinal,
                  ASC
                ),
                [Markets].CurrentMember.Properties("MEMBER_CAPTION"),
                ", "
              )
    
    The calculated members that are to appear on the ROWS-axis (and which will thus report the lineage information for the hierarchies on the COLUMNS-axis) are:
    MEMBER    [Time].[Lineage] AS 1
    MEMBER    [Markets].[Lineage of Product]
    AS        Generate(
                Order(
                  Ascendants([Product].CurrentMember),
                  [Product].CurrentMember.Level.Ordinal,
                  ASC
                ),
                [Product].CurrentMember.Properties("MEMBER_CAPTION"),
                ", "
              )
    MEMBER    [Markets].[Lineage of Order Status]
    AS        Generate(
                Order(
                  Ascendants([Order Status].CurrentMember),
                  [Order Status].CurrentMember.Level.Ordinal,
                  ASC
                ),
                [Order Status].CurrentMember.Properties("MEMBER_CAPTION"),
                ", "
              )
    
  3. Because our axes now have multiple calculated members to retrieve the lineage, it makes sense to put those in a set to simplify creation of the set that we want to glue to the original set of the input query. We can name these sets [Lineage of <axis>]:
    SET       [Lineage of Rows]
    AS        {[Measures].[Lineage of Time]
              ,[Measures].[Lineage of Markets]}
    
    and
    SET       [Lineage of Columns]
    AS        {[Markets].[Lineage of Product]
              ,[Markets].[Lineage of Order Status]}
    
  4. The final step is just as described earlier - using a Union() to add the extra calculated members to the original set expression on the axes. The difference with the earlier simple example is that instead of writing out the literal tuples, we now construct one tuple per axis consisting of all the dummy calculated members, which we then CrossJoin() with its respective [Lineage of <axis>]. This then gives use the set that we can use as the first argument to Union(). The second argument to the union will of course be the [Original <axis>] named sets we created for the original axis sets.

    So, we get:
    Union(
      CrossJoin(
        ([Product].[Lineage], [Order Status].[Lineage]),
        [Lineage of Rows]
      ),
      [Original Columns]
    ) 
    ON COLUMNS
    
    and
    Union(
      CrossJoin(
        [Time].[Lineage],
        [Lineage of Columns]
      ),
      [Original Rows]
    ) 
    ON ROWS
    
Thus, the final query becomes:
WITH
SET       [Original Columns]
AS        CrossJoin(
            CrossJoin(
              [Product].[Product].Members,
              [Order Status].Members
            ),
            [Measures].[Sales]
          )
MEMBER    [Product].[Lineage] AS 1
MEMBER    [Order Status].[Lineage] AS 1
MEMBER    [Measures].[Lineage of Time]
AS        Generate(
            Order(
              Ascendants([Time].CurrentMember),
              [Time].CurrentMember.Level.Ordinal,
              ASC
            ),
            [Time].CurrentMember.Properties("MEMBER_CAPTION"),
            ", "
          )
MEMBER    [Measures].[Lineage of Markets]
AS        Generate(
            Order(
              Ascendants([Markets].CurrentMember),
              [Markets].CurrentMember.Level.Ordinal,
              ASC
            ),
            [Markets].CurrentMember.Properties("MEMBER_CAPTION"),
            ", "
          )
SET       [Lineage of Rows]
AS        {[Measures].[Lineage of Time]
          ,[Measures].[Lineage of Markets]}
SET       [Original Rows]
AS        CrossJoin(
            [Time].[Months].Members,
            [Markets].[Country].Members
          )
MEMBER    [Time].[Lineage] AS 1
MEMBER    [Markets].[Lineage of Product]
AS        Generate(
            Order(
              Ascendants([Product].CurrentMember),
              [Product].CurrentMember.Level.Ordinal,
              ASC
            ),
            [Product].CurrentMember.Properties("MEMBER_CAPTION"),
            ", "
          )
MEMBER    [Markets].[Lineage of Order Status]
AS        Generate(
            Order(
              Ascendants([Order Status].CurrentMember),
              [Order Status].CurrentMember.Level.Ordinal,
              ASC
            ),
            [Order Status].CurrentMember.Properties("MEMBER_CAPTION"),
            ", "
          )
SET       [Lineage of Columns]
AS        {[Markets].[Lineage of Product]
          ,[Markets].[Lineage of Order Status]}
SELECT    Union(
            CrossJoin(
              ([Product].[Lineage], [Order Status].[Lineage]),
              [Lineage of Rows]
            ),
            [Original Columns]
          ) 
ON COLUMNS,
          Union(
            CrossJoin(
              [Time].[Lineage],
              [Lineage of Columns]
            ),
            [Original Rows]
          ) 
ON ROWS
FROM      [SteelWheelsSales]
Nou, the query transformation result looks positively daunting. But it's all the result of the mechanical application of just a few rules. In that sense, it's not particularly difficult. The main thing is not to lose focus and not to forget a calculated member or a hierarchy here or there. For the original use-case, getting a query tool to retrieve the data in order to render results with ancestor information it's not really a problem since that can easily do this transformation by just looping through the axis and the hierarchies.

Note that you can use this slightly more elaborate second recipe just as well for a simple query like the one from our first example. It will work just as well and you'll end up functionally with the same result. It's just that for reasons of presentation and explanation it would be better to start with the simpler 3-step recipe before expanding the approach to multiple hierarchies.

Finally

I hope you found this article useful. As I mentioned before I am still learning MDX and it's certainly possible that I made a mistake or that my approach is more complicated than necessary. If that is the case please let me know - feel free to drop a line.

No comments:

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...