Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Friday, April 24, 2015

MDX: Grouping on non-unique levels

Diethard Steiner, allround open source BI consultant recently tempted me to a MDX challenge:
I’ve got a question for you. It’s actually a simple question, but it doesn’t seem that simple to solve - unless I am missing something.

In the SteelWheelsSale Cube you’ll find hierarchy called Product, with levels Line - Vendor - Product.

The hierarchy is setup in such a way, that you can use e.g. Vendor on its own, but you will see duplicated values, because the key keeps the context to Line.

So imagine we cannot change this Schema. Our task is to show a unique list of Vendors (on their own, without any other hierarchy levels and without All). I googled a bit for such a solution, but there isn’t much showing up. One article focused on DISTINCT(), but this I guess doesn’t work, because our Vendor Level still keeps the context to Line (I added ORDER() to make it easier to spot the duplicates):
SELECT
NON EMPTY Measures.Sales ON COLUMNS,
NON EMPTY
  ORDER(
    Product.Vendor.Members
  , Product.CurrentMember.Properties("MEMBER_CAPTION")
  , BASC
  ) ON ROWS
FROM SteelWheelsSales
Is there some kind of function to break the Vendor out of the hierarchy context? I’d be interested in hearing your thoughts on this.

The data

If we take a moment to analyze Diethards query, we notice that it gets the Sales (which represents money transferred in a sales transaction) for each member in the Vendor level out of the Product hierarchy. In addition to selecting the Vendor members, Diethard's query uses the Order() function to sort the Vendor members by caption, ensuring the sales results for the same vendor appear subsequently in the result.

If we run the query (I'm using Pash for that) we get a result that looks like this:
Product Sales
Autoart Studio Design 153268.09
Autoart Studio Design 196781.21999999997
Autoart Studio Design 66023.59999999999
Autoart Studio Design 67592.24999999999
Autoart Studio Design 131108.81999999998
Autoart Studio Design 184868.24000000002
Carousel DieCast Legends 200123.57999999993
Carousel DieCast Legends 208583.22

...many more rows...

Welly Diecast Productions 136692.72
Welly Diecast Productions 145128.12
As you can see, thanks to the Order() function we can easily notice lots of results for what appears to be duplicate vendors. That's because of the structure of the product dimension in the SteelWheels sample data. The Product hierarchy has the levels Product Line, Vendor, and Product.

Here's how that looks in the Pentaho Analysis Editor (Phase):



From a MDX point of view, the vendors aren't really duplicates though, which would become clear if we would change the query to include the ancestors of the product vendors. Alternatively, in Pash we can print the result of the previous query using member names instead of member captions. Pash lets you do that by entering the following SET command:
MDX> SET MEMBER_PROPERTY NAME;
This command tells pash to use the MEMBER_NAME property rather than MEMBER_CAPTION to render the headers of the dataset. So when we re-execute the query, we get a result that looks like this:
Product [Measures].[Sales]
[Product].[Classic Cars].[Autoart Studio Design] 153268.09
[Product].[Motorcycles].[Autoart Studio Design] 196781.21999999997
[Product].[Planes].[Autoart Studio Design] 66023.59999999999
[Product].[Ships].[Autoart Studio Design] 67592.24999999999
[Product].[Trucks and Buses].[Autoart Studio Design] 131108.81999999998
[Product].[Vintage Cars].[Autoart Studio Design] 184868.24000000002
[Product].[Classic Cars].[Carousel DieCast Legends] 200123.57999999993
[Product].[Ships].[Carousel DieCast Legends] 208583.22

...many more rows...

[Product].[Trucks and Buses].[Welly Diecast Productions] 136692.72
[Product].[Vintage Cars].[Welly Diecast Productions] 145128.12

Cheating - modifying the cube and creating a Vendor dimension

Before digging into any solutions for Diethard's challenge, it is useful to point out that the entire problem would not have existed in the first place if the cube would have provided an alternate hierarchy for the vendor, with the levels Vendor, Product Line, Product.

If you think about it a little more, you could even question whether a vendor level makes sense at all in a product hierarchy. In some businesses, vendors deliver unique products, but in case of the SteelWheels sample this is not the case. The Vendor in this case is more like a shop, and clearly, many shops sell the same products.

So, what we really need is a separate Vendor dimension. I think this makes sense, since a Vendor is really a distinct kind of thing as compared to product. In fact, the concept of a Vendor is completely orthogonal to a Product and I think many business users would agree.

I don't know why the SteelWheels example was setup with a Vendor level midway the Product hierarchy. But it illustrates nicely why and what to refactor.

I don't know if there is a proper term for a hierarchy like our Product hierarchy, that mixes and mingles levels that deal with more than one entirely different concept within the same hierarchy. By lack of better terms, I will henceforth call this a bastard-hierarchy. By extension, the Vendor level is a bastard-level. I think the terms are appropriate, since the Vendor level appears in a line of ancestry where it really doesn't fit. (Plus, I find it relieving to cuss at situations I don't like.)

Fortunately with Phase we can really, obscenely quickly refactor this hierarchy without even messing up our original SteelWheels cube. Phase has a nifty clone-button which allows you to make a deep copy of just about any schema or schema element. We can use this to clone the SteelWheels schema, and within the cloned schema, clone the Product dimension. We can then rename it to "Vendor" and modify its hierarchy, removing the Product Line and Product levels and leaving only the Vendor level:
  1. In the treeview, click the "SteelWheels" schema to select it, and hit the clone button. That's the first button on the toolbar above the schema form. You now have a new schema called "SteelWheels1".
  2. Expand the new SteelWheels1" schema and expand the "SteelWheelsSales" cube to find the Product dimension.
  3. Click the "Product" dimension to select it, and hit the clone button again. You now have a new dimension called "Product1", which is also automatically selected
  4. In the form, change the name of the dimension from "Product1" to "Vendor".
  5. Remove the levels "Line" and "Product". To do that, select the level and click the button with the red X - the delete button.
  6. Hit the save button to save the new cloned and modified schema.
You should now have something like this:



We can now immediately try out the new schema and Vendor dimension design in Pash:
MDX> USE SteelWheels1;
Current catalog set to "SteelWheels1".
MDX> SELECT Measures.Sales ON COLUMNS,
   2        Vendor.Vendor.Members ON ROWS
   3 FROM   SteelWheelsSales;
And we'll get a result like this:
Vendor Sales
Autoart Studio Design 799642.2199999999
Carousel DieCast Legends 749795.7799999999
Classic Metal Creations 1023667.4800000001
Exoto Designs 879854.2200000001
Gearbox Collectibles 912923.6599999999
Highway 66 Mini Classics 747959.1799999999
Min Lin Diecast 764228.96
Motor City Art Classics 809277.5399999999
Red Start Diecast 730593.4400000001
Second Gear Diecast 857851.2500000001
Studio M Art Models 567335.9299999999
Unimax Art Galleries 971571.68
Welly Diecast Productions 831247.8400000001
If you take a moment to go back to the result of Diethard's initial query and manually calculate the sum of sales for all products sold by Vendor "Autoart Studio Design" then you'll notice that this query delivers the correct result.

Now that we have seen that this approach works we could consider making it permanent. We could overwrite the old SteelWheels schema, and we could optimize the Vendor dimension a little bit by marking the Vendor level as having unique members. Finally, after clearing it with the report authors we could clean up the original Product dimensions and remove the Vendor level from that hierarchy altogether. This is something that could even be done gradually - you could create a new Product hierarchy by cloning the old one, and removing the Vendor level only there, and then, once all reports are modified, remove the old Product hierarchy. All these options are open and up to you.

A first attempt: named sets and Aggregate()

The brief intermezzo that concludes the previous section is just to inform you that you should always at least consider whether any trouble you have retrieving the results you require are maybe due to the design of the schema. In this particular case I feel it a very clear cut case that we actually should change the cube design. Especially since the changes do not require any new database structures or ETL - all we need to do is add a logical definition to our cube, and we can do so without taking away the user's ability to navigate the data using the old Product hierarchy.

You do not always have the ability or authority to change the schema, but if you have, and you can make the business case for it, then you should in my opinion always take that route. The remainder of this blog however is what you can in case you're not in such a position. So lets get on with that.

I googled a bit and bumped into this question on stackoverflow by Travis: "How can I merge two members into one in a query?".

The answer provided by user findango is modeled after a typical "sales per country" example, and shows how to combine the sales of a group of selected countries and compare that as a whole to the group of all other countries. This seems quite appropriate, since what I want to do is merge all members at the Vendor level that happen to have the same "local" vendor name, regardless of their ancestry into one member that represents the vendor.

I adapted that idea to fit Diethard's challenge and came up with this solution:
WITH
SET [Set of Autoart Studio Design] AS {
  [Product].[Classic Cars].[Autoart Studio Design],
  [Product].[Motorcycles].[Autoart Studio Design],
  [Product].[Planes].[Autoart Studio Design],
  [Product].[Ships].[Autoart Studio Design],
  [Product].[Trucks and Buses].[Autoart Studio Design],
  [Product].[Vintage Cars].[Autoart Studio Design]
}
MEMBER [Product].[Autoart Studio Design] AS Aggregate([Set of Autoart Studio Design])
SET [Set of Carousel DieCast Legends] AS {
  [Product].[Classic Cars].[Carousel DieCast Legends],
  [Product].[Ships].[Carousel DieCast Legends],
  [Product].[Trains].[Carousel DieCast Legends],
  [Product].[Trucks and Buses].[Carousel DieCast Legends],
  [Product].[Vintage Cars].[Carousel DieCast Legends]
}
MEMBER [Product].[Carousel DieCast Legends] AS Aggregate([Set of Carousel DieCast Legends])

...more SET and MEMBER clauses for the other vendors...
SET [Set of Welly Diecast Productions] AS { [Product].[Classic Cars].[Welly Diecast Productions], [Product].[Motorcycles].[Welly Diecast Productions], [Product].[Ships].[Welly Diecast Productions], [Product].[Trucks and Buses].[Welly Diecast Productions], [Product].[Vintage Cars].[Welly Diecast Productions] } MEMBER [Product].[Welly Diecast Productions] AS Aggregate([Set of Welly Diecast Productions]) SELECT [Measures].[Sales] ON COLUMNS, {[Product].[Autoart Studio Design] ,[Product].[Carousel DieCast Legends] ...names of other calculated members go here... ,[Product].[Welly Diecast Productions]} ON ROWS FROM SteelWheelsSales
This solution relies on two structural elements:
  1. A query-scoped named set for each Vendor grouping we'd like to see in our result. These named sets are constructed in the WITH-clause using the SET keyword. In the previous query, the definition of the sets themselves consist of a simple enumeration of member literals that we'd like to treat as a single group.
  2. For each of the named sets created in #1, a query-scoped calculated member that folds the members of each named set into a single new member. This is achieved by applying the Aggregate() function to the set. The Aggregate() function is passed the name of the set as first argument and then the calculated member acts as a new member that represents the set as a whole.
You might notice I marked up the Vendor name for Autoart Studio Design in bold in the previous query. I hope it helps you to reveal how this achieves a grouping of members that belong to the same Vendor. The same process applies to all other Vendors.

With these things in place, we can now select our measure on the COLUMNS axis, and put all of our caclulated members in a new set on the ROWS axis to get the required result, which looks something like this:

      
Product Sales
Autoart Studio Design 799642.2199999999
Carousel DieCast Legends 749795.7799999999
...more vendor sales results...
Unimax Art Galleries 971571.68
Welly Diecast Productions 831247.8400000001
You can crosscheck this result with the result we got from querying the sales over our Vendor dimension and you'll notice that they are identical (well, except for the caption, since we're still working with a Product hierarchy here, and not with a Vendor hierarchy). So, this certainly looks like we're on the right track.

Now, if you take a moment to analyze this query you might notice that we didn't really need to explicitly create a named set for each distinct vendor. The only really essential element is the calculated member based on the Aggregate() function, and instead of first creating a named set and then the calculated member that applies the Aggregate() function to it, we could've passed the definition of the set immediately as first argument to Aggregate().

For example, the calculated member [Product].[Autoart Studio Design] could just as well have been defined as
MEMBER [Product].[Autoart Studio Design] AS Aggregate({
  [Product].[Classic Cars].[Autoart Studio Design],
  [Product].[Motorcycles].[Autoart Studio Design],
  [Product].[Planes].[Autoart Studio Design],
  [Product].[Ships].[Autoart Studio Design],
  [Product].[Trucks and Buses].[Autoart Studio Design],
  [Product].[Vintage Cars].[Autoart Studio Design]
})
That said, the explicitly named sets do help to clarify how the solution works by separating the grouping of the members from the actual aggregation of the measure.

Drawbacks

The obvious drawback to this approach is that it is not dynamic, and thus not flexible. There are at least two glaring sources of inflexibility:
  1. An explicit definition for each group. We only knew which named sets to create because we ran Diethard's original query and looked at the result. We had to manually de-deplicate the Vendor list and create an explicit named set for each of them.
  2. The enumeration of members for each group. Again we had to look at the query result to determine the composition of each named set.
If you're a little bit familiar with MDX, you might've noticed right away that the explicit enumeration of members for each Vendor set could've been written a lot smarter. Once we know the caption of each distinct Vendor, we can construct the named sets dynamically using the Filter() function.

The Filter() function takes a set as first argument, and a condition (a logical expression) as second argument. The condition is applied to each member in the set and the function returns a subset, containing only those members for which the condition holds true. So instead of:
WITH
SET [Set of Autoart Studio Design] AS {
  [Product].[Classic Cars].[Autoart Studio Design],
  [Product].[Motorcycles].[Autoart Studio Design],
  [Product].[Planes].[Autoart Studio Design],
  [Product].[Ships].[Autoart Studio Design],
  [Product].[Trucks and Buses].[Autoart Studio Design],
  [Product].[Vintage Cars].[Autoart Studio Design]
}
We could have written:
WITH
SET [Set of Autoart Studio Design] AS 
    Filter(
      Product.Vendor.Members
    , Product.CurrentMember.Properties("MEMBER_CAPTION") = "Autoart Studio Design"
    )
So, instead of enumerating all the individual "Autoart Studio Design" members at the Vendor level, we write Product.Vendor.Members to take the entire set of members at the Vendor level, and then apply the condition Product.CurrentMember.Properties("MEMBER_CAPTION") = "Autoart Studio Design" to single out those members that belong to the particular vendor called "Autoart Studio Design".

This solution is surely better than what we had before: it is much, much less verbose, and more importantly, we now only need to have a list of unique vendors to construct our query, regardless of what members might or might not exist for that vendor. Constructing the set by explictly enumerating individual members is risky because we might accidentally leave out a member, or mix up members of different vendors in a single vendor group. More importantly: if the data changes in the future, and members are added for a particular vendor, our query will not be correct anymore. All these problems are solved by using a Filter() expression.

While Filter() allows us to solve one source of inflexibility, we are still stuck with regard to having to create a separate calculated member for each individual Vendor. The most important objection to enumerating all individual members that make up a set for one particular Vendor remains: by requiring advance knowledge of the list of unique vendors, our query is vulnerable to future data changes. Any vendors that might be added to the product dimension in the future will not be taken into account automatically by our query, and hence we run the risk of delivering incomplete (and thus, incorrect) results.

A more dynamic solution

I googled a bit more and ran into a fairly recent article on Richard Lees' blog, MDX - Aggregating by member_caption. In this article Richard explains how to aggregate over all cities with the same name in a geography dimension that has a country, state and a city level. So, quite similar to our Vendor problem!

Dynamically retrieving a unique list of Vendors

Unfortunately, Richard's code is way above my head. But I did manage to pick up one really neat idea from it: If we have a set of members ordered by vendor, then we can apply a Filter() such that members are retained only if their caption is not equal to that of the member that precedes it. In other words, we can filter the ordered set such that we keep only every first occurrence of a particular vendor.

This query does exactly that:
WITH
SET     OrderedVendors
AS      Order(
          Product.Vendor.Members
        , Product.CurrentMember.Properties("MEMBER_CAPTION")
        , BASC
        )
SET     UniqueVendors
AS      Filter(
          OrderedVendors
        , OrderedVendors.Item(OrderedVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") <> 
          OrderedVendors.Item(OrderedVendors.CurrentOrdinal - 1).Properties("MEMBER_CAPTION")
        )
SELECT  Measures.Sales ON COLUMNS
,       UniqueVendors ON ROWS
FROM    SteelWheelsSales
And the result:
Product [Measures].[Sales]
[Product].[Classic Cars].[Autoart Studio Design] 153268.09
[Product].[Classic Cars].[Carousel DieCast Legends] 200123.57999999993
[Product].[Classic Cars].[Classic Metal Creations] 742694.2000000002
[Product].[Classic Cars].[Exoto Designs] 265792.4400000001
[Product].[Classic Cars].[Gearbox Collectibles] 585119.6699999999
[Product].[Classic Cars].[Highway 66 Mini Classics] 190488.55000000002
[Product].[Classic Cars].[Min Lin Diecast] 335771.35000000003
[Product].[Classic Cars].[Motor City Art Classics] 120339.81000000003
[Product].[Classic Cars].[Red Start Diecast] 110501.80000000002
[Product].[Classic Cars].[Second Gear Diecast] 506032.90000000014
[Product].[Classic Cars].[Studio M Art Models] 128409.65999999996
[Product].[Classic Cars].[Unimax Art Galleries] 351828.50000000006
[Product].[Classic Cars].[Welly Diecast Productions] 401049.32000000007
If you analyze the results and compare them with Diethard's original query, you will notice that it does indeed report Sales for only the first occurrence of each Vendor (which coincidentally happen all to be children of the "Classic Cars" product line).

So, this is still only a partial solution, since we aren't currently getting the correct Sales figures. But it's an important step nonetheless, since this does give us a unique list of vendors, and it does so in a dynamic way. In other words, this might be the key to getting rid of our requirement to explicitly write code for each disctinct vendor.

This partial solution hinges on two elements:
  1. A set of members at the Vendor level that uses the Order() function to order the members according to their caption. We've seen this already in Diethard's original query. The only difference now is that we put this in a named set called OrderedVendors, instead of putting the Order() expression immediately on a query axis.
  2. A Filter() expression which uses an expression like OrderedVendors.Item(OrderedVendors.CurrentOrdinal) to compare the caption of the currently evaluated member from the OrderedVendors set with that of the previously evaluated member, which is captured using a similar but slightly different expression OrderedVendors.Item(OrderedVendors.CurrentOrdinal - 1)
The Item() function can be applied to a set to retrieve a particular tuple by ordinal position. The ordinal position is specified as argument to the Item() function. To retrieve the current tuple we apply CurrentOrdinal to the set. So OrderedVendors.Item(OrderedVendors.CurrentOrdinal) simply means: get us the current tuple from the OrderedVendors set. Since its tuples contain only one member, we can immediately apply Properties("MEMBER_CAPTION") to retrieve its caption.

Similarly, OrderedVendors.Item(OrderedVendors.CurrentOrdinal - 1).Properties("MEMBER_CAPTION") gets the caption of the previous member in the OrderedVendors set, because substracting 1 from the current ordinal means we are looking at the previous tuple. So, the entire expression:
Filter(
  OrderedVendors
, OrderedVendors.Item(OrderedVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") <> 
  OrderedVendors.Item(OrderedVendors.CurrentOrdinal - 1).Properties("MEMBER_CAPTION")
)
simply means: keep those members in the OrderedVendors set which happen to have a different caption than the previous member. And since the set was ordered by caption, this must mean we end up with only one member for each unique Vendor caption. (To be precise, we end up with only the first member for each distinct Vendor.)

Calculating totals for each unique Vendor

In order to calculate the correct totals for the Vendors we got in our previous result, we just have to add a calculated measure that takes the current value of the Vendor into account:
WITH
SET     OrderedVendors
AS      Order(
          Product.Vendor.Members,
          Product.CurrentMember.Properties("MEMBER_CAPTION"),
          BASC
        )
SET     UniqueVendors
AS      Filter(
          OrderedVendors
        , OrderedVendors.Item(OrderedVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") <>
          OrderedVendors.Item(OrderedVendors.CurrentOrdinal - 1).Properties("MEMBER_CAPTION")
        )
MEMBER  Measures.S
AS      SUM(
          Filter(
            OrderedVendors
          , UniqueVendors.Item(UniqueVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") =
            Product.CurrentMember.Properties("MEMBER_CAPTION")
          )
        , Measures.Sales
        )
SELECT  Measures.S ON COLUMNS
,       UniqueVendors ON ROWS
FROM    SteelWheelsSales
The intention of the Calculated measure is to get the SUM() of Measures.Sales, but only for those members at the Vendor level which happen to have a caption that is equal to the current member of our UniqueVendors set. If you look at the calculated measure, it looks quite logical: We filter the OrderedVendors set, which is a set of members at the Vendor level of the Product hierarchy. The expression Product.CurrentMember.Properties("MEMBER_CAPTION") is meant to refer to the current member of the OrderedVendors set in this Filter expression, and UniqueVendors.Item(UniqueVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") is meant to refer to whatever member is current in the unique vendor set, and since we compare by caption, this should give us the set of all Vendor members with the same caption, for each unique caption.

The results of the query are:
Product [Measures].[S]
[Product].[Classic Cars].[Autoart Studio Design] 799642.2199999999
[Product].[Classic Cars].[Carousel DieCast Legends] 749795.7799999999
[Product].[Classic Cars].[Classic Metal Creations] 1023667.4800000001
[Product].[Classic Cars].[Exoto Designs] 879854.2200000001
[Product].[Classic Cars].[Gearbox Collectibles] 912923.6599999999
[Product].[Classic Cars].[Highway 66 Mini Classics] 747959.1799999999
[Product].[Classic Cars].[Min Lin Diecast] 764228.96
[Product].[Classic Cars].[Motor City Art Classics] 809277.5399999999
[Product].[Classic Cars].[Red Start Diecast] 730593.4400000001
[Product].[Classic Cars].[Second Gear Diecast] 857851.2500000001
[Product].[Classic Cars].[Studio M Art Models] 567335.9299999999
[Product].[Classic Cars].[Unimax Art Galleries] 971571.68
[Product].[Classic Cars].[Welly Diecast Productions] 831247.8400000001
If you compare it to our previous results you'll notice that this is indeed the correct result.

Now, even though it seems to work, and even though it does satisfy Diethard's challenge, there are a couple of things about this solution that aren't quite to my liking.
  1. For starters, the result is strange because the row headers are clearly a single member whereas the result is definitely not that of a single member. If all we care about is the label on the ROWS axis, then it doesn't matter, and indeed we wouldn't notice if we'd print the member captions instead of the full member name. But our initial solution based on Aggregate() was more pure in this respect, since that actually allowed us to explicitly create a new member to represent our group of Vendor members. (But of course, the drawback there was that we were unable to generate those groups dynamically)
  2. The solution with Aggregate() had another significant advantage: it knew automagically how to calculate the measure. Apparently Aggregate() is aware of the underlying aggregator that is used to define the measure, and without instruction it calculated the right result, whereas my last solution requires me to explicitly define SUM() to aggregate the values of the measure across the vendors. This might not seem a problem, but what if our measure was supposed to be aggregated by taking the average? In short, we have to have external knowledge about the nature of the measure and choose an appropriate aggregate function ourselves. I'd much prefer it if that could be avoided.
  3. Finally, what worries me about this solution is that, despite the explanation I gave of how it works, I don't really fully understand it.
If my explanation is correct, then I should be able to write:
MEMBER  Measures.S
AS      SUM(
          Filter(
            Product.Vendor.Members
          , UniqueVendors.Item(UniqueVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") =
            Product.CurrentMember.Properties("MEMBER_CAPTION")
          )
        , Measures.Sales
        )
instead of:
MEMBER  Measures.S
AS      SUM(
          Filter(
            OrderedVendors
          , UniqueVendors.Item(UniqueVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") =
            Product.CurrentMember.Properties("MEMBER_CAPTION")
          )
        , Measures.Sales
        )
After all, what difference does it make whether we filter Vendor level members out of an ordered or an ordered set?

Well, it turns out that it does make a difference: If I use Product.Vendor.Members instead of OrderedMembers then the measure just keeps repeating the value 799642.219999999, which is the total for the vendor "Autoart Studio Design".

Frankly I have zero clue why. I put up a question about this on Stackoverflow, so far, with not many encouraging answers. Please, chime on if you can shed some light on this. I would really appreciate any insights on this matter.

Aggregate(): Redux

So, first we had one problem, and no solutions. Now we have 2 solutions, and at least 3 problems, just different problems. Whether this can be called progess, I'd rather not decide. There is this proverb:
when you're in a hole, stop digging.
Blogs like these would not be written if I'd heed such sensible advice. Instead, I came up with yet another solution that, sort of, combines the elements of my current two solutions into something that is so terrific, we can truly call it the best worst of two worlds.

First, lets consider this summary of my two solutions:

Good Bad
1st solution (using Aggregate())
  • Proper grouping into explicit new members
  • Implicit calculation of measures
  • Completely static
  • Requires knowledge of vendors and vendor members in advance
2nd solution (using UniqueVendors)
  • Completely dynamic
  • Does not require knowledge in advance of vendors and its members
  • Strange grouping to first occurrence of Vendor
  • Explicit calculation of measures


So obviously we'd like to have a solution that has all of the good and none of the bad. The problem we have to overcome is the dynamic generation of calculated members to represent the custom grouping of our members to represent individual vendors. It turns out, there is a way. Just not in one query.

(Please, somebody, anybody, prove me wrong on this!)

What we can do though, is use a query that uses the essential elements of my second solution that generates output that is exactly like my first query. This result can then be run to obtain the desired result. In other words, we enter the domain of dynamic MDX or, with a really posh term, higher order MDX.

So, here goes:
WITH
SET     OrderedVendors
AS      Order(
          Product.Vendor.Members
        , Product.CurrentMember.Properties("MEMBER_CAPTION"),
          BASC
        )
SET     UniqueVendors
AS      Filter(
          OrderedVendors
        , OrderedVendors.Item(OrderedVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") <>
          OrderedVendors.Item(OrderedVendors.CurrentOrdinal - 1).Properties("MEMBER_CAPTION")
        )
MEMBER  Measures.S
AS      "WITH"||Chr(10)||
        Generate(
          UniqueVendors
        , "MEMBER Product.["||Product.CurrentMember.Properties("MEMBER_CAPTION")||"]"||Chr(10)||
          "AS Aggregate("||
            "Filter("||
            "  Product.Vendor.Members"||
            ", Product.CurrentMember.Properties('MEMBER_CAPTION') = "||
            "'"||Product.CurrentMember.Properties("MEMBER_CAPTION")||"'"||
            ")"||
          ")"
        , Chr(10)
        )
        ||Chr(10)||"SELECT Measures.Sales ON COLUMNS,"
        ||Chr(10)||"{"||
        Generate(
          UniqueVendors
        , "Product.["||Product.CurrentMember.Properties("MEMBER_CAPTION")||"]"
        , Chr(10)||","
        )
        ||"} ON ROWS"
        ||Chr(10)||"FROM SteelWheelsSales"
SELECT  Measures.S ON COLUMNS
FROM    SteelWheelsSales
The query might be easier to analyze if you see its exact result:
WITH
MEMBER Product.[Autoart Studio Design]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Autoart Studio Design'))
MEMBER Product.[Carousel DieCast Legends]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Carousel DieCast Legends'))
MEMBER Product.[Classic Metal Creations]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Classic Metal Creations'))
MEMBER Product.[Exoto Designs]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Exoto Designs'))
MEMBER Product.[Gearbox Collectibles]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Gearbox Collectibles'))
MEMBER Product.[Highway 66 Mini Classics]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Highway 66 Mini Classics'))
MEMBER Product.[Min Lin Diecast]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Min Lin Diecast'))
MEMBER Product.[Motor City Art Classics]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Motor City Art Classics'))
MEMBER Product.[Red Start Diecast]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Red Start Diecast'))
MEMBER Product.[Second Gear Diecast]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Second Gear Diecast'))
MEMBER Product.[Studio M Art Models]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Studio M Art Models'))
MEMBER Product.[Unimax Art Galleries]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Unimax Art Galleries'))
MEMBER Product.[Welly Diecast Productions]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Welly Diecast Productions'))
SELECT Measures.Sales ON COLUMNS,
{Product.[Autoart Studio Design]
,Product.[Carousel DieCast Legends]
,Product.[Classic Metal Creations]
,Product.[Exoto Designs]
,Product.[Gearbox Collectibles]
,Product.[Highway 66 Mini Classics]
,Product.[Min Lin Diecast]
,Product.[Motor City Art Classics]
,Product.[Red Start Diecast]
,Product.[Second Gear Diecast]
,Product.[Studio M Art Models]
,Product.[Unimax Art Galleries]
,Product.[Welly Diecast Productions]} ON ROWS
FROM SteelWheelsSales
The generating query relies on the Generate() function, which I discussed in my previous blog post, MDX: retrieving the entire hierarchy path with Ancestors(). The Generate() function is used twice, both over the UniqueVendors set. The first Generate() function is used to create the code that defines the calculated members, which serve to group the Vendor members based on caption. The second Generate() function generates the code that defines the set which references these calculated members and which appears on the ROWS axis of the generated query.

There's a few elements in this query that might or might not be familiar:
  • String constants, which are denoted using either double or single quotes
  • The String concatenation operator ||. Note that this is Mondrian specific syntax - The MDX standard defines the plus sign (+) as string concatenation operator. (If someone could point out the proper way to discover which operator is used for string concatenation, I'd be really grateful!) You can easily rewrite this query to standard MDX by replacing each occurrence of || with +.
  • The Chr() function, which MDX inherits from VBA generates a character that corresponds to the character code passed as argument. The generator uses Chr(10) to create newlines in the generated query.
You might recall that I discussed a number of variants of my first query. The version generated here is the most compact one. It doesn't bother to generate separate named sets for the Vendors, and it uses the Filter() expression to define the contents of the named set instead of enumerating all the individual members for the Vendors.

It is quite easy to rewrite the query so that it generates one of the other forms of the query. In particular, it might be useful to generate a query that enumerates the members rather than using the Filter() function. You might recall I initially argued against that on the basis that it is error prone and not resilient to future data changes, but since we can now generate a correct and up to date version of the query anytime, these objections are lifted.

To obtain the query such that it enumerates all members explicitly, one might use the SetToStr() function, which I also discussed briefly in my previous blog. Such a solution would something like this to generate the Aggregate() expressions for the calculated members:
"AS Aggregate("||
  SetToStr(
    Filter(
      Product.Vendor.Members
    , Product.CurrentMember.Properties("MEMBER_CAPTION") = 
      UniqueVendors.Item(UniqueVendors.CurrentOrdinal).Properties("MEMBER_CAPTION")
    )
  )||")"

Conclusion

If you have a requirement like Diethard's, you really should first consider if you can achieve it by refactoring the schema. Remember, this is not only about making life easy for the MDX author; If there really is a need to make top-level groupings on lower levels of a hierarchy, you might be dealing with a bastard-hierarchy, and the general state of things will be much improved if you put it in a different hierarchy.

There may be other cases where there is a need to make groupings on members of non-uniqe levels. An example that comes to mind is querying Sales quarters against years. This is different from our Vendor example for two reasons. Quarters and Years clearly could very well belong to the same hierarchy (i.e., we're not dealing with a bastard-hierarchy in this case). But if we accept that, we then have to figure out how to put members from the same hierarchy on two different axes of a MDX query. This is a completely different kettle of fish.

However, it is still good to know that this situation too, could, in principle, be solved by creating two separate hierarchies: one with Year as top level and one with the Quarter as top level. (And this is a design that I have observed.)

If it is not possible or appropriate to change or add the hierarchy, you're going to have to work your way around it. I offered three different solutions that can help you out. None of them is perfect, and I did my best to point out the advantages and disadvantages of each method. I hope you find this information useful and I hope it will help you decide how to meet your particular requirements.

Finally, as always, I happily welcome any comments, critique and suggestions. I'm still quite new to MDX so it is entirely possible that I missed a solution or that my solutions could be simplified. I would be very grateful if you could point it out so I can learn from your insights.

1 comment:

Diethard Steiner said...

Hi Roland,
This is a very impressive article! Thanks for sharing these three solutions to my challenging question. I'd say we learnt a lot out of this. Certainly if we can change the Mondrian schema, we should be practical and create a dedicated hierarchy. We knew this - but we were also up for a good MDX challenge, which you passed with flying colors!
Best regards,
Diethard