Friday, April 24, 2015

MDX: Grouping on non-unique levels

Diethard Steiner, allround open source BI consultant recently tempted me to a MDX challenge:
I’ve got a question for you. It’s actually a simple question, but it doesn’t seem that simple to solve - unless I am missing something.

In the SteelWheelsSale Cube you’ll find hierarchy called Product, with levels Line - Vendor - Product.

The hierarchy is setup in such a way, that you can use e.g. Vendor on its own, but you will see duplicated values, because the key keeps the context to Line.

So imagine we cannot change this Schema. Our task is to show a unique list of Vendors (on their own, without any other hierarchy levels and without All). I googled a bit for such a solution, but there isn’t much showing up. One article focused on DISTINCT(), but this I guess doesn’t work, because our Vendor Level still keeps the context to Line (I added ORDER() to make it easier to spot the duplicates):
SELECT
NON EMPTY Measures.Sales ON COLUMNS,
NON EMPTY
  ORDER(
    Product.Vendor.Members
  , Product.CurrentMember.Properties("MEMBER_CAPTION")
  , BASC
  ) ON ROWS
FROM SteelWheelsSales
Is there some kind of function to break the Vendor out of the hierarchy context? I’d be interested in hearing your thoughts on this.

The data

If we take a moment to analyze Diethards query, we notice that it gets the Sales (which represents money transferred in a sales transaction) for each member in the Vendor level out of the Product hierarchy. In addition to selecting the Vendor members, Diethard's query uses the Order() function to sort the Vendor members by caption, ensuring the sales results for the same vendor appear subsequently in the result.

If we run the query (I'm using Pash for that) we get a result that looks like this:
Product Sales
Autoart Studio Design 153268.09
Autoart Studio Design 196781.21999999997
Autoart Studio Design 66023.59999999999
Autoart Studio Design 67592.24999999999
Autoart Studio Design 131108.81999999998
Autoart Studio Design 184868.24000000002
Carousel DieCast Legends 200123.57999999993
Carousel DieCast Legends 208583.22

...many more rows...

Welly Diecast Productions 136692.72
Welly Diecast Productions 145128.12
As you can see, thanks to the Order() function we can easily notice lots of results for what appears to be duplicate vendors. That's because of the structure of the product dimension in the SteelWheels sample data. The Product hierarchy has the levels Product Line, Vendor, and Product.

Here's how that looks in the Pentaho Analysis Editor (Phase):



From a MDX point of view, the vendors aren't really duplicates though, which would become clear if we would change the query to include the ancestors of the product vendors. Alternatively, in Pash we can print the result of the previous query using member names instead of member captions. Pash lets you do that by entering the following SET command:
MDX> SET MEMBER_PROPERTY NAME;
This command tells pash to use the MEMBER_NAME property rather than MEMBER_CAPTION to render the headers of the dataset. So when we re-execute the query, we get a result that looks like this:
Product [Measures].[Sales]
[Product].[Classic Cars].[Autoart Studio Design] 153268.09
[Product].[Motorcycles].[Autoart Studio Design] 196781.21999999997
[Product].[Planes].[Autoart Studio Design] 66023.59999999999
[Product].[Ships].[Autoart Studio Design] 67592.24999999999
[Product].[Trucks and Buses].[Autoart Studio Design] 131108.81999999998
[Product].[Vintage Cars].[Autoart Studio Design] 184868.24000000002
[Product].[Classic Cars].[Carousel DieCast Legends] 200123.57999999993
[Product].[Ships].[Carousel DieCast Legends] 208583.22

...many more rows...

[Product].[Trucks and Buses].[Welly Diecast Productions] 136692.72
[Product].[Vintage Cars].[Welly Diecast Productions] 145128.12

Cheating - modifying the cube and creating a Vendor dimension

Before digging into any solutions for Diethard's challenge, it is useful to point out that the entire problem would not have existed in the first place if the cube would have provided an alternate hierarchy for the vendor, with the levels Vendor, Product Line, Product.

If you think about it a little more, you could even question whether a vendor level makes sense at all in a product hierarchy. In some businesses, vendors deliver unique products, but in case of the SteelWheels sample this is not the case. The Vendor in this case is more like a shop, and clearly, many shops sell the same products.

So, what we really need is a separate Vendor dimension. I think this makes sense, since a Vendor is really a distinct kind of thing as compared to product. In fact, the concept of a Vendor is completely orthogonal to a Product and I think many business users would agree.

I don't know why the SteelWheels example was setup with a Vendor level midway the Product hierarchy. But it illustrates nicely why and what to refactor.

I don't know if there is a proper term for a hierarchy like our Product hierarchy, that mixes and mingles levels that deal with more than one entirely different concept within the same hierarchy. By lack of better terms, I will henceforth call this a bastard-hierarchy. By extension, the Vendor level is a bastard-level. I think the terms are appropriate, since the Vendor level appears in a line of ancestry where it really doesn't fit. (Plus, I find it relieving to cuss at situations I don't like.)

Fortunately with Phase we can really, obscenely quickly refactor this hierarchy without even messing up our original SteelWheels cube. Phase has a nifty clone-button which allows you to make a deep copy of just about any schema or schema element. We can use this to clone the SteelWheels schema, and within the cloned schema, clone the Product dimension. We can then rename it to "Vendor" and modify its hierarchy, removing the Product Line and Product levels and leaving only the Vendor level:
  1. In the treeview, click the "SteelWheels" schema to select it, and hit the clone button. That's the first button on the toolbar above the schema form. You now have a new schema called "SteelWheels1".
  2. Expand the new SteelWheels1" schema and expand the "SteelWheelsSales" cube to find the Product dimension.
  3. Click the "Product" dimension to select it, and hit the clone button again. You now have a new dimension called "Product1", which is also automatically selected
  4. In the form, change the name of the dimension from "Product1" to "Vendor".
  5. Remove the levels "Line" and "Product". To do that, select the level and click the button with the red X - the delete button.
  6. Hit the save button to save the new cloned and modified schema.
You should now have something like this:



We can now immediately try out the new schema and Vendor dimension design in Pash:
MDX> USE SteelWheels1;
Current catalog set to "SteelWheels1".
MDX> SELECT Measures.Sales ON COLUMNS,
   2        Vendor.Vendor.Members ON ROWS
   3 FROM   SteelWheelsSales;
And we'll get a result like this:
Vendor Sales
Autoart Studio Design 799642.2199999999
Carousel DieCast Legends 749795.7799999999
Classic Metal Creations 1023667.4800000001
Exoto Designs 879854.2200000001
Gearbox Collectibles 912923.6599999999
Highway 66 Mini Classics 747959.1799999999
Min Lin Diecast 764228.96
Motor City Art Classics 809277.5399999999
Red Start Diecast 730593.4400000001
Second Gear Diecast 857851.2500000001
Studio M Art Models 567335.9299999999
Unimax Art Galleries 971571.68
Welly Diecast Productions 831247.8400000001
If you take a moment to go back to the result of Diethard's initial query and manually calculate the sum of sales for all products sold by Vendor "Autoart Studio Design" then you'll notice that this query delivers the correct result.

Now that we have seen that this approach works we could consider making it permanent. We could overwrite the old SteelWheels schema, and we could optimize the Vendor dimension a little bit by marking the Vendor level as having unique members. Finally, after clearing it with the report authors we could clean up the original Product dimensions and remove the Vendor level from that hierarchy altogether. This is something that could even be done gradually - you could create a new Product hierarchy by cloning the old one, and removing the Vendor level only there, and then, once all reports are modified, remove the old Product hierarchy. All these options are open and up to you.

A first attempt: named sets and Aggregate()

The brief intermezzo that concludes the previous section is just to inform you that you should always at least consider whether any trouble you have retrieving the results you require are maybe due to the design of the schema. In this particular case I feel it a very clear cut case that we actually should change the cube design. Especially since the changes do not require any new database structures or ETL - all we need to do is add a logical definition to our cube, and we can do so without taking away the user's ability to navigate the data using the old Product hierarchy.

You do not always have the ability or authority to change the schema, but if you have, and you can make the business case for it, then you should in my opinion always take that route. The remainder of this blog however is what you can in case you're not in such a position. So lets get on with that.

I googled a bit and bumped into this question on stackoverflow by Travis: "How can I merge two members into one in a query?".

The answer provided by user findango is modeled after a typical "sales per country" example, and shows how to combine the sales of a group of selected countries and compare that as a whole to the group of all other countries. This seems quite appropriate, since what I want to do is merge all members at the Vendor level that happen to have the same "local" vendor name, regardless of their ancestry into one member that represents the vendor.

I adapted that idea to fit Diethard's challenge and came up with this solution:
WITH
SET [Set of Autoart Studio Design] AS {
  [Product].[Classic Cars].[Autoart Studio Design],
  [Product].[Motorcycles].[Autoart Studio Design],
  [Product].[Planes].[Autoart Studio Design],
  [Product].[Ships].[Autoart Studio Design],
  [Product].[Trucks and Buses].[Autoart Studio Design],
  [Product].[Vintage Cars].[Autoart Studio Design]
}
MEMBER [Product].[Autoart Studio Design] AS Aggregate([Set of Autoart Studio Design])
SET [Set of Carousel DieCast Legends] AS {
  [Product].[Classic Cars].[Carousel DieCast Legends],
  [Product].[Ships].[Carousel DieCast Legends],
  [Product].[Trains].[Carousel DieCast Legends],
  [Product].[Trucks and Buses].[Carousel DieCast Legends],
  [Product].[Vintage Cars].[Carousel DieCast Legends]
}
MEMBER [Product].[Carousel DieCast Legends] AS Aggregate([Set of Carousel DieCast Legends])

...more SET and MEMBER clauses for the other vendors...
SET [Set of Welly Diecast Productions] AS { [Product].[Classic Cars].[Welly Diecast Productions], [Product].[Motorcycles].[Welly Diecast Productions], [Product].[Ships].[Welly Diecast Productions], [Product].[Trucks and Buses].[Welly Diecast Productions], [Product].[Vintage Cars].[Welly Diecast Productions] } MEMBER [Product].[Welly Diecast Productions] AS Aggregate([Set of Welly Diecast Productions]) SELECT [Measures].[Sales] ON COLUMNS, {[Product].[Autoart Studio Design] ,[Product].[Carousel DieCast Legends] ...names of other calculated members go here... ,[Product].[Welly Diecast Productions]} ON ROWS FROM SteelWheelsSales
This solution relies on two structural elements:
  1. A query-scoped named set for each Vendor grouping we'd like to see in our result. These named sets are constructed in the WITH-clause using the SET keyword. In the previous query, the definition of the sets themselves consist of a simple enumeration of member literals that we'd like to treat as a single group.
  2. For each of the named sets created in #1, a query-scoped calculated member that folds the members of each named set into a single new member. This is achieved by applying the Aggregate() function to the set. The Aggregate() function is passed the name of the set as first argument and then the calculated member acts as a new member that represents the set as a whole.
You might notice I marked up the Vendor name for Autoart Studio Design in bold in the previous query. I hope it helps you to reveal how this achieves a grouping of members that belong to the same Vendor. The same process applies to all other Vendors.

With these things in place, we can now select our measure on the COLUMNS axis, and put all of our caclulated members in a new set on the ROWS axis to get the required result, which looks something like this:

      
Product Sales
Autoart Studio Design 799642.2199999999
Carousel DieCast Legends 749795.7799999999
...more vendor sales results...
Unimax Art Galleries 971571.68
Welly Diecast Productions 831247.8400000001
You can crosscheck this result with the result we got from querying the sales over our Vendor dimension and you'll notice that they are identical (well, except for the caption, since we're still working with a Product hierarchy here, and not with a Vendor hierarchy). So, this certainly looks like we're on the right track.

Now, if you take a moment to analyze this query you might notice that we didn't really need to explicitly create a named set for each distinct vendor. The only really essential element is the calculated member based on the Aggregate() function, and instead of first creating a named set and then the calculated member that applies the Aggregate() function to it, we could've passed the definition of the set immediately as first argument to Aggregate().

For example, the calculated member [Product].[Autoart Studio Design] could just as well have been defined as
MEMBER [Product].[Autoart Studio Design] AS Aggregate({
  [Product].[Classic Cars].[Autoart Studio Design],
  [Product].[Motorcycles].[Autoart Studio Design],
  [Product].[Planes].[Autoart Studio Design],
  [Product].[Ships].[Autoart Studio Design],
  [Product].[Trucks and Buses].[Autoart Studio Design],
  [Product].[Vintage Cars].[Autoart Studio Design]
})
That said, the explicitly named sets do help to clarify how the solution works by separating the grouping of the members from the actual aggregation of the measure.

Drawbacks

The obvious drawback to this approach is that it is not dynamic, and thus not flexible. There are at least two glaring sources of inflexibility:
  1. An explicit definition for each group. We only knew which named sets to create because we ran Diethard's original query and looked at the result. We had to manually de-deplicate the Vendor list and create an explicit named set for each of them.
  2. The enumeration of members for each group. Again we had to look at the query result to determine the composition of each named set.
If you're a little bit familiar with MDX, you might've noticed right away that the explicit enumeration of members for each Vendor set could've been written a lot smarter. Once we know the caption of each distinct Vendor, we can construct the named sets dynamically using the Filter() function.

The Filter() function takes a set as first argument, and a condition (a logical expression) as second argument. The condition is applied to each member in the set and the function returns a subset, containing only those members for which the condition holds true. So instead of:
WITH
SET [Set of Autoart Studio Design] AS {
  [Product].[Classic Cars].[Autoart Studio Design],
  [Product].[Motorcycles].[Autoart Studio Design],
  [Product].[Planes].[Autoart Studio Design],
  [Product].[Ships].[Autoart Studio Design],
  [Product].[Trucks and Buses].[Autoart Studio Design],
  [Product].[Vintage Cars].[Autoart Studio Design]
}
We could have written:
WITH
SET [Set of Autoart Studio Design] AS 
    Filter(
      Product.Vendor.Members
    , Product.CurrentMember.Properties("MEMBER_CAPTION") = "Autoart Studio Design"
    )
So, instead of enumerating all the individual "Autoart Studio Design" members at the Vendor level, we write Product.Vendor.Members to take the entire set of members at the Vendor level, and then apply the condition Product.CurrentMember.Properties("MEMBER_CAPTION") = "Autoart Studio Design" to single out those members that belong to the particular vendor called "Autoart Studio Design".

This solution is surely better than what we had before: it is much, much less verbose, and more importantly, we now only need to have a list of unique vendors to construct our query, regardless of what members might or might not exist for that vendor. Constructing the set by explictly enumerating individual members is risky because we might accidentally leave out a member, or mix up members of different vendors in a single vendor group. More importantly: if the data changes in the future, and members are added for a particular vendor, our query will not be correct anymore. All these problems are solved by using a Filter() expression.

While Filter() allows us to solve one source of inflexibility, we are still stuck with regard to having to create a separate calculated member for each individual Vendor. The most important objection to enumerating all individual members that make up a set for one particular Vendor remains: by requiring advance knowledge of the list of unique vendors, our query is vulnerable to future data changes. Any vendors that might be added to the product dimension in the future will not be taken into account automatically by our query, and hence we run the risk of delivering incomplete (and thus, incorrect) results.

A more dynamic solution

I googled a bit more and ran into a fairly recent article on Richard Lees' blog, MDX - Aggregating by member_caption. In this article Richard explains how to aggregate over all cities with the same name in a geography dimension that has a country, state and a city level. So, quite similar to our Vendor problem!

Dynamically retrieving a unique list of Vendors

Unfortunately, Richard's code is way above my head. But I did manage to pick up one really neat idea from it: If we have a set of members ordered by vendor, then we can apply a Filter() such that members are retained only if their caption is not equal to that of the member that precedes it. In other words, we can filter the ordered set such that we keep only every first occurrence of a particular vendor.

This query does exactly that:
WITH
SET     OrderedVendors
AS      Order(
          Product.Vendor.Members
        , Product.CurrentMember.Properties("MEMBER_CAPTION")
        , BASC
        )
SET     UniqueVendors
AS      Filter(
          OrderedVendors
        , OrderedVendors.Item(OrderedVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") <> 
          OrderedVendors.Item(OrderedVendors.CurrentOrdinal - 1).Properties("MEMBER_CAPTION")
        )
SELECT  Measures.Sales ON COLUMNS
,       UniqueVendors ON ROWS
FROM    SteelWheelsSales
And the result:
Product [Measures].[Sales]
[Product].[Classic Cars].[Autoart Studio Design] 153268.09
[Product].[Classic Cars].[Carousel DieCast Legends] 200123.57999999993
[Product].[Classic Cars].[Classic Metal Creations] 742694.2000000002
[Product].[Classic Cars].[Exoto Designs] 265792.4400000001
[Product].[Classic Cars].[Gearbox Collectibles] 585119.6699999999
[Product].[Classic Cars].[Highway 66 Mini Classics] 190488.55000000002
[Product].[Classic Cars].[Min Lin Diecast] 335771.35000000003
[Product].[Classic Cars].[Motor City Art Classics] 120339.81000000003
[Product].[Classic Cars].[Red Start Diecast] 110501.80000000002
[Product].[Classic Cars].[Second Gear Diecast] 506032.90000000014
[Product].[Classic Cars].[Studio M Art Models] 128409.65999999996
[Product].[Classic Cars].[Unimax Art Galleries] 351828.50000000006
[Product].[Classic Cars].[Welly Diecast Productions] 401049.32000000007
If you analyze the results and compare them with Diethard's original query, you will notice that it does indeed report Sales for only the first occurrence of each Vendor (which coincidentally happen all to be children of the "Classic Cars" product line).

So, this is still only a partial solution, since we aren't currently getting the correct Sales figures. But it's an important step nonetheless, since this does give us a unique list of vendors, and it does so in a dynamic way. In other words, this might be the key to getting rid of our requirement to explicitly write code for each disctinct vendor.

This partial solution hinges on two elements:
  1. A set of members at the Vendor level that uses the Order() function to order the members according to their caption. We've seen this already in Diethard's original query. The only difference now is that we put this in a named set called OrderedVendors, instead of putting the Order() expression immediately on a query axis.
  2. A Filter() expression which uses an expression like OrderedVendors.Item(OrderedVendors.CurrentOrdinal) to compare the caption of the currently evaluated member from the OrderedVendors set with that of the previously evaluated member, which is captured using a similar but slightly different expression OrderedVendors.Item(OrderedVendors.CurrentOrdinal - 1)
The Item() function can be applied to a set to retrieve a particular tuple by ordinal position. The ordinal position is specified as argument to the Item() function. To retrieve the current tuple we apply CurrentOrdinal to the set. So OrderedVendors.Item(OrderedVendors.CurrentOrdinal) simply means: get us the current tuple from the OrderedVendors set. Since its tuples contain only one member, we can immediately apply Properties("MEMBER_CAPTION") to retrieve its caption.

Similarly, OrderedVendors.Item(OrderedVendors.CurrentOrdinal - 1).Properties("MEMBER_CAPTION") gets the caption of the previous member in the OrderedVendors set, because substracting 1 from the current ordinal means we are looking at the previous tuple. So, the entire expression:
Filter(
  OrderedVendors
, OrderedVendors.Item(OrderedVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") <> 
  OrderedVendors.Item(OrderedVendors.CurrentOrdinal - 1).Properties("MEMBER_CAPTION")
)
simply means: keep those members in the OrderedVendors set which happen to have a different caption than the previous member. And since the set was ordered by caption, this must mean we end up with only one member for each unique Vendor caption. (To be precise, we end up with only the first member for each distinct Vendor.)

Calculating totals for each unique Vendor

In order to calculate the correct totals for the Vendors we got in our previous result, we just have to add a calculated measure that takes the current value of the Vendor into account:
WITH
SET     OrderedVendors
AS      Order(
          Product.Vendor.Members,
          Product.CurrentMember.Properties("MEMBER_CAPTION"),
          BASC
        )
SET     UniqueVendors
AS      Filter(
          OrderedVendors
        , OrderedVendors.Item(OrderedVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") <>
          OrderedVendors.Item(OrderedVendors.CurrentOrdinal - 1).Properties("MEMBER_CAPTION")
        )
MEMBER  Measures.S
AS      SUM(
          Filter(
            OrderedVendors
          , UniqueVendors.Item(UniqueVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") =
            Product.CurrentMember.Properties("MEMBER_CAPTION")
          )
        , Measures.Sales
        )
SELECT  Measures.S ON COLUMNS
,       UniqueVendors ON ROWS
FROM    SteelWheelsSales
The intention of the Calculated measure is to get the SUM() of Measures.Sales, but only for those members at the Vendor level which happen to have a caption that is equal to the current member of our UniqueVendors set. If you look at the calculated measure, it looks quite logical: We filter the OrderedVendors set, which is a set of members at the Vendor level of the Product hierarchy. The expression Product.CurrentMember.Properties("MEMBER_CAPTION") is meant to refer to the current member of the OrderedVendors set in this Filter expression, and UniqueVendors.Item(UniqueVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") is meant to refer to whatever member is current in the unique vendor set, and since we compare by caption, this should give us the set of all Vendor members with the same caption, for each unique caption.

The results of the query are:
Product [Measures].[S]
[Product].[Classic Cars].[Autoart Studio Design] 799642.2199999999
[Product].[Classic Cars].[Carousel DieCast Legends] 749795.7799999999
[Product].[Classic Cars].[Classic Metal Creations] 1023667.4800000001
[Product].[Classic Cars].[Exoto Designs] 879854.2200000001
[Product].[Classic Cars].[Gearbox Collectibles] 912923.6599999999
[Product].[Classic Cars].[Highway 66 Mini Classics] 747959.1799999999
[Product].[Classic Cars].[Min Lin Diecast] 764228.96
[Product].[Classic Cars].[Motor City Art Classics] 809277.5399999999
[Product].[Classic Cars].[Red Start Diecast] 730593.4400000001
[Product].[Classic Cars].[Second Gear Diecast] 857851.2500000001
[Product].[Classic Cars].[Studio M Art Models] 567335.9299999999
[Product].[Classic Cars].[Unimax Art Galleries] 971571.68
[Product].[Classic Cars].[Welly Diecast Productions] 831247.8400000001
If you compare it to our previous results you'll notice that this is indeed the correct result.

Now, even though it seems to work, and even though it does satisfy Diethard's challenge, there are a couple of things about this solution that aren't quite to my liking.
  1. For starters, the result is strange because the row headers are clearly a single member whereas the result is definitely not that of a single member. If all we care about is the label on the ROWS axis, then it doesn't matter, and indeed we wouldn't notice if we'd print the member captions instead of the full member name. But our initial solution based on Aggregate() was more pure in this respect, since that actually allowed us to explicitly create a new member to represent our group of Vendor members. (But of course, the drawback there was that we were unable to generate those groups dynamically)
  2. The solution with Aggregate() had another significant advantage: it knew automagically how to calculate the measure. Apparently Aggregate() is aware of the underlying aggregator that is used to define the measure, and without instruction it calculated the right result, whereas my last solution requires me to explicitly define SUM() to aggregate the values of the measure across the vendors. This might not seem a problem, but what if our measure was supposed to be aggregated by taking the average? In short, we have to have external knowledge about the nature of the measure and choose an appropriate aggregate function ourselves. I'd much prefer it if that could be avoided.
  3. Finally, what worries me about this solution is that, despite the explanation I gave of how it works, I don't really fully understand it.
If my explanation is correct, then I should be able to write:
MEMBER  Measures.S
AS      SUM(
          Filter(
            Product.Vendor.Members
          , UniqueVendors.Item(UniqueVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") =
            Product.CurrentMember.Properties("MEMBER_CAPTION")
          )
        , Measures.Sales
        )
instead of:
MEMBER  Measures.S
AS      SUM(
          Filter(
            OrderedVendors
          , UniqueVendors.Item(UniqueVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") =
            Product.CurrentMember.Properties("MEMBER_CAPTION")
          )
        , Measures.Sales
        )
After all, what difference does it make whether we filter Vendor level members out of an ordered or an ordered set?

Well, it turns out that it does make a difference: If I use Product.Vendor.Members instead of OrderedMembers then the measure just keeps repeating the value 799642.219999999, which is the total for the vendor "Autoart Studio Design".

Frankly I have zero clue why. I put up a question about this on Stackoverflow, so far, with not many encouraging answers. Please, chime on if you can shed some light on this. I would really appreciate any insights on this matter.

Aggregate(): Redux

So, first we had one problem, and no solutions. Now we have 2 solutions, and at least 3 problems, just different problems. Whether this can be called progess, I'd rather not decide. There is this proverb:
when you're in a hole, stop digging.
Blogs like these would not be written if I'd heed such sensible advice. Instead, I came up with yet another solution that, sort of, combines the elements of my current two solutions into something that is so terrific, we can truly call it the best worst of two worlds.

First, lets consider this summary of my two solutions:

Good Bad
1st solution (using Aggregate())
  • Proper grouping into explicit new members
  • Implicit calculation of measures
  • Completely static
  • Requires knowledge of vendors and vendor members in advance
2nd solution (using UniqueVendors)
  • Completely dynamic
  • Does not require knowledge in advance of vendors and its members
  • Strange grouping to first occurrence of Vendor
  • Explicit calculation of measures


So obviously we'd like to have a solution that has all of the good and none of the bad. The problem we have to overcome is the dynamic generation of calculated members to represent the custom grouping of our members to represent individual vendors. It turns out, there is a way. Just not in one query.

(Please, somebody, anybody, prove me wrong on this!)

What we can do though, is use a query that uses the essential elements of my second solution that generates output that is exactly like my first query. This result can then be run to obtain the desired result. In other words, we enter the domain of dynamic MDX or, with a really posh term, higher order MDX.

So, here goes:
WITH
SET     OrderedVendors
AS      Order(
          Product.Vendor.Members
        , Product.CurrentMember.Properties("MEMBER_CAPTION"),
          BASC
        )
SET     UniqueVendors
AS      Filter(
          OrderedVendors
        , OrderedVendors.Item(OrderedVendors.CurrentOrdinal).Properties("MEMBER_CAPTION") <>
          OrderedVendors.Item(OrderedVendors.CurrentOrdinal - 1).Properties("MEMBER_CAPTION")
        )
MEMBER  Measures.S
AS      "WITH"||Chr(10)||
        Generate(
          UniqueVendors
        , "MEMBER Product.["||Product.CurrentMember.Properties("MEMBER_CAPTION")||"]"||Chr(10)||
          "AS Aggregate("||
            "Filter("||
            "  Product.Vendor.Members"||
            ", Product.CurrentMember.Properties('MEMBER_CAPTION') = "||
            "'"||Product.CurrentMember.Properties("MEMBER_CAPTION")||"'"||
            ")"||
          ")"
        , Chr(10)
        )
        ||Chr(10)||"SELECT Measures.Sales ON COLUMNS,"
        ||Chr(10)||"{"||
        Generate(
          UniqueVendors
        , "Product.["||Product.CurrentMember.Properties("MEMBER_CAPTION")||"]"
        , Chr(10)||","
        )
        ||"} ON ROWS"
        ||Chr(10)||"FROM SteelWheelsSales"
SELECT  Measures.S ON COLUMNS
FROM    SteelWheelsSales
The query might be easier to analyze if you see its exact result:
WITH
MEMBER Product.[Autoart Studio Design]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Autoart Studio Design'))
MEMBER Product.[Carousel DieCast Legends]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Carousel DieCast Legends'))
MEMBER Product.[Classic Metal Creations]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Classic Metal Creations'))
MEMBER Product.[Exoto Designs]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Exoto Designs'))
MEMBER Product.[Gearbox Collectibles]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Gearbox Collectibles'))
MEMBER Product.[Highway 66 Mini Classics]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Highway 66 Mini Classics'))
MEMBER Product.[Min Lin Diecast]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Min Lin Diecast'))
MEMBER Product.[Motor City Art Classics]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Motor City Art Classics'))
MEMBER Product.[Red Start Diecast]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Red Start Diecast'))
MEMBER Product.[Second Gear Diecast]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Second Gear Diecast'))
MEMBER Product.[Studio M Art Models]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Studio M Art Models'))
MEMBER Product.[Unimax Art Galleries]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Unimax Art Galleries'))
MEMBER Product.[Welly Diecast Productions]
AS Aggregate(Filter(Product.Vendor.Members, Product.CurrentMember.Properties('MEMBER_CAPTION') = 'Welly Diecast Productions'))
SELECT Measures.Sales ON COLUMNS,
{Product.[Autoart Studio Design]
,Product.[Carousel DieCast Legends]
,Product.[Classic Metal Creations]
,Product.[Exoto Designs]
,Product.[Gearbox Collectibles]
,Product.[Highway 66 Mini Classics]
,Product.[Min Lin Diecast]
,Product.[Motor City Art Classics]
,Product.[Red Start Diecast]
,Product.[Second Gear Diecast]
,Product.[Studio M Art Models]
,Product.[Unimax Art Galleries]
,Product.[Welly Diecast Productions]} ON ROWS
FROM SteelWheelsSales
The generating query relies on the Generate() function, which I discussed in my previous blog post, MDX: retrieving the entire hierarchy path with Ancestors(). The Generate() function is used twice, both over the UniqueVendors set. The first Generate() function is used to create the code that defines the calculated members, which serve to group the Vendor members based on caption. The second Generate() function generates the code that defines the set which references these calculated members and which appears on the ROWS axis of the generated query.

There's a few elements in this query that might or might not be familiar:
  • String constants, which are denoted using either double or single quotes
  • The String concatenation operator ||. Note that this is Mondrian specific syntax - The MDX standard defines the plus sign (+) as string concatenation operator. (If someone could point out the proper way to discover which operator is used for string concatenation, I'd be really grateful!) You can easily rewrite this query to standard MDX by replacing each occurrence of || with +.
  • The Chr() function, which MDX inherits from VBA generates a character that corresponds to the character code passed as argument. The generator uses Chr(10) to create newlines in the generated query.
You might recall that I discussed a number of variants of my first query. The version generated here is the most compact one. It doesn't bother to generate separate named sets for the Vendors, and it uses the Filter() expression to define the contents of the named set instead of enumerating all the individual members for the Vendors.

It is quite easy to rewrite the query so that it generates one of the other forms of the query. In particular, it might be useful to generate a query that enumerates the members rather than using the Filter() function. You might recall I initially argued against that on the basis that it is error prone and not resilient to future data changes, but since we can now generate a correct and up to date version of the query anytime, these objections are lifted.

To obtain the query such that it enumerates all members explicitly, one might use the SetToStr() function, which I also discussed briefly in my previous blog. Such a solution would something like this to generate the Aggregate() expressions for the calculated members:
"AS Aggregate("||
  SetToStr(
    Filter(
      Product.Vendor.Members
    , Product.CurrentMember.Properties("MEMBER_CAPTION") = 
      UniqueVendors.Item(UniqueVendors.CurrentOrdinal).Properties("MEMBER_CAPTION")
    )
  )||")"

Conclusion

If you have a requirement like Diethard's, you really should first consider if you can achieve it by refactoring the schema. Remember, this is not only about making life easy for the MDX author; If there really is a need to make top-level groupings on lower levels of a hierarchy, you might be dealing with a bastard-hierarchy, and the general state of things will be much improved if you put it in a different hierarchy.

There may be other cases where there is a need to make groupings on members of non-uniqe levels. An example that comes to mind is querying Sales quarters against years. This is different from our Vendor example for two reasons. Quarters and Years clearly could very well belong to the same hierarchy (i.e., we're not dealing with a bastard-hierarchy in this case). But if we accept that, we then have to figure out how to put members from the same hierarchy on two different axes of a MDX query. This is a completely different kettle of fish.

However, it is still good to know that this situation too, could, in principle, be solved by creating two separate hierarchies: one with Year as top level and one with the Quarter as top level. (And this is a design that I have observed.)

If it is not possible or appropriate to change or add the hierarchy, you're going to have to work your way around it. I offered three different solutions that can help you out. None of them is perfect, and I did my best to point out the advantages and disadvantages of each method. I hope you find this information useful and I hope it will help you decide how to meet your particular requirements.

Finally, as always, I happily welcome any comments, critique and suggestions. I'm still quite new to MDX so it is entirely possible that I missed a solution or that my solutions could be simplified. I would be very grateful if you could point it out so I can learn from your insights.

Wednesday, April 15, 2015

MDX: retrieving the entire hierarchy path with Ancestors()

A couple of days ago I wrote about one of my forays into MDX land (Retrieving denormalized tabular results with MDX). The topic of that post was how to write MDX so as to retrieve the kind of flat, tabular results one gets from SQL queries. An essential point of that solution was the MDX Ancestor() function.

I stumbled upon the topic of my previous blogpost while I was researching something else entirely. Creating flat tables and looking up individual ancestors is actually a rather specific application of a much more general solution I found initially.

Pivot tables and the "Show Parents" functionality

GUI OLAP tools typically offer a pivot table query interface. They let you drag and drop measures and dimension items, like members and levels to create a pivot table. The cells of the pivot table are aggregated values of the measures, and the row and column headers of the pivot table are dimension members, which are typically derived from a level that was dragged into the pivot table.

Please recall the sales cube example I introduced in my previous post:



Now, suppose we would drag the Sales quantity measure unto the columns axis of our pivot table, and drag the Quarters level from the Time dimension unto the rows axis. The GUI tool might generate an MDX query quite like the one I introduced in my previous post:
SELECT  Measures.Quantity     ON COLUMNS
,       Time.Quarters.Members ON ROWS
FROM    SteelWheelsSales
Here's how this is rendered in Saiku Analytics:



And here's how it looks in Pivot4J:



Now, as I pointed out in my previous post, the problem with this result is that we don't see any context: we cannot see to which year the quarters belong. Both tools have a very useful feature called "Show parents". This is a toggle button that changes the view so that the headers show the values of the corresponding higher levels. For example, this is what the previous result looks like in Pivot4J when "Show Parents" is toggled:

As you can see, the year level and even the "All level" is now visible.

In Saiku we can achieve a similar thing, but the other way around: you can add the year and the all level, at which point totals are shown for these higher levels:



And you can then choose "Hide Parents" to get rid of the rows for the higher level aggregats, leaving you with essentially the same view of the data as shown in the last Pivot4J screenshot.

Implementing Show/Hide Parents

In Saiku, the "Hide Parents" functionality is achieved by post-processing the resultset: when the result is iterated to render the table, rows for all but the lowest level are filtered away and discarded.

In Pivot4J, it works a little bit different. Here's how Xavier Cho describes it:
the information of the parents is obtained by the members present on the axes. Pivot4J accesses it through Olap4J API which exposes a member's parent and ancestors via Member.getParentMember() and Member.getAncestorMembers() respectively:

http://www.olap4j.org/api/org/olap4j/metadata/Member.html

References to the member instances in a given MDX can be obtained by its CellSet interface, which is equivalent to what is ResultSet for JDBC. In addition, Pivot4J exposes the member instance for each cells to the expression language context, so you can reference itself, or its parent or ancestors in a property expression too.

In summary, if you are trying to access the parent of a member included in MDX, you'll first need to execute the query using the Olap4J then get it from the resulting CellSet instance.

A pure MDX expression

I thought it would be fun to try and rewrite our original query in such a way that its result would give us this information.

The Ancestors() function

As it turns out, we can do this for one particular hierarchy in our query by creating a Calculated Member on the Measures hierarchy that applies the Ancestors() function to the current member of the hierarchy for which we want the path.

The Ancestors() function takes 2 arguments
  1. A member for which to find ancestor members (members at a higher level that contain the argument member)
  2. An argument that specifies how many levels to traverse up.
The function returns a set of members that are an ancestor of the member passed as first argument.

Specifying the first argument is easy: we simply want to find ancestors for whatever member, so we can specify it as <Hierarchy Name>.CurrentMember and it will just work.

The second argument can be specified in 2 ways:
  • As a level: the second argument specifies a level and all ancestors up to that level will be retrieved
  • As a integer representing a distance: the second argument specifies the number of levels that will be traversed upwards
The first form is useful if you want to retrieve ancestors up to a specific level. I want to retrieve all ancestors, so the number of levels I want the function to traverse is in fact equal to the level number of the first argument. We can conveniently specify this with the LEVEL_NUMBER property using an expression like:

<Hierarchy Name>.CurrentMember.Properties("LEVEL_NUMBER")

But this is not yet entirely right, since this form of the Properties() function always returns a string, even though the LEVEL_NUMBER property is actually of the integer type. The standard MDX Properties() function allows an optional second argument TYPED. When this is passed, the property will be returned as a value having its declared type.

Unfortunately, Mondrian, a.k.a. Pentaho Analysis Services does not support that form of the Properties() function (see: MONDRIAN-1795). So, in order to retrieve the level number as an integer value, we have to apply the CInt() function to convert the string representation of the level number to an integer.

So, our call to the Ancestors() function will look like this:

Ancestors(<Hierarchy Name>.CurrentMember, CInt(<Hierarchy Name>.CurrentMember.Properties("LEVEL_NUMBER")))

A simpler alternative: Ascendants()

If it is acceptable to also include the CurrentMember itself, then we can even simplify this quite a bit by using the Ascendants() function. The Ascendants() function takes a single member as argument, and returns the set of ancestor members as well as the argument member, all the way up to the member at the top level. With Ascendants(), our expression would simply be: Ascendants(<Hierarchy Name>.CurrentMember)

We will continue this post using Ancestors(), but the approach can be easily applied to Ascendants() instead.

Converting the set of Ancestor members to a scalar value

However we can't just use the bare Ancestors() expression in our query, nor can we use it as is to create a calculated member. That's because Ancestors() returns a set of members, while we want something that we can retrieve from the cells in the result.

As an initial attempt we can try and see if we can use the SetToStr() function, which takes a set as argument and returns a string representation of that set. So We can now finally write a query and it would look something like this:
WITH
MEMBER  Measures.[Time Ancestors]
AS      SetToStr(
          Ancestors(
            Time.CurrentMember, 
            CInt(
              Time.CurrentMember.Properties("LEVEL_NUMBER")
            )
          )
        )
SELECT  Measures.[Time Ancestors] ON COLUMNS
,       Time.Quarters.Members ON ROWS
FROM    SteelWheelsSales
The results might look something like this:
Time Time Ancestors
QTR1 {[Time].[2003], [Time].[All Years]}
QTR2 {[Time].[2003], [Time].[All Years]}
QTR3 {[Time].[2003], [Time].[All Years]}
QTR4 {[Time].[2003], [Time].[All Years]}
QTR1 {[Time].[2004], [Time].[All Years]}
QTR2 {[Time].[2004], [Time].[All Years]}
QTR3 {[Time].[2004], [Time].[All Years]}
QTR4 {[Time].[2004], [Time].[All Years]}
QTR1 {[Time].[2005], [Time].[All Years]}
QTR2 {[Time].[2005], [Time].[All Years]}
Well this certainly looks like we're on the right track! However, there are at least two things that are not quite right:
  • The string representation returned by SetToStr() looks very much like how one would write the set down as a MDX set literal (is that a thing? It should be :-). While entirely correct, it does not look very friendly and it is certainly quite a bit different from what our GUI tools present to end-users
  • The order of the members. It looks like Ancestors() returns the members in order of upward traversal, that is to say, from lower levels (=higher level numbers) to higher levels (=lower level numbers). The fancy way of saying that is that our result suggests that Ancestors() returns its members in post-natural order. We'd like the members to be in natural order, that is to say, in descending order of level (from high to low). Note that the specification of Ancestors() does not specify or require any particular order. So in the general case we should not rely on the results to be in any particular order.
First, let's see if we can fix the order of ancestor members. There's two different MDX functions that seem to apply here:
  • Order() is general purpose function that can be used to order the members of a set by an arbitrary numberic expression.
  • Hierarchize() is designed to order members into hierarchical order, that is to say, the members are ordered by their level number and by the level number of any of its ancestors.
While Order() is a nice and reasonable choice, Hierarchize() seems tailored exactly for our purpose so that's what we'll use:
WITH
MEMBER  Measures.[Time Ancestors]
AS      SetToStr(
          Hierarchize(
            Ancestors(
              Time.CurrentMember, 
              CInt(
                Time.CurrentMember.Properties("LEVEL_NUMBER")
              )
            )
          )
        )
SELECT  Measures.[Time Ancestors] ON COLUMNS
,       Time.Quarters.Members ON ROWS
FROM    SteelWheelsSales
And the result will now look like:
Time Time Ancestors
QTR1 {[Time].[All Years], [Time].[2003]}
QTR2 {[Time].[All Years], [Time].[2003]}
QTR3 {[Time].[All Years], [Time].[2003]}
QTR4 {[Time].[All Years], [Time].[2003]}
QTR1 {[Time].[All Years], [Time].[2004]}
QTR2 {[Time].[All Years], [Time].[2004]}
QTR3 {[Time].[All Years], [Time].[2004]}
QTR4 {[Time].[All Years], [Time].[2004]}
QTR1 {[Time].[All Years], [Time].[2005]}
QTR2 {[Time].[All Years], [Time].[2005]}
Now, as for obtaining a more friendly, human-readable string representation of the set, this is a considerably more open requirement. One the one hand there is the matter of how to represent each member in the ancestor set; on the other hand there is the matter of extracting this information from the resultset and using it in the GUI.

To represent members we have a handful of options: we could use the member name, or we could use its key value; however since we want to expose the information to the user, the only thing that seems really suitable is the member caption. Placing that data into the GUI is an implementation detail that need not concern us too much at this point. Let's say we aim to return the data as a comma-separated list, and assume our GUI tool is capable of extracting that data and then use it to render a result.

The function that seems to suit our need is called Generate(). There are actually 2 forms of Generate(), which frankly seem to suit completely different purposes. The form we're interested in is functionally quite similar to the MySQL-builtin aggregate function GROUP_CONCAT().

The arguments to this form of Generate() are:
  1. A set. This is where we'll feed the Ancestors() expression in
  2. A string expression. This expression will be evaluated for each member in the set passed as first argument. We'll use this to retrieve the caption of the current member of the hiearchy for which we're generating the ancestors list.
  3. A separator. Generate() concatenates the result values returned by the string expression passed as second argument, and this string will be used to separate those values. Since we want to obtain a comma-separated list, we'll use the literal string ", " for this argument.
The result is a single string value.

Putting it together, our query becomes:
WITH
MEMBER  Measures.[Time Ancestors]
AS      Generate(
          Hierarchize(
            Ancestors(
              Time.CurrentMember, 
              CInt(
                Time.CurrentMember.Properties("LEVEL_NUMBER")
              )
            )
          )
        , Time.CurrentMember.Properties("MEMBER_CAPTION")
        , ","
        )
SELECT  Measures.[Time Ancestors] ON COLUMNS
,       Time.Quarters.Members ON ROWS
FROM    SteelWheelsSales
And the result:
Time Time Ancestors
QTR1 All Years,2003
QTR2 All Years,2003
QTR3 All Years,2003
QTR4 All Years,2003
QTR1 All Years,2004
QTR2 All Years,2004
QTR3 All Years,2004
QTR4 All Years,2004
QTR1 All Years,2005
QTR2 All Years,2005
And we can repeat this process for every hierarchy on every axis, just like we did with the Ancestor() function in the previous post.

Sunday, April 12, 2015

Retrieving denormalized tabular results with MDX

I've been trying to learn the MultiDimensional Expression language (MDX) for quite a while. Unfortunately, I haven't been very successful at it. MDX has a few conceptual and some (superficial) syntactical similarities to SQL, but I have arrived at the conclusion that it is probably better to treat it as a completely different kind of language.

I do not intend to write a post entirely about the similarities and differences between MDX and SQL. There are tons of articles that do that already. For example, this Microsoft Technet article puts it like this:
The Multidimensional Expressions (MDX) syntax appears, at first glance, to be remarkably similar to the syntax of Structured Query Language (SQL). In many ways, the functionality supplied by MDX is also similar to that of SQL; with effort, you can even duplicate some of the functionality provided by MDX in SQL.

However, there are some striking differences between SQL and MDX, and you should be aware of these differences at a conceptual level. The following information is intended to provide a guide to these conceptual differences between SQL and MDX, from the point of view of an SQL developer.
A discussion regarding the similarities and differences between MDX and SQL is, to me, quite like a discussion on the similarities and differences between crabs and spiders. They're similar as they both have way too many legs, and because neither speak English, you'll have a hard time getting them to behave according to your instructions. But there are differences too, since one of them is by any measure way more hairy than could ever be called reasonable, and traps you in its web before poisoning you with its venomous fangs, whereas the other is only hard-headed and pinches you with its razor-sharp claws; at least, if its not walking away from you backwards.

The reason why I included the quote from the Microsoft article is because it illustrates an important point about MDX: it is often regarded as something that extends SQL - something that allows you to do things that are very hard or even impossible to do in SQL.

Frankly, that notion is mostly true. You don't even need to spend a lot of time playing with MDX to realize that. Unfortunately, it would take me considerable time and effort to describe with words why this is so. And I wouldn't be able to do that without falling into the trap of describing similarities and mostly, lots of differences between MDX and SQL. A lot of it can be reduced by explaining the similarities and differences between the tables that SQL operates upon, and the multi-dimensional datasets called OLAP-Cubes, which is the type of data structure that MDX operates upon.

What might be less apparent from the quote from the Microsoft article is that sometimes, it may not be so straightforward to make MDX behave like SQL. I should point out right away that it is much easier to make MDX behave like SQL than the other way around. However, since most people do not desire to use MDX to do the kind of stuff that is normally done in SQL, it might be just a little bit obscure and hard to find examples that show you exactly how. That is what this blogpost aims to provide.

The inevitable example - a Sales cube

Let's assume we have a Sales Cube that describes the sales process of some sort of retail business. The cube might have quantitative measures like Sales Quantity and Sales price, and dimensions like Customer (the person that bought something), Product (the item that was sold), and Time (the date/time when the Sale took place). The image below illustrates this setup:



The image illustrates a (very basic) physical star schema, that is, a database model of a fact table surrounded by and linked to its dimension tables. In the top half of each box that represents a table we find the key column(s): in the fact table, the key is composite, and consists of foreign keys pointing to the dimension tables; in the dimension tables, the key is non-composite and referenced by the fact table via a foreign key. In the bottom half of each table box we find any non-key columns.

Although the image illustrates a database star schema, we can, with some effort, also read it as an OLAP-cube design. To do that, just pretend the key columns (those columns appear in the top section of each table) aren't there. For the fact table, think of the non-key columns as measures. For the dimensions, try to think of the non-key columns as the levels of a single hierarchy.

When working with a ROLAP engine like Mondrian (a.k.a. Pentaho Analysis) we can, in principle, literally maintain this 1:1 mapping between dimension tables and dimensions, dimension table columns and hierarchy levels, and fact table columns and measures.

In practical, real-world situations, the mapping between database schema and OLAP-cube doesn't necessarily have to be so straightforward. But I'm keeping things as simple as possible on purpose because it helps to illustrate the point I want to make.

MDX Queries against the Pentaho SteelWheels example Cube

For actual MDX query examples I will use the Pentaho SteelWheels sample database and Cube. While both that Cube and its underlying physical star schema are very simple, it is still a good deal more complex than my example. Just don't let that distract you: the essential elements, like a customer, date and product dimension are all there, as well as the measures for the sales quantity and sales amount.

To execute MDX queries, I like to use the Pentaho Analysis shell (Pash). Pentaho users can install Pash directly from the Pentaho marketplace. Instructions for installing and running Pash on other OLAP platforms, like icCube, Jasper Reports and others can be found in the Pash project README on github.

Sales Quantity per Quarter in MDX and SQL

Consider the following MDX-query against the SteelWheels Sales Cube:
SELECT  Measures.Quantity     ON COLUMNS
,       Time.Quarters.Members ON ROWS
FROM    SteelWheelsSales
This very basic query basically says: get us the Sales Quantity per quarter. The result might be represented like this:
Time Quantity
QTR1 4561
QTR2 5695
QTR3 6629
QTR4 19554
QTR1 8694
QTR2 8443
QTR3 11311
QTR4 20969
QTR1 10995
QTR2 8480
If you're not familiar with MDX, it is worth pointing out that although the result looks just like an ordinary table like you would get as a result from a SQL query, it is actually not quite so simple. I'll get back to that in a bit.

If one were to write a SQL query that provides a similar result, it would look something like this:
SELECT      DateDimension.QuarterLevel
,           SUM(SalesFact.Quantity)     AS SalesQuantityMeasure
FROM        SalesFact
INNER JOIN  DateDimension
ON          SalesFact.DateDimensionKey = DateDimension.DateDimensionKey
GROUP BY    DateDimension.QuarterLevel

Sales Quantity per Quarter with the year in SQL

Now there's one problem with this query, or rather with its result. While the results are correct, they are hard to interpret since we do not have any indication what year the quarters belong to. In SQL, we might add a expression that retrieves the year to the SELECT and GROUP BY clauses to fix this:
SELECT      DateDimension.YearLevel
,           DateDimension.QuarterLevel
,           SUM(SalesFact.Quantity)     AS SalesQuantityMeasure
FROM        SalesFact
INNER JOIN  DateDimension
ON          SalesFact.DateDimensionKey = DateDimension.DateDimensionKey
GROUP BY    DateDimension.YearLevel
,           DateDimension.QuarterLevel
This extension of our previous SQL query basically says: give me the sales quantity per quarter, and also show the year to which the quarter belongs. It this would give us a result like this:
YearLevel QuarterLevel SalesQuantityMeasure
2003 QTR1 4561
2003 QTR2 5695
2003 QTR3 6629
2003 QTR4 19554
2004 QTR1 8694
2004 QTR2 8443
2004 QTR3 11311
2004 QTR4 20969
2005 QTR1 10995
2005 QTR2 8480

Trying the same thing in MDX

Now, considering the superficial syntactical similarities between the MDX and the SQL statement, we might be tempted to add an expression for the Year level to our MDX query as well in order to get a similar result:
SELECT  Measures.Quantity ON COLUMNS
,       {Time.Years.Members, Time.Quarters.Members} ON ROWS
FROM    SteelWheelsSales
For now, don't worry too much about the curly braces. Instead, take a look at the result:
Time Quantity
2003 36439
2004 49417
2005 19475
QTR1 4561
QTR2 5695
QTR3 6629
QTR4 19554
QTR1 8694
QTR2 8443
QTR3 11311
QTR4 20969
QTR1 10995
QTR2 8480
Well, there result certainly contains years, but not quite in the way you would've expected. At least, not if MDX would behave like SQL, which it clearly does not.

While adding the year expression the SQL query caused a new Year column to be added to the result. But doing the - superficially - similar thing in MDX did not change the number of "columns" of the result. Instead, it resulted in adding a number of new "rows" instead.

If you examine the value of the Quantity "column" for the first three rows, you might notice that the value there is quite a bit larger than for any of the quarters. And you even might notice that the value for 2003 is in fact QTR1: 4561 + QTR2: 5695 + QTR3: 6629 + QTR4: 19554 = 36439, and that the values for the other years are similarly the sum of the subsequent quarters.

This is of course no coincidence. The first 4 quarters belong to 2003, just like the second group of 4 quarters belong to 2004, and MDX "knows" this because of the way the cube is structured. But this is also the key to solving our problem: MDX offers functions that allow us to lookup related items of data. In this particular case, we can use the Ancestor() function to lookup the Year that corresponds to the the quarter.

Using the Ancestor() function in a Calculated Measure

The following query uses the Ancestor() function in a Calculated Member to lookup the value at the Year level for the current member on the Time hierarchy:
WITH
MEMBER  Measures.[Time.Year]
AS      Ancestor(
          Time.CurrentMember,
          Time.Years
        ).Properties("MEMBER_CAPTION")
SELECT  {Measures.[Time.Year]
        ,Measures.Quantity}   ON COLUMNS
,       Time.Quarters.Members ON ROWS
FROM    SteelWheelsSales
The result is shown below:
Time Time.Year Quantity
QTR1 2003 4561
QTR2 2003 5695
QTR3 2003 6629
QTR4 2003 19554
QTR1 2004 8694
QTR2 2004 8443
QTR3 2004 11311
QTR4 2004 20969
QTR1 2005 10995
QTR2 2005 8480
Since the Calculated Member was declared to be part of the Measures hierarchy, it will be used to generate values to fill cells. For each quarter on the ROWS axis, it is evaluated. The first argument to our Ancestor() function is Time.CurrentMember and this will be the item for which we are looking for an Ancestor; in other words, we are looking up an ancestor of a quarter. The second argument is a level expression Time.Years, which tells the Ancestor() function that we want whatever ancestor item exists at the Year level for the first argument.

The remainder of the expression for the Calculated member, .Properties("MEMBER_CAPTION"), serves to extract the human readable friendly label for the found Ancestor, and this is the value that will finally be used to populate the cell.

Note: Many MDX engines, including Mondrian, support a shortcut syntax to retrieve the caption: instead of writing .Properties("MEMBER_CAPTION"), you can also simply write .Caption. Unfortunately, this shortcut syntax is not universally supported, while .Properties("MEMBER_CAPTION") should always be supported.

A general recipe for extracting denormalized tables with MDX

By comparing our last 2 MDX queries we can distill a general workflow to construct denormalized result tables using MDX
  1. Make a selection of all levels across all hierarchies that you want to see in your result. Let's take for example Product Line, Product Vendor and Months and Years
  2. The levels you selected in #1 each belong to a particular hierarchy. In this case: Product and Time. For each hierarchy, determine the lowest level in your selection. In this case: Product Vendor from the Product hierarchy and Months from the Time hierarchy.
  3. You can now use the levels you found in #2 to write the expression for the ROWS-axis of your MDX query. Append .Members to each the level name, and combine these expressions with each other using the CrossJoin() function. In this case, that expression looks like CrossJoin(Product.Vendor.Members, Time.Months.Members)
  4. Take the remaining levels from your selection in #1 (that is, those levels that you didn't put on the ROWS axis in step #3). In our case, those levels are Product.Line and Time.Years. Write a Calculated member on the Measures hierarchy that uses the Ancestor() function. To keep things clear, derive the name of the calculated member from the name of the hierarchy and the name of the level. So for example, the Calculated Member for the Product Line level will be Measures.[Product.Line] or something like it. As first argument to Ancestor(), write the name of the hierarchy, followed by .CurrentMember. For the second argument, specify the level itself. To extract the Caption, append .Properties("MEMBER_CAPTION") to the call to Ancestor(). In our case we get: MEMBER Measures.[Product.Line] AS Ancestor(Product.CurrentMember, Product.Line).Properties("MEMBER_CAPTION") and MEMBER Measures.[Time.Years] AS Ancestor(Time.CurrentMember, Time.Years).Properties("MEMBER_CAPTION").
  5. Construct a set for the the COLUMNS axis of your query, consisting of a comma-separated list of the names of the calculated members. In our case it would be {Measures.[Product.Vendor], Measures.[Time.Years]} ON COLUMNS.
  6. Finally, if you want to also select any "real" measure values, include the appropriate measure values into the list on the COLUMNS axis. Remember, the measure will be aggregated on the level you chose in step #2. Suppose we would want to include Sales Quantity on our example, we'd have to change the COLUMNS code we constructed in 5 to {Measures.[Product.Line], Measures.[Time.Years], Measures.[Quantity]} ON COLUMNS
This is the actual complete MDX statement:
WITH
MEMBER  Measures.[Product.Line]
AS      Ancestor(
          Product.CurrentMember,
          Product.Line
        ).Properties("MEMBER_CAPTION")
MEMBER  Measures.[Time.Years]
AS      Ancestor(
          Time.CurrentMember,
          Time.Years
        ).Properties("MEMBER_CAPTION")
SELECT  {Measures.[Product.Line]
        ,Measures.[Time.Years]
        ,Measures.Quantity}   ON COLUMNS
,       CrossJoin(
         Product.Vendor.Members
        ,Time.Quarters.Members
        ) ON ROWS
FROM    SteelWheelsSales
And the result looks something like:
Product Time Product.Line Time.Years Quantity
Autoart Studio Design QTR1 Classic Cars 2003 33
Autoart Studio Design QTR2 Classic Cars 2003 42

...many more rows...

Welly Diecast Productions QTR1 Vintage Cars 2005 76
Welly Diecast Productions QTR2 Vintage Cars 2005 113

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Rolling up each game's lines into a single game row (6/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is the sixth installment of a s...