Sunday, April 12, 2015

Retrieving denormalized tabular results with MDX

I've been trying to learn the MultiDimensional Expression language (MDX) for quite a while. Unfortunately, I haven't been very successful at it. MDX has a few conceptual and some (superficial) syntactical similarities to SQL, but I have arrived at the conclusion that it is probably better to treat it as a completely different kind of language.

I do not intend to write a post entirely about the similarities and differences between MDX and SQL. There are tons of articles that do that already. For example, this Microsoft Technet article puts it like this:
The Multidimensional Expressions (MDX) syntax appears, at first glance, to be remarkably similar to the syntax of Structured Query Language (SQL). In many ways, the functionality supplied by MDX is also similar to that of SQL; with effort, you can even duplicate some of the functionality provided by MDX in SQL.

However, there are some striking differences between SQL and MDX, and you should be aware of these differences at a conceptual level. The following information is intended to provide a guide to these conceptual differences between SQL and MDX, from the point of view of an SQL developer.
A discussion regarding the similarities and differences between MDX and SQL is, to me, quite like a discussion on the similarities and differences between crabs and spiders. They're similar as they both have way too many legs, and because neither speak English, you'll have a hard time getting them to behave according to your instructions. But there are differences too, since one of them is by any measure way more hairy than could ever be called reasonable, and traps you in its web before poisoning you with its venomous fangs, whereas the other is only hard-headed and pinches you with its razor-sharp claws; at least, if its not walking away from you backwards.

The reason why I included the quote from the Microsoft article is because it illustrates an important point about MDX: it is often regarded as something that extends SQL - something that allows you to do things that are very hard or even impossible to do in SQL.

Frankly, that notion is mostly true. You don't even need to spend a lot of time playing with MDX to realize that. Unfortunately, it would take me considerable time and effort to describe with words why this is so. And I wouldn't be able to do that without falling into the trap of describing similarities and mostly, lots of differences between MDX and SQL. A lot of it can be reduced by explaining the similarities and differences between the tables that SQL operates upon, and the multi-dimensional datasets called OLAP-Cubes, which is the type of data structure that MDX operates upon.

What might be less apparent from the quote from the Microsoft article is that sometimes, it may not be so straightforward to make MDX behave like SQL. I should point out right away that it is much easier to make MDX behave like SQL than the other way around. However, since most people do not desire to use MDX to do the kind of stuff that is normally done in SQL, it might be just a little bit obscure and hard to find examples that show you exactly how. That is what this blogpost aims to provide.

The inevitable example - a Sales cube

Let's assume we have a Sales Cube that describes the sales process of some sort of retail business. The cube might have quantitative measures like Sales Quantity and Sales price, and dimensions like Customer (the person that bought something), Product (the item that was sold), and Time (the date/time when the Sale took place). The image below illustrates this setup:



The image illustrates a (very basic) physical star schema, that is, a database model of a fact table surrounded by and linked to its dimension tables. In the top half of each box that represents a table we find the key column(s): in the fact table, the key is composite, and consists of foreign keys pointing to the dimension tables; in the dimension tables, the key is non-composite and referenced by the fact table via a foreign key. In the bottom half of each table box we find any non-key columns.

Although the image illustrates a database star schema, we can, with some effort, also read it as an OLAP-cube design. To do that, just pretend the key columns (those columns appear in the top section of each table) aren't there. For the fact table, think of the non-key columns as measures. For the dimensions, try to think of the non-key columns as the levels of a single hierarchy.

When working with a ROLAP engine like Mondrian (a.k.a. Pentaho Analysis) we can, in principle, literally maintain this 1:1 mapping between dimension tables and dimensions, dimension table columns and hierarchy levels, and fact table columns and measures.

In practical, real-world situations, the mapping between database schema and OLAP-cube doesn't necessarily have to be so straightforward. But I'm keeping things as simple as possible on purpose because it helps to illustrate the point I want to make.

MDX Queries against the Pentaho SteelWheels example Cube

For actual MDX query examples I will use the Pentaho SteelWheels sample database and Cube. While both that Cube and its underlying physical star schema are very simple, it is still a good deal more complex than my example. Just don't let that distract you: the essential elements, like a customer, date and product dimension are all there, as well as the measures for the sales quantity and sales amount.

To execute MDX queries, I like to use the Pentaho Analysis shell (Pash). Pentaho users can install Pash directly from the Pentaho marketplace. Instructions for installing and running Pash on other OLAP platforms, like icCube, Jasper Reports and others can be found in the Pash project README on github.

Sales Quantity per Quarter in MDX and SQL

Consider the following MDX-query against the SteelWheels Sales Cube:
SELECT  Measures.Quantity     ON COLUMNS
,       Time.Quarters.Members ON ROWS
FROM    SteelWheelsSales
This very basic query basically says: get us the Sales Quantity per quarter. The result might be represented like this:
Time Quantity
QTR1 4561
QTR2 5695
QTR3 6629
QTR4 19554
QTR1 8694
QTR2 8443
QTR3 11311
QTR4 20969
QTR1 10995
QTR2 8480
If you're not familiar with MDX, it is worth pointing out that although the result looks just like an ordinary table like you would get as a result from a SQL query, it is actually not quite so simple. I'll get back to that in a bit.

If one were to write a SQL query that provides a similar result, it would look something like this:
SELECT      DateDimension.QuarterLevel
,           SUM(SalesFact.Quantity)     AS SalesQuantityMeasure
FROM        SalesFact
INNER JOIN  DateDimension
ON          SalesFact.DateDimensionKey = DateDimension.DateDimensionKey
GROUP BY    DateDimension.QuarterLevel

Sales Quantity per Quarter with the year in SQL

Now there's one problem with this query, or rather with its result. While the results are correct, they are hard to interpret since we do not have any indication what year the quarters belong to. In SQL, we might add a expression that retrieves the year to the SELECT and GROUP BY clauses to fix this:
SELECT      DateDimension.YearLevel
,           DateDimension.QuarterLevel
,           SUM(SalesFact.Quantity)     AS SalesQuantityMeasure
FROM        SalesFact
INNER JOIN  DateDimension
ON          SalesFact.DateDimensionKey = DateDimension.DateDimensionKey
GROUP BY    DateDimension.YearLevel
,           DateDimension.QuarterLevel
This extension of our previous SQL query basically says: give me the sales quantity per quarter, and also show the year to which the quarter belongs. It this would give us a result like this:
YearLevel QuarterLevel SalesQuantityMeasure
2003 QTR1 4561
2003 QTR2 5695
2003 QTR3 6629
2003 QTR4 19554
2004 QTR1 8694
2004 QTR2 8443
2004 QTR3 11311
2004 QTR4 20969
2005 QTR1 10995
2005 QTR2 8480

Trying the same thing in MDX

Now, considering the superficial syntactical similarities between the MDX and the SQL statement, we might be tempted to add an expression for the Year level to our MDX query as well in order to get a similar result:
SELECT  Measures.Quantity ON COLUMNS
,       {Time.Years.Members, Time.Quarters.Members} ON ROWS
FROM    SteelWheelsSales
For now, don't worry too much about the curly braces. Instead, take a look at the result:
Time Quantity
2003 36439
2004 49417
2005 19475
QTR1 4561
QTR2 5695
QTR3 6629
QTR4 19554
QTR1 8694
QTR2 8443
QTR3 11311
QTR4 20969
QTR1 10995
QTR2 8480
Well, there result certainly contains years, but not quite in the way you would've expected. At least, not if MDX would behave like SQL, which it clearly does not.

While adding the year expression the SQL query caused a new Year column to be added to the result. But doing the - superficially - similar thing in MDX did not change the number of "columns" of the result. Instead, it resulted in adding a number of new "rows" instead.

If you examine the value of the Quantity "column" for the first three rows, you might notice that the value there is quite a bit larger than for any of the quarters. And you even might notice that the value for 2003 is in fact QTR1: 4561 + QTR2: 5695 + QTR3: 6629 + QTR4: 19554 = 36439, and that the values for the other years are similarly the sum of the subsequent quarters.

This is of course no coincidence. The first 4 quarters belong to 2003, just like the second group of 4 quarters belong to 2004, and MDX "knows" this because of the way the cube is structured. But this is also the key to solving our problem: MDX offers functions that allow us to lookup related items of data. In this particular case, we can use the Ancestor() function to lookup the Year that corresponds to the the quarter.

Using the Ancestor() function in a Calculated Measure

The following query uses the Ancestor() function in a Calculated Member to lookup the value at the Year level for the current member on the Time hierarchy:
WITH
MEMBER  Measures.[Time.Year]
AS      Ancestor(
          Time.CurrentMember,
          Time.Years
        ).Properties("MEMBER_CAPTION")
SELECT  {Measures.[Time.Year]
        ,Measures.Quantity}   ON COLUMNS
,       Time.Quarters.Members ON ROWS
FROM    SteelWheelsSales
The result is shown below:
Time Time.Year Quantity
QTR1 2003 4561
QTR2 2003 5695
QTR3 2003 6629
QTR4 2003 19554
QTR1 2004 8694
QTR2 2004 8443
QTR3 2004 11311
QTR4 2004 20969
QTR1 2005 10995
QTR2 2005 8480
Since the Calculated Member was declared to be part of the Measures hierarchy, it will be used to generate values to fill cells. For each quarter on the ROWS axis, it is evaluated. The first argument to our Ancestor() function is Time.CurrentMember and this will be the item for which we are looking for an Ancestor; in other words, we are looking up an ancestor of a quarter. The second argument is a level expression Time.Years, which tells the Ancestor() function that we want whatever ancestor item exists at the Year level for the first argument.

The remainder of the expression for the Calculated member, .Properties("MEMBER_CAPTION"), serves to extract the human readable friendly label for the found Ancestor, and this is the value that will finally be used to populate the cell.

Note: Many MDX engines, including Mondrian, support a shortcut syntax to retrieve the caption: instead of writing .Properties("MEMBER_CAPTION"), you can also simply write .Caption. Unfortunately, this shortcut syntax is not universally supported, while .Properties("MEMBER_CAPTION") should always be supported.

A general recipe for extracting denormalized tables with MDX

By comparing our last 2 MDX queries we can distill a general workflow to construct denormalized result tables using MDX
  1. Make a selection of all levels across all hierarchies that you want to see in your result. Let's take for example Product Line, Product Vendor and Months and Years
  2. The levels you selected in #1 each belong to a particular hierarchy. In this case: Product and Time. For each hierarchy, determine the lowest level in your selection. In this case: Product Vendor from the Product hierarchy and Months from the Time hierarchy.
  3. You can now use the levels you found in #2 to write the expression for the ROWS-axis of your MDX query. Append .Members to each the level name, and combine these expressions with each other using the CrossJoin() function. In this case, that expression looks like CrossJoin(Product.Vendor.Members, Time.Months.Members)
  4. Take the remaining levels from your selection in #1 (that is, those levels that you didn't put on the ROWS axis in step #3). In our case, those levels are Product.Line and Time.Years. Write a Calculated member on the Measures hierarchy that uses the Ancestor() function. To keep things clear, derive the name of the calculated member from the name of the hierarchy and the name of the level. So for example, the Calculated Member for the Product Line level will be Measures.[Product.Line] or something like it. As first argument to Ancestor(), write the name of the hierarchy, followed by .CurrentMember. For the second argument, specify the level itself. To extract the Caption, append .Properties("MEMBER_CAPTION") to the call to Ancestor(). In our case we get: MEMBER Measures.[Product.Line] AS Ancestor(Product.CurrentMember, Product.Line).Properties("MEMBER_CAPTION") and MEMBER Measures.[Time.Years] AS Ancestor(Time.CurrentMember, Time.Years).Properties("MEMBER_CAPTION").
  5. Construct a set for the the COLUMNS axis of your query, consisting of a comma-separated list of the names of the calculated members. In our case it would be {Measures.[Product.Vendor], Measures.[Time.Years]} ON COLUMNS.
  6. Finally, if you want to also select any "real" measure values, include the appropriate measure values into the list on the COLUMNS axis. Remember, the measure will be aggregated on the level you chose in step #2. Suppose we would want to include Sales Quantity on our example, we'd have to change the COLUMNS code we constructed in 5 to {Measures.[Product.Line], Measures.[Time.Years], Measures.[Quantity]} ON COLUMNS
This is the actual complete MDX statement:
WITH
MEMBER  Measures.[Product.Line]
AS      Ancestor(
          Product.CurrentMember,
          Product.Line
        ).Properties("MEMBER_CAPTION")
MEMBER  Measures.[Time.Years]
AS      Ancestor(
          Time.CurrentMember,
          Time.Years
        ).Properties("MEMBER_CAPTION")
SELECT  {Measures.[Product.Line]
        ,Measures.[Time.Years]
        ,Measures.Quantity}   ON COLUMNS
,       CrossJoin(
         Product.Vendor.Members
        ,Time.Quarters.Members
        ) ON ROWS
FROM    SteelWheelsSales
And the result looks something like:
Product Time Product.Line Time.Years Quantity
Autoart Studio Design QTR1 Classic Cars 2003 33
Autoart Studio Design QTR2 Classic Cars 2003 42

...many more rows...

Welly Diecast Productions QTR1 Vintage Cars 2005 76
Welly Diecast Productions QTR2 Vintage Cars 2005 113

No comments:

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...