Monday, February 22, 2021

Year-to-Date on Synapse Analytics 2: Using a self-JOIN and GROUP BY

For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I'd like to share some sample code, and discuss some of the merits and benefits of each approach.

TL;DR: A Year-to-Date solution based on a SUM() window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-JOIN (combined with a GROUP BY), a subquery, and a UNION (also combined with a GROUP BY).

Note: this is the 2nd post in a series. For sample data and setup, please see the first post in this series. (While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

Using a self-JOIN


The recipe for the set-oriented approach can be directly translated to SQL:
select      SalesOriginal.SalesYear
,           SalesOriginal.SalesMonth
,           max(SalesOriginal.SalesAmount) as SalesAmount
,           sum(SalesYtd.SalesAmount)      as SalesYtd
from        SalesYearMonth as SalesOriginal
inner join  SalesYearMonth as SalesYtd
on          SalesOriginal.SalesYear  = SalesYtd.SalesYear
and         SalesOriginal.SalesMonth >= SalesYtd.SalesMonth
group by    SalesOriginal.SalesYear
,           SalesOriginal.SalesMonth

The self-JOIN


In our discussion of the set-oriented approach we mentioned combining the rows from the table with each other to produce all different combinations. In the code sample about, the JOIN-clause takes care of that aspect.

As you can see, the SalesYearMonth table appears twice: on the left hand and on the right hand of the JOIN-keyword, but using different aliases: SalesOriginal and SalesYtd. It is a so-called self-join.

Even though both aliases refer to an instance of the same SalesYearMonth base table, each has a very different role. We can think of the one with the SalesOriginal alias as really the SalesYearMonth table itself. The SalesYtd alias refers to an instance of the SalesYearMonth table that, for any given row from SalesOriginal, represents a subset of rows that chronologically precedes the row from SalesOriginal.

The ON-clause that follows controls which combinations should be retained: for each particular row of SalesOriginal we only want to consider rows from SalesYtd from the same year, which is why the first predicate in the ON-clause is:
SalesOriginal.SalesYear  = SalesYtd.SalesYear
Within that year, we only want to consider rows that precede it chronologically, and that explains the second predicate:
SalesOriginal.SalesMonth >= SalesYtd.SalesMonth

GROUP BY and SUM()


It is is important to realize the JOIN is only half of the solution.

While the JOIN takes care of gathering and combining all related rows necessary to compute the YTD value, the actual calculation is done by the SUM() function in the SELECT-list, and the GROUP BY defines which rows should be taken together to be summed.

In summary:
  • the JOIN generates new rows by combining rows from its left-hand table with the rows from its right-hand table, bound by the condition in the ON-clause.
  • The GROUP BY partitions the rows into subsets having the same combinations of values for SalesYear and SalesMonth.
  • The SUM() aggregates the rows in each SalesYear, SalesMonth partition, turning its associated set of rows into one single row, while adding the values of the SalesAmount column together.
Note that the columns in the GROUP BY list are qualified by the SalesOriginal alias - and not SalesYtd. Also note that the GROUP BY columns form the key of the original SalesYearMonth table - together they uniquely identify a single row from the SalesYearMonth table. This is not a coincidence: it expresses precisely that SalesOriginal really has the role of being just itself - the SalesYearMonth table.

What about the other columns?


The GROUP BY affects treatment of the non-key columns as well. In this overly simple example, we had only one other column - OriginalSales.SalesAmount.

(Note that this is different from YtdSales.SalesAmount, which we aggregated using SUM() to calculate the YTD value)

Since OriginalSales.SalesAmount comes from the SalesOriginal instance of the SalesYearMonth table, we can reason that after the GROUP BY on the key columns SalesYear and SalesMonth, there must be exactly one SalesAmount value for each distinct combination of SalesYear and SalesMonth. In other words, SalesAmount is functionally dependent on SalesYear and SalesMonth.

Some SQL engines are smart enough to realize this and will let you refer to any expression that is functionally dependent upon the expressions in the GROUP BY-list in the SELECT-list. Unfortunately, Synapse and MS SQL Server are not among these and if we try we will get an Error:
Msg 8120, Level 16, State 1, Line 11
Column 'Sales.SalesAmount' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
The error message suggets we can do two things to solve it:
  • either we aggregate by wrapping the SalesOriginal.SalesAmount-expression into some aggregate function
  • or we expand the GROUP BY-list and add the SalesOriginal.SalesAmount-expression there.
To me, neither feels quite right.

SalesAmount is clearly intended as a measure, and it feels weird to treat them the same as the attributes SalesYear and SalesMonth. So adding it to the GROUP BY-list feels like the wrong choice. Besides, it also makes the code less maintainable, as each such column will now appear twice: once in the SELECT-list, where we need it no matter what, and once again in the GROUP BY-list, just to satisfy the SQL engine.

So, if we don't want to put it in the GROUP BY-list, we are going to need to wrap it in an aggregate function. We just mentioned that SalesAmount is a measure and therefore that does not sound unreasonable. However, we have to be careful which one we choose.

One would normally use SalesAmount as an additive measure and be able to use SUM() for that. But here, in this context, SUM() is definitily the wrong choice!

All we want to do is to "get" back" whatever value we had for SalesAmount, in other words, unaffected by the whole routine of join-and-then-aggregate, which we did only to calculate the YTD value. The "extra" rows generated by the JOIN are only needed to do the YTD calculation and should not affect any of the other measures. Using SUM() would simply add the SalesAmount just as many times as there are preceding rows in the current year, which simply does not have any meaningful application.

What we want instead is to report back the original SalesAmount for any given SalesYear, SalesMonth combination. We just reasoned that there will be just one distinct SalesOriginal.SalesAmount value for any combination of values in SalesOriginal.SalesYear, SalesOriginal.SalesMonth, and it would be great if we had an aggregate function that would simply pick the SalesOriginal.SalesAmount value from any of those rows. To the best of my knowledge, no such aggregate function exists in MS SQL Server or Synapse Analytics.

We can use MAX() or MIN(), or even AVG(). While this would all work and deliver the intended result, it still feels wrong as it seems wasteful to ask the SQL engine to do some calculation on a set of values while it could pick just any value.

Next installment: Solution 2 - using a subquery


In the next installment we will present and discuss a solution based on a subquery.

No comments:

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...