Monday, February 22, 2021

Year-to-Date on Synapse Analytics 3: Using a Subquery

For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I'd like to share some sample code, and discuss some of the merits and benefits of each approach.

TL;DR: A Year-to-Date solution based on a SUM() window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-JOIN (combined with a GROUP BY), a subquery, and a UNION (also combined with a GROUP BY).

Note: this is the 3rd post in a series. (While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

Using a subquery

We can also think of YTD calculation as a separate query that we perform for each row of the SalesYearMonth table. While this does imply a row-by-row approach, we can still translate this easily to pure SQL by creating an expression in the SELECT-list, which uses a subquery to calculate the YTD value for the current row:
select      SalesOriginal.SalesYear
,           SalesOriginal.SalesMonth
,           SalesOriginal.SalesAmount
,           (
                select sum(SalesYtd.SalesAmount)
                from   SalesYearMonth as SalesYtd
                where  SalesYtd.SalesYear   = SalesOriginal.SalesYear
                and    SalesYtd.SalesMonth <= SalesOriginal.SalesMonth
            ) as SalesYtd
from        SalesYearMonth as SalesOriginal
There's a similarity with the JOIN-solution, in that we use the SalesYearMonth table twice, but in different roles. In the JOIN-solution both appeared on one side of the JOIN keyword and we used the aliases OriginalSales and YtdSales to be able to keep them apart. In the subquery approach, the distinction between these two different instances of the SalesYearMonth table is more explicit: the main instance of the SalesYearMonth table occurs in the FROM-clause, and the one for the YTD calculation occurs in the SELECT-list.

Also similar to the JOIN solution is the condition to tie the set for the YTD calculation to the main query using the SalesYear and SalesMonth columns. Such a subquery is referred to as a correlated subquery.

As for any differences with the JOIN solution: In the condition, the only difference is the left/right placement of SalesOriginal and SalesYtd, which is chosen only by order of appearance in the query but functionally completely equivalent. The most striking difference between the JOIN solution and the subquery is the absence of the GROUP BY-list in the latter.

Drawbacks of the subquery

As we had much to complain about the GROUP BY-list in the JOIN solution, it might seem that the subquery solution is somehow "better". However, a solution with a correlated subquery in general tends to be slower than a JOIN solution. Whether this is actually the case depends on on many variables and you'd really have to check it against your SQL engine and datasets.

Another drawback of the subquery solution becomes clear when we want to calculate the YTD for multiple measures. Our example only has one SalesAmount measure, but in this same context we can easily imagine that we also want to know about price, discount amounts, tax amounts, shipping costs, and so on.

In the JOIN solution, we would simply add any extra measures to the select list, using MAX() (or MIN() or AVG()) to obtain the original value, and SUM() to calculate its respective YTD value: As long as it's over the same set, the JOIN, its condition, and even the GROUP BY-list would remain the same, no matter for how many different measures we would add a YTD calculation.

This is very different in the subquery case. Each measure for which you need a YTD calculation would get its own subquery. Even though the condition would be the same for each such YTD calculation, you would still need to repeat the subquery code - one for each YTD measure.

Next installment: Solution 3 - using a UNION

In the next installment we will present and discuss a solution based on a UNION and a GROUP BY.

No comments:

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Rolling up each game's lines into a single game row (6/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is the sixth installment of a s...