Monday, February 22, 2021

Year-to-Date on Synapse Analytics 4: Using UNION and GROUP BY

For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I'd like to share some sample code, and discuss some of the merits and benefits of each approach.

TL;DR: A Year-to-Date solution based on a SUM() window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-JOIN (combined with a GROUP BY), a subquery, and a UNION (also combined with a GROUP BY).

Note: this is the 4th post in a series. (While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

Using a UNION


We mentioned how the solution with the JOIN relates each row of the main set with a subset of "extra" rows over which the YTD value is calculated by aggregating over the key of the main set using a GROUP BY.

It may not be immediately obvious, but we can also use the SQL UNION (or rather, UNION ALL) operator to generate such a related subset. Just like with the JOIN-solution, this can then be aggregated using GROUP BY. An example will help to explain this:
select      SalesYear
,           SalesMonth
,           sum(SumOfSalesAmount)      as SumOfSalesAmount
,           sum(YtdOfSumOfSalesAmount) as YtdOfSumOfSalesAmount
from (
    select  SalesYear
    ,       SalesMonth
    ,       SumOfSalesAmount
    ,       SumOfSalesAmount           as YtdOfSumOfSalesAmount
    from    SalesYearMonth
    union all
    -- JANUARY
    select  SalesYear
    ,       SalesMonth + 1            -- february
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 1
    union all
    select  SalesYear
    ,       SalesMonth + 2            -- march
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 1
    union all
    
    ... and so on, all for JANUARY ...

    union all
    select  SalesYear
    ,       SalesMonth + 11            -- december
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 1
    union all
    -- FEBRUARY
    select  SalesYear
    ,       SalesMonth + 1            -- march
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 2
    union all
    
    ... and so on, for the rest of FEBRUARY, 
        and then again for MARCH, APRIl, MAY, JUNE, JULY, AUGUST, SEPTEMBER, OCTOBER...   

    -- NOVEMBER
    select  SalesYear
    ,       SalesMonth + 1            -- december
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 11
) Sales
group by    SalesYear
,           SalesMonth

Duplicating metric-data so it contributes to the following months


In the top of the UNION we simply provide the entire resultset from SalesYearMonth, reporting the SumOfSalesAmount-metric as is, but also copying it to YtdSumOfSalesAmount. The other parts of the UNION are used to selectively duplicate the data for the SumOfSalesAmount-metric into the YtdSumOfSalesAmount, so that its data contributes to the YtdSumOfSalesAmount for all following months.

We start by grabbing january's data by applying the condition that demands that the SalesMonth equals 1:
-- JANUARY
select  SalesYear
,       SalesMonth + 1            -- february
,       null
,       SumOfSalesAmount
from    SalesYearMonth
where   SalesMonth = 1
union all
... repeat to duplicate january's data into february, march, and so on all the way up to december...
. This is done for a total of 11 times, each time adding 1, 2, 3 and so on - all the way up to 11 - to the SalesMonth attribute. This ensures january's data, as captured by the condition in the WHERE clause, is reported also in february (SalesMonth + 1), march (SalesMonth + 2), and so on, all the way up to december (SalesMonth + 11).

After the string of UNIONs for january appear more parts to duplicate the data also for february and all following months: -- FEBRUARY select SalesYear , SalesMonth + 1 -- march , null , SumOfSalesAmount from SalesYearMonth where SalesMonth = 02 union all ... repeat to duplicate february's data into march, april, and so on all the way up to december... . Again, february's data is selected by applying the condition
where   SalesMonth = 2
, and this happens now 10 times, again adding a number to the SalesMonth so it is duplicated to march, april, may, all the way up to december - in other words, all months following february.

What we thus did for january and februry is repeated for march, april, and so on for all months up to november. November is the last month we need to do this for: November's data still needs to be copied to december, but as december is the last month, that data only needs to be counted in december itself.

While it may seem wasteful to duplicate all this data, it really is not that different in that respect from the other solutions we've seen so far. It's just that now it's really in your face, because there is a pretty direct correspondence between the SQL code and the data sets that are being handled. The JOIN and subquery solutions hande similar amounts of data, it's just achieved with way less code, and in a far more implicit manner.

Original metric is retained


Note that the original metric also computes correctly, because the parts of the union only duplicate the data to the YTD column. The union parts that duplicate the data to the subsequent months select a NULL for the original metric. So the data for the original metric is never duplicated, and thus retains its normal value.

Drawbacks to the UNION-solution


The main drawback to the UNION-solution is its maintainability. A lot of code is required, far more than for any of the methods we have seen so far. Despite the indiviual patterns are simple (condition to get one month, adding a number to project that data to a future month), it is suprisingly easy to make a little mistake somewhere

We just argued that this solution is not so much different from the JOIN solution, but that remark only pertains to how the calculation is performed. The JOIN-solution generates the data it operates upon dynamically and declaratively; the UNION solution does this statically and explicitly. This is also why it is impossible to generalize ths approach for any arbitrary JOIN: YTD is a special case, because we know exactly how often we should duplicate the data as this is dictated by the cyclical structure of our calendar.

Next installment: Solution 4 - window functions


In the next installment we will present and discuss a solution based on a window functions.

No comments:

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...