Monday, February 22, 2021

Year-to-Date on Synapse Analytics 1: Background

For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I'd like to share some sample code, and discuss some of the merits and benefits of each approach.

(While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

TL;DR: A Year-to-Date solution based on a SUM() window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-JOIN (combined with a GROUP BY), a subquery, and a UNION (also combined with a GROUP BY).

In this Installment

Context


Our customer is using an Azure Data Lake to store data from all kinds of source systems, including its SAP ERP system. Azure Synapse Analytics sits on top of the Data Lake and is used as analytics workhorse, but also to integrate various data sets present in the data lake. Front-end BI tools, such as Microsoft PowerBI, can then connect to Synapse and import or query the data from there.

In many cases, the datamarts presented by Synapse are pretty straightforward. Calculations and derived measures needed to build dashboards and data visualizations can typically be developed rather quickly inside the Power BI data model. Once the front-end development has stabilized, one can consider to refactor the solution and move parts away from the front-end and push them down to the backend for performance or maintainability.

(There are all kinds of opinions regarding data architecture and on when to put what where. We do not pretend to have the final answer to that, but the current workflow allows us to very quickly deliver solutions that can be used and verified by the users. At present I do not think we could achieve the same productivity if we would demand that everything be designed and built on the backend right from the get go.)

So, today we were refactoring some of the logic in a PowerBI model, including a Year-to-Date calculation. The solution we ended up implementing to solve it seems to work rather nicely so I figured to share it.

Year-to-Date value


What's a year to date (YTD) value? Basically it's the cumulative value over a metric in time, which resets once a year. In other words, the year to date value is the per-year total of the value achieved up to the current date.

This is best explained with an example. Consider the following dataset:

Date Value YTD Value
2012-01-1035,401.1435,401.14
2012-01-2015,012.1850,413.32
2012-02-0125,543.7175,957.03
2012-02-1032,115.41108,072.43
2012-02-2017,688.07125,760.50
2012-03-0110,556.53136,317.03
.........
2013-01-0119,623.9019,623.90
2013-01-108,351.1827,975.08
2013-01-2020,287.6548,262.73
2013-02-0133,055.6981,318.42

In the table above we have dates from two years - 2012 and 2013 - and for each date a Value.

For the first date encountered within a year, the YTD Value is equal to the Value itself; For each subsequent Date, the YTD Value is maintained as a running total of the values that appeared at the earlier dates.

So, 2012-01-10 is the first date we encounter in 2012 and therefore its YTD Value is equal to the Value at that date (35,401.14). The next date is 2012-01-20 and its Value is 15,012.18; therefore its YTD Value is 50,413.32, which is 15,012.18 + 35,401.14. The accumulation continues until we reach the last date of 2012.

At 2013-01-01 the first date of the next year, the YTD Value resets again to be equal to the Value, and then in the subsequent dates of 2013, the YTD Value again accumulates the current Value by adding it to the preceding YTD-value.

How to use YTD


You can use YTD values to analyze how well actual trends are developing over time as compared to a planning or predicition. By comparing the calculated YTD of a measure to a projected value (for example, a sales target), we can see how far off we are at any point in time.

If you gather these comparisons for a couple of moments in time, you can get a sense of the pace in which the actual situation is deviating from on converging to the target or the projected situation. These insights allow you to intervene in some way: maybe you need to adjust your planning, or change your expectations. Or maybe you need to adjust your efforts in order to more closely approximate your target.

Thinking about YTD as iteration


From the way we explained what a year-to-date value is, you might think about it as an actual "rolling sum". By that I mean, you might think about it as an iterative problem, that you solve by going through the rows, one by one. In pseudocode, such a solution would do something like:
    declare year, ytd
  
    loop through rows:
    
        if year equals row.year then 
        
          assign ytd + row.value to ytd
          
        else
        
          assign row.value to ytd
          assign row.year to year
          
        end if
        
    end loop through rows
While this approach would apparently give you the desired result, it does not help you to solve the problem in SQL directly. Pure SQL does not let you iterate rows like that, and it also does not let you work with variables like that.

Even with the iterative approach there is a hidden problem: the reset of the ytd variable and the update of the year variable that occurs whenever the row.year is different from the current value of the year variable will only work properly if the rows of one particular year are next to each other (like when the rows are ordered by year prior to iteration). The same applies within the year: the rows need to be sorted in chronological order, as the YTD value should reflect how much of the value was accumulated at that date within that year.

It may seem like a waste of time to think about an approach that is of no use to solving the problem. But this simple iterative approach provides a very simple recipe for quickly checking whether an actual solution behaves as expected. We'll use it later to veryify some results.

A set-oriented approach


To implement it in SQL we have to think in a set-oriented way. Conceptually, we can think about it as if we combine each row in the set with all of the other rows, forming a cartesian product, and then retain only those combinations that have identical values for year, but a smaller or equal value for the month.

This way, each row will combine with itself, and with all the other rows that chronologically precede it within the same year. The YTD value is then obtained by aggregating the rows over year and month value, summing the value to become the YTD value.

Sample Data


To play around a bit with the problem in SQL, let's set up a simple table:
create table SalesYearMonth (
  SalesYear   int
, SalesMonth  int
, SalesAmount decimal(20,2)
, primary key(SalesYear, SalesMonth)
);
And, here's some data:
insert into SalesYearMonth (
  SalesYear
, SalesMonth
, SalesAmount
) values (
 (2011,5,503805.92)
,(2011,6,458910.82)
,(2011,7,2044600.00)
,(2011,8,2495816.73)
,(2011,9,502073.85)
,(2011,10,4588761.82)
,(2011,11,737839.82)
,(2011,12,1309863.25)
,(2012,1,3970627.28)
,(2012,2,1475426.91)
,(2012,3,2975748.24)
,(2012,4,1634600.80)
,(2012,5,3074602.81)
,(2012,6,4099354.36)
,(2012,7,3417953.87)
,(2012,8,2175637.22)
,(2012,9,3454151.94)
,(2012,10,2544091.11)
,(2012,11,1872701.98)
,(2012,12,2829404.82)
,(2013,1,2087872.46)
,(2013,2,2316922.15)
,(2013,3,3412068.97)
,(2013,4,2532265.91)
,(2013,5,3245623.76)
,(2013,6,5081069.13)
,(2013,7,4896353.74)
,(2013,8,3333964.07)
,(2013,9,4532908.71)
,(2013,10,4795813.29)
,(2013,11,3312130.25)
,(2013,12,4075486.63)
,(2014,1,4289817.95)
,(2014,2,1337725.04)
,(2014,3,7217531.09)
,(2014,4,1797173.92)
,(2014,5,5366674.97)
,(2014,6,49005.84);
This setup is slightly different from the original problem statement. Instead of a column with DATE data type, we have separate SalesYear and SalesMonth columns. This is fine - it doesn't change the problem or the solution in any way.

In fact, this setup allows us to think about the essential elements of the problem without having to worry about the details of getting to that point. Once we done that, we can apply the approach to a more realistic case.

Next installment: Solution 1 - a self-JOIN


In the next installment we will present and discuss a solution based on a self-JOIN and a GROUP BY.

No comments:

SAP HANA Trick: DISTINCT STRING_AGG

Nowadays, many SQL implementations offer some form of aggregate string concatenation function. Being an aggregate function, it has the effe...