(While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)
TL;DR: A Year-to-Date solution based on a
SUM()
window function is simple to code and maintain as well as efficient to execute.
This as compared to a number of alternative implementations, namely a self-JOIN
(combined with a GROUP BY
), a subquery, and a UNION
(also combined with a GROUP BY
).
In this Installment
- A definition of what Year-to-Date (YTD) is
- Some background on why we are interested in calculating the YTD using Synapse Analytics / MS SQL Server
- How to calculate YTD using an iterative and a set-oriented approach
- A sample table and dataset which will be used in the next installments to demonstrate our sample code.
Context
Our customer is using an Azure Data Lake to store data from all kinds of source systems, including its SAP ERP system. Azure Synapse Analytics sits on top of the Data Lake and is used as analytics workhorse, but also to integrate various data sets present in the data lake. Front-end BI tools, such as Microsoft PowerBI, can then connect to Synapse and import or query the data from there.
In many cases, the datamarts presented by Synapse are pretty straightforward. Calculations and derived measures needed to build dashboards and data visualizations can typically be developed rather quickly inside the Power BI data model. Once the front-end development has stabilized, one can consider to refactor the solution and move parts away from the front-end and push them down to the backend for performance or maintainability.
(There are all kinds of opinions regarding data architecture and on when to put what where. We do not pretend to have the final answer to that, but the current workflow allows us to very quickly deliver solutions that can be used and verified by the users. At present I do not think we could achieve the same productivity if we would demand that everything be designed and built on the backend right from the get go.)
So, today we were refactoring some of the logic in a PowerBI model, including a Year-to-Date calculation. The solution we ended up implementing to solve it seems to work rather nicely so I figured to share it.
Year-to-Date value
What's a year to date (YTD) value? Basically it's the cumulative value over a metric in time, which resets once a year. In other words, the year to date value is the per-year total of the value achieved up to the current date.
This is best explained with an example. Consider the following dataset:
Date
Value
YTD Value
2012-01-10 35,401.14 35,401.14
2012-01-20 15,012.18 50,413.32
2012-02-01 25,543.71 75,957.03
2012-02-10 32,115.41 108,072.43
2012-02-20 17,688.07 125,760.50
2012-03-01 10,556.53 136,317.03
... ... ...
2013-01-01 19,623.90 19,623.90
2013-01-10 8,351.18 27,975.08
2013-01-20 20,287.65 48,262.73
2013-02-01 33,055.69 81,318.42
In the table above we have dates from two years -
2012
and 2013
- and for each date a Value
.
For the first date encountered within a year, the
YTD Value
is equal to the Value
itself;
For each subsequent Date
, the YTD Value
is maintained as a running total of the values that appeared at the earlier dates.
So,
2012-01-10
is the first date we encounter in 2012
and therefore its YTD Value
is equal to the Value
at that date (35,401.14
).
The next date is 2012-01-20
and its Value
is 15,012.18
; therefore its YTD Value
is 50,413.32
, which is 15,012.18 + 35,401.14
.
The accumulation continues until we reach the last date of 2012
.
At
2013-01-01
the first date of the next year, the YTD Value
resets again to be equal to the Value
, and then in the subsequent dates of 2013
, the YTD Value
again accumulates the current Value
by adding it to the preceding YTD-value
.
How to use YTD
You can use YTD values to analyze how well actual trends are developing over time as compared to a planning or predicition. By comparing the calculated YTD of a measure to a projected value (for example, a sales target), we can see how far off we are at any point in time.
If you gather these comparisons for a couple of moments in time, you can get a sense of the pace in which the actual situation is deviating from on converging to the target or the projected situation. These insights allow you to intervene in some way: maybe you need to adjust your planning, or change your expectations. Or maybe you need to adjust your efforts in order to more closely approximate your target.
Thinking about YTD as iteration
From the way we explained what a year-to-date value is, you might think about it as an actual "rolling sum". By that I mean, you might think about it as an iterative problem, that you solve by going through the rows, one by one. In pseudocode, such a solution would do something like:
declare year, ytd loop through rows: if year equals row.year then assign ytd + row.value to ytd else assign row.value to ytd assign row.year to year end if end loop through rowsWhile this approach would apparently give you the desired result, it does not help you to solve the problem in SQL directly. Pure SQL does not let you iterate rows like that, and it also does not let you work with variables like that.
Even with the iterative approach there is a hidden problem: the reset of the
ytd
variable and the update of the year
variable that occurs whenever the row.year
is different from the current value of the year
variable will only work properly if the rows of one particular year are next to each other (like when the rows are ordered by year prior to iteration).
The same applies within the year: the rows need to be sorted in chronological order, as the YTD value should reflect how much of the value was accumulated at that date within that year.
It may seem like a waste of time to think about an approach that is of no use to solving the problem. But this simple iterative approach provides a very simple recipe for quickly checking whether an actual solution behaves as expected. We'll use it later to veryify some results.
A set-oriented approach
To implement it in SQL we have to think in a set-oriented way. Conceptually, we can think about it as if we combine each row in the set with all of the other rows, forming a cartesian product, and then retain only those combinations that have identical values for
year
, but a smaller or equal value for the month
.
This way, each row will combine with itself, and with all the other rows that chronologically precede it within the same year. The YTD value is then obtained by aggregating the rows over
year
and month
value, summing the value to become the YTD value.
Sample Data
To play around a bit with the problem in SQL, let's set up a simple table:
create table SalesYearMonth ( SalesYear int , SalesMonth int , SalesAmount decimal(And, here's some data:20
,2
) , primary key(SalesYear, SalesMonth) );
insert into SalesYearMonth ( SalesYear , SalesMonth , SalesAmount ) values ( (This setup is slightly different from the original problem statement. Instead of a column with2011
,5
,503805.92
) ,(2011
,6
,458910.82
) ,(2011
,7
,2044600.00
) ,(2011
,8
,2495816.73
) ,(2011
,9
,502073.85
) ,(2011
,10
,4588761.82
) ,(2011
,11
,737839.82
) ,(2011
,12
,1309863.25
) ,(2012
,1
,3970627.28
) ,(2012
,2
,1475426.91
) ,(2012
,3
,2975748.24
) ,(2012
,4
,1634600.80
) ,(2012
,5
,3074602.81
) ,(2012
,6
,4099354.36
) ,(2012
,7
,3417953.87
) ,(2012
,8
,2175637.22
) ,(2012
,9
,3454151.94
) ,(2012
,10
,2544091.11
) ,(2012
,11
,1872701.98
) ,(2012
,12
,2829404.82
) ,(2013
,1
,2087872.46
) ,(2013
,2
,2316922.15
) ,(2013
,3
,3412068.97
) ,(2013
,4
,2532265.91
) ,(2013
,5
,3245623.76
) ,(2013
,6
,5081069.13
) ,(2013
,7
,4896353.74
) ,(2013
,8
,3333964.07
) ,(2013
,9
,4532908.71
) ,(2013
,10
,4795813.29
) ,(2013
,11
,3312130.25
) ,(2013
,12
,4075486.63
) ,(2014
,1
,4289817.95
) ,(2014
,2
,1337725.04
) ,(2014
,3
,7217531.09
) ,(2014
,4
,1797173.92
) ,(2014
,5
,5366674.97
) ,(2014
,6
,49005.84
);
DATE
data type, we have separate SalesYear
and SalesMonth
columns.
This is fine - it doesn't change the problem or the solution in any way.
In fact, this setup allows us to think about the essential elements of the problem without having to worry about the details of getting to that point. Once we done that, we can apply the approach to a more realistic case.
Next installment: Solution 1 - a self-JOIN
In the next installment we will present and discuss a solution based on a self-
JOIN
and a GROUP BY
.
No comments:
Post a Comment