(While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

**TL;DR**: A Year-to-Date solution based on a

**SUM**()

window function is simple to code and maintain as well as efficient to execute.
This as compared to a number of alternative implementations, namely a self-`JOIN`

(combined with a `GROUP BY`

), a subquery, and a `UNION`

(also combined with a `GROUP BY`

).
### In this Installment

- A definition of what Year-to-Date (YTD) is
- Some background on why we are interested in calculating the YTD using Synapse Analytics / MS SQL Server
- How to calculate YTD using an iterative and a set-oriented approach
- A sample table and dataset which will be used in the next installments to demonstrate our sample code.

### Context

Our customer is using an Azure Data Lake to store data from all kinds of source systems, including its SAP ERP system. Azure Synapse Analytics sits on top of the Data Lake and is used as analytics workhorse, but also to integrate various data sets present in the data lake. Front-end BI tools, such as Microsoft PowerBI, can then connect to Synapse and import or query the data from there.

In many cases, the datamarts presented by Synapse are pretty straightforward. Calculations and derived measures needed to build dashboards and data visualizations can typically be developed rather quickly inside the Power BI data model. Once the front-end development has stabilized, one can consider to refactor the solution and move parts away from the front-end and push them down to the backend for performance or maintainability.

(There are all kinds of opinions regarding data architecture and on when to put what where. We do not pretend to have the final answer to that, but the current workflow allows us to very quickly deliver solutions that can be used and verified by the users. At present I do not think we could achieve the same productivity if we would demand that everything be designed and built on the backend right from the get go.)

So, today we were refactoring some of the logic in a PowerBI model, including a Year-to-Date calculation. The solution we ended up implementing to solve it seems to work rather nicely so I figured to share it.

### Year-to-Date value

What's a year to date (YTD) value? Basically it's the cumulative value over a metric in time, which resets once a year. In other words, the year to date value is the per-year total of the value achieved up to the current date.

This is best explained with an example. Consider the following dataset:

Date
Value
YTD Value
2012-01-10 35,401.14 35,401.14
2012-01-20 15,012.18 50,413.32
2012-02-01 25,543.71 75,957.03
2012-02-10 32,115.41 108,072.43
2012-02-20 17,688.07 125,760.50
2012-03-01 10,556.53 136,317.03
... ... ...
2013-01-01 19,623.90 19,623.90
2013-01-10 8,351.18 27,975.08
2013-01-20 20,287.65 48,262.73
2013-02-01 33,055.69 81,318.42

In the table above we have dates from two years -

`2012`

and `2013`

- and for each date a `Value`

.
For the first date encountered within a year, the

`YTD Value`

is equal to the `Value`

itself;
For each subsequent `Date`

, the `YTD Value`

is maintained as a running total of the values that appeared at the earlier dates.
So,

`2012-01-10`

is the first date we encounter in `2012`

and therefore its `YTD Value`

is equal to the `Value`

at that date (`35,401.14`

).
The next date is `2012-01-20`

and its `Value`

is `15,012.18`

; therefore its `YTD Value`

is `50,413.32`

, which is `15,012.18 + 35,401.14`

.
The accumulation continues until we reach the last date of `2012`

.
At

`2013-01-01`

the first date of the next year, the `YTD Value`

resets again to be equal to the `Value`

, and then in the subsequent dates of `2013`

, the `YTD Value`

again accumulates the current `Value`

by adding it to the preceding `YTD-value`

.
#### How to use YTD

You can use YTD values to analyze how well actual trends are developing over time as compared to a planning or predicition. By comparing the calculated YTD of a measure to a projected value (for example, a sales target), we can see how far off we are at any point in time.

If you gather these comparisons for a couple of moments in time, you can get a sense of the pace in which the actual situation is deviating from on converging to the target or the projected situation. These insights allow you to intervene in some way: maybe you need to adjust your planning, or change your expectations. Or maybe you need to adjust your efforts in order to more closely approximate your target.

### Thinking about YTD as iteration

From the way we explained what a year-to-date value is, you might think about it as an

*actual*"rolling sum". By that I mean, you might think about it as an iterative problem, that you solve by going through the rows, one by one. In pseudocode, such a solution would do something like:

While this approach would apparently give you the desired result, it does not help you to solve the problem in SQL directly. Pure SQL does not let you iterate rows like that, and it also does not let you work with variables like that.declareyear, ytdloop throughrows:ifyearequalsrow.yearthenassignytd + row.valuetoytdelseassignrow.valuetoytdassignrow.yeartoyearend ifend loop throughrows

Even with the iterative approach there is a hidden problem: the reset of the

`ytd`

variable and the update of the `year`

variable that occurs whenever the `row.year`

is different from the current value of the `year`

variable will only work properly if the rows of one particular year are next to each other (like when the rows are ordered by year prior to iteration).
The same applies within the year: the rows need to be sorted in chronological order, as the YTD value should reflect how much of the value was accumulated at that date within that year.
It may seem like a waste of time to think about an approach that is of no use to solving the problem. But this simple iterative approach provides a very simple recipe for quickly checking whether an actual solution behaves as expected. We'll use it later to veryify some results.

### A set-oriented approach

To implement it in SQL we have to think in a set-oriented way. Conceptually, we can think about it as if we combine each row in the set with all of the other rows, forming a cartesian product, and then retain only those combinations that have identical values for

`year`

, but a smaller or equal value for the `month`

.
This way, each row will combine with itself, and with all the other rows that chronologically precede it within the same year. The YTD value is then obtained by aggregating the rows over

`year`

and `month`

value, summing the value to become the YTD value.
### Sample Data

To play around a bit with the problem in SQL, let's set up a simple table:

And, here's some data:create tableSalesYearMonth ( SalesYearint, SalesMonthint, SalesAmountdecimal(`20`

,`2`

) ,primary key(SalesYear, SalesMonth) );

This setup is slightly different from the original problem statement. Instead of a column withinsert intoSalesYearMonth ( SalesYear , SalesMonth , SalesAmount )values( (`2011`

,`5`

,`503805.92`

) ,(`2011`

,`6`

,`458910.82`

) ,(`2011`

,`7`

,`2044600.00`

) ,(`2011`

,`8`

,`2495816.73`

) ,(`2011`

,`9`

,`502073.85`

) ,(`2011`

,`10`

,`4588761.82`

) ,(`2011`

,`11`

,`737839.82`

) ,(`2011`

,`12`

,`1309863.25`

) ,(`2012`

,`1`

,`3970627.28`

) ,(`2012`

,`2`

,`1475426.91`

) ,(`2012`

,`3`

,`2975748.24`

) ,(`2012`

,`4`

,`1634600.80`

) ,(`2012`

,`5`

,`3074602.81`

) ,(`2012`

,`6`

,`4099354.36`

) ,(`2012`

,`7`

,`3417953.87`

) ,(`2012`

,`8`

,`2175637.22`

) ,(`2012`

,`9`

,`3454151.94`

) ,(`2012`

,`10`

,`2544091.11`

) ,(`2012`

,`11`

,`1872701.98`

) ,(`2012`

,`12`

,`2829404.82`

) ,(`2013`

,`1`

,`2087872.46`

) ,(`2013`

,`2`

,`2316922.15`

) ,(`2013`

,`3`

,`3412068.97`

) ,(`2013`

,`4`

,`2532265.91`

) ,(`2013`

,`5`

,`3245623.76`

) ,(`2013`

,`6`

,`5081069.13`

) ,(`2013`

,`7`

,`4896353.74`

) ,(`2013`

,`8`

,`3333964.07`

) ,(`2013`

,`9`

,`4532908.71`

) ,(`2013`

,`10`

,`4795813.29`

) ,(`2013`

,`11`

,`3312130.25`

) ,(`2013`

,`12`

,`4075486.63`

) ,(`2014`

,`1`

,`4289817.95`

) ,(`2014`

,`2`

,`1337725.04`

) ,(`2014`

,`3`

,`7217531.09`

) ,(`2014`

,`4`

,`1797173.92`

) ,(`2014`

,`5`

,`5366674.97`

) ,(`2014`

,`6`

,`49005.84`

);

`DATE`

data type, we have separate `SalesYear`

and `SalesMonth`

columns.
This is fine - it doesn't change the problem or the solution in any way.
In fact, this setup allows us to think about the essential elements of the problem without having to worry about the details of getting to that point. Once we done that, we can apply the approach to a more realistic case.

### Next installment: Solution 1 - a self-**JOIN**

**JOIN**

In the next installment we will present and discuss a solution based on a self-

**JOIN**

and a **GROUP BY**

.
## No comments:

Post a Comment