Monday, February 22, 2021

Year-to-Date on Synapse Analytics 5: Using Window Functions

For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I'd like to share some sample code, and discuss some of the merits and benefits of each approach.

TL;DR: A Year-to-Date solution based on a SUM() window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-JOIN (combined with a GROUP BY), a subquery, and a UNION (also combined with a GROUP BY).

Note: this is the 5th post in a series. (While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

Using window functions

Nowadays, many SQL engines and virtually all major RDBMSes support window functions (sometimes called analytic functions). A window function looks like a classic aggregate function. In some respects it also behaves like one, but at the same time there are essential differences.

Aggregate functions

Consider the following example:
select sum(SalesAmount) as SumOfSalesAmount
,      count(*)         as RowCount
from   SalesYearMonth
The example uses two aggregate functions, SUM() and COUNT(). It returns a result like this:

SumOfSalesAmount RowCount
109,846,381.43 38

Two things are happening here:
  • Even though there are multiple rows in the SalesYearMonth table, the result consists of just one row. In other words, a collection of source rows have been aggregated into fewer (in this case, only one) result row.
  • The functions have caclculated a value based on some aspect of the individual rows in the source collection. In the case of SUM(SalesAmount), the value of the SalesAmount column of each individual row was added to obtain a total. In the case of COUNT(*), each row was counted, adding up to the total number of rows.
Because the previous example uses aggregate functions, we cannot also select any non-aggregated columns. For example, while SalesYear and SalesMonth are present in the individual underlying rows, we cannot simpy select them, because they do not exist in the result row, which is an aggregate.

Window functions

Now, SUM() and COUNT() also exist as window functions. Consider the following query:
select SalesYear
,      SalesMonth
,      SalesAmount
,      sum(SalesAmount) over() as TotalOfSalesAmount
,      count(*)         over() as RowCount
from   SalesYearMonth
You might notice the last two expressions in the SELECT-list look almost identical to the aggregate functions in the previous example. The difference is that in this query, the function call is followed by an OVER()-clause. Syntactically, this is what distinguishes ordinary aggregate functions from window functions.

Here is its result:

SalesYear SalesMonth SalesAmount TotalOfSalesAmount RowCount
2011 5 503,805.92 109,846,381.43 38
2011 6 458,910.82 109,846,381.43 38
...more rows...
2014 6 49,005.84 109,846,381.43 38

Note that we now get all the rows from the underlying SalesYearMonth table: no aggregation has ocurred. But the window functions do return a result that is identical to the one we got when using them as aggregate functions, and that for each row of the SalesYearMonth table.

It's as if for each row of the underlying table, the respective aggregate function was called over all rows in the entire table. Conceptually this is quite like the construct we used in the subquery-solution. The following example illustrates this:
select SalesYear
,      SalesMonth
,      SalesAmount
,      (
           select sum(SalesAmount) 
           from   SalesYearMonth
       ) as TotalOfSalesAmount
,      (
           select count(*)
           from   SalesYearMonth
       ) as RowCount
from   SalesYearMonth

The window and the OVER()-clause

Thinking about window functions as a shorthand for a subquery helps to understand how they work and also explains their name: a window function returns the result of an aggregate function on a particular subset of the rows in the query scope. This subset is called the window and it is defined by the OVER()-clause.

The parenthesis after the OVER-keyword can be used to define which rows will be considered as window. When left empty (like in the example above) all rows are considered.

Controlling the window using the PARTITION BY-clause

If you compare the previous example with our prior subquery-solution, you'll notice that here, we do not have a WHERE-clause to tie the subquery to the current row of the outer query. That's why our result is calculated over the entire table, rather than with respect to the current year and preceding months, as in our prior subquery-solution. This is equivalent to the empty parenthesis following the OVER-keyword in the corresponding window-function example.

In the subquery-solution we wrote a WHERE-clause to specify a condition to tie the rows of the subquery to the current row. For window functions, we can control which rows make up the window window by writing a PARTITION BY-clause inside the parenthesis following the OVER-keyword.

The PARTITION BY-clause does not let you specify an arbitrary condition, like we could in a subquery. Instead, the relationship between the current row and rows in the window must be expressed through one or more attributes for which they share a common value. The following example may illustrate this:
select SalesYear
,      SalesMonth
,      SalesAmount
,      sum(SalesAmount) over(partition by SalesYear) as YearTotalOfSalesAmount
from   SalesYearMonth
In the example above, sum(SalesAmount) over(partition by SalesYear) means: calculate the total of SalesAmount over all rows where the value of SalesYear is equal to the value of the SalesYear in the current row.

The equivalent query using subqueries would be:
select OriginalSales.SalesYear
,      OriginalSales.SalesMonth
,      OriginalSales.SalesAmount
,      (
           select sum(YearSales.SalesAmount) 
           from   SalesYearMonth as YearSales
           where  YearSales.SalesYear = OriginalSales.SalesYear
       ) as YearTotalOfSalesAmount
from   SalesYearMonth as OriginalSales
The result is shown below:

SalesYear SalesMonth SalesAmount YearTotalOfSalesAmount
2011 5 503,805.92 12,641,672.21
2011 6 458,910.82 12,641,672.21
...more rows...
2014 6 49,005.84 20,057,928.81
(Note that 12,641,672.21 is the sum of the SalesAmount for SalesYear 2011; 20,057,928.81 is the total for 2014.)

A partition for the preceding months?

It's great that the PARTITION BY-clause allows us to specify a window for relevant year, but it's still too wide: we want the window to contain only the rows from the current year, but only for this month and its preceding months. In the subquery-solution this was easy, as we could write whatever condition we want in the WHERE-clause. So we wrote:
where  SalesYtd.SalesYear   = SalesOriginal.SalesYear
and    SalesYtd.SalesMonth <= SalesOriginal.SalesMonth
Specifying SalesYear in the window functions' PARTITION BY-clause is equivalent to the first part of the subquery's WHERE-clause condition.

It's less clear what our partition expression should look like to select all months preceding the current month. It's not impossible though. For example, we can write an expression to mark whether the current SalesMonth is equal to or less than a specific month. For example:
-- every month up to and including june is 1, all months beyond june is 0
    when SalesMonth <= 6 then 1 
    else 0
If we can write such an expression, then of course, we can also use it in a PARTITION BY-clause, like so:
sum(SalesAmount) over (
  partition by 
  , case
      when SalesMonth <= 6 then 1 
      else 0
Let's try and think what this brings us.

Suppose the value for SalesMonth is 6 (june), or less? The CASE expression would return 1, and the window function would take all rows into account for which this is the case. So january, february, march and so on, up to june would all get the total of those six months - that is, the YTD value for june.

On the other hand, if SalesMonth is larger than 6, the CASE expression evaluates to 0. So all months beyond june (that is: july, august, and so on up to december) form a partition as well, and for those months, whatever is the sum over those months would be returned.

Now, it's not really clear what the outcome means in case the month is beyond june. But it doesn't really matter - what is important, is that we now know how to calculate the correct YTD value for a given month. And, what we did for june, we can do for any other month. So, once we have the YTD expressions for each individual month, we can set up yet another CASE-expression to pick the right one according to the current SalesMonth.

Putting all that together, we get:
select SalesYear
,      SalesMonth
,      SalesAmount
,      case SalesMonth
         -- january
         when 1 then SalesAmount
         -- february
         when 2 then
           sum(SalesAmount) over(
             partition by 
             , case when SalesMonth <= 2 then 1 else 0 end
         ...more cases for the other months...
         -- december
         when 12 then
           sum(SalesAmount) over(
             partition by 
             , case when SalesMonth <= 12 then 1 else 0 end
       end as YtDOfSalesAmount
from   SalesYearMonth
Like with the UNION-solution, we are taking advantage of our knowledge of the calendar, which allows us to create these static expressions. We would not be able to do this in a general case, or where the number of distinct values is very large. But for 12 months, we can manage.

While it's nice to know that this is possible, there is a much, much nicer way to achieve the same effect - the frame specification.

Frame Specification

The frame specification lets you specify a subset of rows within the partition. The way you can specify the frame feels a bit odd (to me at least), as it is specified in terms of the current row's position in the window. Hopefully the following example will make this more clear:
select SalesYear
,      SalesMonth
,      SalesAmount
,      sum(SalesAmount) over(
         partition by SalesYear
         order by SalesMonth
         rows between unbounded preceding
         and current row
       ) as SalesYtd
from   SalesYearMonth
We already discussed the PARTITION BY-clause, all the clause after that are new.

The ORDER BY-clause sorts the rows within the window, in this case by SalesMonth. We need to rows to be ordered because of how the frame specification works: it lets you pick rows by position, relative to the current row. The position of the rows is undetermined unless we sort them explicitly, so if we want to pick rows reliably we need the ORDER BY-clause to guarantee the order.

The frame specification follows the ORDER BY-clause. There are a number of possible options here, but I will only discuss the one in the example. In this case, it almost explains itself: we want to use the current row, and all rows that precede it. Since we ordered by SalesMonth, this means all the rows that chronologically precede it. As this selection applies to the current partition, we will only encounter months here that are within the current year.

So here we have it: a YTD calculation implemented using a window functions. It's about the same amount of code as compared to the subquery solution, but more delcarative, as we do not need to specify the details of a condition. On the other hand, it is also less flexible than a subquery, but in general one should expect the window functions to perform better than the equivalent subquery.

Generalizing the solutions

So far all our examples were based on the SalesYearMonth table, which provides SalesYear and SalesMonth as separate columns. One might wonder what would it would take to apply these various methods to a realistic use case.

For example, it is likely that in a real dataset, the time would be available as a single column of a DATE or DATETIME data type. A single date column potentially affects the YTD calculation in two ways:
  • Year: As the YTD is calculated over a period of a year and almost all solutions we described used the SalesYear column explicitly to implement that logic.
  • Preceding rows: To calculate the YTD for a specific row, there has to be a clear definition of what rows are in the same year, but which precede it. In our examples we could use the SalesMonth column for that, but this might be a but different in a realistic case.
  • Lowest Granularity: The lowest granularity of the SalesMonthYear table is at the month level, and we collected the YTD values at that level. (If we'd want to be precise we'd have to call that year-to-month).
Apart from the time aspect, the definition of the key affects all solutions that generate "extra" rows and require a GROUP BY to re-aggregate to the original granularity.

The Year

The ON-condition of the JOIN-solution and the WHERE-condition of the subquery-solution both rely on a condition that finds other rows in the same year, and the window function-solution uses the year in its PARTITION BY-clause.

It is usually quite simple to extract the year from a date, date/time or timestamp. In Synapse Analytics or MS SQL one can use the DATEPART or YEAR function to do this.

The UNION-solution has no direct dependency on the year.

The preceding rows

The need to find the preceding rows applies to all solutions that use the year to find the rows to apply the YTD calculation on. In our samples, this could all be solved using the SalesMonth column.

Again, it are the JOIN-solution and subquery-solution that used it in their condition, whereas the window function-solution uses it in its ORDER BY-clause.

In this case, the fix is more straighforward then with the year: instead of the month column, these solutions can simply use the date or date/time column directly. No conversion or datepart extraction is required.

Lowest granularity

The granularity is of special concern to the UNION-solution. The solution relies on an exhaustive and static enumeration of all possible future dates within the year. Already at the month level, this already required a lot of manual code.

Below the month, the next level would be day. While it would in theory be possible to extend the solution to that level, it is already bordering the impractible at the month level.

The Key

The key definition affects both the JOIN-solution and the UNION-solution, as that both require a GROUP BY over the key.

No comments:


Nowadays, many SQL implementations offer some form of aggregate string concatenation function. Being an aggregate function, it has the effe...