Monday, February 22, 2021

Year-to-Date on Synapse Analytics 5: Using Window Functions

For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I'd like to share some sample code, and discuss some of the merits and benefits of each approach.

TL;DR: A Year-to-Date solution based on a SUM() window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-JOIN (combined with a GROUP BY), a subquery, and a UNION (also combined with a GROUP BY).

Note: this is the 5th post in a series. (While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

Using window functions


Nowadays, many SQL engines and virtually all major RDBMSes support window functions (sometimes called analytic functions). A window function looks like a classic aggregate function. In some respects it also behaves like one, but at the same time there are essential differences.

Aggregate functions


Consider the following example:
select sum(SalesAmount) as SumOfSalesAmount
,      count(*)         as RowCount
from   SalesYearMonth
The example uses two aggregate functions, SUM() and COUNT(). It returns a result like this:

SumOfSalesAmount RowCount
109,846,381.43 38

Two things are happening here:
  • Even though there are multiple rows in the SalesYearMonth table, the result consists of just one row. In other words, a collection of source rows have been aggregated into fewer (in this case, only one) result row.
  • The functions have caclculated a value based on some aspect of the individual rows in the source collection. In the case of SUM(SalesAmount), the value of the SalesAmount column of each individual row was added to obtain a total. In the case of COUNT(*), each row was counted, adding up to the total number of rows.
Because the previous example uses aggregate functions, we cannot also select any non-aggregated columns. For example, while SalesYear and SalesMonth are present in the individual underlying rows, we cannot simpy select them, because they do not exist in the result row, which is an aggregate.

Window functions


Now, SUM() and COUNT() also exist as window functions. Consider the following query:
select SalesYear
,      SalesMonth
,      SalesAmount
,      sum(SalesAmount) over() as TotalOfSalesAmount
,      count(*)         over() as RowCount
from   SalesYearMonth
You might notice the last two expressions in the SELECT-list look almost identical to the aggregate functions in the previous example. The difference is that in this query, the function call is followed by an OVER()-clause. Syntactically, this is what distinguishes ordinary aggregate functions from window functions.

Here is its result:

SalesYear SalesMonth SalesAmount TotalOfSalesAmount RowCount
2011 5 503,805.92 109,846,381.43 38
2011 6 458,910.82 109,846,381.43 38
...more rows...
2014 6 49,005.84 109,846,381.43 38

Note that we now get all the rows from the underlying SalesYearMonth table: no aggregation has ocurred. But the window functions do return a result that is identical to the one we got when using them as aggregate functions, and that for each row of the SalesYearMonth table.

It's as if for each row of the underlying table, the respective aggregate function was called over all rows in the entire table. Conceptually this is quite like the construct we used in the subquery-solution. The following example illustrates this:
select SalesYear
,      SalesMonth
,      SalesAmount
,      (
           select sum(SalesAmount) 
           from   SalesYearMonth
       ) as TotalOfSalesAmount
,      (
           select count(*)
           from   SalesYearMonth
       ) as RowCount
from   SalesYearMonth

The window and the OVER()-clause


Thinking about window functions as a shorthand for a subquery helps to understand how they work and also explains their name: a window function returns the result of an aggregate function on a particular subset of the rows in the query scope. This subset is called the window and it is defined by the OVER()-clause.

The parenthesis after the OVER-keyword can be used to define which rows will be considered as window. When left empty (like in the example above) all rows are considered.

Controlling the window using the PARTITION BY-clause


If you compare the previous example with our prior subquery-solution, you'll notice that here, we do not have a WHERE-clause to tie the subquery to the current row of the outer query. That's why our result is calculated over the entire table, rather than with respect to the current year and preceding months, as in our prior subquery-solution. This is equivalent to the empty parenthesis following the OVER-keyword in the corresponding window-function example.

In the subquery-solution we wrote a WHERE-clause to specify a condition to tie the rows of the subquery to the current row. For window functions, we can control which rows make up the window window by writing a PARTITION BY-clause inside the parenthesis following the OVER-keyword.

The PARTITION BY-clause does not let you specify an arbitrary condition, like we could in a subquery. Instead, the relationship between the current row and rows in the window must be expressed through one or more attributes for which they share a common value. The following example may illustrate this:
select SalesYear
,      SalesMonth
,      SalesAmount
,      sum(SalesAmount) over(partition by SalesYear) as YearTotalOfSalesAmount
from   SalesYearMonth
In the example above, sum(SalesAmount) over(partition by SalesYear) means: calculate the total of SalesAmount over all rows where the value of SalesYear is equal to the value of the SalesYear in the current row.

The equivalent query using subqueries would be:
select OriginalSales.SalesYear
,      OriginalSales.SalesMonth
,      OriginalSales.SalesAmount
,      (
           select sum(YearSales.SalesAmount) 
           from   SalesYearMonth as YearSales
           where  YearSales.SalesYear = OriginalSales.SalesYear
       ) as YearTotalOfSalesAmount
from   SalesYearMonth as OriginalSales
The result is shown below:

SalesYear SalesMonth SalesAmount YearTotalOfSalesAmount
2011 5 503,805.92 12,641,672.21
2011 6 458,910.82 12,641,672.21
...more rows...
2014 6 49,005.84 20,057,928.81
(Note that 12,641,672.21 is the sum of the SalesAmount for SalesYear 2011; 20,057,928.81 is the total for 2014.)

A partition for the preceding months?


It's great that the PARTITION BY-clause allows us to specify a window for relevant year, but it's still too wide: we want the window to contain only the rows from the current year, but only for this month and its preceding months. In the subquery-solution this was easy, as we could write whatever condition we want in the WHERE-clause. So we wrote:
where  SalesYtd.SalesYear   = SalesOriginal.SalesYear
and    SalesYtd.SalesMonth <= SalesOriginal.SalesMonth
Specifying SalesYear in the window functions' PARTITION BY-clause is equivalent to the first part of the subquery's WHERE-clause condition.

It's less clear what our partition expression should look like to select all months preceding the current month. It's not impossible though. For example, we can write an expression to mark whether the current SalesMonth is equal to or less than a specific month. For example:
-- every month up to and including june is 1, all months beyond june is 0
case
    when SalesMonth <= 6 then 1 
    else 0
end
If we can write such an expression, then of course, we can also use it in a PARTITION BY-clause, like so:
sum(SalesAmount) over (
  partition by 
    SalesYear
  , case
      when SalesMonth <= 6 then 1 
      else 0
    end
) 
Let's try and think what this brings us.

Suppose the value for SalesMonth is 6 (june), or less? The CASE expression would return 1, and the window function would take all rows into account for which this is the case. So january, february, march and so on, up to june would all get the total of those six months - that is, the YTD value for june.

On the other hand, if SalesMonth is larger than 6, the CASE expression evaluates to 0. So all months beyond june (that is: july, august, and so on up to december) form a partition as well, and for those months, whatever is the sum over those months would be returned.

Now, it's not really clear what the outcome means in case the month is beyond june. But it doesn't really matter - what is important, is that we now know how to calculate the correct YTD value for a given month. And, what we did for june, we can do for any other month. So, once we have the YTD expressions for each individual month, we can set up yet another CASE-expression to pick the right one according to the current SalesMonth.

Putting all that together, we get:
select SalesYear
,      SalesMonth
,      SalesAmount
,      case SalesMonth
         -- january
         when 1 then SalesAmount
         -- february
         when 2 then
           sum(SalesAmount) over(
             partition by 
               SalesYear
             , case when SalesMonth <= 2 then 1 else 0 end
           ) 
           
         ...more cases for the other months...
           
         -- december
         when 12 then
           sum(SalesAmount) over(
             partition by 
               SalesYear
             , case when SalesMonth <= 12 then 1 else 0 end
           ) 
       end as YtDOfSalesAmount
from   SalesYearMonth
Like with the UNION-solution, we are taking advantage of our knowledge of the calendar, which allows us to create these static expressions. We would not be able to do this in a general case, or where the number of distinct values is very large. But for 12 months, we can manage.

While it's nice to know that this is possible, there is a much, much nicer way to achieve the same effect - the frame specification.

Frame Specification


The frame specification lets you specify a subset of rows within the partition. The way you can specify the frame feels a bit odd (to me at least), as it is specified in terms of the current row's position in the window. Hopefully the following example will make this more clear:
select SalesYear
,      SalesMonth
,      SalesAmount
,      sum(SalesAmount) over(
         partition by SalesYear
         order by SalesMonth
         rows between unbounded preceding
         and current row
       ) as SalesYtd
from   SalesYearMonth
We already discussed the PARTITION BY-clause, all the clause after that are new.

The ORDER BY-clause sorts the rows within the window, in this case by SalesMonth. We need to rows to be ordered because of how the frame specification works: it lets you pick rows by position, relative to the current row. The position of the rows is undetermined unless we sort them explicitly, so if we want to pick rows reliably we need the ORDER BY-clause to guarantee the order.

The frame specification follows the ORDER BY-clause. There are a number of possible options here, but I will only discuss the one in the example. In this case, it almost explains itself: we want to use the current row, and all rows that precede it. Since we ordered by SalesMonth, this means all the rows that chronologically precede it. As this selection applies to the current partition, we will only encounter months here that are within the current year.

So here we have it: a YTD calculation implemented using a window functions. It's about the same amount of code as compared to the subquery solution, but more delcarative, as we do not need to specify the details of a condition. On the other hand, it is also less flexible than a subquery, but in general one should expect the window functions to perform better than the equivalent subquery.

Generalizing the solutions


So far all our examples were based on the SalesYearMonth table, which provides SalesYear and SalesMonth as separate columns. One might wonder what would it would take to apply these various methods to a realistic use case.

For example, it is likely that in a real dataset, the time would be available as a single column of a DATE or DATETIME data type. A single date column potentially affects the YTD calculation in two ways:
  • Year: As the YTD is calculated over a period of a year and almost all solutions we described used the SalesYear column explicitly to implement that logic.
  • Preceding rows: To calculate the YTD for a specific row, there has to be a clear definition of what rows are in the same year, but which precede it. In our examples we could use the SalesMonth column for that, but this might be a but different in a realistic case.
  • Lowest Granularity: The lowest granularity of the SalesMonthYear table is at the month level, and we collected the YTD values at that level. (If we'd want to be precise we'd have to call that year-to-month).
Apart from the time aspect, the definition of the key affects all solutions that generate "extra" rows and require a GROUP BY to re-aggregate to the original granularity.

The Year


The ON-condition of the JOIN-solution and the WHERE-condition of the subquery-solution both rely on a condition that finds other rows in the same year, and the window function-solution uses the year in its PARTITION BY-clause.

It is usually quite simple to extract the year from a date, date/time or timestamp. In Synapse Analytics or MS SQL one can use the DATEPART or YEAR function to do this.

The UNION-solution has no direct dependency on the year.

The preceding rows


The need to find the preceding rows applies to all solutions that use the year to find the rows to apply the YTD calculation on. In our samples, this could all be solved using the SalesMonth column.

Again, it are the JOIN-solution and subquery-solution that used it in their condition, whereas the window function-solution uses it in its ORDER BY-clause.

In this case, the fix is more straighforward then with the year: instead of the month column, these solutions can simply use the date or date/time column directly. No conversion or datepart extraction is required.

Lowest granularity


The granularity is of special concern to the UNION-solution. The solution relies on an exhaustive and static enumeration of all possible future dates within the year. Already at the month level, this already required a lot of manual code.

Below the month, the next level would be day. While it would in theory be possible to extend the solution to that level, it is already bordering the impractible at the month level.

The Key


The key definition affects both the JOIN-solution and the UNION-solution, as that both require a GROUP BY over the key.

Year-to-Date on Synapse Analytics 4: Using UNION and GROUP BY

For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I'd like to share some sample code, and discuss some of the merits and benefits of each approach.

TL;DR: A Year-to-Date solution based on a SUM() window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-JOIN (combined with a GROUP BY), a subquery, and a UNION (also combined with a GROUP BY).

Note: this is the 4th post in a series. (While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

Using a UNION


We mentioned how the solution with the JOIN relates each row of the main set with a subset of "extra" rows over which the YTD value is calculated by aggregating over the key of the main set using a GROUP BY.

It may not be immediately obvious, but we can also use the SQL UNION (or rather, UNION ALL) operator to generate such a related subset. Just like with the JOIN-solution, this can then be aggregated using GROUP BY. An example will help to explain this:
select      SalesYear
,           SalesMonth
,           sum(SumOfSalesAmount)      as SumOfSalesAmount
,           sum(YtdOfSumOfSalesAmount) as YtdOfSumOfSalesAmount
from (
    select  SalesYear
    ,       SalesMonth
    ,       SumOfSalesAmount
    ,       SumOfSalesAmount           as YtdOfSumOfSalesAmount
    from    SalesYearMonth
    union all
    -- JANUARY
    select  SalesYear
    ,       SalesMonth + 1            -- february
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 1
    union all
    select  SalesYear
    ,       SalesMonth + 2            -- march
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 1
    union all
    
    ... and so on, all for JANUARY ...

    union all
    select  SalesYear
    ,       SalesMonth + 11            -- december
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 1
    union all
    -- FEBRUARY
    select  SalesYear
    ,       SalesMonth + 1            -- march
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 2
    union all
    
    ... and so on, for the rest of FEBRUARY, 
        and then again for MARCH, APRIl, MAY, JUNE, JULY, AUGUST, SEPTEMBER, OCTOBER...   

    -- NOVEMBER
    select  SalesYear
    ,       SalesMonth + 1            -- december
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 11
) Sales
group by    SalesYear
,           SalesMonth

Duplicating metric-data so it contributes to the following months


In the top of the UNION we simply provide the entire resultset from SalesYearMonth, reporting the SumOfSalesAmount-metric as is, but also copying it to YtdSumOfSalesAmount. The other parts of the UNION are used to selectively duplicate the data for the SumOfSalesAmount-metric into the YtdSumOfSalesAmount, so that its data contributes to the YtdSumOfSalesAmount for all following months.

We start by grabbing january's data by applying the condition that demands that the SalesMonth equals 1:
-- JANUARY
select  SalesYear
,       SalesMonth + 1            -- february
,       null
,       SumOfSalesAmount
from    SalesYearMonth
where   SalesMonth = 1
union all
... repeat to duplicate january's data into february, march, and so on all the way up to december...
. This is done for a total of 11 times, each time adding 1, 2, 3 and so on - all the way up to 11 - to the SalesMonth attribute. This ensures january's data, as captured by the condition in the WHERE clause, is reported also in february (SalesMonth + 1), march (SalesMonth + 2), and so on, all the way up to december (SalesMonth + 11).

After the string of UNIONs for january appear more parts to duplicate the data also for february and all following months: -- FEBRUARY select SalesYear , SalesMonth + 1 -- march , null , SumOfSalesAmount from SalesYearMonth where SalesMonth = 02 union all ... repeat to duplicate february's data into march, april, and so on all the way up to december... . Again, february's data is selected by applying the condition
where   SalesMonth = 2
, and this happens now 10 times, again adding a number to the SalesMonth so it is duplicated to march, april, may, all the way up to december - in other words, all months following february.

What we thus did for january and februry is repeated for march, april, and so on for all months up to november. November is the last month we need to do this for: November's data still needs to be copied to december, but as december is the last month, that data only needs to be counted in december itself.

While it may seem wasteful to duplicate all this data, it really is not that different in that respect from the other solutions we've seen so far. It's just that now it's really in your face, because there is a pretty direct correspondence between the SQL code and the data sets that are being handled. The JOIN and subquery solutions hande similar amounts of data, it's just achieved with way less code, and in a far more implicit manner.

Original metric is retained


Note that the original metric also computes correctly, because the parts of the union only duplicate the data to the YTD column. The union parts that duplicate the data to the subsequent months select a NULL for the original metric. So the data for the original metric is never duplicated, and thus retains its normal value.

Drawbacks to the UNION-solution


The main drawback to the UNION-solution is its maintainability. A lot of code is required, far more than for any of the methods we have seen so far. Despite the indiviual patterns are simple (condition to get one month, adding a number to project that data to a future month), it is suprisingly easy to make a little mistake somewhere

We just argued that this solution is not so much different from the JOIN solution, but that remark only pertains to how the calculation is performed. The JOIN-solution generates the data it operates upon dynamically and declaratively; the UNION solution does this statically and explicitly. This is also why it is impossible to generalize ths approach for any arbitrary JOIN: YTD is a special case, because we know exactly how often we should duplicate the data as this is dictated by the cyclical structure of our calendar.

Next installment: Solution 4 - window functions


In the next installment we will present and discuss a solution based on a window functions.

Year-to-Date on Synapse Analytics 3: Using a Subquery

For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I'd like to share some sample code, and discuss some of the merits and benefits of each approach.

TL;DR: A Year-to-Date solution based on a SUM() window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-JOIN (combined with a GROUP BY), a subquery, and a UNION (also combined with a GROUP BY).

Note: this is the 3rd post in a series. (While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

Using a subquery


We can also think of YTD calculation as a separate query that we perform for each row of the SalesYearMonth table. While this does imply a row-by-row approach, we can still translate this easily to pure SQL by creating an expression in the SELECT-list, which uses a subquery to calculate the YTD value for the current row:
select      SalesOriginal.SalesYear
,           SalesOriginal.SalesMonth
,           SalesOriginal.SalesAmount
,           (
                select sum(SalesYtd.SalesAmount)
                from   SalesYearMonth as SalesYtd
                where  SalesYtd.SalesYear   = SalesOriginal.SalesYear
                and    SalesYtd.SalesMonth <= SalesOriginal.SalesMonth
            ) as SalesYtd
from        SalesYearMonth as SalesOriginal
There's a similarity with the JOIN-solution, in that we use the SalesYearMonth table twice, but in different roles. In the JOIN-solution both appeared on one side of the JOIN keyword and we used the aliases OriginalSales and YtdSales to be able to keep them apart. In the subquery approach, the distinction between these two different instances of the SalesYearMonth table is more explicit: the main instance of the SalesYearMonth table occurs in the FROM-clause, and the one for the YTD calculation occurs in the SELECT-list.

Also similar to the JOIN solution is the condition to tie the set for the YTD calculation to the main query using the SalesYear and SalesMonth columns. Such a subquery is referred to as a correlated subquery.

As for any differences with the JOIN solution: In the condition, the only difference is the left/right placement of SalesOriginal and SalesYtd, which is chosen only by order of appearance in the query but functionally completely equivalent. The most striking difference between the JOIN solution and the subquery is the absence of the GROUP BY-list in the latter.

Drawbacks of the subquery


As we had much to complain about the GROUP BY-list in the JOIN solution, it might seem that the subquery solution is somehow "better". However, a solution with a correlated subquery in general tends to be slower than a JOIN solution. Whether this is actually the case depends on on many variables and you'd really have to check it against your SQL engine and datasets.

Another drawback of the subquery solution becomes clear when we want to calculate the YTD for multiple measures. Our example only has one SalesAmount measure, but in this same context we can easily imagine that we also want to know about price, discount amounts, tax amounts, shipping costs, and so on.

In the JOIN solution, we would simply add any extra measures to the select list, using MAX() (or MIN() or AVG()) to obtain the original value, and SUM() to calculate its respective YTD value: As long as it's over the same set, the JOIN, its condition, and even the GROUP BY-list would remain the same, no matter for how many different measures we would add a YTD calculation.

This is very different in the subquery case. Each measure for which you need a YTD calculation would get its own subquery. Even though the condition would be the same for each such YTD calculation, you would still need to repeat the subquery code - one for each YTD measure.

Next installment: Solution 3 - using a UNION


In the next installment we will present and discuss a solution based on a UNION and a GROUP BY.

Year-to-Date on Synapse Analytics 2: Using a self-JOIN and GROUP BY

For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I'd like to share some sample code, and discuss some of the merits and benefits of each approach.

TL;DR: A Year-to-Date solution based on a SUM() window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-JOIN (combined with a GROUP BY), a subquery, and a UNION (also combined with a GROUP BY).

Note: this is the 2nd post in a series. For sample data and setup, please see the first post in this series. (While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

Using a self-JOIN


The recipe for the set-oriented approach can be directly translated to SQL:
select      SalesOriginal.SalesYear
,           SalesOriginal.SalesMonth
,           max(SalesOriginal.SalesAmount) as SalesAmount
,           sum(SalesYtd.SalesAmount)      as SalesYtd
from        SalesYearMonth as SalesOriginal
inner join  SalesYearMonth as SalesYtd
on          SalesOriginal.SalesYear  = SalesYtd.SalesYear
and         SalesOriginal.SalesMonth >= SalesYtd.SalesMonth
group by    SalesOriginal.SalesYear
,           SalesOriginal.SalesMonth

The self-JOIN


In our discussion of the set-oriented approach we mentioned combining the rows from the table with each other to produce all different combinations. In the code sample about, the JOIN-clause takes care of that aspect.

As you can see, the SalesYearMonth table appears twice: on the left hand and on the right hand of the JOIN-keyword, but using different aliases: SalesOriginal and SalesYtd. It is a so-called self-join.

Even though both aliases refer to an instance of the same SalesYearMonth base table, each has a very different role. We can think of the one with the SalesOriginal alias as really the SalesYearMonth table itself. The SalesYtd alias refers to an instance of the SalesYearMonth table that, for any given row from SalesOriginal, represents a subset of rows that chronologically precedes the row from SalesOriginal.

The ON-clause that follows controls which combinations should be retained: for each particular row of SalesOriginal we only want to consider rows from SalesYtd from the same year, which is why the first predicate in the ON-clause is:
SalesOriginal.SalesYear  = SalesYtd.SalesYear
Within that year, we only want to consider rows that precede it chronologically, and that explains the second predicate:
SalesOriginal.SalesMonth >= SalesYtd.SalesMonth

GROUP BY and SUM()


It is is important to realize the JOIN is only half of the solution.

While the JOIN takes care of gathering and combining all related rows necessary to compute the YTD value, the actual calculation is done by the SUM() function in the SELECT-list, and the GROUP BY defines which rows should be taken together to be summed.

In summary:
  • the JOIN generates new rows by combining rows from its left-hand table with the rows from its right-hand table, bound by the condition in the ON-clause.
  • The GROUP BY partitions the rows into subsets having the same combinations of values for SalesYear and SalesMonth.
  • The SUM() aggregates the rows in each SalesYear, SalesMonth partition, turning its associated set of rows into one single row, while adding the values of the SalesAmount column together.
Note that the columns in the GROUP BY list are qualified by the SalesOriginal alias - and not SalesYtd. Also note that the GROUP BY columns form the key of the original SalesYearMonth table - together they uniquely identify a single row from the SalesYearMonth table. This is not a coincidence: it expresses precisely that SalesOriginal really has the role of being just itself - the SalesYearMonth table.

What about the other columns?


The GROUP BY affects treatment of the non-key columns as well. In this overly simple example, we had only one other column - OriginalSales.SalesAmount.

(Note that this is different from YtdSales.SalesAmount, which we aggregated using SUM() to calculate the YTD value)

Since OriginalSales.SalesAmount comes from the SalesOriginal instance of the SalesYearMonth table, we can reason that after the GROUP BY on the key columns SalesYear and SalesMonth, there must be exactly one SalesAmount value for each distinct combination of SalesYear and SalesMonth. In other words, SalesAmount is functionally dependent on SalesYear and SalesMonth.

Some SQL engines are smart enough to realize this and will let you refer to any expression that is functionally dependent upon the expressions in the GROUP BY-list in the SELECT-list. Unfortunately, Synapse and MS SQL Server are not among these and if we try we will get an Error:
Msg 8120, Level 16, State 1, Line 11
Column 'Sales.SalesAmount' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
The error message suggets we can do two things to solve it:
  • either we aggregate by wrapping the SalesOriginal.SalesAmount-expression into some aggregate function
  • or we expand the GROUP BY-list and add the SalesOriginal.SalesAmount-expression there.
To me, neither feels quite right.

SalesAmount is clearly intended as a measure, and it feels weird to treat them the same as the attributes SalesYear and SalesMonth. So adding it to the GROUP BY-list feels like the wrong choice. Besides, it also makes the code less maintainable, as each such column will now appear twice: once in the SELECT-list, where we need it no matter what, and once again in the GROUP BY-list, just to satisfy the SQL engine.

So, if we don't want to put it in the GROUP BY-list, we are going to need to wrap it in an aggregate function. We just mentioned that SalesAmount is a measure and therefore that does not sound unreasonable. However, we have to be careful which one we choose.

One would normally use SalesAmount as an additive measure and be able to use SUM() for that. But here, in this context, SUM() is definitily the wrong choice!

All we want to do is to "get" back" whatever value we had for SalesAmount, in other words, unaffected by the whole routine of join-and-then-aggregate, which we did only to calculate the YTD value. The "extra" rows generated by the JOIN are only needed to do the YTD calculation and should not affect any of the other measures. Using SUM() would simply add the SalesAmount just as many times as there are preceding rows in the current year, which simply does not have any meaningful application.

What we want instead is to report back the original SalesAmount for any given SalesYear, SalesMonth combination. We just reasoned that there will be just one distinct SalesOriginal.SalesAmount value for any combination of values in SalesOriginal.SalesYear, SalesOriginal.SalesMonth, and it would be great if we had an aggregate function that would simply pick the SalesOriginal.SalesAmount value from any of those rows. To the best of my knowledge, no such aggregate function exists in MS SQL Server or Synapse Analytics.

We can use MAX() or MIN(), or even AVG(). While this would all work and deliver the intended result, it still feels wrong as it seems wasteful to ask the SQL engine to do some calculation on a set of values while it could pick just any value.

Next installment: Solution 2 - using a subquery


In the next installment we will present and discuss a solution based on a subquery.

Year-to-Date on Synapse Analytics 1: Background

For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I'd like to share some sample code, and discuss some of the merits and benefits of each approach.

(While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

TL;DR: A Year-to-Date solution based on a SUM() window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-JOIN (combined with a GROUP BY), a subquery, and a UNION (also combined with a GROUP BY).

In this Installment

Context


Our customer is using an Azure Data Lake to store data from all kinds of source systems, including its SAP ERP system. Azure Synapse Analytics sits on top of the Data Lake and is used as analytics workhorse, but also to integrate various data sets present in the data lake. Front-end BI tools, such as Microsoft PowerBI, can then connect to Synapse and import or query the data from there.

In many cases, the datamarts presented by Synapse are pretty straightforward. Calculations and derived measures needed to build dashboards and data visualizations can typically be developed rather quickly inside the Power BI data model. Once the front-end development has stabilized, one can consider to refactor the solution and move parts away from the front-end and push them down to the backend for performance or maintainability.

(There are all kinds of opinions regarding data architecture and on when to put what where. We do not pretend to have the final answer to that, but the current workflow allows us to very quickly deliver solutions that can be used and verified by the users. At present I do not think we could achieve the same productivity if we would demand that everything be designed and built on the backend right from the get go.)

So, today we were refactoring some of the logic in a PowerBI model, including a Year-to-Date calculation. The solution we ended up implementing to solve it seems to work rather nicely so I figured to share it.

Year-to-Date value


What's a year to date (YTD) value? Basically it's the cumulative value over a metric in time, which resets once a year. In other words, the year to date value is the per-year total of the value achieved up to the current date.

This is best explained with an example. Consider the following dataset:

Date Value YTD Value
2012-01-1035,401.1435,401.14
2012-01-2015,012.1850,413.32
2012-02-0125,543.7175,957.03
2012-02-1032,115.41108,072.43
2012-02-2017,688.07125,760.50
2012-03-0110,556.53136,317.03
.........
2013-01-0119,623.9019,623.90
2013-01-108,351.1827,975.08
2013-01-2020,287.6548,262.73
2013-02-0133,055.6981,318.42

In the table above we have dates from two years - 2012 and 2013 - and for each date a Value.

For the first date encountered within a year, the YTD Value is equal to the Value itself; For each subsequent Date, the YTD Value is maintained as a running total of the values that appeared at the earlier dates.

So, 2012-01-10 is the first date we encounter in 2012 and therefore its YTD Value is equal to the Value at that date (35,401.14). The next date is 2012-01-20 and its Value is 15,012.18; therefore its YTD Value is 50,413.32, which is 15,012.18 + 35,401.14. The accumulation continues until we reach the last date of 2012.

At 2013-01-01 the first date of the next year, the YTD Value resets again to be equal to the Value, and then in the subsequent dates of 2013, the YTD Value again accumulates the current Value by adding it to the preceding YTD-value.

How to use YTD


You can use YTD values to analyze how well actual trends are developing over time as compared to a planning or predicition. By comparing the calculated YTD of a measure to a projected value (for example, a sales target), we can see how far off we are at any point in time.

If you gather these comparisons for a couple of moments in time, you can get a sense of the pace in which the actual situation is deviating from on converging to the target or the projected situation. These insights allow you to intervene in some way: maybe you need to adjust your planning, or change your expectations. Or maybe you need to adjust your efforts in order to more closely approximate your target.

Thinking about YTD as iteration


From the way we explained what a year-to-date value is, you might think about it as an actual "rolling sum". By that I mean, you might think about it as an iterative problem, that you solve by going through the rows, one by one. In pseudocode, such a solution would do something like:
    declare year, ytd
  
    loop through rows:
    
        if year equals row.year then 
        
          assign ytd + row.value to ytd
          
        else
        
          assign row.value to ytd
          assign row.year to year
          
        end if
        
    end loop through rows
While this approach would apparently give you the desired result, it does not help you to solve the problem in SQL directly. Pure SQL does not let you iterate rows like that, and it also does not let you work with variables like that.

Even with the iterative approach there is a hidden problem: the reset of the ytd variable and the update of the year variable that occurs whenever the row.year is different from the current value of the year variable will only work properly if the rows of one particular year are next to each other (like when the rows are ordered by year prior to iteration). The same applies within the year: the rows need to be sorted in chronological order, as the YTD value should reflect how much of the value was accumulated at that date within that year.

It may seem like a waste of time to think about an approach that is of no use to solving the problem. But this simple iterative approach provides a very simple recipe for quickly checking whether an actual solution behaves as expected. We'll use it later to veryify some results.

A set-oriented approach


To implement it in SQL we have to think in a set-oriented way. Conceptually, we can think about it as if we combine each row in the set with all of the other rows, forming a cartesian product, and then retain only those combinations that have identical values for year, but a smaller or equal value for the month.

This way, each row will combine with itself, and with all the other rows that chronologically precede it within the same year. The YTD value is then obtained by aggregating the rows over year and month value, summing the value to become the YTD value.

Sample Data


To play around a bit with the problem in SQL, let's set up a simple table:
create table SalesYearMonth (
  SalesYear   int
, SalesMonth  int
, SalesAmount decimal(20,2)
, primary key(SalesYear, SalesMonth)
);
And, here's some data:
insert into SalesYearMonth (
  SalesYear
, SalesMonth
, SalesAmount
) values (
 (2011,5,503805.92)
,(2011,6,458910.82)
,(2011,7,2044600.00)
,(2011,8,2495816.73)
,(2011,9,502073.85)
,(2011,10,4588761.82)
,(2011,11,737839.82)
,(2011,12,1309863.25)
,(2012,1,3970627.28)
,(2012,2,1475426.91)
,(2012,3,2975748.24)
,(2012,4,1634600.80)
,(2012,5,3074602.81)
,(2012,6,4099354.36)
,(2012,7,3417953.87)
,(2012,8,2175637.22)
,(2012,9,3454151.94)
,(2012,10,2544091.11)
,(2012,11,1872701.98)
,(2012,12,2829404.82)
,(2013,1,2087872.46)
,(2013,2,2316922.15)
,(2013,3,3412068.97)
,(2013,4,2532265.91)
,(2013,5,3245623.76)
,(2013,6,5081069.13)
,(2013,7,4896353.74)
,(2013,8,3333964.07)
,(2013,9,4532908.71)
,(2013,10,4795813.29)
,(2013,11,3312130.25)
,(2013,12,4075486.63)
,(2014,1,4289817.95)
,(2014,2,1337725.04)
,(2014,3,7217531.09)
,(2014,4,1797173.92)
,(2014,5,5366674.97)
,(2014,6,49005.84);
This setup is slightly different from the original problem statement. Instead of a column with DATE data type, we have separate SalesYear and SalesMonth columns. This is fine - it doesn't change the problem or the solution in any way.

In fact, this setup allows us to think about the essential elements of the problem without having to worry about the details of getting to that point. Once we done that, we can apply the approach to a more realistic case.

Next installment: Solution 1 - a self-JOIN


In the next installment we will present and discuss a solution based on a self-JOIN and a GROUP BY.

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...