TL;DR: A Year-to-Date solution based on a
SUM()
window function is simple to code and maintain as well as efficient to execute.
This as compared to a number of alternative implementations, namely a self-JOIN
(combined with a GROUP BY
), a subquery, and a UNION
(also combined with a GROUP BY
).
Note: this is the 2nd post in a series. For sample data and setup, please see the first post in this series. (While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)
Using a self-JOIN
The recipe for the set-oriented approach can be directly translated to SQL:
select SalesOriginal.SalesYear , SalesOriginal.SalesMonth , max(SalesOriginal.SalesAmount) as SalesAmount , sum(SalesYtd.SalesAmount) as SalesYtd from SalesYearMonth as SalesOriginal inner join SalesYearMonth as SalesYtd on SalesOriginal.SalesYear = SalesYtd.SalesYear and SalesOriginal.SalesMonth >= SalesYtd.SalesMonth group by SalesOriginal.SalesYear , SalesOriginal.SalesMonth
The self-JOIN
In our discussion of the set-oriented approach we mentioned combining the rows from the table with each other to produce all different combinations. In the code sample about, the
JOIN
-clause takes care of that aspect.
As you can see, the
SalesYearMonth
table appears twice: on the left hand and on the right hand of the JOIN
-keyword, but using different aliases: SalesOriginal
and SalesYtd
.
It is a so-called self-join.
Even though both aliases refer to an instance of the same
SalesYearMonth
base table, each has a very different role.
We can think of the one with the SalesOriginal
alias as really the SalesYearMonth
table itself.
The SalesYtd
alias refers to an instance of the SalesYearMonth
table that, for any given row from SalesOriginal
, represents a subset of rows that chronologically precedes the row from SalesOriginal
.
The
ON
-clause that follows controls which combinations should be retained: for each particular row of SalesOriginal
we only want to consider rows from SalesYtd
from the same year, which is why the first predicate in the ON
-clause is:SalesOriginal.SalesYear = SalesYtd.SalesYearWithin that year, we only want to consider rows that precede it chronologically, and that explains the second predicate:
SalesOriginal.SalesMonth >= SalesYtd.SalesMonth
GROUP BY
and SUM()
It is is important to realize the
JOIN
is only half of the solution.
While the
JOIN
takes care of gathering and combining all related rows necessary to compute the YTD value,
the actual calculation is done by the SUM()
function in the SELECT
-list, and the GROUP BY
defines which rows should be taken together to be summed.
In summary:
- the
JOIN
generates new rows by combining rows from its left-hand table with the rows from its right-hand table, bound by the condition in theON
-clause. - The
GROUP BY
partitions the rows into subsets having the same combinations of values forSalesYear
andSalesMonth
. - The
SUM()
aggregates the rows in eachSalesYear, SalesMonth
partition, turning its associated set of rows into one single row, while adding the values of theSalesAmount
column together.
GROUP BY
list are qualified by the SalesOriginal
alias - and not SalesYtd
.
Also note that the GROUP BY
columns form the key of the original SalesYearMonth
table - together they uniquely identify a single row from the SalesYearMonth
table.
This is not a coincidence: it expresses precisely that SalesOriginal
really has the role of being just itself - the SalesYearMonth
table.
What about the other columns?
The
GROUP BY
affects treatment of the non-key columns as well.
In this overly simple example, we had only one other column - OriginalSales.SalesAmount
.
(Note that this is different from
YtdSales.SalesAmount
, which we aggregated using SUM()
to calculate the YTD value)
Since
OriginalSales.SalesAmount
comes from the SalesOriginal
instance of the SalesYearMonth
table, we can reason that after the GROUP BY
on the key columns SalesYear
and SalesMonth
, there must be exactly one SalesAmount
value for each distinct combination of SalesYear
and SalesMonth
.
In other words, SalesAmount
is functionally dependent on SalesYear
and SalesMonth
.
Some SQL engines are smart enough to realize this and will let you refer to any expression that is functionally dependent upon the expressions in the
GROUP BY
-list in the SELECT
-list.
Unfortunately, Synapse and MS SQL Server are not among these and if we try we will get an Error:
Msg 8120, Level 16, State 1, Line 11 Column 'Sales.SalesAmount' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.The error message suggets we can do two things to solve it:
- either we aggregate by wrapping the
SalesOriginal.SalesAmount
-expression into some aggregate function - or we expand the
GROUP BY
-list and add theSalesOriginal.SalesAmount
-expression there.
SalesAmount
is clearly intended as a measure, and it feels weird to treat them the same as the attributes SalesYear
and SalesMonth
.
So adding it to the GROUP BY
-list feels like the wrong choice. Besides, it also makes the code less maintainable, as each such column will now appear twice: once in the SELECT
-list, where we need it no matter what, and once again in the GROUP BY
-list, just to satisfy the SQL engine.
So, if we don't want to put it in the
GROUP BY
-list, we are going to need to wrap it in an aggregate function.
We just mentioned that SalesAmount
is a measure and therefore that does not sound unreasonable.
However, we have to be careful which one we choose.
One would normally use
SalesAmount
as an additive measure and be able to use SUM()
for that.
But here, in this context, SUM()
is definitily the wrong choice!
All we want to do is to "get" back" whatever value we had for
SalesAmount
, in other words, unaffected by the whole routine of join-and-then-aggregate, which we did only to calculate the YTD value.
The "extra" rows generated by the JOIN
are only needed to do the YTD calculation and should not affect any of the other measures.
Using SUM()
would simply add the SalesAmount
just as many times as there are preceding rows in the current year, which simply does not have any meaningful application.
What we want instead is to report back the original
SalesAmount
for any given SalesYear, SalesMonth
combination.
We just reasoned that there will be just one distinct SalesOriginal.SalesAmount
value for any combination of values in SalesOriginal.SalesYear, SalesOriginal.SalesMonth
,
and it would be great if we had an aggregate function that would simply pick the SalesOriginal.SalesAmount
value from any of those rows.
To the best of my knowledge, no such aggregate function exists in MS SQL Server or Synapse Analytics.
We can use
MAX()
or MIN()
, or even AVG()
.
While this would all work and deliver the intended result, it still feels wrong as it seems wasteful to ask the SQL engine to do some calculation on a set of values while it could pick just any value.
Next installment: Solution 2 - using a subquery
In the next installment we will present and discuss a solution based on a subquery.
No comments:
Post a Comment