TL;DR: A Year-to-Date solution based on a
SUM()
window function is simple to code and maintain as well as efficient to execute.
This as compared to a number of alternative implementations, namely a self-JOIN
(combined with a GROUP BY
), a subquery, and a UNION
(also combined with a GROUP BY
).
Note: this is the 3rd post in a series.
- For sample data and setup, please see the 1st post in this series.
- For a solution based on a self-
JOIN
andGROUP BY
, please find the 2nd post in this series.
Using a subquery
We can also think of YTD calculation as a separate query that we perform for each row of the
SalesYearMonth
table.
While this does imply a row-by-row approach, we can still translate this easily to pure SQL by creating an expression in the SELECT
-list, which uses a subquery to calculate the YTD value for the current row:select SalesOriginal.SalesYear , SalesOriginal.SalesMonth , SalesOriginal.SalesAmount , ( select sum(SalesYtd.SalesAmount) from SalesYearMonth as SalesYtd where SalesYtd.SalesYear = SalesOriginal.SalesYear and SalesYtd.SalesMonth <= SalesOriginal.SalesMonth ) as SalesYtd from SalesYearMonth as SalesOriginalThere's a similarity with the
JOIN
-solution, in that we use the SalesYearMonth
table twice, but in different roles.
In the JOIN
-solution both appeared on one side of the JOIN
keyword and we used the aliases OriginalSales
and YtdSales
to be able to keep them apart.
In the subquery approach, the distinction between these two different instances of the SalesYearMonth
table is more explicit: the main instance of the SalesYearMonth
table occurs in the FROM
-clause, and the one for the YTD calculation occurs in the SELECT
-list.
Also similar to the
JOIN
solution is the condition to tie the set for the YTD calculation to the main query using the SalesYear
and SalesMonth
columns.
Such a subquery is referred to as a correlated subquery.
As for any differences with the
JOIN
solution:
In the condition, the only difference is the left/right placement of SalesOriginal
and SalesYtd
, which is chosen only by order of appearance in the query but functionally completely equivalent.
The most striking difference between the JOIN
solution and the subquery is the absence of the GROUP BY
-list in the latter.
Drawbacks of the subquery
As we had much to complain about the
GROUP BY
-list in the JOIN
solution, it might seem that the subquery solution is somehow "better".
However, a solution with a correlated subquery in general tends to be slower than a JOIN
solution.
Whether this is actually the case depends on on many variables and you'd really have to check it against your SQL engine and datasets.
Another drawback of the subquery solution becomes clear when we want to calculate the YTD for multiple measures. Our example only has one
SalesAmount
measure, but in this same context we can easily imagine that we also want to know about price, discount amounts, tax amounts, shipping costs, and so on.
In the
JOIN
solution, we would simply add any extra measures to the select list, using MAX()
(or MIN()
or AVG()
) to obtain the original value, and SUM()
to calculate its respective YTD value:
As long as it's over the same set, the JOIN
, its condition, and even the GROUP BY
-list would remain the same, no matter for how many different measures we would add a YTD calculation.
This is very different in the subquery case. Each measure for which you need a YTD calculation would get its own subquery. Even though the condition would be the same for each such YTD calculation, you would still need to repeat the subquery code - one for each YTD measure.
Next installment: Solution 3 - using a UNION
In the next installment we will present and discuss a solution based on a UNION and a GROUP BY.
No comments:
Post a Comment