Monday, February 22, 2021

Year-to-Date on Synapse Analytics 5: Using Window Functions

For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I'd like to share some sample code, and discuss some of the merits and benefits of each approach.

TL;DR: A Year-to-Date solution based on a SUM() window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-JOIN (combined with a GROUP BY), a subquery, and a UNION (also combined with a GROUP BY).

Note: this is the 5th post in a series. (While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

Using window functions


Nowadays, many SQL engines and virtually all major RDBMSes support window functions (sometimes called analytic functions). A window function looks like a classic aggregate function. In some respects it also behaves like one, but at the same time there are essential differences.

Aggregate functions


Consider the following example:
select sum(SalesAmount) as SumOfSalesAmount
,      count(*)         as RowCount
from   SalesYearMonth
The example uses two aggregate functions, SUM() and COUNT(). It returns a result like this:

SumOfSalesAmount RowCount
109,846,381.43 38

Two things are happening here:
  • Even though there are multiple rows in the SalesYearMonth table, the result consists of just one row. In other words, a collection of source rows have been aggregated into fewer (in this case, only one) result row.
  • The functions have caclculated a value based on some aspect of the individual rows in the source collection. In the case of SUM(SalesAmount), the value of the SalesAmount column of each individual row was added to obtain a total. In the case of COUNT(*), each row was counted, adding up to the total number of rows.
Because the previous example uses aggregate functions, we cannot also select any non-aggregated columns. For example, while SalesYear and SalesMonth are present in the individual underlying rows, we cannot simpy select them, because they do not exist in the result row, which is an aggregate.

Window functions


Now, SUM() and COUNT() also exist as window functions. Consider the following query:
select SalesYear
,      SalesMonth
,      SalesAmount
,      sum(SalesAmount) over() as TotalOfSalesAmount
,      count(*)         over() as RowCount
from   SalesYearMonth
You might notice the last two expressions in the SELECT-list look almost identical to the aggregate functions in the previous example. The difference is that in this query, the function call is followed by an OVER()-clause. Syntactically, this is what distinguishes ordinary aggregate functions from window functions.

Here is its result:

SalesYear SalesMonth SalesAmount TotalOfSalesAmount RowCount
2011 5 503,805.92 109,846,381.43 38
2011 6 458,910.82 109,846,381.43 38
...more rows...
2014 6 49,005.84 109,846,381.43 38

Note that we now get all the rows from the underlying SalesYearMonth table: no aggregation has ocurred. But the window functions do return a result that is identical to the one we got when using them as aggregate functions, and that for each row of the SalesYearMonth table.

It's as if for each row of the underlying table, the respective aggregate function was called over all rows in the entire table. Conceptually this is quite like the construct we used in the subquery-solution. The following example illustrates this:
select SalesYear
,      SalesMonth
,      SalesAmount
,      (
           select sum(SalesAmount) 
           from   SalesYearMonth
       ) as TotalOfSalesAmount
,      (
           select count(*)
           from   SalesYearMonth
       ) as RowCount
from   SalesYearMonth

The window and the OVER()-clause


Thinking about window functions as a shorthand for a subquery helps to understand how they work and also explains their name: a window function returns the result of an aggregate function on a particular subset of the rows in the query scope. This subset is called the window and it is defined by the OVER()-clause.

The parenthesis after the OVER-keyword can be used to define which rows will be considered as window. When left empty (like in the example above) all rows are considered.

Controlling the window using the PARTITION BY-clause


If you compare the previous example with our prior subquery-solution, you'll notice that here, we do not have a WHERE-clause to tie the subquery to the current row of the outer query. That's why our result is calculated over the entire table, rather than with respect to the current year and preceding months, as in our prior subquery-solution. This is equivalent to the empty parenthesis following the OVER-keyword in the corresponding window-function example.

In the subquery-solution we wrote a WHERE-clause to specify a condition to tie the rows of the subquery to the current row. For window functions, we can control which rows make up the window window by writing a PARTITION BY-clause inside the parenthesis following the OVER-keyword.

The PARTITION BY-clause does not let you specify an arbitrary condition, like we could in a subquery. Instead, the relationship between the current row and rows in the window must be expressed through one or more attributes for which they share a common value. The following example may illustrate this:
select SalesYear
,      SalesMonth
,      SalesAmount
,      sum(SalesAmount) over(partition by SalesYear) as YearTotalOfSalesAmount
from   SalesYearMonth
In the example above, sum(SalesAmount) over(partition by SalesYear) means: calculate the total of SalesAmount over all rows where the value of SalesYear is equal to the value of the SalesYear in the current row.

The equivalent query using subqueries would be:
select OriginalSales.SalesYear
,      OriginalSales.SalesMonth
,      OriginalSales.SalesAmount
,      (
           select sum(YearSales.SalesAmount) 
           from   SalesYearMonth as YearSales
           where  YearSales.SalesYear = OriginalSales.SalesYear
       ) as YearTotalOfSalesAmount
from   SalesYearMonth as OriginalSales
The result is shown below:

SalesYear SalesMonth SalesAmount YearTotalOfSalesAmount
2011 5 503,805.92 12,641,672.21
2011 6 458,910.82 12,641,672.21
...more rows...
2014 6 49,005.84 20,057,928.81
(Note that 12,641,672.21 is the sum of the SalesAmount for SalesYear 2011; 20,057,928.81 is the total for 2014.)

A partition for the preceding months?


It's great that the PARTITION BY-clause allows us to specify a window for relevant year, but it's still too wide: we want the window to contain only the rows from the current year, but only for this month and its preceding months. In the subquery-solution this was easy, as we could write whatever condition we want in the WHERE-clause. So we wrote:
where  SalesYtd.SalesYear   = SalesOriginal.SalesYear
and    SalesYtd.SalesMonth <= SalesOriginal.SalesMonth
Specifying SalesYear in the window functions' PARTITION BY-clause is equivalent to the first part of the subquery's WHERE-clause condition.

It's less clear what our partition expression should look like to select all months preceding the current month. It's not impossible though. For example, we can write an expression to mark whether the current SalesMonth is equal to or less than a specific month. For example:
-- every month up to and including june is 1, all months beyond june is 0
case
    when SalesMonth <= 6 then 1 
    else 0
end
If we can write such an expression, then of course, we can also use it in a PARTITION BY-clause, like so:
sum(SalesAmount) over (
  partition by 
    SalesYear
  , case
      when SalesMonth <= 6 then 1 
      else 0
    end
) 
Let's try and think what this brings us.

Suppose the value for SalesMonth is 6 (june), or less? The CASE expression would return 1, and the window function would take all rows into account for which this is the case. So january, february, march and so on, up to june would all get the total of those six months - that is, the YTD value for june.

On the other hand, if SalesMonth is larger than 6, the CASE expression evaluates to 0. So all months beyond june (that is: july, august, and so on up to december) form a partition as well, and for those months, whatever is the sum over those months would be returned.

Now, it's not really clear what the outcome means in case the month is beyond june. But it doesn't really matter - what is important, is that we now know how to calculate the correct YTD value for a given month. And, what we did for june, we can do for any other month. So, once we have the YTD expressions for each individual month, we can set up yet another CASE-expression to pick the right one according to the current SalesMonth.

Putting all that together, we get:
select SalesYear
,      SalesMonth
,      SalesAmount
,      case SalesMonth
         -- january
         when 1 then SalesAmount
         -- february
         when 2 then
           sum(SalesAmount) over(
             partition by 
               SalesYear
             , case when SalesMonth <= 2 then 1 else 0 end
           ) 
           
         ...more cases for the other months...
           
         -- december
         when 12 then
           sum(SalesAmount) over(
             partition by 
               SalesYear
             , case when SalesMonth <= 12 then 1 else 0 end
           ) 
       end as YtDOfSalesAmount
from   SalesYearMonth
Like with the UNION-solution, we are taking advantage of our knowledge of the calendar, which allows us to create these static expressions. We would not be able to do this in a general case, or where the number of distinct values is very large. But for 12 months, we can manage.

While it's nice to know that this is possible, there is a much, much nicer way to achieve the same effect - the frame specification.

Frame Specification


The frame specification lets you specify a subset of rows within the partition. The way you can specify the frame feels a bit odd (to me at least), as it is specified in terms of the current row's position in the window. Hopefully the following example will make this more clear:
select SalesYear
,      SalesMonth
,      SalesAmount
,      sum(SalesAmount) over(
         partition by SalesYear
         order by SalesMonth
         rows between unbounded preceding
         and current row
       ) as SalesYtd
from   SalesYearMonth
We already discussed the PARTITION BY-clause, all the clause after that are new.

The ORDER BY-clause sorts the rows within the window, in this case by SalesMonth. We need to rows to be ordered because of how the frame specification works: it lets you pick rows by position, relative to the current row. The position of the rows is undetermined unless we sort them explicitly, so if we want to pick rows reliably we need the ORDER BY-clause to guarantee the order.

The frame specification follows the ORDER BY-clause. There are a number of possible options here, but I will only discuss the one in the example. In this case, it almost explains itself: we want to use the current row, and all rows that precede it. Since we ordered by SalesMonth, this means all the rows that chronologically precede it. As this selection applies to the current partition, we will only encounter months here that are within the current year.

So here we have it: a YTD calculation implemented using a window functions. It's about the same amount of code as compared to the subquery solution, but more delcarative, as we do not need to specify the details of a condition. On the other hand, it is also less flexible than a subquery, but in general one should expect the window functions to perform better than the equivalent subquery.

Generalizing the solutions


So far all our examples were based on the SalesYearMonth table, which provides SalesYear and SalesMonth as separate columns. One might wonder what would it would take to apply these various methods to a realistic use case.

For example, it is likely that in a real dataset, the time would be available as a single column of a DATE or DATETIME data type. A single date column potentially affects the YTD calculation in two ways:
  • Year: As the YTD is calculated over a period of a year and almost all solutions we described used the SalesYear column explicitly to implement that logic.
  • Preceding rows: To calculate the YTD for a specific row, there has to be a clear definition of what rows are in the same year, but which precede it. In our examples we could use the SalesMonth column for that, but this might be a but different in a realistic case.
  • Lowest Granularity: The lowest granularity of the SalesMonthYear table is at the month level, and we collected the YTD values at that level. (If we'd want to be precise we'd have to call that year-to-month).
Apart from the time aspect, the definition of the key affects all solutions that generate "extra" rows and require a GROUP BY to re-aggregate to the original granularity.

The Year


The ON-condition of the JOIN-solution and the WHERE-condition of the subquery-solution both rely on a condition that finds other rows in the same year, and the window function-solution uses the year in its PARTITION BY-clause.

It is usually quite simple to extract the year from a date, date/time or timestamp. In Synapse Analytics or MS SQL one can use the DATEPART or YEAR function to do this.

The UNION-solution has no direct dependency on the year.

The preceding rows


The need to find the preceding rows applies to all solutions that use the year to find the rows to apply the YTD calculation on. In our samples, this could all be solved using the SalesMonth column.

Again, it are the JOIN-solution and subquery-solution that used it in their condition, whereas the window function-solution uses it in its ORDER BY-clause.

In this case, the fix is more straighforward then with the year: instead of the month column, these solutions can simply use the date or date/time column directly. No conversion or datepart extraction is required.

Lowest granularity


The granularity is of special concern to the UNION-solution. The solution relies on an exhaustive and static enumeration of all possible future dates within the year. Already at the month level, this already required a lot of manual code.

Below the month, the next level would be day. While it would in theory be possible to extend the solution to that level, it is already bordering the impractible at the month level.

The Key


The key definition affects both the JOIN-solution and the UNION-solution, as that both require a GROUP BY over the key.

Year-to-Date on Synapse Analytics 4: Using UNION and GROUP BY

For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I'd like to share some sample code, and discuss some of the merits and benefits of each approach.

TL;DR: A Year-to-Date solution based on a SUM() window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-JOIN (combined with a GROUP BY), a subquery, and a UNION (also combined with a GROUP BY).

Note: this is the 4th post in a series. (While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

Using a UNION


We mentioned how the solution with the JOIN relates each row of the main set with a subset of "extra" rows over which the YTD value is calculated by aggregating over the key of the main set using a GROUP BY.

It may not be immediately obvious, but we can also use the SQL UNION (or rather, UNION ALL) operator to generate such a related subset. Just like with the JOIN-solution, this can then be aggregated using GROUP BY. An example will help to explain this:
select      SalesYear
,           SalesMonth
,           sum(SumOfSalesAmount)      as SumOfSalesAmount
,           sum(YtdOfSumOfSalesAmount) as YtdOfSumOfSalesAmount
from (
    select  SalesYear
    ,       SalesMonth
    ,       SumOfSalesAmount
    ,       SumOfSalesAmount           as YtdOfSumOfSalesAmount
    from    SalesYearMonth
    union all
    -- JANUARY
    select  SalesYear
    ,       SalesMonth + 1            -- february
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 1
    union all
    select  SalesYear
    ,       SalesMonth + 2            -- march
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 1
    union all
    
    ... and so on, all for JANUARY ...

    union all
    select  SalesYear
    ,       SalesMonth + 11            -- december
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 1
    union all
    -- FEBRUARY
    select  SalesYear
    ,       SalesMonth + 1            -- march
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 2
    union all
    
    ... and so on, for the rest of FEBRUARY, 
        and then again for MARCH, APRIl, MAY, JUNE, JULY, AUGUST, SEPTEMBER, OCTOBER...   

    -- NOVEMBER
    select  SalesYear
    ,       SalesMonth + 1            -- december
    ,       null
    ,       SumOfSalesAmount
    from    SalesYearMonth
    where   SalesMonth = 11
) Sales
group by    SalesYear
,           SalesMonth

Duplicating metric-data so it contributes to the following months


In the top of the UNION we simply provide the entire resultset from SalesYearMonth, reporting the SumOfSalesAmount-metric as is, but also copying it to YtdSumOfSalesAmount. The other parts of the UNION are used to selectively duplicate the data for the SumOfSalesAmount-metric into the YtdSumOfSalesAmount, so that its data contributes to the YtdSumOfSalesAmount for all following months.

We start by grabbing january's data by applying the condition that demands that the SalesMonth equals 1:
-- JANUARY
select  SalesYear
,       SalesMonth + 1            -- february
,       null
,       SumOfSalesAmount
from    SalesYearMonth
where   SalesMonth = 1
union all
... repeat to duplicate january's data into february, march, and so on all the way up to december...
. This is done for a total of 11 times, each time adding 1, 2, 3 and so on - all the way up to 11 - to the SalesMonth attribute. This ensures january's data, as captured by the condition in the WHERE clause, is reported also in february (SalesMonth + 1), march (SalesMonth + 2), and so on, all the way up to december (SalesMonth + 11).

After the string of UNIONs for january appear more parts to duplicate the data also for february and all following months: -- FEBRUARY select SalesYear , SalesMonth + 1 -- march , null , SumOfSalesAmount from SalesYearMonth where SalesMonth = 02 union all ... repeat to duplicate february's data into march, april, and so on all the way up to december... . Again, february's data is selected by applying the condition
where   SalesMonth = 2
, and this happens now 10 times, again adding a number to the SalesMonth so it is duplicated to march, april, may, all the way up to december - in other words, all months following february.

What we thus did for january and februry is repeated for march, april, and so on for all months up to november. November is the last month we need to do this for: November's data still needs to be copied to december, but as december is the last month, that data only needs to be counted in december itself.

While it may seem wasteful to duplicate all this data, it really is not that different in that respect from the other solutions we've seen so far. It's just that now it's really in your face, because there is a pretty direct correspondence between the SQL code and the data sets that are being handled. The JOIN and subquery solutions hande similar amounts of data, it's just achieved with way less code, and in a far more implicit manner.

Original metric is retained


Note that the original metric also computes correctly, because the parts of the union only duplicate the data to the YTD column. The union parts that duplicate the data to the subsequent months select a NULL for the original metric. So the data for the original metric is never duplicated, and thus retains its normal value.

Drawbacks to the UNION-solution


The main drawback to the UNION-solution is its maintainability. A lot of code is required, far more than for any of the methods we have seen so far. Despite the indiviual patterns are simple (condition to get one month, adding a number to project that data to a future month), it is suprisingly easy to make a little mistake somewhere

We just argued that this solution is not so much different from the JOIN solution, but that remark only pertains to how the calculation is performed. The JOIN-solution generates the data it operates upon dynamically and declaratively; the UNION solution does this statically and explicitly. This is also why it is impossible to generalize ths approach for any arbitrary JOIN: YTD is a special case, because we know exactly how often we should duplicate the data as this is dictated by the cyclical structure of our calendar.

Next installment: Solution 4 - window functions


In the next installment we will present and discuss a solution based on a window functions.

Year-to-Date on Synapse Analytics 3: Using a Subquery

For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I'd like to share some sample code, and discuss some of the merits and benefits of each approach.

TL;DR: A Year-to-Date solution based on a SUM() window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-JOIN (combined with a GROUP BY), a subquery, and a UNION (also combined with a GROUP BY).

Note: this is the 3rd post in a series. (While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

Using a subquery


We can also think of YTD calculation as a separate query that we perform for each row of the SalesYearMonth table. While this does imply a row-by-row approach, we can still translate this easily to pure SQL by creating an expression in the SELECT-list, which uses a subquery to calculate the YTD value for the current row:
select      SalesOriginal.SalesYear
,           SalesOriginal.SalesMonth
,           SalesOriginal.SalesAmount
,           (
                select sum(SalesYtd.SalesAmount)
                from   SalesYearMonth as SalesYtd
                where  SalesYtd.SalesYear   = SalesOriginal.SalesYear
                and    SalesYtd.SalesMonth <= SalesOriginal.SalesMonth
            ) as SalesYtd
from        SalesYearMonth as SalesOriginal
There's a similarity with the JOIN-solution, in that we use the SalesYearMonth table twice, but in different roles. In the JOIN-solution both appeared on one side of the JOIN keyword and we used the aliases OriginalSales and YtdSales to be able to keep them apart. In the subquery approach, the distinction between these two different instances of the SalesYearMonth table is more explicit: the main instance of the SalesYearMonth table occurs in the FROM-clause, and the one for the YTD calculation occurs in the SELECT-list.

Also similar to the JOIN solution is the condition to tie the set for the YTD calculation to the main query using the SalesYear and SalesMonth columns. Such a subquery is referred to as a correlated subquery.

As for any differences with the JOIN solution: In the condition, the only difference is the left/right placement of SalesOriginal and SalesYtd, which is chosen only by order of appearance in the query but functionally completely equivalent. The most striking difference between the JOIN solution and the subquery is the absence of the GROUP BY-list in the latter.

Drawbacks of the subquery


As we had much to complain about the GROUP BY-list in the JOIN solution, it might seem that the subquery solution is somehow "better". However, a solution with a correlated subquery in general tends to be slower than a JOIN solution. Whether this is actually the case depends on on many variables and you'd really have to check it against your SQL engine and datasets.

Another drawback of the subquery solution becomes clear when we want to calculate the YTD for multiple measures. Our example only has one SalesAmount measure, but in this same context we can easily imagine that we also want to know about price, discount amounts, tax amounts, shipping costs, and so on.

In the JOIN solution, we would simply add any extra measures to the select list, using MAX() (or MIN() or AVG()) to obtain the original value, and SUM() to calculate its respective YTD value: As long as it's over the same set, the JOIN, its condition, and even the GROUP BY-list would remain the same, no matter for how many different measures we would add a YTD calculation.

This is very different in the subquery case. Each measure for which you need a YTD calculation would get its own subquery. Even though the condition would be the same for each such YTD calculation, you would still need to repeat the subquery code - one for each YTD measure.

Next installment: Solution 3 - using a UNION


In the next installment we will present and discuss a solution based on a UNION and a GROUP BY.

Year-to-Date on Synapse Analytics 2: Using a self-JOIN and GROUP BY

For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I'd like to share some sample code, and discuss some of the merits and benefits of each approach.

TL;DR: A Year-to-Date solution based on a SUM() window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-JOIN (combined with a GROUP BY), a subquery, and a UNION (also combined with a GROUP BY).

Note: this is the 2nd post in a series. For sample data and setup, please see the first post in this series. (While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

Using a self-JOIN


The recipe for the set-oriented approach can be directly translated to SQL:
select      SalesOriginal.SalesYear
,           SalesOriginal.SalesMonth
,           max(SalesOriginal.SalesAmount) as SalesAmount
,           sum(SalesYtd.SalesAmount)      as SalesYtd
from        SalesYearMonth as SalesOriginal
inner join  SalesYearMonth as SalesYtd
on          SalesOriginal.SalesYear  = SalesYtd.SalesYear
and         SalesOriginal.SalesMonth >= SalesYtd.SalesMonth
group by    SalesOriginal.SalesYear
,           SalesOriginal.SalesMonth

The self-JOIN


In our discussion of the set-oriented approach we mentioned combining the rows from the table with each other to produce all different combinations. In the code sample about, the JOIN-clause takes care of that aspect.

As you can see, the SalesYearMonth table appears twice: on the left hand and on the right hand of the JOIN-keyword, but using different aliases: SalesOriginal and SalesYtd. It is a so-called self-join.

Even though both aliases refer to an instance of the same SalesYearMonth base table, each has a very different role. We can think of the one with the SalesOriginal alias as really the SalesYearMonth table itself. The SalesYtd alias refers to an instance of the SalesYearMonth table that, for any given row from SalesOriginal, represents a subset of rows that chronologically precedes the row from SalesOriginal.

The ON-clause that follows controls which combinations should be retained: for each particular row of SalesOriginal we only want to consider rows from SalesYtd from the same year, which is why the first predicate in the ON-clause is:
SalesOriginal.SalesYear  = SalesYtd.SalesYear
Within that year, we only want to consider rows that precede it chronologically, and that explains the second predicate:
SalesOriginal.SalesMonth >= SalesYtd.SalesMonth

GROUP BY and SUM()


It is is important to realize the JOIN is only half of the solution.

While the JOIN takes care of gathering and combining all related rows necessary to compute the YTD value, the actual calculation is done by the SUM() function in the SELECT-list, and the GROUP BY defines which rows should be taken together to be summed.

In summary:
  • the JOIN generates new rows by combining rows from its left-hand table with the rows from its right-hand table, bound by the condition in the ON-clause.
  • The GROUP BY partitions the rows into subsets having the same combinations of values for SalesYear and SalesMonth.
  • The SUM() aggregates the rows in each SalesYear, SalesMonth partition, turning its associated set of rows into one single row, while adding the values of the SalesAmount column together.
Note that the columns in the GROUP BY list are qualified by the SalesOriginal alias - and not SalesYtd. Also note that the GROUP BY columns form the key of the original SalesYearMonth table - together they uniquely identify a single row from the SalesYearMonth table. This is not a coincidence: it expresses precisely that SalesOriginal really has the role of being just itself - the SalesYearMonth table.

What about the other columns?


The GROUP BY affects treatment of the non-key columns as well. In this overly simple example, we had only one other column - OriginalSales.SalesAmount.

(Note that this is different from YtdSales.SalesAmount, which we aggregated using SUM() to calculate the YTD value)

Since OriginalSales.SalesAmount comes from the SalesOriginal instance of the SalesYearMonth table, we can reason that after the GROUP BY on the key columns SalesYear and SalesMonth, there must be exactly one SalesAmount value for each distinct combination of SalesYear and SalesMonth. In other words, SalesAmount is functionally dependent on SalesYear and SalesMonth.

Some SQL engines are smart enough to realize this and will let you refer to any expression that is functionally dependent upon the expressions in the GROUP BY-list in the SELECT-list. Unfortunately, Synapse and MS SQL Server are not among these and if we try we will get an Error:
Msg 8120, Level 16, State 1, Line 11
Column 'Sales.SalesAmount' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
The error message suggets we can do two things to solve it:
  • either we aggregate by wrapping the SalesOriginal.SalesAmount-expression into some aggregate function
  • or we expand the GROUP BY-list and add the SalesOriginal.SalesAmount-expression there.
To me, neither feels quite right.

SalesAmount is clearly intended as a measure, and it feels weird to treat them the same as the attributes SalesYear and SalesMonth. So adding it to the GROUP BY-list feels like the wrong choice. Besides, it also makes the code less maintainable, as each such column will now appear twice: once in the SELECT-list, where we need it no matter what, and once again in the GROUP BY-list, just to satisfy the SQL engine.

So, if we don't want to put it in the GROUP BY-list, we are going to need to wrap it in an aggregate function. We just mentioned that SalesAmount is a measure and therefore that does not sound unreasonable. However, we have to be careful which one we choose.

One would normally use SalesAmount as an additive measure and be able to use SUM() for that. But here, in this context, SUM() is definitily the wrong choice!

All we want to do is to "get" back" whatever value we had for SalesAmount, in other words, unaffected by the whole routine of join-and-then-aggregate, which we did only to calculate the YTD value. The "extra" rows generated by the JOIN are only needed to do the YTD calculation and should not affect any of the other measures. Using SUM() would simply add the SalesAmount just as many times as there are preceding rows in the current year, which simply does not have any meaningful application.

What we want instead is to report back the original SalesAmount for any given SalesYear, SalesMonth combination. We just reasoned that there will be just one distinct SalesOriginal.SalesAmount value for any combination of values in SalesOriginal.SalesYear, SalesOriginal.SalesMonth, and it would be great if we had an aggregate function that would simply pick the SalesOriginal.SalesAmount value from any of those rows. To the best of my knowledge, no such aggregate function exists in MS SQL Server or Synapse Analytics.

We can use MAX() or MIN(), or even AVG(). While this would all work and deliver the intended result, it still feels wrong as it seems wasteful to ask the SQL engine to do some calculation on a set of values while it could pick just any value.

Next installment: Solution 2 - using a subquery


In the next installment we will present and discuss a solution based on a subquery.

Year-to-Date on Synapse Analytics 1: Background

For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approaches and in this series I'd like to share some sample code, and discuss some of the merits and benefits of each approach.

(While our use case deals with Azure Synapse, most of the code will be directly compatible with other SQL Engines and RDBMS-es.)

TL;DR: A Year-to-Date solution based on a SUM() window function is simple to code and maintain as well as efficient to execute. This as compared to a number of alternative implementations, namely a self-JOIN (combined with a GROUP BY), a subquery, and a UNION (also combined with a GROUP BY).

In this Installment

Context


Our customer is using an Azure Data Lake to store data from all kinds of source systems, including its SAP ERP system. Azure Synapse Analytics sits on top of the Data Lake and is used as analytics workhorse, but also to integrate various data sets present in the data lake. Front-end BI tools, such as Microsoft PowerBI, can then connect to Synapse and import or query the data from there.

In many cases, the datamarts presented by Synapse are pretty straightforward. Calculations and derived measures needed to build dashboards and data visualizations can typically be developed rather quickly inside the Power BI data model. Once the front-end development has stabilized, one can consider to refactor the solution and move parts away from the front-end and push them down to the backend for performance or maintainability.

(There are all kinds of opinions regarding data architecture and on when to put what where. We do not pretend to have the final answer to that, but the current workflow allows us to very quickly deliver solutions that can be used and verified by the users. At present I do not think we could achieve the same productivity if we would demand that everything be designed and built on the backend right from the get go.)

So, today we were refactoring some of the logic in a PowerBI model, including a Year-to-Date calculation. The solution we ended up implementing to solve it seems to work rather nicely so I figured to share it.

Year-to-Date value


What's a year to date (YTD) value? Basically it's the cumulative value over a metric in time, which resets once a year. In other words, the year to date value is the per-year total of the value achieved up to the current date.

This is best explained with an example. Consider the following dataset:

Date Value YTD Value
2012-01-1035,401.1435,401.14
2012-01-2015,012.1850,413.32
2012-02-0125,543.7175,957.03
2012-02-1032,115.41108,072.43
2012-02-2017,688.07125,760.50
2012-03-0110,556.53136,317.03
.........
2013-01-0119,623.9019,623.90
2013-01-108,351.1827,975.08
2013-01-2020,287.6548,262.73
2013-02-0133,055.6981,318.42

In the table above we have dates from two years - 2012 and 2013 - and for each date a Value.

For the first date encountered within a year, the YTD Value is equal to the Value itself; For each subsequent Date, the YTD Value is maintained as a running total of the values that appeared at the earlier dates.

So, 2012-01-10 is the first date we encounter in 2012 and therefore its YTD Value is equal to the Value at that date (35,401.14). The next date is 2012-01-20 and its Value is 15,012.18; therefore its YTD Value is 50,413.32, which is 15,012.18 + 35,401.14. The accumulation continues until we reach the last date of 2012.

At 2013-01-01 the first date of the next year, the YTD Value resets again to be equal to the Value, and then in the subsequent dates of 2013, the YTD Value again accumulates the current Value by adding it to the preceding YTD-value.

How to use YTD


You can use YTD values to analyze how well actual trends are developing over time as compared to a planning or predicition. By comparing the calculated YTD of a measure to a projected value (for example, a sales target), we can see how far off we are at any point in time.

If you gather these comparisons for a couple of moments in time, you can get a sense of the pace in which the actual situation is deviating from on converging to the target or the projected situation. These insights allow you to intervene in some way: maybe you need to adjust your planning, or change your expectations. Or maybe you need to adjust your efforts in order to more closely approximate your target.

Thinking about YTD as iteration


From the way we explained what a year-to-date value is, you might think about it as an actual "rolling sum". By that I mean, you might think about it as an iterative problem, that you solve by going through the rows, one by one. In pseudocode, such a solution would do something like:
    declare year, ytd
  
    loop through rows:
    
        if year equals row.year then 
        
          assign ytd + row.value to ytd
          
        else
        
          assign row.value to ytd
          assign row.year to year
          
        end if
        
    end loop through rows
While this approach would apparently give you the desired result, it does not help you to solve the problem in SQL directly. Pure SQL does not let you iterate rows like that, and it also does not let you work with variables like that.

Even with the iterative approach there is a hidden problem: the reset of the ytd variable and the update of the year variable that occurs whenever the row.year is different from the current value of the year variable will only work properly if the rows of one particular year are next to each other (like when the rows are ordered by year prior to iteration). The same applies within the year: the rows need to be sorted in chronological order, as the YTD value should reflect how much of the value was accumulated at that date within that year.

It may seem like a waste of time to think about an approach that is of no use to solving the problem. But this simple iterative approach provides a very simple recipe for quickly checking whether an actual solution behaves as expected. We'll use it later to veryify some results.

A set-oriented approach


To implement it in SQL we have to think in a set-oriented way. Conceptually, we can think about it as if we combine each row in the set with all of the other rows, forming a cartesian product, and then retain only those combinations that have identical values for year, but a smaller or equal value for the month.

This way, each row will combine with itself, and with all the other rows that chronologically precede it within the same year. The YTD value is then obtained by aggregating the rows over year and month value, summing the value to become the YTD value.

Sample Data


To play around a bit with the problem in SQL, let's set up a simple table:
create table SalesYearMonth (
  SalesYear   int
, SalesMonth  int
, SalesAmount decimal(20,2)
, primary key(SalesYear, SalesMonth)
);
And, here's some data:
insert into SalesYearMonth (
  SalesYear
, SalesMonth
, SalesAmount
) values (
 (2011,5,503805.92)
,(2011,6,458910.82)
,(2011,7,2044600.00)
,(2011,8,2495816.73)
,(2011,9,502073.85)
,(2011,10,4588761.82)
,(2011,11,737839.82)
,(2011,12,1309863.25)
,(2012,1,3970627.28)
,(2012,2,1475426.91)
,(2012,3,2975748.24)
,(2012,4,1634600.80)
,(2012,5,3074602.81)
,(2012,6,4099354.36)
,(2012,7,3417953.87)
,(2012,8,2175637.22)
,(2012,9,3454151.94)
,(2012,10,2544091.11)
,(2012,11,1872701.98)
,(2012,12,2829404.82)
,(2013,1,2087872.46)
,(2013,2,2316922.15)
,(2013,3,3412068.97)
,(2013,4,2532265.91)
,(2013,5,3245623.76)
,(2013,6,5081069.13)
,(2013,7,4896353.74)
,(2013,8,3333964.07)
,(2013,9,4532908.71)
,(2013,10,4795813.29)
,(2013,11,3312130.25)
,(2013,12,4075486.63)
,(2014,1,4289817.95)
,(2014,2,1337725.04)
,(2014,3,7217531.09)
,(2014,4,1797173.92)
,(2014,5,5366674.97)
,(2014,6,49005.84);
This setup is slightly different from the original problem statement. Instead of a column with DATE data type, we have separate SalesYear and SalesMonth columns. This is fine - it doesn't change the problem or the solution in any way.

In fact, this setup allows us to think about the essential elements of the problem without having to worry about the details of getting to that point. Once we done that, we can apply the approach to a more realistic case.

Next installment: Solution 1 - a self-JOIN


In the next installment we will present and discuss a solution based on a self-JOIN and a GROUP BY.

Monday, December 02, 2019

Building a UI5 Demo for SAP HANA Text Analysis: Part 4

This is the last of a series of blogposts describing a simple web front end tool to explore SAP HANA's Text Analysis features on documents uploaded by the user. As a reminder, the following overview outlines all the posts in the series: In the previous post we presented the sample application, explained its functionality, and concluded by pointing to the github repository and the installation instructions so that you may run the application on your own HANA system.

In this post, we'll explain in detail how the upload functionality works from the UI5 side of things

Uploading Files and Binary data to HANA .xsodata services uing UI5

In the first installment of the series, options and concerns were discussed on the topic of loading the (binday) document content into the SAP HANA database. We chose to use an OData service. In this installment, we'll go into fairly deep detail how to implement a file upload feature backed by a HANA .xsodata service using the UI5 front-end framework.

Some notes on the UI5 Application Implementation

Before we discuss any particular details of the implementation of the UI5 application, it is necessary to point out that this particular application is demoware. Many typical patterns of UI5 application development were omitted here: there is no internationalization, and no modules or dependency injection with sap.ui.define(). There is not even a MVC architecture, so no XML views or controllers; no Component configuration, and no application descriptor (manifest.json).

Instead, the application consists of just a single index.html, which contains 2 <script> tags:
<script 
  src="https://sapui5.hana.ondemand.com/1.71.5/resources/sap-ui-core.js"
  id="sap-ui-bootstrap"
  data-sap-ui-libs="
    sap.ui.core,
    sap.ui.layout,
    sap.ui.unified,
    sap.ui.table,
    sap.ui.commons, 
    sap.m
  "
  data-sap-ui-theme="sap_bluecrystal"
>
</script>
<script src="index.js" type="text/javascript"></script>
The first one bootstraps ui5, and the second one loads index.js, which contains the implementation.

The main reason for this rather spartan approach is that the primary goal of me and my colleagues Arjen and Mitchell was to quickly come up with a functional prototype that demonstrates the file upload feature. Although I have grown used to a more orhtodox UI5 boilerplate, it was a distraction when it came to just quickly illustrating an upload feature. Once we built the upload feature, I wanted to see how easy it would be to augment it and make it somewhat useful application, and I was kind of interested to experience how it would be to carry on using this unorthodox, pure-javascript approach.

There's much more that could be said about this approach but it's another topic. So for now: if you're new to UI5 and want to learn more: don't take this application as an example, it's atypical. And if you are an experienced UI5 developer: now you have the background, let's move on to the essence.

Communication with the backend using an OData Model

Before we get to the topic of building and controlling the upload feature, a couple of words should be said about how UI5 applications can communicate with their (xsodata) backend.

In UI5, we needn't worry about the exact details of doing the backend call directly. Rather, UI5 offers an object that provides javascript methods that take care of this. This object is the model. In our application, the model is an instance of the sap.ui.model.odata.v2.ODataModel, which we instantiated somewhere in the top of our index.js:
var pkg = document.location.pathname.split('/').slice(1, -2);
var odataService = [].concat(pkg, ['service', 'ta']);  

/**
 * OData
 */
var modelName = 'data';
var model = new sap.ui.model.odata.v2.ODataModel('/' + odataService.join('/') + '.xsodata', {
  disableHeadRequestForToken: true
});
It's not necessary to go over the model instantiation in detail - for now it is enough to know that upon instantiation, the model is passed the uri of the .xsodata service we already built. We obtain the url in the code preceding the model instantiation by taking the url of the current webpage and building a path to service/ta.xsodata relative to that location:
var pkg = document.location.pathname.split('/').slice(1, -2);
var odataService = [].concat(pkg, ['service', 'ta']);  

Uploading a file: high level client-side tasks

From a functional point of view, the web app (client) there's two distinct tasks to be considered:
  • Building the user interface so the user can select the file to upload.
  • Loading the file contents into the database.
The first high-level task is strictly a matter of user interaction and is more or less independent from how the second high level task is implemented. For the second high-level task, we already have the backend in place - this is the OData service we built in the second installment of this blogpost series. What remains is how to do this from within UI5.

But already, we can break down this task in two subtasks:
  • Extracting the content from the chosen file. Once the user has chosen a file, they have only identified the thing they want to upload. The web app does not need to parse or understand the file content, but it does need to extract the data (file content) so it can send it to the server.
  • Sending the right request to the backend. The request will somehow include the contents extracted from the file, and it will have such a form that the server understands what to do with those contents - in this case, store it in the a table for text analysis.

A UI5 File Upload control

For the file upload user interaface, we settled on the sap.ui.unified.FileUploader control. Here's the relevant code from index.js that instantiates the control:
var fileUploader = new sap.ui.unified.FileUploader({
  buttonText: 'Browse File...',
  change: onFileToUploadChanged
  busyIndicatorDelay: 0
});
The sap.ui.unified.FileUploader control is presented to the user as a input field and a button to open a file chooser. This lets the user browse and pick a file from their client device.

In addition, the sap.ui.unified.FileUploader control provides events, configuration options and methods to validate the user's choice, and to send the file off to a server. For example, you can set the uploadUrl property to specify where to send the file to, and there's an upload() method to let the control do the request.

As it turns out, most of this addition functionality did not prove to be very useful for the task at hand, because the request we need to make is quite specific, and we didn't really find a clear way of configuring the control to send just the right request. Perhaps it is possible, and then we would be most obliged to learn how.

What we ended up doing instead is to only use the file choosing capabilities of the sap.ui.unified.FileUploader control. To keep track of the user's choice, we configured a handler for the change event, which gets called whenever the user chooses a file, or cancels the choice.

The handler does a couple of things:
  • Determine wheter a file was chosen. If not, the Upload confirmation button gets disabled so the user can only either retry choosing a file, or close the upload dialog.
  • If a file is chosen, a request is sent to the backend to figure out if the file already exists.
  • Depending upon whether the file already exists, the state of the upload dialog is set to inform the user of what action will be taken if they confirm the upload.
Let's go over these tasks in detail. First, validating the user's choice by checking if the user did in fact choose a file:
var fileToUpload;
var fileToUploadExists;
function onFileToUploadChanged(event){
  fileToUpload = null;
  fileToUploadExists = false;
  var files = event.getParameter('files');
  if (files.length === 0) {
    initFileUploadDialog();
    return;
  }
  fileToUpload  = files[0];
  
  ...more code here...
}
Note that we set up the fileToUpload variable to keep track of the user's choice. We need to keep track of it somewhere, since the choosing of the file and the upload are separate tasks with regards to the UI: choosing the file happens when the user hits the Browse button provided by the sap.ui.unified.FileUploader control, wheras the upload is triggered by hitting the confirm button of the upload dialog.

When the user is done choosing the file, the sap.ui.unified.FileUploader will fire the change event, and our handler onFileToUploadChanged() gets called and passed the event as an argument. This event provides access to the FileList object associated with the file chooser:
  var files = event.getParameter('files');
Note: the FileList is not part of UI5. Rather, it is one of a number of brower built-in objects, which together form the Web File API. We would have loved to obtain the FileList or the File object from our sap.ui.unified.FileUploader control directly by using a getter or something like that, but at the time we found no such method, and settled for a handler in the change event.

Once we have the FileList, we can check whether the user selected any files, and either disable the upload confirmation button (if no file was selected), or assign the chosen file to our fileToUpload variable so we can refer to it when the upload is confirmed:
function onFileToUploadChanged(event){

  ...

  if (files.length === 0) {
    initFileUploadDialog();
    return;
  }
  fileToUpload = files[0];

  ....

}
If we pass the check, our variable fileToUpload will now contain the File object reflecting the user's choice. (Note that this object too is not a UI5 object, it's also part of the Web File API.)

Note that in theory, the list of files associated with the sap.ui.unified.FileUploader could have contained more than one file. But the default behavior is to let the user choose only one file. You can override that behavior by setting the sap.ui.unified.FileUploader's multiple property to true. Because we know that in this case, there can be at most only one file, we only need to check whether there is a file or not - there's no need to consider muliple files.

Checking whether the File was already Uploaded

Once we know for sure the user has chosen a file, it remains to be determined what should be done with it should the user decide to confirm the upload. To help the user decide whether they should confirm, we send a request to the backend to find out if the file was already uploaded:
function onFileToUploadChanged(event){

  ...

  fileToUpload  = files[0];
  fileUploader.setBusy(true);
  model.read('/' + filesEntityName, {
    filters: [new sap.ui.model.Filter({
      path: fileNamePath,
      operator: sap.ui.model.FilterOperator.EQ,
      value1: fileToUpload.name
    })],
    urlParameters: {
      $select: [fileNamePath, 'FILE_LAST_MODIFIED']
    },            
    success: function(data){

      ...update state depending upon whether the file exists...

    },
    error: function(error){

      ...update state to inform the user of an error...

    }
  });

  ....

}
The model provides a read() method which can be used to query the backend OData service. The first argument to the read() method is the so-called path, which identifies the OData EntitySet we want to query. In this case, we are interested in the Files EntitySet, as this corresonds to our CT_FILE database table in our backend. Because we use the name of the Files EntitySet in a lot of places, we stored it in the filesEntityName variable. So, our path becomes:
'/' + filesEntityName
Apart from the path, the read() method takes a second argument, which is an object of query options. We'll highlight the few we need here.

Because we only want to know whether the backend already has a file with the same name as the one the user just selected , we add a parameter to restict the search. This is done with the filters option:
    filters: [new sap.ui.model.Filter({
      path: fileNamePath,
      operator: sap.ui.model.FilterOperator.EQ,
      value1: fileToUpload.name
    })],
The filters option takes an array of sap.ui.model.Filter objects. When we instantiate the sap.ui.model.Filter object, we pass an object with the following configuration options:
  • path - this should get a value that refers to a property defined by the OData entity type of this Entity Set. It corresponds to the name of a column of our database table. In this case, it is set to fileNamePath, which is a variable we initialized with 'FILE_NAME', i.e., the name of the column in the CT_FILE table that holds the name of our files.
  • value1 - this should be the literal value that we want to use in our filter. In this case, we want to look for files with the same name as the file chosen by the user, so we set it to the name property of the File object that the user selected - fileToUpload.name
  • operator - this should be one of the values defined by the sap.ui.model.FilterOperator object, which defines how the given filter value should be compared to the value of the column. In this case the operator is sap.ui.model.FilterOperator.EQ, which stands for an equals comparison. By using this operator, we demand that the value of the column should be exactly the same as the name of the chosen file.
There is one other option specified that affects the request:
    urlParameters: {
      $select: [fileNamePath, 'FILE_LAST_MODIFIED']
    },            
This specifies for which columns we want to retrieve the values from the backend. It may be omitted, but in that case, all columns would be returned. Often this will not be a problem, but in this case, we really want to prevent the server from returning the values for the FILE_CONTENT column. Always retrieving the file contents would be an unnessary burden for both the front- and the backend so we actively suppress the default behavior. The only columns requested here are FILE_NAME and FILE_LAST_MODIFIED. The latter is currently unused but might come in handy to provide even more information to the user so they can better decide whether they want to re-upload an existing file.

The remaining options in the call to the model's read() method have nothing todo with the request, but are callback functions for handling the result of the read request. The error callback gets called if there is some kind of issue with the request itself - maybe the backend has gone away, or maybe the structure of the service changed. The success callback is called when the read request executes normally, and any results are then passed to it as argument. This is even true if no results are found - the callback then simply gets passed an empty list of results.
In our example, the main purpose of the success callback is to flag whether the file already exists, and to update the state of the file uploader accordingly to inform the user. The existence of the file is flagged by assigning the fileToUploadExists variable, and we will see its significance in the next section where we discuss the implementation of the upload of the file contents.

Handling the Upload

We've just seen exactly how the UI5 application can let the user choose a file, and we even used our model to check whether the chosen file is already present in the backend. Once these steps are done, we now have successfully initialized two variables, fileToUpload and fileAlreadyExists. This is all we need to handle the upload.

In the application, the user initiates the upload by clicking the Confirmation Button of the uploadDialog. This then triggers the button's press event, where we've attached the function uploadFile as handler.

So, in this handler, we must examine the value of the fileAlreadyExists variable and take the appopriate action:
  • If fileAlreadyExists is false, we should tell our model to add a new item. This is done by calling the createEntry()-method
  • If fileAlreadyExists is true, we should tell our model to update the existing item. This is done by calling the update()-method

The path argument

Both methods take a path as first argument to indicate where the new item should be added, or which item to update.

When adding a new item, the path is simply the path of the Files Entity Set with the model:
'/' + filesEntityName
(Note: this is exactly the same as the path we used in the read() call to figure out whether the file already exists.)

The path for updating an existing item also starts with the path of the Entity Set, but includes the key to identify the item that is to be updated. Lucky for us, the sap.ui.model.odata.v2.ODataModel model provides the createKey() method which constructs such a path, including the key part, based on the values of the properties that make up the key. So, the code to construct the path for the update method becomes:
'/' + model.createKey(filesEntityName, {"FILE_NAME": fileToUpload.name})
(For more detailed information about how OData keys and paths work, see ODAta Uri Conventions, in particular the section on "Adressing Entries".)

The payload

In addition to the path, we also need to pass the data, which is sometimes referred to as the payload. While the path tells the model where to add or update an item, the payload specifies what should be added or updated.

Now, even though the UI documentation is not very specific about how to construct to payload, we have used the createEntry() and update() methods of the sap.ui.model.odata.v2.ODataModel in the past without any problems. It is normally quite intuitive and hasslefree: you simply specify an Object, and specify keys that match the property names of the target entity set, and assign JavaScript values, just as-is. So, if we disregard the FILE_CONTENT field for a moment, the payload for the Files entity set could be something like this:
var payload = {
  "FILE_NAME": fileToUpload.name,
  "FILE_TYPE": fileToUpload.type,
  "FILE_LAST_MODIFIED": new Date(fileToUpload.lastModified),
  "FILE_SIZE": fileToUpload.size,
  "FILE_LAST_UPLOADED": new Date(Date.now())
};
Let's compare this to the data types of the corresponding properties in the entity type of the entity set:
<EntityType Name="FilesType">
  <Key>
    <PropertyRef Name="FILE_NAME"/>
  </Key>
  <Property Name="FILE_NAME" Type="Edm.String" Nullable="false" MaxLength="256"/>
  <Property Name="FILE_TYPE" Type="Edm.String" Nullable="false" MaxLength="256"/>
  <Property Name="FILE_LAST_MODIFIED" Type="Edm.DateTime" Nullable="false"/>
  <Property Name="FILE_SIZE" Type="Edm.Int32" Nullable="false"/>
  <Property Name="FILE_CONTENT" Type="Edm.Binary" Nullable="false"/>
  <Property Name="FILE_LAST_UPLOADED" Type="Edm.DateTime" Nullable="false"/>
</EntityType>
(Note: this entity type is taken from the $metadata document of our service.)

So, in short - there is a pretty straightforward mapping between the JavaScript runtime values and the Edm Type System (see: "6. Primitive Data Types") used by OData: JavaScript Strings may be assigned to Edm.Strings, JavaScript Date objects may be assigned to Edm.DateTimes, and JavaScript Numbers may be assigned to Edm.Int32s.

This is less trivial than one might think when one considers what happens here: one the one hand, we have the types as declared by OData service, which are Edm Types. Then, we have to consider the content type used to transport the payload in the HTTP request: OData services may support several content types, and by default OData supports both application/atom+xml and application/json.So, when starting with a payload as a JavaScript runtime object, this first needs to be converted to an equivalent representation in one of these content types (UI5 uses the JSON representation) by the client before it can be sent off to the OData service in a HTTP request.

It bears repeating that this is not a simple, standard JSON serialization, since the type system used by the JSON standard only knows how to represent JavaScript Strings, Numbers, Booleans (and arrays and objects containing values of those types). The native JSON type system is simply too minimal to represent all the types in the Edm Type system used by OData, hence the need for an extra JSON representation format. The sap.ui.model.odata.v2.ODataModel does a pretty remarkable job in hiding all this complexity and making sure things work relatively painless.

Representing FILE_CONTENT in the payload

Now for the FILE_CONTENT property. In the entity type, we notice that the data type is Edm.Binary. What would be the proper JavaScript runtime type to construct the payload?

We just mentioned that normally, the mapping from JavaScript runtime types is usually taken care of by the sap.ui.model.odata.v2.ODataModel. So we might be tempted to simply pass the File object itself directly as value for the FILE_CONTENT property. But when we call either the createEntry or update method with a payload like this:
var payload = {
  "FILE_NAME": fileToUpload.name,
  "FILE_TYPE": fileToUpload.type,
  "FILE_LAST_MODIFIED": new Date(fileToUpload.lastModified),
  "FILE_SIZE": fileToUpload.size,
  "FILE_LAST_UPLOADED": new Date(Date.now()),
  "FILE_CONTENT": fileToUpload
};
we get an error in the response:
The serialized resource has an invalid value in member 'FILE_CONTENT'.
So clearly, the sap.ui.model.odata.v2.ODataModel needs some help here.

One might assume that the problem has to do with the File object being a little bit too specific for UI5 - after all, a File object is not just some binary value, but it is a sublclass of the Blob object, which has all kinds of file-specific properties of itself. However, assigning a proper, plain Blob object in the payload yields exactly the same result, so that's not it either.

Instead of continuing to experiment with different values and types, we took a step back and took a look at the OData specification to see if we could learn a bit more about the Edm.Binary type. In the part about the JSON representation (See: "4. Primitive Types") we found this:
Base64 encoded value of an EDM.Binary value represented as a JSON string
It seems to suggest the whatever thing that represents the Edm.Binary value need to be Base64 encoded, which yields a string value at runtime, and this string may then be serialized to a JSON string. So, if we could make a Base64 encoded string value of our binary value, we could assign that in the payload. (We already saw that sap.ui.model.odata.v2.ODataModel will turn JavaScript String values to a JSON representation so we don't have to do that step ourselves.)

Fortunately, it's easy to create Base64 encoded values. The browser built-in function btoa() does this for us.

However, we're not there yet, as the spec starts with a binary value, and JavaScript does not have a binary type (and hence, no binary values).

We then took a look at the specification to find out exactly what a Edm.Binary value is. We found something in the section about Primitive Datatypes on how to create literal Edm.Binary Values:
binary'[A-Fa-f0-9][A-Fa-f0-9]*' OR X '[A-Fa-f0-9][A-Fa-f0-9]*' 
NOTE: X and binary are case sensitive. 
Spaces are not allowed between binary and the quoted portion. 
Spaces are not allowed between X and the quoted portion. 
Odd pairs of hex digits are not allowed.

Example 1: X'23AB' 
Example 2: binary'23ABFF'
At this point the thinking was that we could take the bytes that make up the binary value and convert it to its hexadecimal string representation, single-quote it the resulting hex string, and finally prepend either X or binary to it. At runtime, this would then be a JavaScript string value reprenting an Edm.Binary literal, which we could then turn into its Base64 encoded value, and send assign to the payload.

When we went this route, the error message went away, and sure enough, documents started to show up in our backend table. Unfortunately, the documents ended up there as Edm.Binary literals, that is, as strings that are an accurate Edm.Binary literal representation of the document but otherwise useless.

At this point the solution was clear though - just leave out the intermediate step of converting the original value to an Edm.Binary literal.

The uploadFile function

Remember, at this point we have the File object stored in the fileToUpload variable, and a flag fileToUploadExists is set to true or false depending upon whether the file is already stored in the backend table. This is code we ended up with for uploading the file:
function uploadFile(){
  var fileReader = new FileReader();
  fileReader.onload = function(event){
    var binaryString = event.target.result;
    var payload = {
      "FILE_NAME": fileToUpload.name,
      "FILE_TYPE": fileToUpload.type,
      "FILE_LAST_MODIFIED": new Date(fileToUpload.lastModified),
      "FILE_SIZE": fileToUpload.size,
      "FILE_CONTENT": btoa(binaryString),
      "FILE_LAST_UPLOADED": new Date(Date.now())
    };
    if (fileToUploadExists) {
      model.update(
        '/' + model.createKey(filesEntityName, {
          "FILE_NAME": fileToUpload.name
        }), 
        payload
      );
    }
    else {
      model.createEntry('/' + filesEntityName, {
        properties: payload
      });
    }
    model.submitChanges({
      success: function(){
        closeUploadDialog();                
      }
    });
  };
  fileReader.readAsBinaryString(fileToUpload);
}
As explained earlier, uploading the file breaks down into 2 subtasks, and this handler takes care of both:
  • First, we use the FileReader to read the contents of the File object
  • Then, we send it to the backend. To do that, construct the path and the payload, and call either the createEntry or the update method based on whether the file already exists, passing the path and the payload.

Using the FileReader to read the contents of a File object

First, we need to read the contents of the file. We do that using a FileReader, which is also part of the Web File API. To get the contents of a File object, we can call one of the FileReader's read methods.

The FileReader'sread methods do not return the contents of the file directly: the Web File API is mostly asynchronous. Instead, we have to attach an event handler to the FileReader which can respond to the FileReader's events. In this case we overrided the FileReader's onload() method, which gets called when the FileReader is done reading a File. (Instead of the override, we could also have attached a handler with addEventListener but it really doesn't matter too much how the handler is attached.)

Once set up, we can now call a read() method and wait for the reader to call our onload() handler.

So the general structure to read the file is as follows:
function uploadFile(){
  var fileReader = new FileReader();
  fileReader.onload = function(event){

    var binaryString = event.target.result;

    ...do something with the file contents...

  };
  fileReader.readAsBinaryString(fileToUpload);
}


We already mentioned the FileReader provides a number of different read methods, and the chosen method determines the type of the value that will be available in event.target.result by the time the load handler is called. Today, the FileReader provides:To figure out which method we should use, we should consider how our backend expects to receive the data. Or rather, how our sap.ui.model.odata.v2.ODataModel wants us to pass the data so it can do the appropriate call to the backend. In a previous section we already explained our struggle to figure out how to represent a Edm.Binary value in the payload, and based on those findings, readAsBinaryString() is the appropriate method. With this read method, the FileReader turns each individual byte of the file contents in to a JavaScript character, much like the fromCharCode()-method of the String object would do. The resulting value is a JavasScript binary string: each character represents a byte.

Note that this is very different from what the readAsText() method would do: that would attempt to read the bytes as if they are encoded characters in UTF-8 encoding, in other words it would result in a character string, not a binary string.

After obtaining the file contents as binary string, we can apply the Base64 encoding and assign it to the payload:
    var payload = {
      "FILE_NAME": fileToUpload.name,
      "FILE_TYPE": fileToUpload.type,
      "FILE_LAST_MODIFIED": new Date(fileToUpload.lastModified),
      "FILE_SIZE": fileToUpload.size,
      "FILE_CONTENT": btoa(binaryString),
      "FILE_LAST_UPLOADED": new Date(Date.now())
    };

Summary

This concludes the last installment in this blog series. In this post we learned how to use: And by putting these elements together we created a File upload feature for a UI5 application, backed by a HANA OData service.

Odds and Ends

It's nice that we finally found out how to write the file contents to our OData Service. But something does not feel quite right. Although we happened to find a way to write the binary data that satisfies both the sap.ui.model.odata.v2.ODataModel as well as the SAP HANA .xsodata service that backs it, we still haven't found any official documentation, either from the OData specification or from SAP that confirms that this is really the correct way. We would hope that SAP HANA's .xsodata implementation is a faithful implementation of the standard, but for the Edm.Binary type, I'm just not 100% sure. If anybody could chime in and confirm this, and preferably point me to something in the OData specification that confirms this, then I would be most grateful.

Year-to-Date on Synapse Analytics 5: Using Window Functions

For one of our Just-BI customers we implemented a Year-to-Date calculation in a Azure Synapse Backend. We encountered a couple of approache...