Tuesday, December 29, 2015

jjsml: a Module Loader for the Nashorn JavaScript Shell

jjs is a JavaScript shell that ships with Oracle Java 1.8. I recently found myself in a situation where it seemed worth while to check it out, so I did. I do not want to use this post to elaborate too much on why I started looking at jjs, but I intend to write about that shortly. For now I just want to share a few observations, as well as a solution to a particular obstacle I encountered.

What is jjs?

Java 1.8 (both SDK and JRE) ships a new executable called jjs. This executable implements a command-line JavaScript shell - that is, a program that runs from the command line and which "speaks" JavaScript. It is based on Nashorn - a JavaScript engine written in Java and first released with Java 1.8.

Nashorn can be used to offer embedded JavaScript functionality in Java applications; jjs simply is such an application that happens to be desigend as shell.

Why do I think jjs is cool?

The concept of a JavaScript shell is not at all new. See this helpful page from Mozilla for a comprehensive list. The concept of a JavaScript engine written in Java is also not new: The Rhino engine has been shipping with Java since version 1.6.

Even so, I think jjs has got a couple of things going for it that makes it uniquely useful:
jjs is an executable shell
Even though Java shipped a javascript engine since Java 1.6, (the Rhino engine), you'd still have to get a shell program externally.
Ships with Java
This means it will be available literally everywhere. By "available everywhere" I don't just mean that it will be supported on many different operating systems - I mean that it will in fact be installed on many actual systems, and in many cases it will even be in the path so that you can simply run it with no extra configuration or setup. This is not the case for any other javascript shell.
Java Integration
jjs comes with extensions that make it really quite simple to instantiate and work with Java classes and objects (but of course, you get to do it using JavaScript syntax). This may not mean much if you're not familiar with the Java platform. But if you are it means you get to use tons and tons of high-level functionality basically for free, and you get to use it through an interface you are already familiar with. I'm sure other JavaScript shells allow you to create your own extensions so you can make external functionality scriptable, but then you'd still need to create such a module, and you'd need to invent the way how to bind that external functionality to shell built-ins. Because Java and JavaScript both support an objected-oriented language paradigm the integration is virtually seamless.
It's JavaScript!
To some, this may sound hardly as an argument in favor, but to me, it is. While Java is great in that it offers so much functionality, I just can't comfortably use it for quick and dirty programs. There are many reasons why this is so - very verbose and explicit syntax, compilation, static typing etc. I'm not saying these language features are categorically bad - just that they get in the way when you just want to make a small program real quick. I really think JavaScript helps to take a lot of that pain away.
Scripting extensions
In addition to offering JavaScript with Java language integration, jjs throws in a couple of features that are neither Java nor JavaScript: String Interpolation and Here document support. Together, these features make it very easy to use jjs for templating and text output generation. Again such features themselves are very far from unique. But in my opinion, they complement the Java/JavaScript language environment in a pretty significant way and greatly increase its ease of use.
So, in short, I'm not saying any particular feature is unique about jss. But I do think the particular combination of features make it a very interesting tool, especially for those that are used to Java and/or JavaScript and that have the need for ubiquitous and reliable shell scripting.

Quick example

The Nashorn Manual already provides pretty complete (but concise) documentation with regard to both the Java language integration and the typical shell-scripting features, so I won't elaborate too much on that. Instead, I just want to provide a really simple example to immediately give you an idea of how jjs works, and what it would be like to write scripts for it.

Here's my example script called "fibonacci.js":
//get the number of arguments passed at the command-line
var n = $ARG.length;
if (n === 0) {
  //if no arguments were passed, we can't calculate anything, so exit.
  print("Enter a space-separated list of integers to calculate their corresponding fibonacci numbers.");
  exit();
}

//a function to calculate the n-th number in the fibonacci sequence
function fibonacci(n){
  var i, fibPrev = 0, fibNext = 1;
  if (n === 1) return fibPrev;
  if (n === 2) return fibNext;
  for (i = 2; i < n; i++, fibNext = fibPrev + (fibPrev = fibNext));
  return fibNext;
}

var i, arg;

//process the arguments passed at the command line:
for (i = 0; i < n; i++) {
  arg = $ARG[i];

  //validate the argument. Is it a positive integer?
  if (!/^[1-9]\d*$/g.test(arg)) {

    //Not valid, skip this one and report.
    print(<<EOD
Skipping "${arg}": not a positive integer.
    EOD);
    continue;
  }

  //Calculate and print the fibonacci number.
  print(<<EOD
fibonacci(${arg}): ${fibonacci(parseInt(arg, 10))}
  EOD);
}
As you might have guessed the script calculates the fibonacci number corresponding to the sequence number(s) entered by the user on the command-line. While this may not be a very useful task, this example does illustrate the most important elements that are key to typical shell scripts:
  • Collect and validate command-line arguments passed by the user
  • Process the arguments - use them to do some work deemed useful
  • Print out information and results
Assuming that jjs is in the path, and fibonacci.js is in the current working directory, then we can execute it with this command:
$ jjs -scripting fibonacci.js
There are three parts to this command: jjs, the actual executable program, the -scripting option, and finally fibonacci.js, which is the path to our script file.

I think the executable and the script file parts are self-explanatory. The -scripting option may need some explanation: without this, jjs only supports a plain, standard javascript environment without any extensions that make it behave like a proper shell-scripting environment. Personally I think it's a bit annoying we have to explicitly pass it, but such is life. Just make sure you always pass it.

On *nix based systems, you can even make the script itself executable by prepending the code with a shebang, like so:
#!/usr/bin/jjs
When we run it, the output is:
Enter a space-separated list of integers to calculate their corresponding fibonacci numbers.
Which is not surprising, since we didn't actually provide any input arguments. So let's fix that:
$ jjs -scripting fibonacci.js -- 1 2 3 4 5 6
fibonacci(1): 0
fibonacci(2): 1
fibonacci(3): 1
fibonacci(4): 2
fibonacci(5): 3
fibonacci(6): 5
Note the -- delimiter in the command line. This is a special token that tells the jjs executable to treat anything that appears after the -- token as command line arguments targeted at the script. We will see in a bit how you can access those arguments programmatically from within your JavasSript code.

Now, it's not necessary to analyze every aspect of this script. Instead, let's just highlight the features that are specific to the jjs environment itself. In the code sample above, I marked up those elements that are specific to Nashorn/jjs in lovely fuchsia
  • We can use the built-in global $ARG property to refer to the arguments passed at the command line. $ARG is essentially an array that will contain any token appearing after the -- delimiter on the jjs command-line is an element. Since $ARG can be treated as an array, we can use its length property to see how many arguments we have, and we can use the array access operator (square braces) to get an element at a particular index.
  • We can use the built-in print() function to write to standard out.
  • We can call the built-in function exit() to terminate the jjs program. This will exit the shell, and return you to whatever program spawned jjs.
  • Here documents are strings delimited by <<EOD and EOD delimiters. At runtime, Here documents are simply javascript strings, but their syntaxis allows the string to span multiple lines, and to contain unescaped regular string literal delimiters (single and double quotes, " and '). This makes Here documents extremely useful for generating largish pieces of text.
  • Literal strings and here documents can contain ${<expression>} placeholders, and jjs will automatically substitute these with the value of the expression. Expressions can be simple variable names, but also more complex expressions, like function calls, or operations.

Next step: reusing existing functionality with load()

We already mentioned that the program does three distinct things (collect input, process input to create output, and print output). Obviously, these different tasks are related or else they wouldn't have been in the same script. But at the same time, we can acknowledge that the fibonacci() function that does the actual work of calculating numbers from the Finonacci sequence might be usable by other scripts, whereas collecting input and printing output (let's call it the user-interface) are quite specific for exactly this program.

The desire to separate these concerns into distinct scripts is only natural and will become stronger as programs grow bigger and more complex. So let's cut it up.

For now let's assume we only want to be able to reuse the fibonacci() function in other scripts. In order to do that we at least have to separate that in put it in a new script. Let's call that fibonacci.js, and let's store it in some central directory where we store all of our reusable scripts - say, scripts/fibonacci.js. Handling of both input and output can stay together in the same script for now - let's just call that fibonacci-ui.js.

So, in order to actually make this happen we can rename the orignal file to fibonacci-ui.js and create a new and empty fibonacci.js which we store in the scripts directory. We can then cut the code for the entire fiboncci() from fibonacci-ui.js and paste it into fibonacci.js.

Of course, the code in fibonacci-ui.js is now missing its definition of the fibonacci() function. But since it is still calling the function, we somehow need to import the function definition from scripts/fibonacci.js. To do that, the Nashorn scripting environment provides a built-in function called load().

The load() function takes a single string argument, which represents a file path or a URL that identifies a script. Load will evaluate the script immediately, and it returns the evaluation result. (It may be bit strange to think of the evaluation result of script, but this is simply the value of the last expression. It need not concern us right now since the fibonacci() function is global anyway - we can call it regardless of whether we have the evaluation result of scripts/fibonacci.js or not.)

So, putting it all together, we now have scripts/fibonacci.js:
//a function to calculate the n-th number in the fibonacci sequence
function fibonacci(n){
  var i, fibPrev = 0, fibNext = 1;
  if (n === 1) return fibPrev;
  if (n === 2) return fibNext;
  for (i = 2; i < n; i++, fibNext = fibPrev + (fibPrev = fibNext));
  return fibNext;
}
and fibonacci-ui.js:
//get the number of arguments passed at the command-line
var n = $ARG.length;
if (n === 0) {
  //if no arguments were passed, we can't calculate anything, so exit.
  print("Enter a space-separated list of integers to calculate their corresponding fibonacci numbers.");
  exit();
}

//acquire the fibonacci function
load("scripts/fibonacci.js");

var i, arg;

//process the arguments passed at the command line:
for (i = 0; i < n; i++) {
  arg = $ARG[i];

  //validate the argument. Is it a positive integer?
  if (!/^[1-9]\d*$/g.test(arg)) {

    //Not valid, skip this one and report.
    print(<<EOD
Skipping "${arg}": not a positive integer.
    EOD);
    continue;
  }

  //Calculate and print the fibonacci number.
  print(<<EOD
fibonacci(${arg}): ${fibonacci(parseInt(arg, 10))}
  EOD);
}
Assuming that jjs is in the path, and the directory where we stored fibonacci-ui.js is also the current working directory, we can now run it like this:
$ jjs -scripting fibonacci-ui.js -- 1 2 3 4 5 6
If all went well, we would get the same output as we got before.

The problem with load()

While it is cool that we can load external scripts with the built-in load() function, not all is well. We just ran our fibonacci-ui.js script while our current working directory was also the directory where our script resides. Suppose that, for some reason, our current working directory is the scripts directory - i.e. in a subdirectory of the directory where fibonacci-ui.js resides.

Obviously, we need to modify our command line to point to the right location of fibonacii-ui.js, like so:
$ jjs -scripting ../fibonacci-ui.js -- 1 2 3 4 5 6
But if we run this, we get an error message:
../fibonacci-ui.js:7 TypeError: Cannot load script from scripts/fibonacci.js
This tells us is that in our entry point script, ../fibonacci-ui.js, an error occurred at line 7. This is our call to load() and the complaint is that it cannot seem to find the external script file as it was passed to load, scripts/fibonacci.js.

So, it looks like load() resolves relative paths against the current working directory. This is a bit of a bummer since it poses a serious challenge to creating reusable, portable scripts. What I would find more intuitive is if load() would resolve relative paths against the directory of the current script (that is, the directory where the script that calls out to load() resides).

Alas, it does not so we have to find a solution.

UPDATE: @sundararajan_a kindly pointed me to a page in the Open JDK wiki which explains that there actually does exist a solution. Nashorn/jjs provides the built-ins __DIR__ and __FILE__ which hold the directory and file of the current script. There's also a __LINE__ built-in which holds the current linenumber in the script. So we can write load(__DIR__ + relative_path) to resolve a relative path against the directory of the current script.


Without any extra measures, I think there are only two possible "solutions":
  • Abstain from using relative paths - only pass absolute paths to load().
  • Write your entry point scripts with one fixed working directory in mind, and ensure that is the current working directory before running your script with jss.
Neither of these "solutions" should make you happy: requiring a fixed directory for all reusable scripts means our reusable scripts aren't portable. We can't just demand that the location of our reusable scripts are the same, regardless of platform. And even if we could, we would be unable to have multiple copies (say, one copy for active development, one for production purposes) of our reusable scripts on one system. So clearly this is solution is not without its own share of problems. Requiring a fixed working directory for our top-level entry point scripts is maybe a little bit better, but would require some external solution to ensure the correct working directory for our script.

Despite the possibility of using these workarounds, there is one particular scenario that I think just can't be solved - at least, not without a way to resolve relative paths against the location of the "current" script. Suppose you would create a script that is intended for reuse. But now suppose that this script itself relies on functionality from other reusable scripts. Either this script chooses to use absolute paths to acquire the scripts it depends upon, in which case the entire solution becomes unportable. Or this script would require whichever script that needs to use it to first set the current working directory so that correct loading of its dependencies is ensured.

No matter which way I look at it, load() puts us between a rock and a hard place.

A better load()?

Let's imagine for a moment we can write "a better load()" - that is, a function that works just like the built-in load(), but which resolves any relative paths against "the current script". This function would also need to have some sort of starting point - that is, it needs to know the current working directory. For now let's call this function a_better_load().

As it turns out, all the elements that we need to actually build a_better_load() are present in the jjs/Nashorn scripting environment. In pseudo code it would look something like this:
//only create the function if it doesn't already exist
if (!a_better_load) {

  //create a stack to keep track of the current working directory
  //$PWD is provided by jjs and contains the current working directory
  cwd = [$PWD];

  function a_better_load(path){

    //separate path in a directory and a file
    dir = getDir(path);
    file = getFile(path);

    if (path is absolute) {

      //path is absolute - push it unto the cwd stack so that any load requests 
      //that might occur inside the script we are now loading get resolved against 
      //the absolute path of this current script.
      cwd.push(dir)

    }
    else {
  
      //path is relative - take the path that is currently on top of the cwd stack,
      //and append the relative dir path
      //This should be the dir of the current script.
      cwd.peek() += dir

    }
 
    //use the built-in load to actually acquire the external script;
    ret = load(cwd.peek() + file);

    //after loading, we have to restore the state of the cwd stack

    if (path is absolute) {

      //the absolute path of the current script is on top of the stack.
      //remove it.
      cwd.pop()

    }
    else {
  
      //path is relative - take the path on top of the stack and remove
      //the part of the path that was added by the relative dir of the 
      //current script
      cwd.peek() -= dir

    }

    //return the evaluated script, just like load() would do.
    return ret;
  }

}

jjsml: A module loader for jjs

I actually implemented something like a_better_load() and then I started to think about the problem a bit more. Sure, a_better_load() sovles the script portability problem. But it adds a new global, and it would require all scripts to use this instead of plain, built-in load(). Alternatively, I could modify the script for a_better_load() a bit more and actually overwrite the built-in load(). But this would actually be worse in a way, since I would then need to distinguish between those scripts that know about this modified behavior of load() and those that, for some reason or other rely on the default behavior of load() (which would basically be any third party script).

I ended up creating a solution familiar from the browser world - a module loader like RequireJS. The solution is called jjsml and you can find it here on github: https://github.com/rpbouman/jjsutils/blob/master/src/jjsml/jjsml.js.

You might wonder what a module loader is, and why you'd need it - indeed, it may seem we already solved the problem by creating a better load(), why would we need to introduce a new concept?

I just identified a couple of problems by introducing a_better_load(). It seems self-evident to me that no matter what solution we end up with, new scripts would need to use it and become dependent upon it to actually benefit from it. This would even be the case if we'd overwrite the built-in load() function, which seems like a good argument to me to not do such a thing at all, ever, since a visible dependency is much better than a magical, invisible one.

So, if we're going to need to accept that we'd have to write our scripts to explicitly use this solution, then we'd better make sure the solution is attractive as possible, and offers some extra functionality that makes sense in the context of external script and dependency loading. I'm not sure if I succeeded, but check out the next section and let me know in the comments what you think.

Using jjsml

jjsml.js provides a single new global function called define(). The name and design was borrowed directly from RequireJS and it is called like that because it's purpose is to define a module. A module is simply some bag of related functionalities and it may take the form of an object, or a function. Instead of being a referenceable value itself, the module may even manifest itself simply by exposing new globals. Defining a module simply means creating it, and doing everything that presumes creating it, such as loading any dependencies and running any initialization code.

The define() function has the following signature:
<module> define([<scriptPath1> ,..., <scriptPathN>], <moduleConstructor>) 
  • <module> is a module - typically an object or a function that provides some functionality.
  • <scriptPath> is either a relative or an absolute path that points to a script. Ideally loading this script would itself return a module. Note that the square braces in the signature indicate that the argument is optional, and can be multiple - In other words, you can pass as many (or as little) of these as you like. Currently, define() does not handle an array of dependencies (but I should probably modify it so that it does. At least, RequireJS does it that way too.)
  • <moduleConstructor> is a module constructor - some thing that actually creates the module, and typically runs the code necessary to initialize the module.
While this describes it in a very generalized way, there are a few more things to it to get the most out of this design:
  • An important point to make is that the various <scriptPath>'s are guaranteed to be loaded prior to evaluating the <moduleConstructor>. The idea is that each <scriptPath> represents a dependency for the module that is being defined.
  • In many cases, the <moduleConstructor> is a callback function. If that is the case, then this callback will be called after all dependencies are loaded, and those dependencies would be passed to the callback function as arguments, in the same order that the dependencies were passed. Of course, to actually use the dependencies in this way, evaluating the <scriptPath> should result in a referenceable value. So this works best if the <scriptPath> are themselves proper modules. The immediate advantage of this design is that no module ever really needs to add any new globals: any functionality provided by a module is, in principle, managed in isolation from any functionalities provided by any other modules.
  • Proper modules are loaded only once, and are then cached. Subsequent references (typically, because another module tries to load it as a dependency) to an already loaded module will be served from the cache. This ensures that no time is wasted loading and running modules, or worse, to mess up module initialization.

Modular fibobacci example

This may sound a bit abstract so here's a concrete example, based on our earlier Fibonacci example. Let's first turn scripts/fibonacci.js into a proper module:
(function(){

  //a function to calculate the n-th number in the fibonacci sequence
  function fibonacci(n){
    var i, fibPrev = 0, fibNext = 1;
    if (n === 1) return fibPrev;
    if (n === 2) return fibNext;
    for (i = 2; i < n; i++, fibNext = fibPrev + (fibPrev = fibNext));
    return fibNext;
  }

  return fibonacci;

})();
Note that the script itself is an anonymous function, that is called immediately. The purpose of this construct is to establish a scope wherein we can create any private variables and functions - things only our script can see but which are not visible from the outside. The only interaction this script has with its surroundings is via its return value. In this case, the return value is simply the fibonacci() function itself. Also note that this script does not actually run any code, apart from defining the function. That's because this module does not require any initialization, and does not rely on any private data. It simply provides a pure function, and that's that.

To modify our fibonacci-ui.js example accordingly, we could simply change this line:
//acquire the fibonacci function
load("scripts/fibonacci.js");
to this:
//acquire the fibonacci function
var fibonacci = load("scripts/fibonacci.js");
This is a small but crucial difference: in the setup we had earlier, scripts/fibonacci.js would create a new fibonacci() function as a global. Since scripts/fibonacci.js is now a proper module, it returns the function rather than adding a new global itself. So in order to use the modularized fibonacci() function, we capture the result of our call to load(), and store it in a local fibonacci variable.

However, this change only has to do with the modular design of our modified scripts/finbonacci.js script. It still uses the old, built-in load() function, and is this not portable.

To actually benefit from define(), we should slightly rewrite the fibonacci-ui.js script in this way:
(function(){

  define(
    "scripts/fibonacci.js",
    function(fibonacci){
      //get the number of arguments passed at the command-line
      var n = $ARG.length;
      if (n === 0) {
        //if no arguments were passed, we can't calculate anything, so exit.
        print("Enter a space-separated list of integers to calculate their corresponding fibonacci numbers.");
        exit();
      }

      var i, arg;

      //process the arguments passed at the command line:
      for (i = 0; i < n; i++) {
        arg = $ARG[i];

        //validate the argument. Is it a positive integer?
        if (!/^[1-9]\d*$/g.test(arg)) {

          //Not valid, skip this one and report.
          print(<<EOD
Skipping "${arg}": not a positive integer.
          EOD);
          continue;
        }

        //Calculate and print the fibonacci number.
        print(<<EOD
fibonacci(${arg}): ${fibonacci(parseInt(arg, 10))}
        EOD);
      }
    }
  );

})();
Just like the modified scripts/fibonacci.js script, we wrapped the original code in an anonyumous function that is immediately called, thus keeping any variable and function definitions completely private and isolated from the global space. Inside that anonymous function, we have a single call to define(), passing the relative path to our reusable and modularized scripts/fibonacci.js script.

The last argument to define() is the module constructor. In this case, it is a callback function that will get called after the scripts/fibonacci.js dependency is loaded, and which is responsible for creating the actual program. The callback has a single argument, that directly corresponds to the dependency - when the callback is called, the fibonacci() function that was returned by the scripts/fibonacci.js script will be passed via this argument, and will this become available to the module constructor code.

Running scripts with jjsml

Suppose we acquired the jjsml.js script and stored it in the same directory as fibonacci-ui.js, then we can run the program using the following command line:
$ jjs -scripting -Djjsml.main.module=fibonacci-ui.js jjsml.js -- 1 2 3 4 5 6
You'll notice the same command line elements as we used before, plus one extra: -Djjsml.main.module=fibonacci-ui.js.

As you can see, in the command line, the actual script that jjs gets to run is jjsml.js, and not fibonacci-ui.js. Instead, fibonacci-ui.js is passed via the jjsml.main.module property of the java.lang.System object. You may recognize the -D prefix from other java programs: this is what you use to set a so-called system property, and this is what the jjsml.js script looks out for after attaching the define() function to the global object. If specified, the jjsml.js will attempt to load that as the initial script, i.e. the entry point of the shell scripted program.

Now, at this point, you may wonder - was all this really worth it? I would say it was, for the following reasons:
  • Modularization improved the quality of our scripts. Since they run in complete isolation there is no chance of undesired side effects
  • The fibonacci() function is now truly reusable, and any other script that may need it can order it as a dependency via a call to define(), and no matter how often it will be pulled in, it will be loaded exactly once
  • Last but not least, any relative paths used to identify dependencies will be resolved against the location of the script that pulls in the dependency, thus making all solutions completely portable
Arguably, our example was so trivially simple that we may not notice these benefits, but as you start to rely more on reusable scripts, and those scripts themselves start to depend on other scripts, you most surely will be happy that these things are managed for you jjsml.js.

Finally

I created jjsml.js not beacause of some theorethical principle, but because I really do need to write scripts that have dependencies, and I cannot afford to assume an absolute, fixed location for my scripts. You may have noticed that jjsml.js itself is part of a larger project called jjsutils. Inside this project are already a few reusable components (for example, for JDBC database access) as well as some top-level utilities. I plan to write more about jjsutils in the future.

In the mean while - let me know what you think! I really appreciate your comments, and you're free to check out and use both jjsml.js as well as the entire jjsutils project. There's API documentation, a few examples, and since the entire project is on github you can file issues or send me pull requests. Be my guest!

DuckDB Bag of Tricks: Reading JSON, Data Type Detection, and Query Performance

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is about a particular challenge...