Saturday, October 18, 2014

Performing administrative tasks on Pentaho 5.x Business Analytics Server using RESTful webservices and PHP/cURL

Yesterday, I noticed a discussion in the Pentaho Business Analytics group on linkedin: Using RESTful services to add users, add roles, add solutions, add datasources. In this discussion, Capital Markets Analyst/Consultant Rob Tholemeier writes:
We built some code in PHP that performs most if the 3.x admin console functions. Now with 5.x there appears to be RESTful services to do the same. Does anyone have code examples they are willing to share that uses RESTful services to add users, add roles, add solutions, add datasources? Change the same, assign roles, deletes the same?
I think it is an interesting question. I haven't seen many people integrating Pentaho in their PHP web applications so I decided to do a quick write up to demonstrate that this is not only possible but actually quite easy.

For this write up, I used the following software: Because everything is more fun with pictures, here's a high level overview of this setup:

Pentaho 5.x RESTful Webservices

Pentaho 5.x featured major refactoring to modernize its webservices to be more RESTful. Here's an overview of all the services.

All these webservices reside under the /api path beneath the path of the Pentaho web application, which is by default in the /pentaho path at the root of the server. Each service has a distinct path beneath the /api path. So assuming the pentaho server is running on the localhost (at the default port of 8080), you can access all calls offered by a particular service beneath http://localhost:8080/pentaho/api/service-specific-path.

The Pentaho 5.x webservices are, to some extent, self-documenting. You can get an overview of the available call for a specific service by doing a HTTP OPTIONS request to the root path of the service. For example, an OPTIONS request to http://localhost:8080/pentaho/api/session might return a document like this to describe the service:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<application xmlns="http://wadl.dev.java.net/2009/02">
  <doc xmlns:jersey="http://jersey.java.net/" jersey:generatedBy="Jersey: 1.16 11/28/2012 03:18 PM"/>
  <grammars>
    <include href="http://localhost:8080/pentaho/api/application.wadl/xsd0.xsd">
      <doc title="Generated" xml:lang="en"/>
    </include>
  </grammars>
  <resources base="http://localhost:8080/pentaho/api/">
    <resource path="session">
      <resource path="/setredirect">
        <method id="setredirect" name="GET">
          <response>
            <representation mediaType="text/plain"/>
          </response>
        </method>
      </resource>
      <resource path="/userWorkspaceDir">
        <method id="doGetCurrentUserDir" name="GET">
          <response>
            <representation mediaType="text/plain"/>
          </response>
        </method>
      </resource>
      <resource path="/workspaceDirForUser/{user}">
        <param xmlns:xs="http://www.w3.org/2001/XMLSchema" name="user" style="template" type="xs:string"/>
        <method id="doGetUserDir" name="GET">
          <response>
            <representation mediaType="text/plain"/>
          </response>
        </method>
      </resource>
    </resource>
  </resources>
</application>


The documentation provides a categorized overview of the service calls, as well as the necessary technical information, such as:
The HTTP method
In general, with very few exceptions, Pentho's web services use either GET (to retrieve information) or PUT (to modify or change data). There are a few calls that use POST (for example, the publish service). As far as I know, DELETE is not used.
Parameters
If a service accepts parameters, they are either passed through the query string or as a document in the request message body. This is colloquially known as the POST data, regardless of whether the POST method is actually used.

As per the design of the HTTP protocol, GET calls only accept parameters via the query string. The PUT calls to the Pentaho webservices sometimes accept parameters via the query string. This happens only if the information passed in the parameter has a relatively simple structure - think of a single key/value pair, or maybe a list of values. More complicated data is typically conveyed via the message body.
Data type information
In general Pentaho's web services support both XML and JSON. You can control the data format of the response by specifying the Accept request header. To specify the format of the request message body, you should use the Content-Type header in the request.
Unfortunately the documentation does not have any human readable descriptive information, so sometimes a little inspection and experimentation is required to figure out exactly how things work.

For this write-up, I decided to focus on the administrative interface around users, roles and privileges.

NOTE: The link above only lists the web services that are built into the Pentaho Business Analytics Platform. The platform can, and ususally does, have multiple plugins that offer extra functionality and some of these plugins ship by default with the server so that one might consider those also as builtins. Each plugin can, and often does, offer its own service calls, but these are not listed in the documentation referred to above. A discussion of these is also out of scope for this write-up but I will try and get back on this topic in a future blog post. If you don't feel like waiting that long, you can try and figure out the webservice calls offered by a particular plugin by doing an OPTIONS request to the root path corresponding to a particular plugin. The root path of a particular plugin is http://localhost:8080/pentaho/plugin/plugin-id/api. You can obtain a list of all installed plugins by doing a GET request the ids service of the PluginManagerResource services.

Suggested Tools

I found the following tools to be very useful in inspecting and experimenting with the Pentaho webservices api.
  • Administration perspective in the Pentaho user console. This web application ships with Pentaho and is normally used by Pentaho administrators to manage users, roles and privileges. Basically, we want to offer the functionality provided by this web application, but then with PHP.
  • The network tab in google chrome's developer tools. I used the network tab to monitor calls from Pentaho's administration perspective to the services provided by the Pentaho server.
  • Postman REST client. This is a handy extension from the chrome webstore. It allows you to do all kinds of REST calls directly from within your chrome browser. This was useful to test any assumptions on the details of how to exactly construct requests to the Pentaho web services.

The UserRoleDaoResource services

All functionality to work with users, roles and privileges is bundled in the UserRoleDaoResource service. The services specific path of this service is /userroledao. So, all calls that belong to the UserRoleDaoResource service can be accessed through the path http://localhost:8080/pentaho/api/userroledao.

The following categorized overview illustrates the different kinds of calls that belong to the UserRoleDaoResource service:
Users
Named accounts. An account is typically associated with a particular person that needs to work with the Pentaho server. The following calls are specific to working with users:
GET users
Retrieve the list of existing Pentaho user accounts.
PUT createUser
Create a new Pentaho user account.
PUT updatePassword
Modify the password of the specified user.
PUT deleteUsers
Remove one or more Pentaho user accounts.
Roles
Roles are basically a package of privileges (system roles), which can be assigned to one or more users. Any given user can be assigned multiple roles.
GET roles
Retrieve the list of existing roles.
PUT createRole
Create a new role. After the role is created, privileges can be assigned to it, and the role can then be assigned to one or more users, effectively granting the associated privileges to those users.
PUT deleteRoles
Remove one or more roles.
Assignments
Users can get assigned multiple roles and many users can be assigned a particular role. The following calls can be used to create or remove these assocations:
PUT assignAllRolesToUser
Assign all available roles to the specified user.
PUT assignAllUsersToRole
Assign the specified role to all available users.
PUT assignRoleToUser
Assign a specific role to a particular user.
PUT assignUserToRole
Assign a specific role to a particular user.
PUT removeAllRolesFromUser
Unassign whatever roles were assigned to a specific user.
PUT removeAllUsersFromRole
Take the specified role away from all users that were assigned the role.
PUT removeRoleFromUser
Unassign a specific role from a particular user.
PUT removeUserFromRole
Unassign a specific role from a particular user.
System Roles (also called Logical Roles)
These are essentially privileges: the ability to perform a particular action. Examples of such actions are Read BI content, Publish BI content, Schedule a job etc. In Pentaho, system roles cannot be assigned directly to users; instead, they have to be assigned to a role. Roles can then be associated to users to effectively grant them the privileges associated with the role.
GET logicalRoleMap
This returns a document containing two separate bits of information: the list of available system roles, as well as the association between regular roles and system roles.
PUT roleAssignments
Specify which system roles are associated with a particular regular role. Note that there is no separate call to add or remove individual associations between a role and a system role: rather, an entire set of system roles is assigned to a role at once, removing whatever set was assigned prior to that role.
Remember, you can always obtain the entire set of calls available for the UserRoleDaoResource service for your server by doing a OPTIONS request at the root of the /pentaho/api/userroledao path.

Webservice calls in PHP with cURL

Typically, calling out to HTTP (or FTP) servers from within PHP is done using the cURL library. A full discussion of cURL in PHP is well out of scope; you can refer to the - excellent - official PHP documentation instead. I will only discuss the basic pattern and only in as far as it applies to calling the Pentaho webservices.

Basic cURL calling sequence

The basic cURL calling sequence may be summarized as follows:
  1. Obtain a cURL handle by calling curl_init(). You should save the handle to a variable so you can use it in subsequent calls to the cURL library.
  2. Configure the cURL request by doing various calls to curl_setopt($handle, $option, $value). Each curl_setopt call basically sets a property ("option") on the cURL handle that is passed as first argument to curl_setopt(). The library defines a large number of property keys to control the various aspects of the HTTP request, such as the HTTP method, the request headers, message body etcetera.
  3. Call curl_exec() to send the request. This function will also return the response if a prior call to curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE) was made.
  4. Optionally, information about the response can be obtained by calling curl_getinfo()
  5. Finally, curl_close($handle) should be called to clean up the cURL handle and free any underlying resources used by the library.

Basic GET request to Pentaho with PHP/cURL

The following snippet shows how to do a GET request to Pentaho using PHP/cURL:
<?php
//obtain a cURL handle
$c = curl_init();

//specify the url and the HTTP method
curl_setopt($c, CURLOPT_URL, 'http://localhost:8080/pentaho/api/userroledao/users');
curl_setopt($c, CURLOPT_CUSTOMREQUEST, 'GET');

//supply credentials to authenticate against pentaho
curl_setopt($curl_handle, CURLOPT_USERPWD, 'admin:password');

//tell cURL to return the response as a string
curl_setopt($c, CURLOPT_RETURNTRANSFER, TRUE);

//obtain the response
$response = curl_exec($c);

//get the HTTP status code
$status = curl_getinfo($c, CURLINFO_HTTP_CODE);

//clean up the cURL handle
curl_close($c);
?>
As you can see the snippet follows the general cURL calling sequence. The options CURLOPT_URL and CURLOPT_CUSTOMREQUEST are used to specify the url and the HTTP method respectively, and CURLOPT_RETURNTRANSFER is set to TRUE to obtain the response as a string result when calling curl_exec.

The CURLOPT_USERPWD option is used to specify the credentials for basic HTTP authentication. The value is a string consisting of the username and password, separated by a colon, and the example uses the default built-in administrator's account called admin with the password password.

Note: The web service requests described in this blog post require authentication with the admin account, or at least an account that is privileged to perform administrative actions. Other webservices my work while being authenticated with less privileged accounts.

No specific request headers were set in this example. Because there is no specific Accept header to specify a format for the response, the default format will be used, which happens to be XML.

After executing this snippet, the variable $response will have a string value equivalent to the following document:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<userList>
    <users>suzy</users>
    <users>pat</users>
    <users>tiffany</users>
    <users>admin</users>
</userList>
We could have added an extra call to curl_setopt() to explicitly set the Accept header to specify that we want to receive the response in the JSON format:
<?php
//obtain a cURL handle
$c = curl_init();

...other calls to curl_setopt...

curl_setopt($c, CURLOPT_HTTPHEADER, array(
  'Accept: application/json'
));

$response = curl_exec($c);

curl_close($c); 
?>
Note that all request headers must be passed as an array of strings using a single call to curl_setopt($handle, CURLOPT_HTTP_HEADER, $array_of_header). Each element of the array of headers should be a single string, consisting of the header name followed by a colon followed by the header value.

After executing this snippet, the variable $response will contain a string equivalent to the following JSON document:
{
    "users": [
        "suzy",
        "pat",
        "tiffany",
        "admin"
    ]
}
While the format for the response defaults to XML, it is generally a good idea to always explicitly specify it. In order to explicitly request for XML, change the value for the Accept header to application/xml.

Processing the response

PHP has support for both XML as well as JSON. In this write-up I'll only use XML but it is good to realize that it would have worked just as well if we had used JSON. As PHP offers exactly one library for working with JSON, there many options for processing XML. (I'm inclined to say, way too many.)

Fortunately, for this particular task, the XML documents are always simple and never very large, and I have had good results working with the SimpleXML library. I believe this is included and enabled by default, which makes it a safe choice. Another reason why I like SimpleXML is that it offers exceptionally convenient access to the data in the XML document using property access operators and iterators.

It would be outside the scope of this write-up to discuss SimpleXML in detail but the following snippet may illustrate how easy it is to process an XML document like the <userList> response obtained from the GET to the /userroledao/users API call described above:
<?php
  $userlist = ...response from /userroledao/users...

  //parse xml string
  $doc = simplexml_load_string($userlist);

  //iterate elements
  foreach ($doc as $user) {
    //do something with the <user> element.
    //even though $user is an object we can easily extract its value by treating it as a string
    echo('<div>'.$user.'</div>');
  }  
?>
As you can see, it doesn't ever get much simpler than this: one call to simplexml_load_string to parse the xml document, and we can directly traverse the elements using foreach. Plus accessing the text content of the elements is also very easy: no need for a separate call to extract the text, just treat the element as a string. Note that if you do need more advanced ways to traverse the structure of the document and access the data, the SimpleXML library still goes a long way. You can even use XPath expressions, if you really need that.

Putting it together: a simple Pentaho Admin application in PHP

Using the techniques described above, and an argueably minimal amount of client side javascript, I put together a simple, yet fully functional administrative application for managing Pentaho roles, users and privileges in PHP. It is a single, self-contained script (php, html template, and javascript - no css) of just over 600 lines, including whitespace and comments. Here's what the application looks like:



User management features

The left had side of the page is dedicated to user management. From top to bottom we have:
User form
In the left top we have a form with fields for "Username" and "Password", and a "Create User" button. Hitting the button results in a PUT request to /userroledao/createUser to create a new user using the values in the "Username" and "Password" fields.
Existing Users list
Halfway the page below the user form there's a list showing the existing users. This is populated with the data from the response of a GET request to /userroledao/users.
Delete selected users button
Hitting this button fires a javascript function that collects the selection from the existing user list. This is used to do a PUT request to the /userroledao/deleteUsers service in order to delete those user accounts.
User roles list
When a single user is selected in the existing users list a GET request is made to the /userroledao/roles service to create a list of all available roles. Another GET request is made to the /userroledao/userRoles service and the data from the response is used to set the state of the checkboxes in front of the role names, indicating which roles are assigned to the user. If such a checkbox is checked by user interaction, a PUT request is made to the /userroledao/assignRoleToUser service, which will assign the corresponding role to the currently selected user. If the checkbox gets unchecked through user interaction, a PUT request is made to the /userroledao/removeRoleFromUser service, which will unassign the corresponding role from the currently selected user.

Role management features

The right had side of the page is dedicated to role management. From top to bottom we have:
Role form
In the right top we have a form with a "Rolename" field and a "Create Role" button. Hitting the button results in a PUT request to /userroledao/createRole to create a new role with the specified role name.
Existing Roles list
Halfway the page below the role form there's a list showing the existing roles. This is populated with the data from the response of a GET request to /userroledao/roles.
Delete selected roles button
Hitting this button fires a javascript function that collects the selection from the existing roles list. This is used to do a PUT request to the /userroledao/deleteRoles service in order to delete those roles.
Role members list
When a single role is selected in the existing roles list a GET request is made to the /userroledao/users service to create a list of all available users. Another GET request is made to the /userroledao/userRoles service and the data from the response is used to check the appropriate checkboxes in front of the users names to indicate which users got the current role assigned. If such a checkbox is checked through user interaction, a PUT request is made to the /userroledao/assignUserToRole service to assign the currently selected role to the checked user. If such a checkbox gets unchecked due to user interaction, a PUT request is made to the /userroledao/removeUserFromRole service to unassign the currently selected role from the unchecked user.
Privileges (logical roles) list
If a single role is selected in the existing roles list, a GET request is done to the /userroledao/logicalRoleMap service. The data from the response is used to create a list of all available privileges. From the same response, the list of logical roles assigned to the role selected in the existing role list is used to check the checkboxes in front of the logical role names in order to indicate which logical role names are assigned to the currently selected role. When such a checkbox is checked, or unchecked, a PUT request is done to the /userroledao/roleAssignments service to associate the appropriate set of logical roles with the currently selected role

With a few (arguably non-essential) exceptions, this application covers all services of the UserRoleDaoResource.

Implementation details

For reference I will now discuss the implementation details of this application.

User form

The user form can be used to create new users. Here's its corresponding HTML code:
<form method="POST">
  <table>
    <tr>
      <td>Username:</td>
      <td><input type="text" name="user" /></td>
    </tr>
    <tr>
      <td>Password:</td>
      <td><input type="password" name="password" /></td>
    </tr>
    <tr>
      <td colspan="2">
        <input type="submit" name="action" value="Create User"/>
      </td>
    </tr>
  </table>
</form>
Hitting the "Create User" button submits the form. But since the form element does not specify a specific action url, it will simply refresh the page, setting the form fields as POST data. In the top of the PHP script, this is handled with the following PHP code:
if (isset($_POST['action'])) {
  $action = strtolower($_POST['action']);
}
else {
  $action = NULL;
}
switch ($action) {
  case 'create user':
    $status = create_user($_POST['user'], $_POST['password']);
    break;
  case '...':
    ...
    break;
  
  ... many more case branches ...
}
In fact, all of the actions that the user can initiate result in a POST request that refreshes the page, setting a specific value for the action field to select the appropriate backend action. In case of the user form, this results in a call to the PHP function create_user(), passing the values of the POST data fields user and password, which originate from the HTML user form.

The PHP code of the create_user() function is shown below:
//create a user with specified name and password.
function create_user($user, $password){
  $c = curl_init();

  curl_setopt($c, CURLOPT_CUSTOMREQUEST, 'PUT');
  curl_setopt($c, CURLOPT_URL, 'http://localhost:8080/pentaho/api/userroledao/createUser');

  curl_setopt($curl_handle, CURLOPT_USERPWD, 'admin:password');

  curl_setopt($c, CURLOPT_POSTFIELDS, 
    '<user>'.
      '<userName>'.$user.'</userName>'.
      '<password>'.$password.'</password>'.
    '</user>'
  );
  curl_setopt($c, CURLOPT_HTTPHEADER, array(
    'Content-Type: application/xml'
  ));

  curl_exec($c);
  $status = curl_getinfo($c, CURLINFO_HTTP_CODE);

  curl_close($c);
  return $status;
}
The create_user function follows the basic pattern of the cURL calling sequence. The difference with the preceding generic example is that this is a PUT request, and thus the value for the CURLOPT_CUSTOMREQUEST option is 'PUT' rather than 'GET'.

The /userroledao/createUser is specified to take a user-element in the request message body, which is used to convey the user name and password. This element is constructed as a XML document and passed to the message body using a curl_setopt call using the CURLOPT_POSTFIELDS option.

Because we are passing a message body in the request, we also need to set the Content-Type header to application/xml to specify that the data passed in the message body is an XML document.

Finally, after the call to curl_exec, we use a call to curl_getinfo() using the CURLINFO_HTTP_CODE constant to obtain the HTTP-status of the request. This should be 200 if the PUT request succeeds. If there is some problem with the request we should receive a code in the 400 range (if the request itself has some problem) or the the 500 range (in case the server is experiencing some problem that is not related to this particular request). For example, if the user already exists, one gets a 403 (Forbidden) status instead of 200.

Note: The Pentaho REST services do not seem to have a robust way to convey the exact nature of the problem in case the request could not be met. At least, I have not noticed any useful information being conveyed in the response except for the HTTP status code. I checked the Administration perspective in the Pentaho user console to see what would happen in case an existing users is entered, and there the action just silently fails. It would be nice to get a recommendation about how to deal with error situations when using the Pentaho web service calls.

The Existing Users list

The existing users list is one of the simplest items in the interface. The HTML / php code is shown below:
<select multiple="true" id="Users" onchange="userSelectionChanged(this)">
>?php
  $users = get_users();
  foreach ($users as $user) {
?>
   <option><?php echo($user); ?></option>
<?php
  }
?>
</select>
The get_users() function is a simple GET request to /userroledao/users, followed by a parse of the XML response. Since both these aspects have been discussed already I will not repeat that code here.

The <select> element has an onchange event handler which calls to the javascript function userSelectionChanged(). Whenever the selection of the user list changes, that function will be called. This function will then determine if a single user is selected, and if that is the case, it will refresh the user role list by explicitly reloading the userRolesFrame:
function userSelectionChanged(list){

  //get the list selection as an array of user names 
  var selection = [];
  for (var i = 0, options = list.options, n = options.length; i < n; i++) {
    if (options[i].checked) {
      selection.push(options[i].value);
    }
  }

  //get the user roles frame
  var frame = document.getElementById("userRolesFrame");

  if (selection.length === 1) {
    //if there's exaclty one selected user, then load its assigned roles in the frame
    frame.src = "?view=userRoles&user=" + selection[0];
  }
  else {
    //blank the frame
    frame.src = "about://blank";
  }
}

The user roles list

We just disussed how selecting a single item in the existing user list refreshes the user role list by loading the userRolesFrame. The frame will be loading the current php script, passing values for the view and user parameters via the query string. The PHP script handles this by checking for the value of the view parameter in the query string. If now view parameter is present, the default interface will load as shown in the screenshot. But if specifying a value of userRoles for view will render only a list of roles, checking the roles that are assigned to the user specified by the user parameter, which is also passed via the query string:
<?php
  //see if a specific view was requested
  if (isset($_GET['view'])) {
    $view = $_GET['view'];
  }
  else {
    $view = NULL;
  }

  //select and render the requested view 
  switch ($view) {

    //render the user roles view
    case 'userRoles':

      //get the current user
      $user = $_GET['user'];

      //get the current user's assigned roles
      $roles = get_user_roles($user);

      //store the user's roles as rolenames in an array
      $assigned_roles = array();
      foreach ($roles as $role) {
        array_push($assigned_roles, ''.$role);
      }

      //get all roles
      $roles = get_roles();
      
      //render all roles as a list of divs with a checkbox
      foreach ($roles as $role) {

        //if the current role appears in the array of assigned roles, check the checkbox.
        $checked = in_array(''.$role, $assigned_roles);
?>
        <div>
          <input
            onchange="changeUserRoleAssignment(this)"
            name="<?php echo($role) ?>"
            type="checkbox"
            <?php echo ($checked ? 'checked="true"' : '')?>
          />
          <?php echo($role) ?>
        </div>
<?php
      }
      break;
    case '...':

      ...code to handle other views here...

    default:

      ...code for the regular interface (no specific view) goes here...
  }
?>
First, get_user_roles($user) is called to GET a response from the /userroledao/userRoles service, which is a list of roles for the specified user. From the php side of things, nothing new is really happening here. The only difference with regard to getting the list of users is the url, which is now /userroledao/userRoles rather than /userroledao/users and which includes a querystring parameter to specify the user:
  curl_setopt($c, CURLOPT_URL, 'http://localhost:8080/pentaho/api/userroledao/userRoles?userName='.$user);
The get_user_roles($user) function calls returns an XML document containing <role>-elements representing the roles assigned to the specified users. We use the foreach loop to iterate them and we store their string values (i.e., the actual role names) in the array $assigned_roles.

The remainder of the code is very similar to how the existing user list was rendered, except that we now use a call to get_roles() rather than get_users(). This does a GET request to /userroledao/roles and returns an XML document containing all available roles. We then iterate through that list to create an input-element of type checkbox along with the actual role name. The checkbox is checked according to whether the current role name is found in the previously populated $assigned_roles array.

Each checkbox is given an onchange handler which is implemented by the changeUserRoleAssignment() javascript function. This function sets a few variables in a form to indicate whether the corresponding role is to be assigned or unassigned, and the submits the form. The code for the form and the function are shown below:
<form 
  name="userRoleAssignment" method="POST" 
  action="?view=<?php echo($view)?>&user=<?php echo(urlencode($user))?>"
>
  <input type="hidden" name="action"/>
  <input type="hidden" name="role"/>
  <input type="hidden" name="user" value="<?php echo($user)?>"/>
</form>

<script type="text/javascript">
  function changeUserRoleAssignment(checkbox) {
    var form = document.forms["userRoleAssignment"];
    form.elements["action"].value = checkbox.checked ? "assign role to user" : "unassign role from user";
    form.elements["role"].value = checkbox.name;
    form.submit();
  }
</script>
The changeUserRoleAssignment() function writes its associated role name (stored in its name property) in the role field of the form, and it uses its checked state to set the value of the action field to assign role to user or unassign role from user. It then submits the form.

Since this code appears all in the user role view, it has the effect of refreshing only the frame wherein the view is contained. Because the form sets the action value, it triggers a PHP backend action before rendering the view (just like we saw in the implementation of the create user action):
switch ($action) {
  case 'create user':
    $status = create_user($_POST['user'], $_POST['password']);
    break;
  case 'assign role to user':
    assign_role_to_user($_POST['role'], $_POST['user']);
    break;
  case 'unassign role from user':
    unassign_role_from_user($_POST['role'], $_POST['user']);
    break;
  
  ... many more case branches ...
}
The PHP functions assign_role_to_user() and unassign_role_from_user() both perform a straightforward PUT request to the /userroledao/assignRoleToUser and /userroledao/removeRoleFromUser services respectively. For each these requests, the values of the user and role fields are passed to the service via the query string parameters userName and roleNames respectively.

Note that these two services support multiple role names; however only one is passed at any time by our application. Should you wish to pass multiple role names, then you should separate rolenames by a tab-character (ascii character 0x09). Note that since the names are passed in the query string, they must be url-encoded.

Finally

Although I haven't covered all implementation details, the rest is simply more of the same stuff. If you want to play with the code yourself, you can download the entire PHP script here.

Friday, September 26, 2014

MySQL 5.7.5: GROUP BY respects functional dependencies!

Today, Oracle announced the availability of the Development Milestone Release 15 of MySQL 5.7.5. The tagline for this release promises "Enhanced Database Performance and Manageability". That may sound rather generic, the actual list of changes and improvements is simply *huge*, and includes many items that I personally find rather exciting! Perhaps I'm mistaken but I think this may be one of the largest number of changes packed into a MySQL point release that I've witnessed in a long time. The list of changes includes improvements such as:
  • InnoDB improvements: Simplified tablespace recovery, support for spatial indexes, dynamic configuration of the innodb_buffer_pool_size parameter(!), more efficient creation and rebuilding of indexes ("sorted index build")
  • Several general improvements for spatial data, such as support for open polygons and functions to manipulate geohash values and GeoJSON documents
  • performance_schema additions and improvements, such as a new user_variables_by_thread table, addition of WORK_COMPLETED and WORK_ESTIMATED columns to the stage event tables, improvements to the wait event tables, instrumentation for InnoDB memory allocation in the memory summary tables.
  • Quite a bunch of optimizer improvements, such as better cost estimation for semi-join materialization, configurable cost model (by way of the mysql.server_cost and mysql.engine_cost system tables) and more exact index statistics.
  • Many improvements and additions that make replication more robust
  • A more sane default SQL mode and GROUP BY behaviour
This is very far from an exhaustive list; It really is not an exaggeration to say that there is much much more than I can cover here. Just see for yourself.

Now, one of the changes I'd like to highlight in this post is the improved GROUP BY support.

GROUP BY behavior before MySQL 5.7.5

More than 7 years ago, I wrote an article on this blog called Debunking GROUP BY Myths. The article is basically an explanation of the syntax and semantics of the SQL GROUP BY clause, with a focus on MySQL particular non-standard implementation.

Before MySQL 5.7.5, MySQL would by default always aggregate over the list of expressions that appear in the GROUP BY-clause, even if the SELECT-list contains non-aggregate expressions that do not also appear in the GROUP BY-clause. In the final resultset, MySQL would produce one of the available values for such non-aggregate expressions and the result would basically not be deterministic from the user's point of view.

This behavior is not standard: SQL92 states that any non-aggregate expressions appearing in the SELECT-list must appear in the GROUP BY-clause; SQL99 and on state that any non-aggregate expressions that appear in the SELECT-list must be functionally dependent upon the list of expressions appearing in the GROUP BY. In this context, "functionally dependent" simply means that for each unique combination of values returned by the expressions that make up the GROUP BY-clause, the non-aggregate expression necessarily yields exactly one value. (This concept is further explained and illustrated in my original article.)

Most RDBMS-es implement the SQL92 behavior, and generate an error in case a non-aggregate expression appears in the SELECT-list but not the GROUP BY-clause. Because MySQL would not generate an error at all and instead would simply allow such queries while silently producing a non-deterministic result for such expressions, many users got bitten and frustrated.

My original article offered 3 suggestions to cope with this non-standard behavior:
  • One could explicitly add a ONLY_FULL_GROUP_BY option to the sql_mode (since it was not included by default). This should essentially make pre-MySQL 5.7.5 behave according to SQL92. Unfortunately, this feature would often erroneously spot properly aggregated SELECT-list expressions and reject perfectly valid queries. This is why I disrecommended this approach. (See bug #8510 for details.)
  • I argued instead to be more conscious when building queries, and manually ensure that non-aggregated expressions in the SELECT-list are functionally dependent upon the list of expressions appearing in the GROUP BY clause. The aim of my original article was to teach a way of thinking about aggregate queries so that developers would be conditioned to do "the right thing" and avoid writing non-deterministic queries.
  • The final suggestion was to artificially convert non-aggregate expressions in the SELECT-list to aggregate expressions by wrapping them inside an appropriate "dummy" aggregate function like MIN() or MAX().
In addition, it may seem that one can also simply add the non-aggregate expressions int the SELECT-list to the GROUP BY-clause. However, as I have shown in my original article, this is typically not the best solution: if that expression is truly functionally dependent upon the expressions in the GROUP BY-clause it can (and often does) have a non-trivial impact on the performance. (And if it wasn't functionally dependent, you probably weren't exactly aware what you were doing when you wrote your query ;-)

At the time my recommendation was to go for option two and manually ensure that any non-aggregate expressions in the SELECT-list are dependent upon the expressions in the GROUP BY-clause. The bug in the former behahaviour of ONLY_FULL_GROUP_BY was simply too restrictive to work with, and adding dummy aggregate expressions makes it harder to maintain the query. Besides, successfully writing those dummy aggregates or add those non-aggregate expressions to the GROUP BY list still requires the same structural understanding of the query that was required to write it correctly, so why bother if you could just as well have written the query right in the first place?

Basically the message was that giving your query a little more thought is simply the best solution on all accounts.

GROUP BY in MySQL 5.7.5

In the 5.7.5m15 milestone release, ONLY_FULL_GROUP_BY is included in the sql_mode by default. Contrary to what its name might suggest, this does *not* mean that GROUP BY-clauses must list all non-aggregated columns appearing in the SELECT-list. Oracle and the MySQL development team, in particular Guilhelm Bichot, really went the extra mile and implemented behavior that is identical, or at least very close to what is described in SQL99 and beyond. That is, MySQL 5.7.5m15 will by default reject only those GROUP BY-queries that include non-aggregated expressions in the SELECT-list that are not functionally dependent upon the GROUP BY-list.

This not only means that you cannot mess up your GROUP BY-queries anymore (as MySQL will now reject an improper GROUP BY query), it will also not require you to write non-sensical "dummy" aggregates over expressions that can only have one value per aggregated result row. Hurrah!

Note that this does not mean that writing the query becomes any easier. It is just as hard (or easy) as before, this new feature simply means it becomes impossible to accidentally write a query that delivers non-deterministic results. I don't think anybody in their right mind can be against that.

Examples

Let's put it to the test with the examples from my original article. First, let's check what the default sql_mode looks like in MySQL-5.7.5-m15:
mysql> select version(), @@sql_mode;
+-----------+---------------------------------------------------------------+
| version() | @@sql_mode                                                    |
+-----------+---------------------------------------------------------------+
| 5.7.5-m15 | ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION |
+-----------+---------------------------------------------------------------+
1 row in set (0.00 sec)

As you can see, ONLY_FULL_GROUP_BY is included by default. Now let's try this query:
mysql> use menagerie;
mysql> SELECT   species
    -> ,        MIN(birth)  -- birthdate of oldest pet per species
    -> ,        MAX(birth)  -- birthdate of youngest pet per species
    -> ,        birth       -- birthdate of ... uh oh...!
    -> FROM     menagerie.pet
    -> GROUP BY species;
ERROR 1055 (42000): Expression #4 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'menagerie.pet.birth' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
We should have expected the query to be rejected, because the birth column appears in the SELECT-list outside an aggregate function, but is not in the GROUP BY-clause.

This exact same query would have been rejected prior to MySQL 5.7.5 as well, provided the ONLY_FULL_GROUP_BY would have been explicitly included in the sql_mode. However, the error message was quite a bit shorter, and furthermore, it conveyed a different message:
ERROR 1055 (42000): 'menagerie.pet.birth' isn't in GROUP BY
Take a moment to consider the difference: prior to MySQL 5.7.5, the complaint was that the non-aggregated column is not part of the GROUP BY-list, implying that we should include it; in MySQL 5.7.5, the message is that it is not functionally dependent upon the GROUP BY list, implying that we can include non-aggregated columns that are functionally dependent upon the GROUP BY-clause.

Let's consider this query to illustrate the difference:
mysql> use sakila;
mysql> SELECT   film_id       -- primary key
    -> ,        title         -- non-key column
    -> ,        COUNT(*)      -- one row per group
    -> FROM     sakila.film
    -> GROUP BY film_id;      -- group by on primary key
In this query, the GROUP BY-clause contains only film_id, which is the primary key of the film table. (Note that it doesn't really make much sense to perform such an aggregation, since each aggregate row is based on exactly one source row, but the point is to illustrate how MySQL 5.7.5 handles this query differently than prior versions.)

The SELECT-list also contains the film_id column, which should be ok since it appears in the GROUP BY list. But the SELECT-list also contains the title column, which does not appear in the GROUP BY-list. However, since both columns come from the same table, and film_id is the primary key, it follows that for any value of film_id there can be only one value for title. In other words, if the value for film_id is known, then the corresponding value for title is also known. A fancy way of saying it is that title is functionally dependent upon fim_id, which is to say that the value of title is fully determined once the value for film_id is known. This is not surprising since film_id is the key of the table. This is virtually identical to the very definition of what it means to be a key: some column or combination of columns upon which any other column is functionally dependent.

So, it should be perfectly alright to execute this query, and MySQL 5.7.5 does. Note that because of the much narrower semantics of the ONLY_FULL_GROUP_BY option in MySQL prior to 5.7.5, this query would be rejected in earlier MySQL versions if ONLY_FULL_GROUP_BY is part of the sql_mode. Prior to 5.7.5 MySQL could execute this query but not if the sql_mode includes ONLY_FULL_GROUP_BY.

Now, in the previous query, the functional dependency between film_id and title is quite straightforward and easy to detect, since both columns come from the same table. This type of functional dependency is also correctly detected by for example Postgres 9.1, and this query can be executed there too.

However, MySQL 5.7.5 is capable of detecting more complex functional dependencies. Consider this query:
mysql> SELECT    i.film_id
    -> ,         f.title
    -> ,         count(i.inventory_id) inventory_count
    -> FROM      film f
    -> LEFT JOIN inventory i
    -> ON        f.film_id = i.film_id
    -> GROUP BY  f.film_id
    -> HAVING    inventory_count = 0
This is almost a typical master detail query, which joins film to inventory over film_id to find out how many copies of each film exist in the inventory. As an extra criterion, we filter out those films for which there are no copies available by writing HAVING inventory_count = 0. The GROUP BY-clause is again on film.film_id, so this means we should be able to use any column from the film table in the SELECT-list, since they are functionally dependent upon the GROUP BY-list; again we ask for the title column from the film table.

Again we also select a film_id column, but instead of asking for the film_id column from the film table, we ask for the one from the inventory table. The inventory.film_id column does not appear in the GROUP BY-list, and is also not an aggregated. But even though there may be multiple rows from the inventory table for one specific row in the film table, the query is still valid. This is because the join condition f.film_id = i.film_id ensures the value of the film_id column in the inventory table is functionally dependent upon the film_id column from the film table. And becaue the film_id column from the film table does appear in the GROUP BY-list, it must mean the film_id column from the inventory table is fully determined, and hence the query is valid.

This query will fail in Postgres 9.1, but not in MySQL 5.7.5: In Postgres 9.1, we first have to rewrite the HAVING-clause and refer to count(i.inventory_id) rather than its alias inventory_count. But even then, it still considers inventory.film_id not to be functionally dependent upon the GROUP BY-clause, and it will reject this query. (If you have any indication that later versions of Postgres also handle this query correctly, please let me know and I'll gladly amend this article)

(A little digression: I just argued that in the previous query, film_id from inventory is functionally dependent upon film_id from film because of the join condition. However, this does not necessarily mean the value of these columns is identical. In fact, in this particular example the selected film_id column from inventory will be NULL because our HAVING-clause, whereas the value of film_id from the film table is never NULL. But it is still true that for each distinct value from film_id from the film table, the value of film_id from the inventory table is fully determined, and hence, functionally dependent.)

Upgrade advice

If you decide to upgrade to MySQL 5.7.5 (or beyond), and you used to run with a sql_mode that did not include ONLY_FULL_GROUP_BY, then some GROUP BY queries that used to work prior to the upgrade might fail after the upgrade. This sounds like a bad thing but if you think about it, it really isn't: the queries that are going to fail were in fact invalid all along, and gave you non-deterministic results. You just didn't notice.

A simple way to make these queries work again would be to remove ONLY_FULL_GROUP_BY from the sql_mode. However, I would very strongly disrecommend that approach. Rather, each query that fails in MySQL 5.7.5 (or beyond) due to enabling ONLY_FULL_GROUP_BY option should be inspected and rewritten. If your query contains a non-aggregated expression in the SELECT-list that is not dependent upon the GROUP BY-list, your application was probably using bogus (or well, at least non-deterministic) results and you should decide what the proper behavior should be and modify the query (and/or the application) accordingly. If you really want to keep relying on non-deterministic results (why?), you can wrap such expressions into the new ANY_VALUE() function. This will essentially preserve the old behaviour even if ONLY_FULL_GROUP_BY is enabled. I suppose this is still better than running without ONLY_FULL_GROUP_BY, because in this way it will at least be clear the result will be non-deterministic since you're literally asking for it.

(One word about the ANY_VALUE() "function". By the way it looks and is called, you might get the impression that ANY_VALUE() is some kind of new aggregate function that aggregates by actively picking one value out of a list. This is not the case. Proper aggregate functions, like COUNT(), MIN(), MAX() etc. will condense the resultset to one single row in the absence of a GROUP BY list; ANY_VALUE() does no such thing. It is merely a placeholder that tells the optimizer to not generate an error when the ONLY_FULL_GROUP_BY contract is broken. Basically, ANY_VALUE() means: this query is broken but instead of fixing it we chose to ignore the fact that it is broken.)

Finally

MySQL 5.7.5 is looking like it will bring a lot of improvements, and thanks to Guilhelm Bichot, very much improved standards compliance for GROUP BY. I have no hesitation to recommend you start using the ONLY_FULL_GROUP_BY option. Of course many of the considerations to consciously writing your query still applies, and now MySQL will make it even more easy to do so.

If you're interested in these or other improvements, consider downloading and installing MySQL 5.7.5 m15 to test your current systems. MySQL 5.7.5 brings quite a number of incompatible changes and while I believe they are all improvements, one best get prepared. Happy hacking.

Further Reading

Friday, June 13, 2014

MySQL: Extracting timstamp and MAC address from UUIDs

To whom it may concern.

Surrogate keys: auto-increment or UUID?

I recently overheard a statement about whether to use auto-incrementing id's (i.e, a sequence managed by the RDBMS) or universal unique identifiers (UUIDs) as method for generating surrogate key values.

Leakiness

Much has been written about this subject with regard to storage space, query performance and so on, but in this particular case the main consideration was leakiness. Leakiness in this case means that key values convey information about the state of the system that we didn't intend to disclose.

Auto-incrementing id's are leaky

For example, suppose you would subscribe to a new social media site, and you get assigned a personal profile page which looks like this:
http://social.media.site/user/67638
Suppose that 67638 is the auto-incrementing key value that was uniquely assigned to the profile. If that were the case then we could wait a day and create a new profile. We could then compare the key values and use it to estimate how many new profiles were created during that day. This might not necessarily be very sensitive information, but the point here is that by exposing the key values, the system exposes information that it didn't intend to disclose (or at least not in that way).

Are UUIDs leaky?

So clearly, auto-incrementing keys are leaky. The question is, are UUIDs less leaky? Because if that's the case, then that might weigh in on your consideration to choose for a UUID surrogate key. As it turns out, this question can be answered with the universal but always unsatisfactory answer that "it depends". Not all UUIDs are created equal, and wikipedia lists 5 different variants. This is not an exhaustive list, since vendors can (and so, probably will) invent their own variants.

MySQL UUIDs

In this article I want to focus on MySQL's implementation. MySQL has two different functions that generate UUIDs: UUID() and UUID_SHORT().

Are MySQL UUIDs leaky?

If you follow the links and read the documentation, then we can easily give a definitive answer, which is: yes, MySQL UUIDs are leaky: It is not my role to judge whether this leakiness is better or worse than the leakiness of auto-incrementing keys, I'm just providing the information so you can decide whether it affects you or not.

Hacking MySQL UUID values

Now, on to the fun bit. Let's hack MySQL UUIDs and extract meaningful information. Because we can.

Credit where credit's due: Although the documentation and MySQL source code contain all the information, I had a lot of benefit from the inconspicuously-looking but otherwise excellent website from the Kruithof family. It provides a neat recipe for extracting timestamp and MAC address from type 1 UUIDs. Thanks!

Here's a graphical representation of the recipe:

Without further ado, here come the hacks:

Extracting the timestamp from a MySQL UUID

Here's how:
select  uid                           AS uid
,       from_unixtime(
          (conv(                      
            concat(                   -- step 1: reconstruct hexadecimal timestamp
              substring(uid, 16, 3)
            , substring(uid, 10, 4)
            , substring(uid, 1, 8)
            ), 16, 10)                -- step 2: convert hexadecimal to decimal
            div 10 div 1000 div 1000  -- step 3: go from nanoseconds to seconds
          ) - (141427 * 24 * 60 * 60) -- step 4: substract timestamp offset (October 15,  
        )                             AS uuid_to_timestamp
,       current_timestamp()           AS timestamp
from    (select uuid() uid)           AS alias;
Here's an example result:
+--------------------------------------+---------------------+---------------------+
| uid                                  | uuid_to_timestamp   | timestamp           |
+--------------------------------------+---------------------+---------------------+
| a89e6d7b-f2ec-11e3-bcfb-5c514fe65f28 | 2014-06-13 13:20:00 | 2014-06-13 13:20:00 |
+--------------------------------------+---------------------+---------------------+
1 row in set (0.01 sec)
The query works by first obtaining the value from UUID(). I use a subquery in the from clause for that, which aliases the UUID() function call to uid. This allows other expressions to manipulate the same uid value. You cannot simply call the UUID() multiple times, since it generates a new unique value each time you call it. The raw value of uid is shown as well, which is:a89e6d7b-f2ec-11e3-bcfb-5c514fe65f28. Most people will recognize this as 5 hexadecimal fields, separated by dashes. The first step is to extract and re-order parts of the uid to reconstruct a valid timestamp:
  • Characters 16-18 form the most significant part of the timestamp. In our example that's 1e3; the last 3 characters of the third field in the uid.
  • Characters 10-13 form the middle part timestamp. In our example that's f2ec; this corresponds to the second field
  • Characters 1-8 form the least significant part of the timestamp. In our example that's a89e6d7b; this is the first field of the uid.

Extracting the parts is easy enough with SUBSTRING(), and we can use CONCAT() to glue the parts into the right order; that is, putting the most to least significant parts of the timestamp in a left-to-right order. The hexadecimal timestamp is now 1e3f2eca89e6d7b.

The second step is to convert the hexadecimal timestamp to a decimal value. We can do that using CONV(hextimestamp, 16, 10), where 16 represents the number base of the hexadecimal input timestamp, and 10 represents the number base of output value.

We now have a timestamp, but it is in a 100-nanosecond resolution. So the third step is to divide so that we get back to seconds resolution. We can safely use a DIV integer division. First we divide by 10 to go from 100-nanosecond resolution to microseconds; then by 1000 to go to milliseconds, and then again by 1000 to go from milliseconds to seconds.

We now have a timestamp expressed as the number of seconds since the date of Gregorian reform to the Christian calendar, which is set at October 15, 1582. We can easily convert this to unix time by subtracting the number of seconds between that date and January 1, 1970 (i.e. the start date for unix time). I suppose there are nicer ways to express that, but 141427 * 24 * 60 * 60 is the value we need to do the conversion.

We now have a unix timestamp, and MySQL offers the FROM_UNIXTIME() function to go from unix time to a MySQL timestamp value.

Extracting the MAC address from a MySQL UUID

The last field of type 1 UUID's is the so-called node id. On BSD and Linux platforms, MySQL uses the MAC address to create the node id. The following query extracts the MAC address in the familiar colon-separated representation:
select  uid                           AS uid
,       concat(
                substring(uid, 25,2)
        , ':',  substring(uid, 27,2)
        , ':',  substring(uid, 29,2)
        , ':',  substring(uid, 31,2)
        , ':',  substring(uid, 33,2)
        , ':',  substring(uid, 35,2)
        )                             AS uuid_to_mac
from    (select uuid() uid)           AS alias;
Here's the result:
+--------------------------------------+-------------------+
| uid                                  | uuid_to_mac       |
+--------------------------------------+-------------------+
| 78e5e7c0-f2f5-11e3-bcfb-5c514fe65f28 | 5c:51:4f:e6:5f:28 |
+--------------------------------------+-------------------+
1 row in set (0.01 sec)
I checked on Ubuntu with ifconfig and found that this actually works.

What about UUID_SHORT()?

The UUID_SHORT() function is implemented thus:
(server_id & 255) << 56
+ (server_startup_time_in_seconds << 24)
+ incremented_variable++;
This indicates we could try and apply right bitshifting to extract server id and start time.

Since the server_id can be larger (much larger) than 255, we cannot reliably extract it. However, you can give it a try; assuming there are many mysql replication clusters with less than 255 nodes, and assuming admins will often use a simple incrementing number scheme for the server id. you might give it a try.

The start time is also easy to extract with bitshift. Feel free to post queries for that in the comments.

Conclusions

I do not pretend to present any novel insights here, this is just a summary of well-known principles. The most important take-away is that you should strive to not expose system implementation details. Surrogate key values are implementation details so should never have been exposed in the first place. If you cannot meet that requirement (or you need to compromise because of some other requirement) then you, as system or application designer should be aware of the leakiness of your keys. In order to achieve that awareness, you must have insight at the implementation-level of how the keys are generated. Then you should be able to explain, in simple human language, to other engineers, product managers and users, which bits of information are leaking, and what would be the worst possible scenario of abuse of that information. Without that analysis you just cannot decide to expose the keys and hope for the best.

Wednesday, June 11, 2014

When kettle's "Get data From XML" is bombed by the BOM

To whom it may concern...
I just ran into a problem with Pentaho Data Integration, and I figured it may save others some time if I document it here.
The case is very straightforward: read a fairly small XML document directly from a URL, and parse out interesting data using the Get data from XML step.
Typically, this steps works quite well for me. But I just ran into a case where it doesn't work quite as expected. I ran into an error when I tried it on this URL:
http://api.worldbank.org/en/countries?page=1
If you follow the URL you'll find it returns a normal looking document:
<?xml version="1.0" encoding="utf-8"?>
<wb:countries page="1" pages="6" per_page="50" total="260" xmlns:wb="http://www.worldbank.org">
  <wb:country id="ABW">
    <wb:iso2Code>AW</wb:iso2Code>
    <wb:name>Aruba</wb:name>
    <wb:region id="LCN">Latin America &amp; Caribbean (all income levels)</wb:region>
    <wb:adminregion id="" />
    <wb:incomeLevel id="NOC">High income: nonOECD</wb:incomeLevel>
    <wb:lendingType id="LNX">Not classified</wb:lendingType>
    <wb:capitalCity>Oranjestad</wb:capitalCity>
    <wb:longitude>-70.0167</wb:longitude>
    <wb:latitude>12.5167</wb:latitude>
  </wb:country>
  ...
</wb:countries>

The error: Content is not allowed in prolog

The error I got was:
Content is not allowed in prolog.
You can encounter this error in any context where the step tries to retrieve the document from the URL, for example when you hit the "Get XPAth nodes" or "Preview" buttons, as well as when you're actually running the step.

Using the w3c XML validator

The error message indicates that the XML document is in some way not valid. So I ran the URL through the w3c validator:
http://validator.w3.org/check?uri=http%3A%2F%2Fapi.worldbank.org%2Fen%2Fcountries%3Fpage%3D1&charset=%28detect+automatically%29&doctype=Inline&group=0
Interestingly, this indicated that the document is valid XML.

A rather dismal workaround

Then I tried a few things in kettle in an attempt to work around it. I won't bother you with everything I tried. Eventually, I did find a viable work-around: By retrieving the document with the HTTP Client step, and then saving that to file using a simple Text file output step (omitting the header, separators, and quotes), I could then successfully open and parse that file with the "Get data from XML" step (from within a second transformation). This was of course a bit annoying since it involved a second transformation, which complicates things considerably. However all attempts to skip the "Text file output" step all brought me back to where I was and gave me the dreaded Content is not allowed in prolog. error. So something was happening to the document between saving and loading from disk that somehow fixed it.

Mind w3c validator Warnings!

I decided to investigate a bit more. What I didn't notice at first when I validated the XML document is that, despite passing validation, it does yield 2 warnings:
  • No DOCTYPE found! Checking XML syntax only.
  • Byte-Order Mark found in UTF-8 File.
As it turns out, this second warning conveys a very important tidbit of information.

UTF-8, the BOM, and java don't play nice together

I knew what a BOM was, but I didn't quite understand it's implications in particular for java and java-based applications. Here's a quick list of things you need to know to understand the problem:
  • The byte-order mark (BOM) is a special unicode character that indicates several details of the encoding of an inputstream.
  • The BOM is optional, and for UTF-8 it is actually disrecommended. But, apparently, this does not mean it's never there, or even non-standard!
  • The particular combination of a BOM occurring in a UTF-8 stream is not supported by java. There are bug reports about it here and here.
Maybe the Get data from XML step should be more forgiving, and take care of the BOM for us if it does occur. It sure would have saved me time. Anyway, it currently doesn't, and I came up with the following solution that is reasonably straightforward and does solve the problem:

A better workaround

We can first retrieve the document with the "Http Client" step, and then remove the BOM if it is present, and then process the document using the Get data from XML step. The transformation below illustrates that setup: So, the "HTTP client" step retrieves the XML text in the document field, and the User-defined Java Expression step simply finds the first occurrence of the less than character (<), which demarcates either the start of the xml declaration, or the document element. The code for that expression is rather straightforward:
document.substring(document.indexOf("<"))
All in all, not very pretty, but it does the job. I hope this was of some use to you.
UPDATE1: I created PDI-12410 pertaining to this case.
UPDATE2: Apart from the BOM, there seems to be a separate, independent problem when the XML is acquired from a URL and the server uses gzip compression.
UPDATE3: I have a commit here that solves both the BOM and the gzip issues: https://github.com/rpbouman/pentaho-kettle/commit/6cf28b5e4e88022dbf356ccad01c3b949bed4731.

DuckDB bag of tricks: Processing PGN chess games with DuckDB - Rolling up each game's lines into a single game row (6/6)

DuckDB bag of tricks is the banner I use on this blog to post my tips and tricks about DuckDB . This post is the sixth installment of a s...