Book: "Pentaho Kettle Solutions", Matt Casters, Roland Bouman, & Jos van Dongen, Wiley 2010 Book: "Pentaho Solutions", Roland Bouman & Jos van Dongen, Wiley 2009

Sunday, March 18, 2018

A Tale of a JavaScript Memory Leak

Abstract: Matching JavaScript regular expressions against large input strings with V8 can result in memory leaks. In this post, I explain how to troubleshoot the issue using Google Chrome heap snapshots. Finally, a fix proposed by my son David (age 14) is presented.

Background

At Just-BI we developed a browser-based application for one of our customers. One way the application gets its data is by loading and parsing Microsoft Excel files. The app is succesful and our customers are happy, a fact they express by attempting to load ever larger files.

On mobile Safari (iPad), our app starts crashing when the files reach a certain size. This happens around 5 to 6 MB. Argueably, that's not that large, but things are a bit more complicated: Microsft Excel files, at least those of the .xlsx variety, are actually zip-compressed folders, containing mostly OOXML spreadsheet documents.

It is a well known fact that XML is really verbose, and I suppose we should be grateful that our 5 - 6 MB excel files uncompress to only 40 MB of XML.

We could debug the issue somewhat, and we noticed that by the time Safari crashes, it does so reporting it is out of heap space. We can't be sure if that's the actual cause, but we think we might be able to overcome or at least postpone this issue by somehow cutting down on memory usage.

This brings us to the main topic of our tale.

Parsing xlsx files in the Browser

To parse xlsx files, we use a particular javascript library called js-xlsx. This is actually a pretty nice piece of work, and I do not hesitate to recommend it. We have used it for quite a while without major issues; It's just that the particular strategey that this library uses to parse the xlsx file will temporarily spike memory usage, and we believe this triggers some bug in Safari, which eventually leads to a crash.

So, we're currently investigating a less general, less standards-compliant way to parse xlsx files. In return, this allows us to parse xlsx faster, and with a much reduced peak-memory usage.

I don't want to talk too much in detail about the xlsx parser we're developing. All I can say it is not meant to be a general, fully featured xlsx parser; the only requirement it is designed to fulfill is to avoid or at least postpone the crash we observe in Safari, and to parse portions of xlsx workbooks having a well-known structure specific to our application requirements.

I am happy to report that, with just one day of work, we managed to get an initial version of our parser to work. It has a peak-memory usage that is half that of the previous solution. And it sounds even more spectacular when I write: 100% less memory!

As an added bonus, the new solution is just a little less than 10x as fast as the old solution. For some reason, it does not sound much cooler when I say the new solution is about an order of magnitude faster, so I won't.

JavaScript Regular Expressions

Our parser relies on JavaScript regular expressions, which are exposed through the global built-in RegExp object.

One might be aware that for at least 100 different programming languages there at least 10,000 stackexchange answers to wittily denounce any attempt to parse XML using regular expressions. Some bloggers' entire careers seems built entirely around their particular brand of scorn and disdain about this topic.

We have little to add to discussions like these, other than that in modern JavaScript runtimes, regular expressions are a very productive and powerful tool for quickly building tokenizers. These tokenizers can have amazing performance, and can serve admirably as a foundation to build parsers of many kinds, including but certainly not limited to XML-parsers.

A Memory Leak

Despite the initial success, not all is well with our new xlsx parser. We found that, notwithstanding lower peak-memory consumption, it did suffer from a memory leak. We noticed this by creating a simple sample application, loading only our parser, and then making heap snapshots using Chrome developer tools during various phases of the process. See the screenshot below:



This is what happens:
  1. First heap snapshot was made directly after loading the application, and measures 6.5 MB. Whether you think this is a lot or not, this is our baseline, and there is not much we can do about it now.
  2. Next, the user picks a xlsx workbook, and the application opens it. The snapshot is now 12.6 MB, which is an increase of 6.1 MB as compared to our baseline. The workbook file is a little less than 6 MB and accounts for most of the increase. At this point, our sample application has also extracted and parsed the list of worksheets contained within the workbook, as well as the shared string table. I haven't looked at that in detail, but for now I am satisfied to believe that this accounts for the remaining extra memory.
  3. At this point, we extracted the worksheet of interest from the workbook and uncompressed it into a javascript string. This made our heap snapshot increase by almost 34 MB. That is certainly a lot! However, the filesize of the worksheet document itself is 34,445 kB, so it seems everything is accounted for.
  4. The next heap snapshot was taken after parsing the worksheet and building a parse tree. The snapshot weighs 77.3 MB - an increase another 31 MB. Now, the sheet has 32,294 rows with 24 cells of data each, and most of the cells are unique decimal numbers, so it is a decent chunk of data. But even then, it still feels as if this is way too large.

    That said, things probably look worse than they really are. Our new parser is event based: the parse method accepts a configuration object that contains only a callback, which is called every time a new row is extracted from the sheet. For our sample application, the callback is only a very naive proof of concept. I suspect there are plenty opportunities to make the parse tree builder smarter and the parse tree smaller.
  5. The last heap snapshot was taken after the parse. At this point, the parse tree, the workbook object, and the XML string went out of scope. But we are still looking at a heap snapshot of more than 40 MB! This is bad news: we really should be back at something close to heap snapshot 1. So, there's about 34 MB unaccounted for.
In the screenshot, you can also see what's hogging the memory: in the top right pane, we find our XML document string, which indeed accounts for the retained 34 MB of memory. In the bottom right pane, we can see who's still referencing it: it's some property called parent of sliced string @15298471. And these are referenced twice in some array, which is referenced by something called regexp_last_match_info in the native context.

Memory Leak, Explained

Now, what I think we're looking at is the lastMatch property of the global built-in RegExp-object.

If you're not familiar with JavaScript regular expressions, it might be helpful to consider exactly how our parser uses them. We're using code like this:
function parse(){
  //regular expression to match a <row> start-tag.
  var regexp = /<row\s([^>]+)\s*>/g;

  var match, data, xml = "...this is our huge XML document, loaded into a JavaScript String...";

  match = regexp.exec(xml);
  if (match) {

    //...do something with matched substrings contained as array elements in match...
    data = match[1];
    ...etc...

  }
}
(Note that this is just an example of the concept - not literally the actual code)

The parse() function first assigns a literal regular expression to the regexp variable. Under the covers, this literal regular expression results in calling the built-in global RegExp constructor, instantiating a new RegExp instance. Then, the exec-method of the RegExp instance is called, passing the -huge- XML document string. The exec-method returns a object representing the result: if there was no match, null is returned; if there is a match an object is returned that contains information about the match(es).

If there was a match, the match object will look a lot like an overloaded array object, having the matched parts of the string argument as elements. The element at index 0 of the match object (match[0]) is the substring matching the entire regular expression, the element at index 1 is the substring that matches the first parenthesized capturing group, and so on.

Now, since the match variable is a local variable in our parse function, everything should be garbage collectible after the function ends, right?

Yes. But No.

About RegExp.lastMatch

As it turns out, when an instance of the RegExp object finds a match, then the corresponding match info object containing all the matching substrings is stored in the lastMatch property of the global built-in RegExp-object. So, even if our parse-method is out of scope, the last match made by some regular expression inside it is still dangling around, attached to the global RegExp object in its lastMatch property.

Substrings in V8

Now, if the the lastMatch-object is still around, then the substrings representing the matches are also still around. As it turns out, V8 implements these substrings as "slice" objects. From within the JavaScript environment, they act and behave like String objects, but internally, the V8 javascript engine implements them as objects that have a parent property to keep a reference to the original String object from which they are a substring, along with some indexes to indicate what part of the original string makes up the substring.

Now, if you think about it, this way of implementing substrings is actually pretty clever, since it allows V8 to do many string manipulations very efficiently, minimizing overhead and memory consumption due to copying parts of strings to and through. In my case, it just becomes an atrocious memory hog because of the RegExp object, that has decided to maintain a reference to the last match object (for whatever reason).

Other people have ran into issues due to V8's substring design as well, and a bug was filed here:

https://bugs.chromium.org/p/v8/issues/detail?id=2869

Solutions

My oldest son, David of 14 years old, came up with a pretty creative solution: what if we'd write our own substring implementation, overriding the native one? If this makes you cringe, just think of 30 MB memory leaks and crashing browsers: it puts things in perspective. If this still sounds crazy to you, you should realize that when he came up with this idea, we had been looking together at this issue for two hours already. And even though we felt we knew the substring issue was related, we still had no way to prove that this was actually the case. His idea was feasible and might possibly confirm our suscpicions, so we went ahead and did it, and attached our own implementations of substring and substr to the __proto__ object of our XML string to override only these methods of that string instance.

As was to be expected, our own substring implementations were way slower than the native ones, and the parse took about 25 times longer than before. However, it *did* solve the memory leak. This was a strong indication that we were on the right track.

Then, David suggested another solution: why don't we simply clear out the lastMatch property of the global built-in RegExp object? We tried to do this directly, simply by assignment:
RegExp.lastMatch = null;
Unfortunately, this does not work. Although it does not throw a runtime exception, the RegExp object is protected against this kind of assignment, and the property never gets overwritten. However, it is still possible to achieve what we want, simply by instantiating a new RegExp object, and then forcing a match against a known, short string. We can then wrap that in a utility function, so we can always call it after doing some serious regular expression matching on large strings:
function freeRegExp(){
  /\s*/g.exec("");
}
Here's the heap snapshot after applying this fix:

Summary

  • Globals are bad.
  • Side-effects are bad, in particular if the modifications are global.
  • V8's substring implementation may lead to unexpected memory leaks.
  • Chrome heap snapshots are a powerful tool to troubleshoot them.
  • After applying regular expressions to huge strings, always force a match against a small string to prevent memory leaks.
  • David Rocks! He truly impressed me with his troubleshooting skills and his knack for pragmatic, feasible solutions.

6 comments:

Thad Guidry said...

Could your customer just use OpenRefine ?

Great Job ... DAVID !

Reiner said...

Awesome...
Learned a lot!

Anonymous said...

Thax alot.. Easy to understand but. If these are enough to cover this unit?
AndHRM notes pls...

Roland Bouman said...

Thad, thanks for suggesting OpenRefine. Our customers are C-level end users - Refine looks more like an analyst/developer tool.

Cheers!

Roland Bouman said...

@Reiner, cheers! Thanks and glad this was helpful for you!

Roland Bouman said...

@Anonymous,

"If these are enough to cover this unit? AndHRM notes pls..."

sorry, you lost me. What do you mean exactly?