Sunday, June 21, 2020

Java: trying out String deduplication and the G1 garbage collector

As of 8u20, java supports automatic String deduplication.

-XX:+UseG1GC -XX:+UseStringDeduplication

You need to use the G1 garbage collector, and it will do the dedup as you scan the heap. Essentially, it checks each String and if the backing char[] array is the same as one it's already got, it merges the references.

Obviously, this could save memory if you have a lot of repeated strings.

Consider my illuminate utility. One of the thing it does is parse the old SVR4 packaging contents file. That's a big file, and there's a huge amount of duplication - while the file names are obviously unique, things like the type of file, permissions, owner, group, and names of packages are repeated many times. So, does turning this thing on make a difference?

Here's the head of the class histogram (produced by jcmd pid GC.class_histogram).

First without:

 num     #instances         #bytes  class name
   1:       2950682      133505088  [C
   2:       2950130       70803120  java.lang.String
   3:        862390       27596480  java.util.HashMap$Node
   4:        388539       21758184  org.tribblix.illuminate.pkgview.ContentsFileDetail

and now with deduplication:

 num     #instances         #bytes  class name
   1:       2950165       70803960  java.lang.String
   2:        557004       60568944  [C
   3:        862431       27597792  java.util.HashMap$Node
   4:        388539       21758184  org.tribblix.illuminate.pkgview.ContentsFileDetail

Note that there's the same number of entries in the contents file (there's one ContentsFileDetail for each line), and essentially the same number of String objects. But the [C, which is the char[] backing those Strings, has fallen dramatically. You're saving about a third of the memory used to store all that String data.

This also clearly demonstrates that the deduplication isn't on the String objects, those are unchanged, but on the char[] arrays backing those Strings.

Even more interesting is the performance. This is timing of a parser before:

real        1.730556446
user        7.977604040
sys         0.251854581

and afterwards:

real        1.469453551
user        6.054787878
sys         0.407259095

That's actually a bit of a surprise: G1GC is going to have to do work to do the comparisons to see if the strings are the same, and do some housekeeping if they are. However, with just the G1GC on its own, without deduplication, we get a big performance win:

real        1.217800287
user        3.944160155
sys         0.362586413

Therefore, for this case, G1GC is a huge performance benefit, and the deduplication takes some of that performance gain and trades it for memory efficiency.

For the illuminate GUI, without G1GC:

user       10.363291056
sys         0.393676741

and with G1GC:

user        8.151806315
sys         0.401426176

(elapsed time isn't meaningful here as you're waiting for interaction to shut it down)

The other thing you'll sometime see in this context is interning Strings. I tried that, it didn't help at all.

Next, with a little more understanding of what was going on, I tried some modifications to the code to reduce the cost of storing all those Strings.

I did tweak my contents file reader slightly, to break lines up using a simple String.split() rather than using a StringTokenizer. (The java docs recommend you don't use StringTokenizer any more, so this is also a bit of modernization.) I don't think the change of itself makes any difference, but it's slightly less work to simply ignore fields in an array from String.split() than call nextToken() to skip over the ones you don't want.

Saving the size and mtime as long - primitive types - saves a fair amount of memory too. Each String object is 24 bytes plus the content, so the saving is significant. And given that any uses will be of the numerical value, we may as well convert up front.

The ftype is only a single character. So storing that as a char avoids an object, saving space, and they're automatically interned for us.

That manual work gave me about another 10% speedup. What about memory usage?

Using primitive types rather than String gives us the following class histogram:

 num     #instances         #bytes  class name
   1:       1917289      102919512  [C
   2:       1916938       46006512  java.lang.String
   3:        862981       27615392  java.util.HashMap$Node
   4:        388532       24866048  org.tribblix.illuminate.pkgview.ContentsFileDetail
So, changing the code gives almost the same memory saving as turning on String deduplication, without any performance hit.

There are 3 lessons here:

  1. Don't use Strings to store what could be primitive types if you can help it
  2. Under some (not all) circumstances, the G1 garbage collector can be a huge win
  3. When you're doing optimization occasionally the win you get isn't the one you were looking for