-XX:+UseG1GC -XX:+UseStringDeduplication
You need to use the G1 garbage collector, and it will do the dedup as you scan the heap. Essentially, it checks each String and if the backing char[] array is the same as one it's already got, it merges the references.
Obviously, this could save memory if you have a lot of repeated strings.
Consider my illuminate utility. One of the thing it does is parse the old SVR4 packaging contents file. That's a big file, and there's a huge amount of duplication - while the file names are obviously unique, things like the type of file, permissions, owner, group, and names of packages are repeated many times. So, does turning this thing on make a difference?
Here's the head of the class histogram (produced by jcmd pid GC.class_histogram).
First without:
num #instances #bytes class name
----------------------------------------------
1: 2950682 133505088 [C
2: 2950130 70803120 java.lang.String
3: 862390 27596480 java.util.HashMap$Node
4: 388539 21758184 org.tribblix.illuminate.pkgview.ContentsFileDetail
and now with deduplication:
num #instances #bytes class name
----------------------------------------------
1: 2950165 70803960 java.lang.String
2: 557004 60568944 [C
3: 862431 27597792 java.util.HashMap$Node
4: 388539 21758184 org.tribblix.illuminate.pkgview.ContentsFileDetail
Note that there's the same number of entries in the contents file (there's one ContentsFileDetail for each line), and essentially the same number of String objects. But the [C, which is the char[] backing those Strings, has fallen dramatically. You're saving about a third of the memory used to store all that String data.
This also clearly demonstrates that the deduplication isn't on the String objects, those are unchanged, but on the char[] arrays backing those Strings.
Even more interesting is the performance. This is timing of a parser before:
real 1.730556446
user 7.977604040
sys 0.251854581
and afterwards:
real 1.469453551
user 6.054787878
sys 0.407259095
That's actually a bit of a surprise: G1GC is going to have to do work to do the comparisons to see if the strings are the same, and do some housekeeping if they are. However, with just the G1GC on its own, without deduplication, we get a big performance win:
real 1.217800287
user 3.944160155
sys 0.362586413
Therefore, for this case, G1GC is a huge performance benefit, and the deduplication takes some of that performance gain and trades it for memory efficiency.
For the illuminate GUI, without G1GC:
user 10.363291056
sys 0.393676741
and with G1GC:
user 8.151806315
sys 0.401426176
(elapsed time isn't meaningful here as you're waiting for interaction to shut it down)
The other thing you'll sometime see in this context is interning Strings. I tried that, it didn't help at all.
Next, with a little more understanding of what was going on, I tried some modifications to the code to reduce the cost of storing all those Strings.
I did tweak my contents file reader slightly, to break lines up using a simple String.split() rather than using a StringTokenizer. (The java docs recommend you don't use StringTokenizer any more, so this is also a bit of modernization.) I don't think the change of itself makes any difference, but it's slightly less work to simply ignore fields in an array from String.split() than call nextToken() to skip over the ones you don't want.
Saving the size and mtime as long - primitive types - saves a fair amount of memory too. Each String object is 24 bytes plus the content, so the saving is significant. And given that any uses will be of the numerical value, we may as well convert up front.
The ftype is only a single character. So storing that as a char avoids an object, saving space, and they're automatically interned for us.
That manual work gave me about another 10% speedup. What about memory usage?
Using primitive types rather than String gives us the following class histogram:
num #instances #bytes class nameSo, changing the code gives almost the same memory saving as turning on String deduplication, without any performance hit.
----------------------------------------------
1: 1917289 102919512 [C
2: 1916938 46006512 java.lang.String
3: 862981 27615392 java.util.HashMap$Node
4: 388532 24866048 org.tribblix.illuminate.pkgview.ContentsFileDetail
There are 3 lessons here:
- Don't use Strings to store what could be primitive types if you can help it
- Under some (not all) circumstances, the G1 garbage collector can be a huge win
- When you're doing optimization occasionally the win you get isn't the one you were looking for