Wednesday, July 08, 2020

Customizing EC2 instance storage and networking with the AWS CLI

I use AWS to run illumos quite a bit, either with Tribblix or OmniOS.

Creating EC2 instances with the console is fine for one-offs, but gets a bit tedious. So using the AWS CLI offers a better route, with the ec2 run-instances command.

Yes. there are things like templates and terraform and all sorts of other options. For whatever reason, they don't work in all cases.

In particular, the reasons you might want to customize an instance if you're running illumos might be slightly different than a more traditional usage model.

For storage, there are a couple of customizations we might want. The first is that the AMI has a fairly small root disk, which we might want to make larger. We may be adding zones, with their root filesystems installed on the system pool. We may be adding swap (while anonymous reservation means applications like java don't need to write to swap, you still need space backing the swap space to be available). For the second, there's the fact that we might actually want to use EBS to provide local storage (so we can use ZFS, for example, which has data integrity and manageability benefits).

To automate the enlargement of the root pool, I create a mapping file that looks like this:

    "DeviceName": "/dev/xvda",
    "Ebs": {
      "VolumeSize": 12,
      "Encrypted": true

The size is in Gigabytes. The /dev/xvda is the normal device name (from EC2, clearly in illumos we have a different naming). If that's in a file called storage.json, then the argument to the ec2 run-instances command is:

--block-device-mappings file://storage.json

Once the instance is running, that will normally (on my instances) show up on c2t0d0, and the rpool can be expanded to use all the available space with the following command:

zpool online -e rpool c2t0d0

To add an additional device, to keep application storage separate, in addition to that enlargement, would involve a json file like:

    "DeviceName": "/dev/xvda",
    "Ebs": {
      "VolumeSize": 12,
      "Encrypted": true
    "DeviceName": "/dev/sdf",
    "Ebs": {
      "VolumeSize": 256,
      "DeleteOnTermination": false,
      "Encrypted": true

On my instances, I always use /dev/sdf, which comes out as c2t5d0.

For networking, I often end up with multiple IP addresses. This is because we have zones - rather than create multiple EC2 instances, it's far more efficient to run applications in zones on a single system, but then you want to assign each zone its own IP address.

You would think - supported by the documentation - that the --secondary-private-ip-addresses flag to ec2 run-instances would do the job. You would be wrong. That flag, actually, is supposed to just be a convenient shortcut for what I'm about to describe, but it doesn't actually work. (And terraform doesn't support this customization either - it can handle additional IP addresses, but not on the same interface as the primary.)

To configure multiple IP addresses we again turn to a json file. This looks like:

    "DeviceIndex": 0,
    "DeleteOnTermination": true,
    "SubnetId": "subnet-0abcdef1234567890",
    "Groups": ["sg-01234567890abcdef"],
    "PrivateIpAddresses": [
        "Primary": true,
        "PrivateIpAddress": ""
        "Primary": false,
        "PrivateIpAddress": ""

You have to define (SubnetId) the subnet you're going to use, and (Groups) the security group that will be applied - these belong to the network interface, not to the instance (in the trivial case there's no difference). So you don't specify the security group(s) or the subnet as regular arguments. Then I define two IP addresses (you can have as many as you like), one is set as the primary ("Primary": true), all the others will be secondary ("Primary": false). Again, if this is in a file network.json you feed that to the command like

--network-interfaces file://network.json

One other thing I found is that you can add tags to the instance (and to EBS volumes) at creation, saving you the effort of having to go through and tag things later. It's slightly annoying that it doesn't seem to allow you to apply different tags to different volumes, you can just say "apply these tags to the instance" and "apply these tags to the volumes". The trick is that the example in the documentation is wrong (it has single quotes, which you don't need and don't work).

So the tag specification looks like:

--tag-specifications \
ResourceType=instance,Tags=[{Key=Name,Value=aws123a}] \ ResourceType=volume,Tags=[{Key=Name,Value=aws123a}]

In the square brackets, you can have multiple comma-separated key-value pairs. We have tags marking projects and roles so you have a vague idea of what's what.

Putting this all together you end up with a command like:

aws ec2 run-instances \
--region eu-west-2 \
--image-id ami-01a1a1a1a1a1a1a1a \
--instance-type t2.micro \
--key-name peter-key \
--network-interfaces file://network.json \
--count 1 \
--block-device-mappings file://storage.json \
--disable-api-termination \
--tag-specifications \
ResourceType=instance,Tags=[{Key=Name,Value=aws123a}] \

Of course, I don't write either the json files or the command invocation by hand. I have a script that knows what all my AMIs and availability zones and subnets and security groups are and does the right thing for each instance I want to build.

Sunday, June 21, 2020

Java: trying out String deduplication and the G1 garbage collector

As of 8u20, java supports automatic String deduplication.

-XX:+UseG1GC -XX:+UseStringDeduplication

You need to use the G1 garbage collector, and it will do the dedup as you scan the heap. Essentially, it checks each String and if the backing char[] array is the same as one it's already got, it merges the references.

Obviously, this could save memory if you have a lot of repeated strings.

Consider my illuminate utility. One of the thing it does is parse the old SVR4 packaging contents file. That's a big file, and there's a huge amount of duplication - while the file names are obviously unique, things like the type of file, permissions, owner, group, and names of packages are repeated many times. So, does turning this thing on make a difference?

Here's the head of the class histogram (produced by jcmd pid GC.class_histogram).

First without:

 num     #instances         #bytes  class name
   1:       2950682      133505088  [C
   2:       2950130       70803120  java.lang.String
   3:        862390       27596480  java.util.HashMap$Node
   4:        388539       21758184  org.tribblix.illuminate.pkgview.ContentsFileDetail

and now with deduplication:

 num     #instances         #bytes  class name
   1:       2950165       70803960  java.lang.String
   2:        557004       60568944  [C
   3:        862431       27597792  java.util.HashMap$Node
   4:        388539       21758184  org.tribblix.illuminate.pkgview.ContentsFileDetail

Note that there's the same number of entries in the contents file (there's one ContentsFileDetail for each line), and essentially the same number of String objects. But the [C, which is the char[] backing those Strings, has fallen dramatically. You're saving about a third of the memory used to store all that String data.

This also clearly demonstrates that the deduplication isn't on the String objects, those are unchanged, but on the char[] arrays backing those Strings.

Even more interesting is the performance. This is timing of a parser before:

real        1.730556446
user        7.977604040
sys         0.251854581

and afterwards:

real        1.469453551
user        6.054787878
sys         0.407259095

That's actually a bit of a surprise: G1GC is going to have to do work to do the comparisons to see if the strings are the same, and do some housekeeping if they are. However, with just the G1GC on its own, without deduplication, we get a big performance win:

real        1.217800287
user        3.944160155
sys         0.362586413

Therefore, for this case, G1GC is a huge performance benefit, and the deduplication takes some of that performance gain and trades it for memory efficiency.

For the illuminate GUI, without G1GC:

user       10.363291056
sys         0.393676741

and with G1GC:

user        8.151806315
sys         0.401426176

(elapsed time isn't meaningful here as you're waiting for interaction to shut it down)

The other thing you'll sometime see in this context is interning Strings. I tried that, it didn't help at all.

Next, with a little more understanding of what was going on, I tried some modifications to the code to reduce the cost of storing all those Strings.

I did tweak my contents file reader slightly, to break lines up using a simple String.split() rather than using a StringTokenizer. (The java docs recommend you don't use StringTokenizer any more, so this is also a bit of modernization.) I don't think the change of itself makes any difference, but it's slightly less work to simply ignore fields in an array from String.split() than call nextToken() to skip over the ones you don't want.

Saving the size and mtime as long - primitive types - saves a fair amount of memory too. Each String object is 24 bytes plus the content, so the saving is significant. And given that any uses will be of the numerical value, we may as well convert up front.

The ftype is only a single character. So storing that as a char avoids an object, saving space, and they're automatically interned for us.

That manual work gave me about another 10% speedup. What about memory usage?

Using primitive types rather than String gives us the following class histogram:

 num     #instances         #bytes  class name
   1:       1917289      102919512  [C
   2:       1916938       46006512  java.lang.String
   3:        862981       27615392  java.util.HashMap$Node
   4:        388532       24866048  org.tribblix.illuminate.pkgview.ContentsFileDetail
So, changing the code gives almost the same memory saving as turning on String deduplication, without any performance hit.

There are 3 lessons here:

  1. Don't use Strings to store what could be primitive types if you can help it
  2. Under some (not all) circumstances, the G1 garbage collector can be a huge win
  3. When you're doing optimization occasionally the win you get isn't the one you were looking for

Tuesday, January 28, 2020

Some hardware just won't die

One of the machines I use to build SPARC packages for Tribblix is an old SunBlade 2000.

It's 18 years old, and is still going strong. Sun built machines like tanks, and this has outlasted the W2100z (Metropolis) that replaced it, and the Ultra 20 M2 that replaced that, and the Dell that replaced that.

It's had an interesting history. I used to work for the MRC, and our department used Sun desktops because they were the best value. Next to no maintenance costs, just worked, never failed, and were compatible with the server fleet. And having good machines more than paid back the extra upfront investment. (People are really expensive - giving them better equipment is very quickly rewarded through extra productivity.)

That compatibility gave us a couple of advantages. One was that we could simply chuck the desktops into the compute farm when they weren't being used, to give us extra power. The other was that when we turned up one morning to find 80% of our servers missing, we simply rounded up a bunch of desktops and promoted them, restoring full service within a couple of hours.

When the department was shut down all the computers were dispersed to other places. Anything over 3 years old had depreciated as an asset, and those were just given away. The SB2000 wasn't quite, but a research group went off to another university taking a full rack of gear and some of the desktops, found they weren't given anything like as much space as they expected, and asked me to keep the SB2000 in exchange for helping out with advice if they had a problem.

The snag with a SunBlade 2000 is that it's both huge and heavy. The domestic authorities weren't terribly enthusiastic when I came home with it and a pair of massive monitors.

The SB2000 ended up following me to my next job, where it was used for patch testing and then as a graphical station in the 2nd datacenter.

And it followed me to the job after, too. They gave me an entry-level SunBlade 1500, I brought in the SB2000 and it's 2 22-inch Sony CRTs.

After a while, we upgraded to Ultra 20M2 workstations. Which released the monster, initially again as a patch test box.

At around this time we were replacing production storage, which was a load of Sun fiber arrays hooked up to V880s, with either SAS raid arrays connected to X4200M2s, or thumpers for bulk image data. Which meant we had a number of random fiber arrays kicking around doing nothing.

And then someone shows up with an urgent project. They need to store and serve a pile of extra data, and it had been forgotten when the budget was put together. Could we help them out?

Half an hour later I had found some QLogic cards in the storeroom, borrowed some fibre cables from networking, shoved the cards in the free slots in the SB2000, hooked up the arrays, told ZFS to sort everything out, and we had a couple of terabytes of storage ready for the project.

It was actually a huge success and worked really well. Some time later a VP from the States was visiting and saw this Heath Robinson contraption we had put together. They were a little bit shocked (to put it mildly) to discover that a mission-critical customer-facing project worth millions of dollars was held together by a pile of refurbished rubbish and a member of staff's cast-offs; shortly thereafter the proper funding for the project magically appeared.

After changing jobs again it came home. By this time the kids were off to University and I had a decent amount of space for a home office, so it could stay. It immediately had Tribblix loaded on it and has been used once or twice a week ever since. And it might be a little slower than a modern machine, but it's definitely viable, and it's still showing no signs of shuffling off this mortal coil.