Thursday, December 31, 2020

Running Eclipse on current illumos

You can run a slightly old version of Eclipse on illumos. The download is here.

Later Eclipse (and SWT) versions didn't have Solaris support. Or any unix variant such AIX or HPUX either, or certain hardware platforms, come to that.

However, if you try and run that eclipse on current Tribblix or OpenIndiana, you'll find it doesn't work and you get a Java crash.

The way I found to fix this is to edit the eclipse.ini file to force the use of GTK2, by adding the following 2 lines in the middle


so that the full eclipse.ini looks like:


The reason this is necessary is because we now have GTK3, and if you don't explicitly force GTK2 you end up with both GTK2 and GTK3 loaded and chaos ensues.

The failure with GTK3 is slightly ironic, given that the reason for dropping Solaris support in eclipse was that we were stuck on GTK2.

Wednesday, December 23, 2020

Installing OmniOS on Vultr with iPXE

A while ago, I wrote about booting Tribblix on Vultr using iPXE.

Naturally, the question arises - is this possible for other illumos distributions, like OmniOS?

First, though, a note about why this might be interesting. Imagine I have some servers in a remote datacenter, and I need to install an OS. Most servers have some sort of remote managamenet capbility that includes mounting an ISO image. That's fine, but if you're at home then you're having to transfer the ISO image the wrong way across an asymmetric network connection. It's very slow, quite unreliable, and sometimes plain doesn't work. On the other hand, booting iPXE is very quick and very reliable, because the bootable image is so small. And the same benefits mean you can repeat the process until you've got it right.

Based on that, Vultr is just a very convenient way to test this out.

Because of the way that the OmniOS media are built, a straight copy of the Tribblix procedure won't work. This is what you need to do instead.

First, you need an accessible web server. I'm using the Tribblix repo server, but it really can be any server. It doesn't need to be illumos, as long as it responds to http requests you're good.

In the document root, create a directory. I'm going to call it 'kayak', because that's what the OmniOS network installer is called, and that's what we're going to leverage.

There are 3 files from the OmniOS distribution or media that we need.

You can't use the boot archive from the ISO. You need a custom boot archive for this. Download the miniroot file from the location. If you look in each release directory, you should find a gzipped miniroot file. Put that in the kayak directory, and gunzip it. I normally rename this to be platform/i86pc/amd64/boot_archive. This is going to be the initrd file in your ipxe configuration.

The next thing you need is the corresponding kernel, aka unix. You need to extract this either from the ISO image or the miniroot. The ISO image might be easier; the miniroot contains a UFS filesystem. You could mount this up using lofi and copy the file out of it. You need the file /platform/i86pc/kernel/amd64/unix, and it needs to be located at platform/i86pc/kernel/amd64/unix.

Another way if you have the iso-read command installed (on Tribblix it's in the TRIBlibcdio package) is

mkdir -p platform/i86pc/kernel/amd64
iso-read -i omniosce-r151030.iso \
  -e platform/i86pc/kernel/amd64/unix \
  -o platform/i86pc/kernel/amd64/unix

The third thing you need is the image to install. This is a compressed zfs send stream. You can download these from for most releases - look for the .zfs.xz file. Make sure not to use the ones with ngz in the name - they're for zones.

If you're trying r151030, then there isn't a downloadable file, but you can pull this file out of the ISO too. For example

iso-read -i omniosce-r151030.iso \
  -e kayak/kayak_r151030.zfs.xz \
  -o kayak_r151030.zfs.xz

Then you need to create a client configuration script. The instructions at are a good start. For example, it might be:

BuildRpool c1t0d0
RootPW '$5$JQkyMDvv$pPzEUsvP/rLwURyrpwz5i1SfVqx2QiEoIdDA9ZrG271'
SetHostname omnios30
SetTimezone UTC
Postboot '/sbin/ipadm create-if vioif0'
Postboot '/sbin/ipadm create-addr -T dhcp vioif0/v4'

Obviously, you need to use the correct disk device (c1t0d0) and network interface (vioif0) appropriate to your system. These are the right values for Vultr.

The installer will look for this file using its (uppercased) MAC address, and then go for shorter parts of the MAC address. As I don't know what that is yet, I simply create a file and create symlinks 0-9 and A-F pointing to it, as it will eventuially get to the first letter of the MAC address and one of them will match. If you know the actual MAC address of the system, you can create a file with the exact match.

What you end up with is an ipxe.txt (and this is the one I'm using at file that looks like this:

kernel /kayak/platform/i86pc/kernel/amd64/unix -B install_media=,install_config=
initrd /kayak/platform/i86pc/amd64/boot_archive

To recap:

  • The initrd file points at the location of the gunzipped miniroot file you downloaded.
  • The kernel line refers to the unix file you extracted.
  • The kernel line supplies boot arguments with the -B flag. There are two of them.
  • NOTE: the kernel line is just one line, despite how its gets formatted here.
  • The install_media argument is a URL to the ZFS send stream file you downloaded. Note that I'm using the IP address of the server rather than its DNS name, as you can't guarantee that DNS resolution will work.
  • The install_config argument is the URL of the directory that the client configuration scripts are put in. The installer will look for files based on (a substring of) the client's MAC address in this directory

With that all set, you're ready to boot a machine on Vultr.

Go to instances, and click on 'Deploy instance'.

Choose a server. Cloud Compute is the basic starter, I normally use London as it's close to me to minimize latency.

For Server Type, go to the Upload ISO tab, choose iPXE isntaed of My
ISOs, and enter the URL of your ipxe file, ipxe.txt


Then select your server size - even the smallest 1024MB instance will work fine, and hit the Deploy Now button.

It should show as Installing for a few moments, and then change to Running. If you click on Running you get to the page for that instance, and the view the console link does exactly that using VNC.

The install is pretty quick:


(You can see it trying the variant client configuration scripts.)

And it will then reboot into the installed system.

This is a typical minimalist OmniOS install. You'll probably have to set DNS up (if you didn't do it in your client confiiguration script), add a user, and then you'll be able to ssh in remotely.

Tuesday, November 17, 2020

Adventures in Server Rooms

I'm a fan of the cloud. Honest. Providing someone else is paying for it, that is, given how ridiculously expensive it is.

But running stuff in the cloud is a whole lot easier for me. Someone else fixing hardware and infrastructure is a big win. No more driving to the datacenter at 3 in the morning. No more wandering into a room that's baking at 70C because the AC failed. No crawling about under the raised floor trying to trace that cable. No having a disk array in pieces because vendor A's SCSI adapter doesn't quite work right with vendor B's SCSI disk drive.

That said, some of the things that happened while I was fixing hardware do stand out.

Like the time I was putting together one of these stupid cable management arms. Feeling very satisfied with myself, I tried to step back only to discover that I had managed to velcro and cable tie myself into the damn thing. Unable to actually move the one arm, or get round it with the other, I eventually wriggled out of the stuck sweater and spent another 10 minutes rescuing my clothing from the clutches of the evil server.

Or, one morning, when I came in to find an SGI Challenge dead as a dodo. This was the under the desk model, back in the day when every machine had its own VT320 or similar perched on the desk above or alongside. I speak to SGI, and they assure me that everything's fine, I just need to replace the fuse. I dutifully do so and crawl under the desk to turn it back on, with my boss (the benefits of rank!) standing as far away as possible. One huge flash and enormous bang were followed by another bang as my head slammed into the underside of the desk. Some sharp words with the SGI support line followed.

Before we had datacenters, we had server rooms. Before that, we had cupboards. As a temporary measure, the campus router (oh, the days when you could find a Cisco AGS+ anywhere) was in a cupboard. The air conditioning unit extracted moisture from the air  into what was effectively a bucket on a scale, When the bucket was full, it was heavy enough to tip the scale, and powered off the AC unit - after all, you don't want puddles all over the floor. Depending on humidity, this wouldn't last overnight, let alone a weekend, resulting in extra trips across the county to empty the thing out of hours.

Servers are relatively benign beasts compared to the unreliable monstrosities responsible for datacenter power and cooling. In the early days at Hinxton, we had regular power cuts. The generator would decide to cut in, but there were several seconds before it had spun up to speed. It turned out that the thresholds on the generator for detecting low power had been set back in the days when mains voltage was 240V. Now, it's 220V, and the official range is technically the same, but when all the local farms brought the cows in for milking there was enough of a drop in voltage that the generator thought there was going to be loss of power and kicked in unnecessarily.

Once, when it cut over, the bus bars stuck halfway, fused, and the generator shed had a pool of molten copper on the floor. It was some hours before that had cooled far enough for the electricians to safely enter the building.

When we got a new building, it was much bigger, so the generators they put in were sized for critical equipment only. Along comes the first big mains outage, and whoever had assigned the "critical" circuits had done so essentially at random. Specifically, our servers (which had UPS protection to tide them through blips long enough for the generators to come on stream) clearly weren't getting any juice. There was a lot of yelling, and we had to shut half the stuff down to buy time, before they connected up the right circuits.

Someone higher up decided we needed fire protection. So that was put in, wired up, the idea is that it kills all power and cooling. There's also a manual switch, which is put by the door so you can hit the big red button as you're legging it out of there. Too close to the door, as it turns out, on the wall exactly where the door handle goes when you open the door.

They installed a status panel on the wall outside too. The electrician wiring this up never satisfactorily explained how he managed to cut all power, although he must have almost had a heart attack when a posse of system administrators came flying out of offices demanding to know what had happened.

Numeracy isn't known as a builder's strong point. Common sense likewise. So the datacenter has an outside door, then a ramp up to an inner door. We believed in having a decent amount of room under the raised floor - enough to crawl under if necessary. So they put in the doorframe, but hadn't allowed for the raised floor. When we went for the initial inspection we had to crawl through the door, it was that low.

Even the final version wasn't really high enough. The old Sun 42U racks are actually quite short - the power supply at the base eats into the dimensions, so the external height is just over 42U and there was really only 36U available for servers. When we ordered our first Dell 42U rack, we had to take all the servers out, half dismantle it, lay it down, and it took six of us to wangle it through the dogleg and the door (avoiding the aforementioned big red emergency cut off button!). After that, we went out and ordered custom 39U racks to be sure they would fit.

When you run a network connection to a campus, you use multiple routes - so we had one fibre running to Cambridge, another to London. Unfortunately, as it turned out, there's about 100 yards from the main building ingress to the main road, and someone cheated and laid the two fibres next to each other. The was some building work one day, and it had carefully been explained where the hole should and shouldn't be dug. Mid-afternoon, Sun Net Manager (remember that?) went totally mental. We go to the window, and there's a large hole, a JCB, and the obligatory gaggle of men in hard hats staring into the hole. My boss says "They're not supposed to be digging there!" and, indeed, they had managed to dig straight through the cables. If it had been a clean cut it wouldn't have been so bad (there's always a little slack so they can splice the fibres in the event of a break), but they had managed to pull the fibres out completely so they had to be blown back from London/Cambridge before we got back on the air.

Sunday, October 25, 2020

Tribblix on Vultr with Block Storage

I wrote a while back about installing Tribblix on Vultr using iPXE.

That all still works perfectly with the m23.2 release. And it's pretty quick too. It's even quicker on the new "High Frequency" servers, that appear to be a 3.8GHz Skylake chip and NVME storage.

One of the snags with many of the smaller cloud providers is that they don't necessarily have a huge choice of instance configurations. I'm not saying I necessarily want the billions of choices that AWS provide, but the instances are fairly one dimensional - CPU, memory, and storage aren't adjustable independently. This is one of the reasons I use AWS, because I can add storage without having to pay for extra CPU or memory.

One option here is to take a small instance, and add extra storage to it. That's what I do on AWS, having a small root volume and adding a large EBS volume to give me the space I need. This isn't - yet - all that commonly available on other providers.

Vultr do have it as an option, but at the moment it's only available in New York (NJ). Let's deploy an instance, and select NY/NJ as the location.

Scroll down and choose the "Upload ISO" tab, select the iPXE radio button, and put in the ipxe.txt script location for the version of Tribblix you want to install.

I've chosen the 1GB instance size, which is ample. Deploy that, and then connect to the console.

View the console and watch the install. This is really quick if you deploy in London, and isn't too bad elsewhere in Europe, as the Tribblix server I'm loading from is in London. Transferring stuff across the Atlantic takes a bit longer.

Then run the install. This is just ./ -G c2t0d0

It will download a number of packages to finish the install (these are normally loaded off the ISO if you boot that, but for an iPXE install it pulls them over the network).

Reboot the server and it will boot normally off disk.

Go to the Block Storage tab, and Add some block storage. I'm going to add 50GB just to play with.

Now we need to connect it to our server.

This isn't quite as obvious as I would have liked. Click on the little pencil icon and you get to the "Manage Block Storage" page. Select the instance you want to attach it to, and hit Attach. This will restart the server, so make sure you're not doing anything.

The documentation talks about /dev/vdb and the like, which isn't the way we name our devices. As we're using vioblk, this comes up as c3t0d0 (the initial boot drive is c2t0d0).

We can create a pool

zpool create -O compression=lz4 store c3t0d0

This took a little longer than I expected to complete.

And I can do all the normal things you would expect with a zfs pool.

If you go to Block Storage and click the pencil to Manage it, the size is clickable. I clicked that, and changed the size to 64GB.

Like resizing an EBS volume on AWS, there doesn't seem to be a way to persuade illumos to rescan the devices to spot that the size has changed. You have to reboot.

Only reboot doesn't appear to be enough. It says on the Vultr console "You will need to restart your server via the control panel before changes will be visible." and it appears to be correct on that.

(This is effectively power-cycling it, which is presumably necessary to propagate the change through the whole stack properly.)

After that, the additional space is visible, as you can see from the extra 14G in the EXPANDSZ column:


And you can expand the pool using 'online -e'

zpool online -e store c3t0d0

This caused me a little bit of trouble. This appeared to generate an I/O error, lots of messages, and a hung console. I had to ssh in, clear the pool, and run a scrub, before things looked sane. Expanding the pool then worked and things look OK.

Generally, block device resize appears to be possible, but is still a bit rough round the edges.

Sunday, October 18, 2020

The state of Tribblix, 2020

It's been a funny year, has 2020.

But amongst all this, work on Tribblix continues.

I released milestone 22 back in March. That was a fairly long time in the making, as the previous relase was 9 months earlier. Part of the reason for the lengthy delay was that there wasn't all that much reason for a new release - there are a lot of updated packages, but no big items. I guess the biggest thing is that the default gcc compiler and runtime went from gcc4 to gcc7. (In places, the gcc4 name continues.)

Milestone 23 was the next full release, in July. Things start to move again here - Tribblix fully transitioned from gcc4 to gcc7, as illumos is now a gcc7 build. I updated the MATE desktop, which was the start of moving from gtk2 to gtk3. There's a prettier boot banner, which allows a bit of custom branding.

There's a long-running effort to migrate from Python 2.x to 3.x. This is slow going - there are actually quite a lot of python modules and tools (and things that use python) that still show no sign of engaging with the Python 3 shift. But I'm gradually making sure that everything that can be version 3 is, and removing the python 2 pieces where possible. This is getting a bit more complicated - as of Python 3.8 I've switched from 32-bit to 64-bit. And now they're doing time-based releases there will be a version bump to navigate every year, just to add to the work.

Most of the Tribblix releases have been full upgrades from one version to the next. With the milestone 20 series, I had update releases, which allowed a shared stream of userland packages, while allowing illumos updates to take place. The same model is true of milestone 23 - update 1 came along in September.

With Milestone 23 update 1 we fixed the bhyve CVE. Other than normal updates, I added XView, which suits the retro theme and I've had quite a few people ask for.

Immediately after that (it was supposed to be in 23.1 but wasn't quite ready) came another major update: refreshing the X server stack.

When Tribblix was created, I didn't have the resources to build everything from scratch straight away, so "borrowed" a few components from OpenIndiana (initially 151a8, then 151a9) just to make sure I had enough bits to provide a complete OS. Many of the isolated components were replaced fairly quickly over time, but the X11 stack was the big holdout. It was finally time to build Xorg and the drivers myself. It wasn't too difficult, but to be honest I have no real way to test most of it. So that will all be present in 23.2.

One reason for doing this - and my hand was forced a little here - is that I've also updated Xfce from 4.12 to 4.14. That's also a gtk2 to gtk3 switch, but Xfce 4.14 simply didn't work on the old Xorg I had before.

Something else I've put together - and these are all gtk3 based - is a lightweight desktop, using abiword, geany, gnumeric, grisbi, imagination, and netsurf. You still need a file manager to round out the set, and I really haven't found anything that's lightweight and builds successfully, so at the moment this is really an adjunct to MATE or Xfce.

Alongside all this I've been working on keeping Java OpenJDK working on illumos. They ripped out Solaris support early in the year, but I've been able to put that back. The real killer here was Studio support, and we don't want that anyway (it's not open source, and the binaries no longer run). There are other unix-like variants supported by Java, running on the x86 architecture with a gcc toolchain, just like us, so it shouldn't be that much of a mountain to climb.

Support for SPARC is currently slightly on the back burner, partly because the big changes mentioned above aren't really relevant for SPARC, partly due to less time, partly due to the weather - running SPARC boxes in the home office tends to be more of a winter than a summer pursuit, due to the heat.

Monday, August 17, 2020

Solaris 10 zones on Tribblix

One of the interesting capabilities of Solaris zones was the ability to run older versions of Solaris than that in the global zone. Marketing managed to mangle this into Containers, and it was supported for Solaris 8 and Solaris 9.

I used this extensively on one project, to lift a whole datacenter of ancient (yes, really ancient) Sun servers into zones on a couple of T5240s. Worked great. (We had to get an E450 out of the dumpster and build it specially to get a Solaris 2.6 system, however.)

Solaris 11 and illumos have dropped the Solaris 8 and 9 legacy containers, but have a Solaris 10 zone brand. On Tribblix, this can be installed with

zap install TRIBsys-zones-brand-s10

(If you're on an IPS based distro, the package name is system/zones/brand/s10.)

Installing an s10 branded zone is just like a regular zone, but you need a Solaris 10 image to install from. You could tar up a legacy system, or create a new image from the install media.

There are certain requirements for the software in the image and on the host. If, on a Solaris 10 system, you look in the directory /usr/lib/brand/solaris10, you might see a couple of files called 0 and 1. They have a little bit of text in them for explanation, but these are emulation compatibility feature flags. If you look at the illumos source, you can see them listed too. This is a basic versioning system - the host running the global zone needs to support all the features of the software in the zone. Fortunately that feature list hasn't changed, so we're good.

For the image, the s10 brand checks the SUNWcakr package and needs it to be patched to a minimum level. In practice, this means that anything S10U8 or newer will work.

During zone installation, there's some sanity checking. It turns out the installer is looking for /var/sadm/system/admin/INST_RELEASE and gives up if it can't find it. I had to manually create that file:

cat > ......./var/sadm/system/admin/INST_RELEASE

There's also some processing of the zone image that gets done as part of the zone installation. And if that processing fails, then the zone install will fail too.

It has a hardcoded list of safe_dir entries for things it needs to replace. If you don't have those, it simply fails, so you need to add a bunch of packages just to make it happy. Specifically: autofs, zfs, and ipf.

The postprocessing also runs sys-unconfig, so you need to make sure that's present, from SUNWadmap.

Enough chatter. If I have my Solaris 10 media under /mnt1, and want to create an image under /export/S10, then it's going to look like this:

cd /mnt1/Solaris_10/Product
# SUNWCcs SUNWcar SUNWcakr SUNWkvm
pkgadd -d . -R /export/S10 SUNWcsr
pkgadd -d . -R /export/S10 SUNWckr
pkgadd -d . -R /export/S10 SUNWcnetr
pkgadd -d . -R /export/S10 SUNWcsd
pkgadd -d . -R /export/S10 SUNWcsl
pkgadd -d . -R /export/S10 SUNWcsu
pkgadd -d . -R /export/S10 SUNWcar.i
pkgadd -d . -R /export/S10 SUNWcakr.i
pkgadd -d . -R /export/S10 SUNWkvm.i
pkgadd -d . -R /export/S10 SUNWcslr
# SUNWCfmd
pkgadd -d . -R /export/S10 SUNWfmdr
pkgadd -d . -R /export/S10 SUNWfmd
# SUNWClexpt
pkgadd -d . -R /export/S10 SUNWlexpt
# SUNWCpicl
pkgadd -d . -R /export/S10 SUNWpiclr
pkgadd -d . -R /export/S10 SUNWpiclu
# SUNWCopenssl SUNWhea (but not man)
pkgadd -d . -R /export/S10 SUNWopensslr
pkgadd -d . -R /export/S10 SUNWopenssl-libraries
pkgadd -d . -R /export/S10 SUNWhea
pkgadd -d . -R /export/S10 SUNWopenssl-include
pkgadd -d . -R /export/S10 SUNWopenssl-commands
# SUNWCpkgcmds SUNWwbsup
pkgadd -d . -R /export/S10 SUNWproduct-registry-root
pkgadd -d . -R /export/S10 SUNWwsr2
pkgadd -d . -R /export/S10 SUNWpkgcmdsr
pkgadd -d . -R /export/S10 SUNWwbsup
pkgadd -d . -R /export/S10 SUNWpkgcmdsu
pkgadd -d . -R /export/S10 SUNWpr
pkgadd -d . -R /export/S10 SUNWtls
pkgadd -d . -R /export/S10 SUNWjss
# SUNWCfwshl
pkgadd -d . -R /export/S10 SUNWbash
pkgadd -d . -R /export/S10 SUNWtcsh
pkgadd -d . -R /export/S10 SUNWzsh
# perl
pkgadd -d . -R /export/S10 SUNWperl584core
pkgadd -d . -R /export/S10 SUNWperl584usr
# SUNWCptoo SUNWtecla SUNWesu SUNWtoo
pkgadd -d . -R /export/S10 SUNWtecla
pkgadd -d . -R /export/S10 SUNWbtool
pkgadd -d . -R /export/S10 SUNWesu
pkgadd -d . -R /export/S10 SUNWcpp
pkgadd -d . -R /export/S10 SUNWtoo
pkgadd -d . -R /export/S10 SUNWlibmr
pkgadd -d . -R /export/S10 SUNWlibm
pkgadd -d . -R /export/S10 SUNWlibmsr
pkgadd -d . -R /export/S10 SUNWlibms
pkgadd -d . -R /export/S10 SUNWsprot
# SUNWCfwcmp SUNWlibC
pkgadd -d . -R /export/S10 SUNWlibC
pkgadd -d . -R /export/S10 SUNWbzip
pkgadd -d . -R /export/S10 SUNWgzip
pkgadd -d . -R /export/S10 SUNWzip
pkgadd -d . -R /export/S10 SUNWzlib
# release and sys-unconfig
pkgadd -d . -R /export/S10 SUNWsolnm
pkgadd -d . -R /export/S10 SUNWadmr
pkgadd -d . -R /export/S10 SUNWadmlib-sysid
pkgadd -d . -R /export/S10 SUNWadmap
# autofs is needed for validation
pkgadd -d . -R /export/S10 SUNWatfsr
pkgadd -d . -R /export/S10 SUNWatfsu
# ditto zfs
pkgadd -d . -R /export/S10 SUNWlxml
pkgadd -d . -R /export/S10 SUNWsmapi
pkgadd -d . -R /export/S10 SUNWzfskr
pkgadd -d . -R /export/S10 SUNWzfsr
pkgadd -d . -R /export/S10 SUNWzfsu
# ditto ipf
pkgadd -d . -R /export/S10 SUNWipfr
pkgadd -d . -R /export/S10 SUNWipfu
# It's about 235M at this point

If you cd to /export/S10, make sure the INST_RELEASE file is there with the correct contents (see above) and then tar up what you have, you can feed that tarball to the zone installation and it should work.

If you look at documentation for s10 zones on Solaris 11, you'll see a -c option. We don't have that, but you could drop a sysidcfg file into /etc/sysidcfg in the zone so it will configure itself at boot. It will look something like this:

# shared
network_interface=primary {
# exclusive
network_interface=primary {
name_service=DNS {

If you're using Tribblix, most of the zone creation is simplified, and it will be:

zap create-zone -t s10 -z s10-test4 -I /tmp/S10.tar -i

I haven't tried this on SPARC (my use case is building Java and Node.JS), but it ought to be exactly the same modulo trivial chagnes to package names.

Wednesday, July 08, 2020

Customizing EC2 instance storage and networking with the AWS CLI

I use AWS to run illumos quite a bit, either with Tribblix or OmniOS.

Creating EC2 instances with the console is fine for one-offs, but gets a bit tedious. So using the AWS CLI offers a better route, with the ec2 run-instances command.

Yes. there are things like templates and terraform and all sorts of other options. For whatever reason, they don't work in all cases.

In particular, the reasons you might want to customize an instance if you're running illumos might be slightly different than a more traditional usage model.

For storage, there are a couple of customizations we might want. The first is that the AMI has a fairly small root disk, which we might want to make larger. We may be adding zones, with their root filesystems installed on the system pool. We may be adding swap (while anonymous reservation means applications like java don't need to write to swap, you still need space backing the swap space to be available). For the second, there's the fact that we might actually want to use EBS to provide local storage (so we can use ZFS, for example, which has data integrity and manageability benefits).

To automate the enlargement of the root pool, I create a mapping file that looks like this:

    "DeviceName": "/dev/xvda",
    "Ebs": {
      "VolumeSize": 12,
      "Encrypted": true

The size is in Gigabytes. The /dev/xvda is the normal device name (from EC2, clearly in illumos we have a different naming). If that's in a file called storage.json, then the argument to the ec2 run-instances command is:

--block-device-mappings file://storage.json

Once the instance is running, that will normally (on my instances) show up on c2t0d0, and the rpool can be expanded to use all the available space with the following command:

zpool online -e rpool c2t0d0

To add an additional device, to keep application storage separate, in addition to that enlargement, would involve a json file like:

    "DeviceName": "/dev/xvda",
    "Ebs": {
      "VolumeSize": 12,
      "Encrypted": true
    "DeviceName": "/dev/sdf",
    "Ebs": {
      "VolumeSize": 256,
      "DeleteOnTermination": false,
      "Encrypted": true

On my instances, I always use /dev/sdf, which comes out as c2t5d0.

For networking, I often end up with multiple IP addresses. This is because we have zones - rather than create multiple EC2 instances, it's far more efficient to run applications in zones on a single system, but then you want to assign each zone its own IP address.

You would think - supported by the documentation - that the --secondary-private-ip-addresses flag to ec2 run-instances would do the job. You would be wrong. That flag, actually, is supposed to just be a convenient shortcut for what I'm about to describe, but it doesn't actually work. (And terraform doesn't support this customization either - it can handle additional IP addresses, but not on the same interface as the primary.)

To configure multiple IP addresses we again turn to a json file. This looks like:

    "DeviceIndex": 0,
    "DeleteOnTermination": true,
    "SubnetId": "subnet-0abcdef1234567890",
    "Groups": ["sg-01234567890abcdef"],
    "PrivateIpAddresses": [
        "Primary": true,
        "PrivateIpAddress": ""
        "Primary": false,
        "PrivateIpAddress": ""

You have to define (SubnetId) the subnet you're going to use, and (Groups) the security group that will be applied - these belong to the network interface, not to the instance (in the trivial case there's no difference). So you don't specify the security group(s) or the subnet as regular arguments. Then I define two IP addresses (you can have as many as you like), one is set as the primary ("Primary": true), all the others will be secondary ("Primary": false). Again, if this is in a file network.json you feed that to the command like

--network-interfaces file://network.json

One other thing I found is that you can add tags to the instance (and to EBS volumes) at creation, saving you the effort of having to go through and tag things later. It's slightly annoying that it doesn't seem to allow you to apply different tags to different volumes, you can just say "apply these tags to the instance" and "apply these tags to the volumes". The trick is that the example in the documentation is wrong (it has single quotes, which you don't need and don't work).

So the tag specification looks like:

--tag-specifications \
ResourceType=instance,Tags=[{Key=Name,Value=aws123a}] \ ResourceType=volume,Tags=[{Key=Name,Value=aws123a}]

In the square brackets, you can have multiple comma-separated key-value pairs. We have tags marking projects and roles so you have a vague idea of what's what.

Putting this all together you end up with a command like:

aws ec2 run-instances \
--region eu-west-2 \
--image-id ami-01a1a1a1a1a1a1a1a \
--instance-type t2.micro \
--key-name peter-key \
--network-interfaces file://network.json \
--count 1 \
--block-device-mappings file://storage.json \
--disable-api-termination \
--tag-specifications \
ResourceType=instance,Tags=[{Key=Name,Value=aws123a}] \

Of course, I don't write either the json files or the command invocation by hand. I have a script that knows what all my AMIs and availability zones and subnets and security groups are and does the right thing for each instance I want to build.

Sunday, June 21, 2020

Java: trying out String deduplication and the G1 garbage collector

As of 8u20, java supports automatic String deduplication.

-XX:+UseG1GC -XX:+UseStringDeduplication

You need to use the G1 garbage collector, and it will do the dedup as you scan the heap. Essentially, it checks each String and if the backing char[] array is the same as one it's already got, it merges the references.

Obviously, this could save memory if you have a lot of repeated strings.

Consider my illuminate utility. One of the thing it does is parse the old SVR4 packaging contents file. That's a big file, and there's a huge amount of duplication - while the file names are obviously unique, things like the type of file, permissions, owner, group, and names of packages are repeated many times. So, does turning this thing on make a difference?

Here's the head of the class histogram (produced by jcmd pid GC.class_histogram).

First without:

 num     #instances         #bytes  class name
   1:       2950682      133505088  [C
   2:       2950130       70803120  java.lang.String
   3:        862390       27596480  java.util.HashMap$Node
   4:        388539       21758184  org.tribblix.illuminate.pkgview.ContentsFileDetail

and now with deduplication:

 num     #instances         #bytes  class name
   1:       2950165       70803960  java.lang.String
   2:        557004       60568944  [C
   3:        862431       27597792  java.util.HashMap$Node
   4:        388539       21758184  org.tribblix.illuminate.pkgview.ContentsFileDetail

Note that there's the same number of entries in the contents file (there's one ContentsFileDetail for each line), and essentially the same number of String objects. But the [C, which is the char[] backing those Strings, has fallen dramatically. You're saving about a third of the memory used to store all that String data.

This also clearly demonstrates that the deduplication isn't on the String objects, those are unchanged, but on the char[] arrays backing those Strings.

Even more interesting is the performance. This is timing of a parser before:

real        1.730556446
user        7.977604040
sys         0.251854581

and afterwards:

real        1.469453551
user        6.054787878
sys         0.407259095

That's actually a bit of a surprise: G1GC is going to have to do work to do the comparisons to see if the strings are the same, and do some housekeeping if they are. However, with just the G1GC on its own, without deduplication, we get a big performance win:

real        1.217800287
user        3.944160155
sys         0.362586413

Therefore, for this case, G1GC is a huge performance benefit, and the deduplication takes some of that performance gain and trades it for memory efficiency.

For the illuminate GUI, without G1GC:

user       10.363291056
sys         0.393676741

and with G1GC:

user        8.151806315
sys         0.401426176

(elapsed time isn't meaningful here as you're waiting for interaction to shut it down)

The other thing you'll sometime see in this context is interning Strings. I tried that, it didn't help at all.

Next, with a little more understanding of what was going on, I tried some modifications to the code to reduce the cost of storing all those Strings.

I did tweak my contents file reader slightly, to break lines up using a simple String.split() rather than using a StringTokenizer. (The java docs recommend you don't use StringTokenizer any more, so this is also a bit of modernization.) I don't think the change of itself makes any difference, but it's slightly less work to simply ignore fields in an array from String.split() than call nextToken() to skip over the ones you don't want.

Saving the size and mtime as long - primitive types - saves a fair amount of memory too. Each String object is 24 bytes plus the content, so the saving is significant. And given that any uses will be of the numerical value, we may as well convert up front.

The ftype is only a single character. So storing that as a char avoids an object, saving space, and they're automatically interned for us.

That manual work gave me about another 10% speedup. What about memory usage?

Using primitive types rather than String gives us the following class histogram:

 num     #instances         #bytes  class name
   1:       1917289      102919512  [C
   2:       1916938       46006512  java.lang.String
   3:        862981       27615392  java.util.HashMap$Node
   4:        388532       24866048  org.tribblix.illuminate.pkgview.ContentsFileDetail
So, changing the code gives almost the same memory saving as turning on String deduplication, without any performance hit.

There are 3 lessons here:

  1. Don't use Strings to store what could be primitive types if you can help it
  2. Under some (not all) circumstances, the G1 garbage collector can be a huge win
  3. When you're doing optimization occasionally the win you get isn't the one you were looking for

Tuesday, January 28, 2020

Some hardware just won't die

One of the machines I use to build SPARC packages for Tribblix is an old SunBlade 2000.

It's 18 years old, and is still going strong. Sun built machines like tanks, and this has outlasted the W2100z (Metropolis) that replaced it, and the Ultra 20 M2 that replaced that, and the Dell that replaced that.

It's had an interesting history. I used to work for the MRC, and our department used Sun desktops because they were the best value. Next to no maintenance costs, just worked, never failed, and were compatible with the server fleet. And having good machines more than paid back the extra upfront investment. (People are really expensive - giving them better equipment is very quickly rewarded through extra productivity.)

That compatibility gave us a couple of advantages. One was that we could simply chuck the desktops into the compute farm when they weren't being used, to give us extra power. The other was that when we turned up one morning to find 80% of our servers missing, we simply rounded up a bunch of desktops and promoted them, restoring full service within a couple of hours.

When the department was shut down all the computers were dispersed to other places. Anything over 3 years old had depreciated as an asset, and those were just given away. The SB2000 wasn't quite, but a research group went off to another university taking a full rack of gear and some of the desktops, found they weren't given anything like as much space as they expected, and asked me to keep the SB2000 in exchange for helping out with advice if they had a problem.

The snag with a SunBlade 2000 is that it's both huge and heavy. The domestic authorities weren't terribly enthusiastic when I came home with it and a pair of massive monitors.

The SB2000 ended up following me to my next job, where it was used for patch testing and then as a graphical station in the 2nd datacenter.

And it followed me to the job after, too. They gave me an entry-level SunBlade 1500, I brought in the SB2000 and it's 2 22-inch Sony CRTs.

After a while, we upgraded to Ultra 20M2 workstations. Which released the monster, initially again as a patch test box.

At around this time we were replacing production storage, which was a load of Sun fiber arrays hooked up to V880s, with either SAS raid arrays connected to X4200M2s, or thumpers for bulk image data. Which meant we had a number of random fiber arrays kicking around doing nothing.

And then someone shows up with an urgent project. They need to store and serve a pile of extra data, and it had been forgotten when the budget was put together. Could we help them out?

Half an hour later I had found some QLogic cards in the storeroom, borrowed some fibre cables from networking, shoved the cards in the free slots in the SB2000, hooked up the arrays, told ZFS to sort everything out, and we had a couple of terabytes of storage ready for the project.

It was actually a huge success and worked really well. Some time later a VP from the States was visiting and saw this Heath Robinson contraption we had put together. They were a little bit shocked (to put it mildly) to discover that a mission-critical customer-facing project worth millions of dollars was held together by a pile of refurbished rubbish and a member of staff's cast-offs; shortly thereafter the proper funding for the project magically appeared.

After changing jobs again it came home. By this time the kids were off to University and I had a decent amount of space for a home office, so it could stay. It immediately had Tribblix loaded on it and has been used once or twice a week ever since. And it might be a little slower than a modern machine, but it's definitely viable, and it's still showing no signs of shuffling off this mortal coil.