Monday, February 19, 2024

The SunOS JDK builder

I've been building OpenJDK on Solaris and illumos for a while.

This has been moderately successful; illumos distributions now have access to up to date LTS releases, most of which work well. (At least 11 and 17 are fine; 21 isn't quite right.)

There are even some third-party collections of my patches, primarily for Solaris (as opposed to illumos) builds.

I've added another tool. The SunOS jdk builder.

The aim here is to be able to build every single jdk tag, rather than going to one of the existing repos which only have the current builds. And, yes, you could grope through the git history to get to older builds, but one problem with that is that you can't actually fix problems with past builds.

Most of the content is in the jdk-sunos-patches repository. Here there are patches for both illumos and Solaris (they're ever so slightly different) for every tag I've built.

(That's almost every jdk tag since the Solaris/SPARC/Studio removal, and a few before that. Every so often I find I missed one. And there's been the odd bad patch along the way.)

The idea here is to make it easy to build every tag, and to do so on a current system. I've had to add new patches to get some of the older builds to work. The world has changed, we have newer compilers and other tools, and the OS we're building on has evolved. So if someone wanted to start building the jdk from scratch (and remember that you have to build all the versions in sequence) then this would be useful.

I'm using it for a couple of other things.

One is to put back SPARC support on illumos and Solaris. The initial port I did was on x86 only, so I'm walking through older builds and getting them to work on SPARC. We'll almost certainly not get to jdk21, but 17 seems a reasonable target.

The other thing is to enable the test suites, and then run them, and hopefully get them clean. At the moment they aren't, but a lot of that is because many tests are OS-specific and they don't know what Solaris is so get confused. With all the tags, I can bisect on failures and (hopefully) fix them.

Wednesday, November 22, 2023

Building up networks of zones on Tribblix

With OpenSolaris and derivatives such as illumos, we gained the ability to build a whole IT infrastructure in a single box, using virtualized networking (crossbow) to build the underlying network and then attaching virtualized systems (zones) atop virtualized storage (zfs).

Some of this was present in Solaris 10, but it didn't have crossbow so the networking piece was a bit tricky (although I did manage to get surprisingly far by abusing the loopback interface).

In Tribblix, I've long had the notion of a router or proxy zone, which acts as a bridge between the outside world and a local virtual subnet. For the next release I've been expanding that into something much more flexible and capable.

What did I need to put this together?

The first thing is a virtual network. You use dladm to create an etherstub. Think of that as a virtual switch you can connect network links to.

To connect that to the world, a zone is created with 2 network interfaces (vnics). One over the system interface so it can connect to the outside world, and one over the etherstub.

That special router zone is a little bit more than that. It runs NAT to allow any traffic on the internal subnet - simple NAT, nothing complicated here. In order to do that the zone has to have IPFilter installed, and the zone creation script creates the right ipnat configuration file and ensures that IPFilter is started.

You also need to have IPFilter installed in the global zone. It doesn't have to be running there, but the installation is required to create the IPFilter devices. Those IPFilter devices are then exposed to the zone, and for that to work the zone needs to use exclusive-ip networking rather than shared-ip (and would need to do so anyway for packet forwarding to work).

One thing I learnt was that you can't lock the router zone's networking down with allowed-address. The anti-spoofing protection that allowed-address gives you prevents forwarding and breaks NAT.

The router zone also has a couple of extra pieces of software installed. The first is haproxy, which is intended as an ingress controller. That's not currently used, and could be replaced by something else. The second is dnsmasq, which is used as a dhcp server to configure any zones that get connected to the subnet.

With a network segment in place, and a router zone for management, you can then create extra zones.

The way this works in Tribblix is that if you tell zap to create a zone with an IP address that is part of a private subnet, it will attach its network to the corresponding etherstub. That works fine for an exclusive-ip zone, where the vnic can be created directly over the etherstub.

For shared-ip zones it's a bit trickier. The etherstub isn't a real network device, although for some purposes (like creating a vnic) it looks like one. To allow shared-ip, I create a dedicated shared vnic over the etherstub, and the virtual addresses for shared-ip zones are associated with that vnic. For this to work, it has to be plumbed in the global zone, but doesn't need an address there. The downside to the shared-ip setup (or it might be an upside, depending on what the zone's going to be used for) is that in this configuration it doesn't get a network route; normally this would be inherited off the parent interface, but there isn't an IP configuration associated with the vnic in the global zone.

The shared-ip zone is handed its IP address. For exclusive-ip zones, the right configuration fragment is poked into dnsmasq on the router zone, so that if the zone asks via dhcp it will get the answer you configured. Generally, though, if I can directly configure the zone I will. And that's either by putting the right configuration into the files in a zone so it implements the right networking at boot, or via cloud-init. (Or, in the case of a solaris10 zone, I populate sysidcfg.)

There's actually a lot of steps here, and doing it by hand would be rather (ahem, very) tedious. So it's all automated by zap, the package and system administration tool in Tribblix. The user asks for a router zone, and all it needs to be given is the zone's name, the public IP address, and the subnet address, and all the work will be done automatically. It saves all the required details so that they can be picked up later. Likewise for a regular zone, it will do all the configuration based on the IP address you specify, with no extra input required from the user.

The whole aim here is to make building zones, and whole systems of zones, much easier and more reliable. And there's still a lot more capability to add.

Saturday, November 04, 2023

Keeping python modules in check

Any operating system distribution - and Tribblix is no different - will have a bunch of packages for python modules.

And one thing about python modules is that they tend to depend on other python modules. Sometimes a lot of python modules. Not only that, the dependency will be on a specific version - or range of versions - of particular modules.

Which opens up the possibility that two different modules might require incompatible versions of a module they both depend on.

For a long time, I was a bit lax about this. Most of the time you can get away with it (often because module writers are excessively cautious about newer versions of their dependencies). But occasionally I got bitten by upgrading a module and breaking something that used it, or breaking it because a dependency hadn't been updated to match.

So now I always check that I've got all the dependencies listed in packaging with

pip3 show modulename

and every time I update a module I check the dependencies aren't broken with

pip3 check

Of course, this relies on the machine having all the (interesting) modules installed, but on my main build machine that is generally true.

If an incompatibility is picked up by pip3 check then I'll either not do the update, or update any other modules to keep in sync. If an update is impossible, I'll take a note of which modules are blockers, and wait until they get an update to unjam the process.

A case in point was that urllib3 went to version 2.x recently. At first, nothing would allow that, so I couldn't update urllib3 at all. Now we're in a situation where I have one module I use that won't allow me to update urllib3, and am starting to see a few modules requiring urllib3 to be updated, so those are held downrev for the time being.

The package dependencies I declare tend to be the explicit module dependencies (as shown by pip3 show). Occasionally I'll declare some or all of the optional dependencies in packaging, if the standard use case suggests it. And there's no obvious easy way to emulate the notion of extras in package dependencies. But that can be handled in package overlays, which is the safest way in any case.

Something else the checking can pick up is when a dependency is removed, which is something that can be easily missed.

Doing all the checking adds a little extra work up front, but should help remove one class of package breakage.

Friday, October 27, 2023

It seemed like a simple problem to fix

While a bit under the weather last week, I decided to try and fix what at first glance appears to be a simple problem:

need to ship the manpage with exa

Now, exa is a modern file lister, and the package on Tribblix doesn't ship a man page. The reason for that, it turns out, is that there isn't a man page in the source, but you can generate one.

To build the man page requires pandoc. OK, so how to get pandoc, which wasn't available on Tribblix? It's written in Haskell, and I did have a Haskell package.

Only my version of Haskell was a bit old, and wouldn't build pandoc. The build complains that it's too old and unsupported. You can't even build an old version of pandoc, which is a little peculiar.

Off to upgrade Haskell then. You need Haskell to build Haskell, and it has some specific requirements about precisely which versions of Haskell work. I wanted to get to 9.4, which is the last version of Haskell that builds using make (and I'll leave Hadrian for another day). You can't build Haskell 9.4 with 9.2 which it claims to be too new, you have to go back to 9.0.

Fortunately we do have some bootstrap kits for illumos available, so I pulled 9.0 from there, successfully built Haskell, then cabal, and finally pandoc.

Back to exa. At which point you notice that it's been deprecated and replaced by eza. (This is a snag with modern point tools. They can disappear on a whim.)

So let's build eza. At which point I find that the MSRV (Minimum Supported Rust Version) has been bumped to 1.70, and I only had 1.69. Another update required. Rust is actually quite simple to package, you can just download the stable version and package it.

After all this, exa still doesn't have a man page, because it's deprecated (if you run man exa you get something completely different from X.Org). But I did manage to upgrade Haskell and Cabal, I managed to package pandoc, I updated rust, and I added a replacement utility - eza - which does now come with a man page.

Monday, October 09, 2023

When zfs was young

On the Solaris 10 Platinum Beta program, one of the most exciting promised features was ZFS, the new file system.

I was especially interested, given that I was in a data-heavy position at the time. The limits of UFS were painful, we had datasets into several terabytes already - and even the multiterabyte file system support that got added was actually pretty useless because the inode density was so low. We tried QFS and SAM-QFS, and they were pretty appalling too.

ZFS was promised, and didn't arrive. In fact, there were about 4 of us on the beta program who saw the original zfs implementation, and it was quite different from what we have now. What eventually landed as zfs in Solaris was a complete rewrite. The beta itself was interesting - we were sent the driver, 3 binaries, and a 3-line cheatsheet, and that was it. There was a fundamental philosophy here that the whole thing was supposed to be so easy to use and sufficiently obvious that it didn't need a manual, and that was actually true. (It's gotten rather more complex since, to be fair.)

The original version was a bit different in terms of implementation than what you're used to, but not that much. The most obvious change was that originally there wasn't a top-level file system for a pool. You created a pool, and then created your file systems. I'm still not sure which is the correct choice. And there was a separate zacl program to handle the ACLs, which were rather different.

In fact, ACLs have been a nightmare of bad implementations throughout their history on Solaris. I already had previous here, having got the POSIX draft ACL implementation reworked for UFS. The original zfs implementation had default aka inheritable ACLs applied to existing objects in a directory. (If you don't immediately realise how bad that is, think of what this allows you to do with hard links to files.) The ACL implementations have continued to be problematic - consider that zfs allows 5 settings for the aclinherit property as evidence that we're glittering a turd at this point.

Eventually we did get zfs shipped in a Solaris 10 update, and it's been continually developed since then. The openzfs project has given the file system an independent existence, it's now in FreeBSD, you can run it (and it runs well) on Linux, and in other OS variations too.

One of the original claims was that zfs was infinitely scalable. I remember it being suggested that you could create a separate zfs file system for each user. I had to try this, so got together a test system (an Ultra 2 with an A1000 disk array) and started creating file systems. Sure, it got into several thousand without any difficulty, but that's not infinite - think universities or research labs and you can easily have 10,000 or 100,000 users, we had well over 20,000. And it fell apart at that scale. That's before each is an NFS share, too. So that idea didn't fly.

Overall, though, zfs was a step change. The fact that you had a file system that was flexible and easily managed was totally new. The fact that a file system actually returned correct data rather than randomly hoping for the best was years ahead of anything else. Having snapshots that allowed users to recover from accidentally deleted files without waiting days for a backup to be restored dramatically improved productivity. It's win after win, and I can't imagine using anything else for storing data.

Is zfs perfect? Of course not, and to my mind one of the most shocking things is that nothing else has even bothered to try and come close.

There are a couple of weaknesses with zfs (or related to zfs, if I put it more accurately). One is that it's still a single-node file system. While we have distributed storage, we still haven't really matured that into a distributed file system. The second is that while zfs has dragged storage into the 21st century, allowing much more sophisticated and scalable management of data, there hasn't been a corresponding improvement in backup, which is still stuck firmly in the 1980s.

Wednesday, October 04, 2023

SMF - part of the Solaris 10 legacy

The Service Management Facility, or SMF, integrated extremely late in the Solaris 10 release cycle. We only got one or two beta builds to test, which seemed highly risky for such a key feature.

So there was very little time to gather feedback from users. And something that central really can't be modified once it's released. It had to work first time.

That said, we did manage some improvements. The current implementation of `svcs -x` is largely due to me struggling to work out why a service was broken.

One of the obvious things about SMF is that it relies on manifests written in XML. Yes, that's of its time - there's a lot of software you can date by the file format it uses.

I don't have a particular problem with the use of XML here, to be honest. What's more of a real problem is that the manifest files were presented as a user interface rather than an internal implementation detail, so that users were forced to write XML from scratch with little to no guidance.

There are a lot of good features around SMF.

Just the very basic restart of an application that dies is something that's so blindingly obvious as a requirement in an operating system. So much so that once it existed I refused to support anything that didn't have SMF when I was on call - after all, most of the 3am phone calls were to simply restart a crashed application. And yes, when we upgraded our systems to Solaris 10 with SMF our availability went way up and the on-call load plummeted.

Being able to grant privileges to a service, and just within the context of that service, without having to give privileges to an application (eg set*id) or a user, makes things so much safer. Although in practice it's letting applications bind to privileged ports while running as a regular user, as that's far and away the most common use case.

Dependencies has been a bit of a mixed bag. Partly because working out what the dependencies should be in the first place is just hard to get right, but also because dependency declaration is bidirectional - you can inject a dependency on yourself into another service, and that other service may not respond well, or you can create a circular dependency if the two services are developed independently.

One part of dependency management in services is deciding whether a given service should start or not given the state of other services (such as its dependencies). Ideally, you want strict dependency management. In the real world, systems are messy and complicated, the dependency tree isn't terribly well understood, and some failure modes don't matter. And in many cases you want the system to try and boot as far as possible so you can get in and fix it.

A related problem is that we've ended up with a complex mesh of services because someone had to take the old mess of rc scripts and translate them into something that would work on day 1. And nobody - either at the time or since  - has gone though the services and studied whether the granularity is correct. One other thing - that again has never happened - once we got a good handle on what services there are is to look at whether the services we have are sensible, or whether there's an opportunity to rearchitect the system to do things better, And because all these services are now baked into SMF, it's actually quite difficult to do any major reworking of the system.

Not only that, but because people write SMF manifests, they simply copy something that looks similar to the problem at hand, so bad practices and inappropriate dependency declarations multiply.

This is one example of what I see as the big problem with SMF - we haven't got supporting tools that present the administrator with useful abstractions, so that everything is raw.

In terms of configuration management, SMF is very much a mixed bag. Yes, it guarantees a consistent and reproducible state of the system. The snag is that there isn't really an automated way to capture the essential state of a system and generate something that will reproduce it (either later or elsewhere) - it can be done, but it's essentially manual. (Backing up the state is a subset of this problem.)

It's clear that there were plans to extend the scope of SMF. Essentially, to be the Solaris version of the Windows registry. Thankfully (see also systemd for where this goes wrong) that hasn't happened much.

In fact, SMF hasn't really involved in any material sense since the day it was introduced. It's very much stuck in time.

There were other features that were left open. For example, there's the notion of the scope of SMF, and the only one available right now is the "localhost" scope - see the smf(7) manual in illumos - so in theory there could be other, non-localhost, scopes. And there was the notion of monitor methods, which never appeared but I can imagine solving a range of niggling application issues I've seen over the years.


Monday, September 11, 2023

Retiring isaexec in Tribblix

One of the slightly unusual features in illumos, and Solaris because that's where it came from, is isaexec.

This facility allows you to have multiple implementations of a binary, and then isaexec will select the best one (for some definition of best).

The full implementation allows you to select from a wide range of architectures. On my machine it'll allow the following list:

amd64 pentium_pro+mmx pentium_pro
pentium+mmx pentium i486 i386 i86

If you wanted, you could ship a highly tuned pentium_pro binary, and eke out a bit more performance.

The common case, though, and it's actually the only way isaexec is used in illumos, is to simply choose between a 32-bit and 64-bit binary. This goes back to when Solaris and illumos supported 32-bit and 64-bit hardware in the same system (and you could actually choose whether to boot 32-bit or 64-bit under certain circumstances). In this case, if you're running a 32-bit kernel you get a 32-bit application; if you're running 64-bit then you can get the 64-bit version of that application.

Not all applications got this treatment. Anything that needed to interface directly with the kernel did (eg the ps utility). And for others it was largely about performance or scalability. But most userland applications were 32-bit, and still are in illumos. (Solaris has migrated most to 64-bit now, we ought to do the same.)

It's been 5 years or more since illumos removed the 32-bit kernel, so the only option is to run in 64-bit mode. So now, isaexec will only ever select the 64-bit binary.

A while ago, Tribblix simply removed the remaining 32-bit binaries that isaexec would have executed on a 32-bit system. This saved a bit of space.

The upcoming m32 release goes further. In almost all cases isaexec is no longer involved, and the 64-bit binary sits directly in the PATH (eg, in /usr/bin). There's none of the wasted redirection. I have put symbolic links in, just in case somebody explicitly referenced the 64-bit path.

This is all done by manipulating packaging - Tribblix runs the IPS package repo through a transformation step to produce the SVR4 packages that the distro uses, and this is just another filter in that process.

(There are a handful of exceptions where I still have 32-bit and 64-bit. Debuggers, for example, might need to match the bitness of the application being debugged. And the way that sh/ksh/ksh93 is installed needs a slightly less trivial transformation to get it right.)