Wednesday, November 22, 2023

Building up networks of zones on Tribblix

With OpenSolaris and derivatives such as illumos, we gained the ability to build a whole IT infrastructure in a single box, using virtualized networking (crossbow) to build the underlying network and then attaching virtualized systems (zones) atop virtualized storage (zfs).

Some of this was present in Solaris 10, but it didn't have crossbow so the networking piece was a bit tricky (although I did manage to get surprisingly far by abusing the loopback interface).

In Tribblix, I've long had the notion of a router or proxy zone, which acts as a bridge between the outside world and a local virtual subnet. For the next release I've been expanding that into something much more flexible and capable.

What did I need to put this together?

The first thing is a virtual network. You use dladm to create an etherstub. Think of that as a virtual switch you can connect network links to.

To connect that to the world, a zone is created with 2 network interfaces (vnics). One over the system interface so it can connect to the outside world, and one over the etherstub.

That special router zone is a little bit more than that. It runs NAT to allow any traffic on the internal subnet - simple NAT, nothing complicated here. In order to do that the zone has to have IPFilter installed, and the zone creation script creates the right ipnat configuration file and ensures that IPFilter is started.

You also need to have IPFilter installed in the global zone. It doesn't have to be running there, but the installation is required to create the IPFilter devices. Those IPFilter devices are then exposed to the zone, and for that to work the zone needs to use exclusive-ip networking rather than shared-ip (and would need to do so anyway for packet forwarding to work).

One thing I learnt was that you can't lock the router zone's networking down with allowed-address. The anti-spoofing protection that allowed-address gives you prevents forwarding and breaks NAT.

The router zone also has a couple of extra pieces of software installed. The first is haproxy, which is intended as an ingress controller. That's not currently used, and could be replaced by something else. The second is dnsmasq, which is used as a dhcp server to configure any zones that get connected to the subnet.

With a network segment in place, and a router zone for management, you can then create extra zones.

The way this works in Tribblix is that if you tell zap to create a zone with an IP address that is part of a private subnet, it will attach its network to the corresponding etherstub. That works fine for an exclusive-ip zone, where the vnic can be created directly over the etherstub.

For shared-ip zones it's a bit trickier. The etherstub isn't a real network device, although for some purposes (like creating a vnic) it looks like one. To allow shared-ip, I create a dedicated shared vnic over the etherstub, and the virtual addresses for shared-ip zones are associated with that vnic. For this to work, it has to be plumbed in the global zone, but doesn't need an address there. The downside to the shared-ip setup (or it might be an upside, depending on what the zone's going to be used for) is that in this configuration it doesn't get a network route; normally this would be inherited off the parent interface, but there isn't an IP configuration associated with the vnic in the global zone.

The shared-ip zone is handed its IP address. For exclusive-ip zones, the right configuration fragment is poked into dnsmasq on the router zone, so that if the zone asks via dhcp it will get the answer you configured. Generally, though, if I can directly configure the zone I will. And that's either by putting the right configuration into the files in a zone so it implements the right networking at boot, or via cloud-init. (Or, in the case of a solaris10 zone, I populate sysidcfg.)

There's actually a lot of steps here, and doing it by hand would be rather (ahem, very) tedious. So it's all automated by zap, the package and system administration tool in Tribblix. The user asks for a router zone, and all it needs to be given is the zone's name, the public IP address, and the subnet address, and all the work will be done automatically. It saves all the required details so that they can be picked up later. Likewise for a regular zone, it will do all the configuration based on the IP address you specify, with no extra input required from the user.

The whole aim here is to make building zones, and whole systems of zones, much easier and more reliable. And there's still a lot more capability to add.

Saturday, November 04, 2023

Keeping python modules in check

Any operating system distribution - and Tribblix is no different - will have a bunch of packages for python modules.

And one thing about python modules is that they tend to depend on other python modules. Sometimes a lot of python modules. Not only that, the dependency will be on a specific version - or range of versions - of particular modules.

Which opens up the possibility that two different modules might require incompatible versions of a module they both depend on.

For a long time, I was a bit lax about this. Most of the time you can get away with it (often because module writers are excessively cautious about newer versions of their dependencies). But occasionally I got bitten by upgrading a module and breaking something that used it, or breaking it because a dependency hadn't been updated to match.

So now I always check that I've got all the dependencies listed in packaging with

pip3 show modulename

and every time I update a module I check the dependencies aren't broken with

pip3 check

Of course, this relies on the machine having all the (interesting) modules installed, but on my main build machine that is generally true.

If an incompatibility is picked up by pip3 check then I'll either not do the update, or update any other modules to keep in sync. If an update is impossible, I'll take a note of which modules are blockers, and wait until they get an update to unjam the process.

A case in point was that urllib3 went to version 2.x recently. At first, nothing would allow that, so I couldn't update urllib3 at all. Now we're in a situation where I have one module I use that won't allow me to update urllib3, and am starting to see a few modules requiring urllib3 to be updated, so those are held downrev for the time being.

The package dependencies I declare tend to be the explicit module dependencies (as shown by pip3 show). Occasionally I'll declare some or all of the optional dependencies in packaging, if the standard use case suggests it. And there's no obvious easy way to emulate the notion of extras in package dependencies. But that can be handled in package overlays, which is the safest way in any case.

Something else the checking can pick up is when a dependency is removed, which is something that can be easily missed.

Doing all the checking adds a little extra work up front, but should help remove one class of package breakage.

Friday, October 27, 2023

It seemed like a simple problem to fix

While a bit under the weather last week, I decided to try and fix what at first glance appears to be a simple problem:

need to ship the manpage with exa

Now, exa is a modern file lister, and the package on Tribblix doesn't ship a man page. The reason for that, it turns out, is that there isn't a man page in the source, but you can generate one.

To build the man page requires pandoc. OK, so how to get pandoc, which wasn't available on Tribblix? It's written in Haskell, and I did have a Haskell package.

Only my version of Haskell was a bit old, and wouldn't build pandoc. The build complains that it's too old and unsupported. You can't even build an old version of pandoc, which is a little peculiar.

Off to upgrade Haskell then. You need Haskell to build Haskell, and it has some specific requirements about precisely which versions of Haskell work. I wanted to get to 9.4, which is the last version of Haskell that builds using make (and I'll leave Hadrian for another day). You can't build Haskell 9.4 with 9.2 which it claims to be too new, you have to go back to 9.0.

Fortunately we do have some bootstrap kits for illumos available, so I pulled 9.0 from there, successfully built Haskell, then cabal, and finally pandoc.

Back to exa. At which point you notice that it's been deprecated and replaced by eza. (This is a snag with modern point tools. They can disappear on a whim.)

So let's build eza. At which point I find that the MSRV (Minimum Supported Rust Version) has been bumped to 1.70, and I only had 1.69. Another update required. Rust is actually quite simple to package, you can just download the stable version and package it.

After all this, exa still doesn't have a man page, because it's deprecated (if you run man exa you get something completely different from X.Org). But I did manage to upgrade Haskell and Cabal, I managed to package pandoc, I updated rust, and I added a replacement utility - eza - which does now come with a man page.

Monday, October 09, 2023

When zfs was young

On the Solaris 10 Platinum Beta program, one of the most exciting promised features was ZFS, the new file system.

I was especially interested, given that I was in a data-heavy position at the time. The limits of UFS were painful, we had datasets into several terabytes already - and even the multiterabyte file system support that got added was actually pretty useless because the inode density was so low. We tried QFS and SAM-QFS, and they were pretty appalling too.

ZFS was promised, and didn't arrive. In fact, there were about 4 of us on the beta program who saw the original zfs implementation, and it was quite different from what we have now. What eventually landed as zfs in Solaris was a complete rewrite. The beta itself was interesting - we were sent the driver, 3 binaries, and a 3-line cheatsheet, and that was it. There was a fundamental philosophy here that the whole thing was supposed to be so easy to use and sufficiently obvious that it didn't need a manual, and that was actually true. (It's gotten rather more complex since, to be fair.)

The original version was a bit different in terms of implementation than what you're used to, but not that much. The most obvious change was that originally there wasn't a top-level file system for a pool. You created a pool, and then created your file systems. I'm still not sure which is the correct choice. And there was a separate zacl program to handle the ACLs, which were rather different.

In fact, ACLs have been a nightmare of bad implementations throughout their history on Solaris. I already had previous here, having got the POSIX draft ACL implementation reworked for UFS. The original zfs implementation had default aka inheritable ACLs applied to existing objects in a directory. (If you don't immediately realise how bad that is, think of what this allows you to do with hard links to files.) The ACL implementations have continued to be problematic - consider that zfs allows 5 settings for the aclinherit property as evidence that we're glittering a turd at this point.

Eventually we did get zfs shipped in a Solaris 10 update, and it's been continually developed since then. The openzfs project has given the file system an independent existence, it's now in FreeBSD, you can run it (and it runs well) on Linux, and in other OS variations too.

One of the original claims was that zfs was infinitely scalable. I remember it being suggested that you could create a separate zfs file system for each user. I had to try this, so got together a test system (an Ultra 2 with an A1000 disk array) and started creating file systems. Sure, it got into several thousand without any difficulty, but that's not infinite - think universities or research labs and you can easily have 10,000 or 100,000 users, we had well over 20,000. And it fell apart at that scale. That's before each is an NFS share, too. So that idea didn't fly.

Overall, though, zfs was a step change. The fact that you had a file system that was flexible and easily managed was totally new. The fact that a file system actually returned correct data rather than randomly hoping for the best was years ahead of anything else. Having snapshots that allowed users to recover from accidentally deleted files without waiting days for a backup to be restored dramatically improved productivity. It's win after win, and I can't imagine using anything else for storing data.

Is zfs perfect? Of course not, and to my mind one of the most shocking things is that nothing else has even bothered to try and come close.

There are a couple of weaknesses with zfs (or related to zfs, if I put it more accurately). One is that it's still a single-node file system. While we have distributed storage, we still haven't really matured that into a distributed file system. The second is that while zfs has dragged storage into the 21st century, allowing much more sophisticated and scalable management of data, there hasn't been a corresponding improvement in backup, which is still stuck firmly in the 1980s.

Wednesday, October 04, 2023

SMF - part of the Solaris 10 legacy

The Service Management Facility, or SMF, integrated extremely late in the Solaris 10 release cycle. We only got one or two beta builds to test, which seemed highly risky for such a key feature.

So there was very little time to gather feedback from users. And something that central really can't be modified once it's released. It had to work first time.

That said, we did manage some improvements. The current implementation of `svcs -x` is largely due to me struggling to work out why a service was broken.

One of the obvious things about SMF is that it relies on manifests written in XML. Yes, that's of its time - there's a lot of software you can date by the file format it uses.

I don't have a particular problem with the use of XML here, to be honest. What's more of a real problem is that the manifest files were presented as a user interface rather than an internal implementation detail, so that users were forced to write XML from scratch with little to no guidance.

There are a lot of good features around SMF.

Just the very basic restart of an application that dies is something that's so blindingly obvious as a requirement in an operating system. So much so that once it existed I refused to support anything that didn't have SMF when I was on call - after all, most of the 3am phone calls were to simply restart a crashed application. And yes, when we upgraded our systems to Solaris 10 with SMF our availability went way up and the on-call load plummeted.

Being able to grant privileges to a service, and just within the context of that service, without having to give privileges to an application (eg set*id) or a user, makes things so much safer. Although in practice it's letting applications bind to privileged ports while running as a regular user, as that's far and away the most common use case.

Dependencies has been a bit of a mixed bag. Partly because working out what the dependencies should be in the first place is just hard to get right, but also because dependency declaration is bidirectional - you can inject a dependency on yourself into another service, and that other service may not respond well, or you can create a circular dependency if the two services are developed independently.

One part of dependency management in services is deciding whether a given service should start or not given the state of other services (such as its dependencies). Ideally, you want strict dependency management. In the real world, systems are messy and complicated, the dependency tree isn't terribly well understood, and some failure modes don't matter. And in many cases you want the system to try and boot as far as possible so you can get in and fix it.

A related problem is that we've ended up with a complex mesh of services because someone had to take the old mess of rc scripts and translate them into something that would work on day 1. And nobody - either at the time or since  - has gone though the services and studied whether the granularity is correct. One other thing - that again has never happened - once we got a good handle on what services there are is to look at whether the services we have are sensible, or whether there's an opportunity to rearchitect the system to do things better, And because all these services are now baked into SMF, it's actually quite difficult to do any major reworking of the system.

Not only that, but because people write SMF manifests, they simply copy something that looks similar to the problem at hand, so bad practices and inappropriate dependency declarations multiply.

This is one example of what I see as the big problem with SMF - we haven't got supporting tools that present the administrator with useful abstractions, so that everything is raw.

In terms of configuration management, SMF is very much a mixed bag. Yes, it guarantees a consistent and reproducible state of the system. The snag is that there isn't really an automated way to capture the essential state of a system and generate something that will reproduce it (either later or elsewhere) - it can be done, but it's essentially manual. (Backing up the state is a subset of this problem.)

It's clear that there were plans to extend the scope of SMF. Essentially, to be the Solaris version of the Windows registry. Thankfully (see also systemd for where this goes wrong) that hasn't happened much.

In fact, SMF hasn't really involved in any material sense since the day it was introduced. It's very much stuck in time.

There were other features that were left open. For example, there's the notion of the scope of SMF, and the only one available right now is the "localhost" scope - see the smf(7) manual in illumos - so in theory there could be other, non-localhost, scopes. And there was the notion of monitor methods, which never appeared but I can imagine solving a range of niggling application issues I've seen over the years.


Monday, September 11, 2023

Retiring isaexec in Tribblix

One of the slightly unusual features in illumos, and Solaris because that's where it came from, is isaexec.

This facility allows you to have multiple implementations of a binary, and then isaexec will select the best one (for some definition of best).

The full implementation allows you to select from a wide range of architectures. On my machine it'll allow the following list:

amd64 pentium_pro+mmx pentium_pro
pentium+mmx pentium i486 i386 i86

If you wanted, you could ship a highly tuned pentium_pro binary, and eke out a bit more performance.

The common case, though, and it's actually the only way isaexec is used in illumos, is to simply choose between a 32-bit and 64-bit binary. This goes back to when Solaris and illumos supported 32-bit and 64-bit hardware in the same system (and you could actually choose whether to boot 32-bit or 64-bit under certain circumstances). In this case, if you're running a 32-bit kernel you get a 32-bit application; if you're running 64-bit then you can get the 64-bit version of that application.

Not all applications got this treatment. Anything that needed to interface directly with the kernel did (eg the ps utility). And for others it was largely about performance or scalability. But most userland applications were 32-bit, and still are in illumos. (Solaris has migrated most to 64-bit now, we ought to do the same.)

It's been 5 years or more since illumos removed the 32-bit kernel, so the only option is to run in 64-bit mode. So now, isaexec will only ever select the 64-bit binary.

A while ago, Tribblix simply removed the remaining 32-bit binaries that isaexec would have executed on a 32-bit system. This saved a bit of space.

The upcoming m32 release goes further. In almost all cases isaexec is no longer involved, and the 64-bit binary sits directly in the PATH (eg, in /usr/bin). There's none of the wasted redirection. I have put symbolic links in, just in case somebody explicitly referenced the 64-bit path.

This is all done by manipulating packaging - Tribblix runs the IPS package repo through a transformation step to produce the SVR4 packages that the distro uses, and this is just another filter in that process.

(There are a handful of exceptions where I still have 32-bit and 64-bit. Debuggers, for example, might need to match the bitness of the application being debugged. And the way that sh/ksh/ksh93 is installed needs a slightly less trivial transformation to get it right.)

Monday, September 04, 2023

Modernizing scripts in Tribblix

It's something I've been putting off for far too long, but it's about time to modernize all the shell scripts that Tribblix is built on.

Part of the reason it's taken this long is the simple notion of, if it ain't broke, don't fix it.

But some of the scripting was starting to look a bit ... old. Antiquated. Prehistoric, even.

And there's a reason for that. Much of the scripting involved in Tribblix is directly derived from the system administration scripts I've been using since the mid-1990s. That involved managing Solaris systems with SVR4 packages, and when I built a distribution derived from OpenSolaris, using SVR4 packages, I just lifted many of my old scripts verbatim. And even new functionality was copied or slightly modified.

Coming from Solaris 2.3 through 10, this meant that they were very strictly Bourne Shell. A lot of the capabilities you might expect in a modern shell simply didn't exist. And much of the work was to be done in the context of installation (i.e. Jumpstart) where the environment was a little sparse.

The most obvious code smell is extensive use of backticks rather than $(). Some of this I've refactored over time, but looking at the code now, not all that much.

One push for this was adding ShellCheck to Tribblix (it was a little bit of a game getting Haskell and Cabal to play nice, but I digress).

Running ShellCheck across all my scripts gave it a lot to complain about. Some of the complaints are justified, although many aren't (it's very enthusiastic about quoting everything in sight, even when that would be completely wrong).

But generally it's encouraged me to clean the scripts up. It's even managed to find a bug, although looking at code it thinks is just rubbish has found a few more by inspection.

The other push here is to speed things up. Tribblix is often fairly quick in comparison to other systems, but it's not quick enough for me. But more of that story later.

Thursday, August 24, 2023

Speed up zone installation with this one weird trick

Sadly, the trick described below won't work in current releases of Solaris, or any of the illumos distributions. But back in the day, it was pretty helpful.

In Solaris 10, we had sparse root zones - which shared /usr with the global zone, which not only saved space because you didn't need a copy of all the files, but creating them was much quicker because you didn't need to take the time to copy all the files.

Zone installation for sparse root zones was typically about 3 minutes for us - this was 15 years ago, so mostly spinning rust and machines a bit slower than we're used to today.

That 3 minutes sounds quick, but I'm an impatient soul, and so were my users. Could I do better?

Actually, yes, quite a bit. What's contributing to that 3 minutes? There's a bit of adding files (the /etc and /var filesystems are not shared, for reasons that should be fairly obvious). And you need to copy the packaging metadata. But that's just a few files.

Most of the time was taken up by building the contents file, which simply lists all the installed files and what package they're in. It loops over all the packages, merging all the files in that package into the contents file, which thus grows every time you process a package.

The trick was to persuade it to process the packages in an optimal order. You want to do all the little packages first, so that the contents file stays small as long as possible.

And the way to do that was to recreate the /var/sadm/pkg directory. It was obvious that it was simply reading the directory and processing packages in the order that it found them. And, on ufs, this is the order that the packages were added to the directory. So what I did was move the packages to one side, create an empty /var/sadm/pkg, and move the package directories back in size order (which you can get fairly easily by looking as the size of the spooled pkgmap files).

This doesn't quite mean that the packages get processed in size order, as it does the install in dependency order, but as long as dependencies are specified it otherwise does them in size order.

The results were quite dramatic - with no other changes, this took zone install times from the original 3 minutes to 1 minute. Much happier administrators and users.

This trick doesn't work at all on zfs, sadly, because zfs doesn't simply create a linear list of directory entries and put new ones on the end.

And all this is irrelevant for anything using IPS packaging, which doesn't do sparse-root zones anyway, and is a completely different implementation.

And even in Tribblix, which does have sparse-root zones like Solaris 10 did, and uses SVR4 packaging, the implementation is orders of magnitude quicker because I just create the contents file in a single pass, so a sparse zone in Tribblix can install in a second or so.

Wednesday, August 23, 2023

Remnants of closed code in illumos

One of the annoying issues with illumos has been the presence of a body of closed binaries - things that, for some reason or other, were never able to be open sourced as part of OpenSolaris.

Generally the illumos project has had some success in replacing the closed pieces, but what's left isn't entirely zero.It took me a little while to work out what's still left, but as of today the list is:

etc/security/tsol/label_encodings.gfi.single
etc/security/tsol/label_encodings.example
etc/security/tsol/label_encodings.gfi.multi
etc/security/tsol/label_encodings
etc/security/tsol/label_encodings.multi
etc/security/tsol/label_encodings.single
usr/sbin/chk_encodings
usr/xpg4/bin/more
usr/lib/raidcfg/mpt.so.1
usr/lib/raidcfg/amd64/mpt.so.1
usr/lib/iconv/646da.8859.t
usr/lib/iconv/8859.646it.t
usr/lib/iconv/8859.646es.t
usr/lib/iconv/8859.646fr.t
usr/lib/iconv/646en.8859.t
usr/lib/iconv/646de.8859.t
usr/lib/iconv/646it.8859.t
usr/lib/iconv/8859.646en.t
usr/lib/iconv/8859.646de.t
usr/lib/iconv/iconv_data
usr/lib/iconv/646fr.8859.t
usr/lib/iconv/8859.646da.t
usr/lib/iconv/646sv.8859.t
usr/lib/iconv/8859.646.t
usr/lib/iconv/646es.8859.t
usr/lib/iconv/8859.646sv.t
usr/lib/fwflash/verify/ses-SUN.so
usr/lib/fwflash/verify/sgen-SUN.so
usr/lib/fwflash/verify/sgen-LSILOGIC.so
usr/lib/fwflash/verify/ses-LSILOGIC.so
usr/lib/labeld
usr/lib/locale/POSIX
usr/lib/inet/certlocal
usr/lib/inet/certrldb
usr/lib/inet/amd64/in.iked
usr/lib/inet/certdb
usr/lib/mdb/kvm/amd64/mpt.so
usr/lib/libike.so.1
usr/lib/amd64/libike.so.1
usr/bin/pax
platform/i86pc/kernel/cpu/amd64/cpu_ms.GenuineIntel.6.46
platform/i86pc/kernel/cpu/amd64/cpu_ms.GenuineIntel.6.47
lib/svc/manifest/network/ipsec/ike.xml
kernel/kmdb/amd64/mpt
kernel/misc/scsi_vhci/amd64/scsi_vhci_f_asym_lsi
kernel/misc/scsi_vhci/amd64/scsi_vhci_f_asym_emc
kernel/misc/scsi_vhci/amd64/scsi_vhci_f_sym_emc
kernel/strmod/amd64/sdpib
kernel/drv/amd64/adpu320
kernel/drv/amd64/atiatom
kernel/drv/amd64/usbser_edge
kernel/drv/amd64/sdpib
kernel/drv/amd64/bcm_sata
kernel/drv/amd64/glm
kernel/drv/amd64/intel_nhmex
kernel/drv/amd64/lsimega
kernel/drv/amd64/marvell88sx
kernel/drv/amd64/ixgb
kernel/drv/amd64/acpi_toshiba
kernel/drv/amd64/mpt
kernel/drv/adpu320.conf
kernel/drv/usbser_edge.conf
kernel/drv/mpt.conf
kernel/drv/intel_nhmex.conf
kernel/drv/sdpib.conf
kernel/drv/lsimega.conf
kernel/drv/glm.conf

Actually, this isn't much. In terms of categories:

Trusted, which includes those label_encodings, and labeld. Seriously, nobody can realistically run trusted on illumos (I have, it's ... interesting). So these don't really matter.

The iconv files actually go with the closed iconv binary, which we replaced ages ago, and our copy doesn't and can't use those files. We should simply drop those (they will be removed in Tribblix next time around).

There's a set of files connected to IKE and IPSec. We should replace those, although I suspect that modern alternatives for remote access will start to obsolete all this over time.

The scsi_vhci files are to get multipathing correctly set up on some legacy SAN systems. If you have to use such a SAN, then you need them. If not, then you're in the clear.

There are a number of drivers. These are mostly somewhat aged. The sdp stuff is being removed anyway as part of IPD29, so that'll soon be gone. Chances are that very few people will need most of these drivers, although mpt was fairly widely used (there was an open mpt replacement in the works). Eventually the need for the drivers will dwindle to zero as systems with them in no longer exist (and, by the same token, we wouldn't need them for something like an aarch64 port).

Which just leaves 2 commands.

Realistically, the XPG4 more could be replaced by less. The standard was based on the behaviour of less, after all. I'm tempted to simply delete /usr/xpg4/bin/more and make it a link to less and have done with it.

As for pax, it's required by POSIX, but to be honest I've never used it, haven't seen anywhere that uses it, and read support is already present in things like libarchive and gtar. The heirloom pax is probably more than good enough.

In summary, illumos isn't quite fully open source, but it's pretty close and for almost all cases we could put together a fully functional open subset that'll work just fine.

Wednesday, August 09, 2023

Static Site Generators

The current Tribblix website is a bit of a hack. Technically it's using a static site generator - a simple home-grown script that constructs pages from a bit of content and boilerplate - but I wanted to be able to go a bit further.

I looked at a few options - and there are really a huge number of them - such as Hugo and Zola. (Both are packaged for Tribblix now, by the way.)

In the end I settled on nanoc. That's packaged too (and I finally got around to having a very simple - rather naive - way of packaging gems).

Why nanoc, though? In this case it was really because it could take the html page fragments I already had and create the site from those, and after tweaking it slightly I end up with exactly the same html output as before.

Other options might be better if I was starting from scratch, but it would have been much harder to retain the fidelity of the existing site.

One advantage of the new system is that I can put the site under proper source control, so the repo is here.

There's still a lot of work to be done on filling out the content, but it should be easier to evolve the Tribblix website in future.

Thursday, July 13, 2023

Zones, way back when

The original big ticket feature in Solaris 10 was Zones, a simple virtualization technology that allowed a set of processes to be put aside in a separate namespace and be under the illusion that this was a separate computer system, all under a single shared kernel.

As a result of this sleight of hand, you could connect to a zone using ssh (or, remember this was way back, telnet or rsh), and from the application level you really were in a separate system - with your own file system and network namespaces. It was like magic.

Of the features in Solaris 10, Zones and DTrace were present early in the beta cycle, while SMF just made it into the last couple of beta builds, and ZFS wasn't actually available to customers until well after the first Solaris 10 release.

I ended up using zones in production quite accidentally. In the Solaris 10 Platinum Beta, we were testing the new features, just giving them a good beating, when one of our webservers (it was something like a Netra X1) died. Sure, we could have got it repaired, or reconfigured another server. But as an experiment, I simply fired up a zone on one of my beta systems, gave it the IP address of the failed server, installed apache, copied over the website, and we were back in service in about 5 minutes.

The Zones framework turns out to be incredibly flexible and powerful. I suspect most don't realize just what it's actually capable of, as Sun only gave you a canned product in two variations - whole-root and sparse-root zones. Later you saw glimpses of the power available with the first incarnation of LX zones (or SCLA - Solaris Containers for Linux Applications) and then the Solaris 8 and Solaris 9 containers, which allowed a different set of applications to run inside a zone.

Things actually became more limited in OpenSolaris and its derivatives such as Solaris 11; not only was LX removed, but so were sparse-root zones, and the diversity of potential zone types dwindled.

In illumos, some of the distributions have pushed Zones a bit further. Tribblix brought back sparse root zones, and introduced the alien brand - essentially a way to run any illumos OS or application in a zone. OmniOS has brought back LX, and it's reasonably current (in terms of keeping up with changes in the Linux world). SmartOS ran KVM in Zones, allowing double-hulled virtualization. And we now have bhyve as a fully supported offering for any illumos distribution, usually
embedded in a Zone.

Using a sparse-root zone is incredibly efficient. By sharing the main operating system files (mostly /lib and /usr, but can be others) you can save huge amounts of disk space - you only have to have one copy so that's a saving of anything for a couple of hundred megabytes to a couple of gigabytes of storage per zone. It gets better, because the read-only segments of any binaries and shared libraries are shared between zones, which dramatically reduces the additional memory footprint of each zone. Further on from that, because Solaris has this trick whereby any shared object used more that 8 times (or something like that) is kept resident in memory, all the common applications are always in memory and start incredibly quickly.

One of the things I did was use sparse-root zones and shared filesystems for a development -> test -> production setup. Basically, you create 3 zones, sparse-root ensures they're identical, and 3 filesystems - one each for development, test, and production. You share the development filesystem read-only into the test zone, so deployment from development to test is a straight copy. Likewise test to production.

One of the weaknesses of the way that zones were managed (distinct from the underlying technology framework) is that it was based around packaging. In Solaris 10, packaging and packages knew about zones, and the details about what files and packages ended up in a zone was embedded in the package metadata. Not only is this complex, it's also very rigid - you can't evolve the system without changing the packaging system and modifying all the packages. Sadly, IPS carried forward the same mistake. (In Tribblix, packaging knows nothing about zones whatsoever, but my zones understand packaging and can do the right thing with it - not only with much more flexibility but many times quicker.)

Later on in the Solaris 10 timeframe we got ZFS, which allowed you to do interesting things around sharing data and quickly creating copies of data for zones, allowing you to extend the virtual capabilities of zones from cpu and memory to storage. And the key missing piece, virtualized networking, never made it to Solaris 10 at all, but had to wait for crossbow to arrive in OpenSolaris.

Monday, May 08, 2023

Maintaining old software with no sign of retirement

There's a lot of really old software out there. Some of it has simply been abandoned; others have been replaced by new versions. But old software never really goes away, and we end up maintaining it.

This is especially tricky when old software depends on other old software, and we have to support the entire dependency tree.

There's always python2 and python 3. Some old software may never be fixed; some current software has consciously decided to stick to python 2. Distributions will be shipping python 2 for a long time yet.

Then there's PCRE and PCRE2. Some things have been updated; others haven't. Generally for this I'll keep updating, and eventually upstream might get around to migrating. But again I'll have to ship both for a while.

And then there's gtk2 and gtk3. (I find it ironic that the gimp itself is still using gtk2.) There's no end in sight of the need to ship both.

Some libraries have been deprecated entirely. the old libXp (the X printing library) is long gone. There were a couple of things built against it in Tribblix. I've just rebuilt chimera (a really old Xaw web browser if your memory doesn't go that far back) which was one consumer and now isn't; the other one was Motif (there's a convenient build flag --disable-printing to disable libXp support, which entertainingly breaks the build someplace else which I ended up having to fix).

Another example, libpng has gone through several different revisions. Each slightly incompatible, and you have to be sure to run with the same version you built against. At least you can ship all the different versions, as they have the version in the names. Mind you, linking against 2 different versions of libpng at the same time (for example, if a dependency pulls in a different version of libpng) is a bad thing, so I did have to rebuild a number of applications to avoid that. I ship the old libpng versions in a separate compat package, I think chimera was the only consumer, but I updated that to use a more current libpng.

A slightly different problem is the use of newer toolchains. Compilers are getting stricter over time, so old unmaintained software needs patches to even compile.

Don't even get me started on openssl.

Sunday, May 07, 2023

Upgrading MATE on Tribblix

I spent a little time yesterday updating MATE on Tribblix, to version 1.26.

This was supposed to be part of the "next" release, but we had to make an out of sequence release for an illumos security issue, so everything gets pushed back a bit.

Updating MATE is actually fairly easy, though. The components in MATE are largely decoupled, so can be updated independently of each other. (And there isn't really a MATE framework everything has to subscribe to, so the applications can be used outside MATE without any issues.)

There's a bit of tidying up and polish that helps. For example, I delete static archives and the harmful libtool archive files. Not only does this save space, it helps maintainability down the line.

Builds have a habit of picking up dependencies from the build system. Sometimes you can control this with judicious --enable-foo or --disable-foo flags, sometime you just have to make sure that the package you don't want pulled in isn't installed. The reverse is true - if you want a feature to be enabled, you have to make sure the dependencies are installed first and the feature will usually get enabled automatically.

That's not always true. For example, you have to explicitly tell it you have OSS for audio, it doesn't work this out on its own.

I took the opportunity to make everything 64-bit. Ultimately I want to get to 64-bit only. This involves a bit of working backwards - you have to make all consumers of a library 64-bit only first.

A couple of components are held downrev. The calculator now wants to pull in mpc and mpfr, which I don't package. (They're used by gcc, but I drop a copy of mpc and mpfr into the build for gcc to find rather than packaging them separately the way that most of the other illumos distributions do.) And pluma wants gtksourceview-4 which I don't have yet. This is related to the lack of tight coupling I mentioned earlier - there really isn't any problem having the different pieces that make up MATE at different revisions.)

You stumble across bugs along the way. For example, mate-control-center actually needs GLib 2.66 or later, which I don't have yet (there's another whole set of issues behind that), but it doesn't actually check for the right version. Fortunately the requirement is fairly localized and easy to patch out.

That done, on to another set of updates...

Wednesday, March 22, 2023

SPARC Tribblix m26 - what's in a number?

I've just released Tribblix m26 for SPARC.

The release history on SPARC looks a little odd - m20, m20.6, m22, m25.1, and now m26. Do these release versions mean anything?

Up to and including m25.1, the illumos commit that the SPARC version was built from matched the corresponding x86 release. This is one reason there might be a gap in the release train - that commit might not build or work on SPARC.

As of m26, the version numbers start to diverge between SPARC and x86. In terms of illumos-gate, this release is closer to m25.2, but the added packages are generally fairly current, closer to m29. So it's a bit of a hybrid.

But the real reason this is a full release rather than an m25 update is to establish a new baseline, which allows me to establish compatibility guarantees and roll over versions of key components, in this case it allows me to upgrade perl.

In the future, the x86 and SPARC releases are likely to diverge further. Clearly SPARC can't track the x86 releases perfectly, as SPARC support is being removed from the mainline source following IPD 19, and many of the recent changes in illumos simply aren't relevant to SPARC anyway. So future SPARC releases are likely to simply increment independently.

Sunday, March 12, 2023

How I build the Tribblix AMIs

I run Tribblix on AWS, and make some AMIs available. They're only available in London (eu-west-2) by default, because that's the only place where I use them, and it costs money to have them available in other regions. If you want to run them elsewhere, you can copy the AMI.

It's not actually that difficult to create the AMIs, once you've got the hang of it. Certainly some of the instructions you might find can seem a little daunting. So here's how I do it. Some of the details here are very specific to my own workflow, but the overall principles are fairly generic. The same method would work for any of the illumos distributions, and you could customize the install however you wish.

The procedure below assumes you're running Tribblix m29 and have bhyve installed.

The general process is to boot and install an instance into bhyve, then boot that and clean it up, save that disk as an image, upload to S3, and register an AMI from that image.

You need to use the minimal ISO (I actually use a custom, even more minimal ISO, but that's just a convenience for myself). Just launch that as root:

zap create-zone -t bhyve -z bhyve1 \
-x 192.168.0.236  \
-I /var/tmp/tribblix-0m29-minimal.iso \
-V 8G

Note that this creates an 8G zvol, which is the starting size of the AMI.

Then run socat as root to give you a VNC socket to talk to

socat TCP-LISTEN:5905,reuseaddr,fork UNIX-CONNECT:/export/zones/bhyve1/root/tmp/vm.vnc

and as yourself, run the vnc viewer

vncviewer :5

Once it's finished booting, log in as root and install with the ec2-baseline overlay which is what makes sure it's got the pieces necessary to work on EC2.

./live_install.sh -G c1t0d0 ec2-baseline

Back as root on the host, ^C to get out of socat, remove the ISO image and reboot, so it will boot from the newly installed image.

zap remove-cd -z bhyve1 -r

Restart socat and vncviewer, and log in to the guest again.

What I then do is to remove any configuration or other data from the guest that we don't want in the final system. (This is similar to the old sys-unconfig that many of us used to Solaris will be familiar with.)

zap unconfigure -a

I usually also ensure that a functional resolv.conf exists, just in case dhcp doesn't create it correctly.

echo "nameserver    8.8.8.8" > /etc/resolv.conf

Back on the host, shut the instance down by shutting down the bhyve zoned it's running in:

zoneadm -z bhyve1 halt

Now the zfs volume you created contains a suitable image. All you have to do is get it to AWS. First copy the image into a plain file:

dd if=/dev/zvol/rdsk/rpool/bhyve1_bhvol0 of=/var/tmp/tribblix-m29.img bs=1048576

At this point you don't need the zone any more so you can get rid of it:

zap destroy-zone -z bhyve1

The raw image isn't in a form you can use, and needs converting. There's a useful tool - the VMDK stream converter (there's also a download here) - just untar it and run it on the image:

python2 ./VMDK-stream-converter-0.2/VMDKstream.py /var/tmp/tribblix-m29.img /var/tmp/tribblix-m29.vmdk

Now copy that vmdk file (and it's also a lot smaller than the raw img file) up to S3, in the following you need to adjust the bucket name from mybucket to something of yours:

aws s3 cp --cli-connect-timeout 0 --cli-read-timeout 0 \
/var/tmp/tribblix-m29.vmdk s3://mybucket/tribblix-m29.vmdk

Now you can import that image into a snapshot:

aws ec2 import-snapshot --description "Tribblix m29" \
--disk-container file://m29-import.json

where the file m29-import.json looks like this:

{
    "Description": "Tribblix m29 VMDK",
    "Format": "vmdk",
    "UserBucket": {
        "S3Bucket": "mybucket",
        "S3Key": "tribblix-m29.vmdk"
    }
}

The command will give you a snapshot id, that looks like import-snap-081c7e42756d7456b, which you can follow the progress of with

aws ec2 describe-import-snapshot-tasks --import-task-ids import-snap-081c7e42756d7456b

When that's finished it will give you the snapshot id itself, such as snap-0e0a87acc60de5394. From that you can register an AMI, with

aws ec2 register-image --cli-input-json file://m29-ami.json

where the m29-ami.json file looks like:

{
    "Architecture": "x86_64",
    "Description": "Tribblix, the retro illumos distribution, version m29",
    "EnaSupport": false,
    "Name": "Tribblix-m29",
    "RootDeviceName": "/dev/xvda",
    "BlockDeviceMappings": [
        {
            "DeviceName": "/dev/xvda",
            "Ebs": {
                "SnapshotId": "snap-0e0a87acc60de5394"
            }
        }
    ],
    "VirtualizationType": "hvm",
    "BootMode": "legacy-bios"
}

If you want to create a Nitro-enabled AMI, change "EnaSupport" from "false" to "true", and "BootMode" from "legacy-bios" to "uefi".


Saturday, March 11, 2023

What, no fsck?

There was a huge amount of resistance early on to the fact that zfs didn't have an fsck. Or, rather, a separate fsck.

I recall being in Sun presentations introducing zfs and question after question was about how to repair zfs when it got corrupted.

People were so used to shoddy file systems that were so badly implemented that a separate utility was needed to repair file system errors caused by fundamental design and implementation errors in the file system itself that the idea that the file system driver itself ought to take responsibility for managing the state of the file system was totally alien.

If you think about ufs, for example, there were a number of known failure modes, and what you did was take the file system offline, run the checker against it, and it would detect the known errors and modify the bits on disk in a way that would hopefully correct the problem. (In reality, if you needed it, there was a decent chance it wouldn't work.) Doing it this way was simple laziness - it would be far better to just fix ufs so it wouldn't corrupt the data in the first place (ufs logging went a long way towards this, eventually). And you were only really protecting against known errors, where you understood exactly the sequence of events that would cause the file system to end up in a corrupted state, so that random corruption was either undetectable or unfixable, or both.

The way zfs thought about this was very different. To start with, eliminate all known behaviour that can cause corruption. The underlying copy on write design goes a long way, and updates are transactional so either complete or not. If you find a new failure mode, fix that in the file system proper. And then, correction is built in rather than separate, which means that it doesn't need manual intervention by an administrator, and all repairs can be done without taking the system offline.

Thankfully we've moved on, and I haven't heard this particular criticism of zfs for a while.