Sunday, March 15, 2026

Observations on Tribblix m39

 I've just announced the latest release of Tribblix - m39, now available for download and upgrades.

This follows the "when I feel like it" release model. Which means that (a) I have some potentially breaking changes that need a release boundary to navigate, and (b) enough time has passed that I feel the need to update illumos to pick up any recent changes. There's no hard timescale, but for regular releases I would have thought 3-6 months would be about the right ballpark.

What's new this time around? There's the usual dry list of updates, which could be much longer (anything updated multiple times only gets listed the once, and all the python and perl module updates are missing entirely).

Lets pick some of those updates apart.

The libtiff update was a large one. Not because the update itself was difficult, but because the shared library SONAME got updated. For the time being (and for an indeterminate length of time) I'll ship both the old and new shared libraries, but I've rebuilt (almost) everything against the new version. That's one reason for this being done on an upgrade boundary - to force the breaking change and all the updates associated with it to take place at once.

There's a similar, but much smaller, story associated with OpenEXR.

For OpenSSL, there are a couple of changes. The first is that it's bumped from the 3.0.x series to the 3.5.x series. It's all binary compatible, so doesn't need the world to be rebuilt, but again the reason for pushing it on a release boundary is to ensure that nothing subsequently built against the new version is installed on a system with the old version. The second change is that the surfaced API is now 3.0, rather than the prior 1.1.1.

It wasn't strictly necessary to have an OpenSSH update tied to a release, but that got rolled in. My very first test triggered the post-quantum warning, because one of my build servers is deliberately running a much older version of Tribblix with a much older SSH server.

The underlying illumos gate build also has additional patches for NFSv4.1 - specifically the backchannel fixes in 16390. Hopefully this will make it into regular illumos soon, but it seemed like an excellent feature to get baked in.

I also patch illumos, as I did last time, for a larger range of pids, remove the y2038 clamp in ZFS, and tune the network stack for the 21st century.

There's one Tribblix feature that would be worth talking about in this release - appstack zones, which allow you to build a zone running an application, and do basic configuration of it, all with a simple one-line command. I'll talk about that separately, as it's almost but not quite ready for wider adoption.

Sunday, February 15, 2026

Enabling IPv6 on EC2

I've actually been using IPv6 on and off since the late 1990s - there was an addon for Solaris 2.6 that we installed on a bunch of test machines. It worked great, but wasn't something we ever did properly in production because at the time none of our customers had IPv6.

I've wanted to use IPv6 more, but my home ISP has never had it. Until recently, when I noticed my main machine (running Tribblix, naturally) has an Ipv6 address and was using it. At that point, investigating and testing IPv6 becomes a lot more interesting.

Some of the Tribblix web servers (specifically the pkgs and iso download servers) are hosted in the cloud. I do this for testing and dogfooding, so I know that Tribblix works on those cloud environments.

The iso machine is on Digital Ocean; sadly Digital Ocean don't support IPv6 for custom images, for no good reason that I can see. But the pkgs server is on AWS, and they do support IPv6, but it wasn't enabled.

My first test here was actually to see if I could launch an EC2 instance with an IPv6 address, using the aws cli:

aws ec2 run-instances --ipv6-address-count 1 ...

and this failed, because of the fact that IPv6 wasn't enabled where I was trying to launch it. So I needed to do a few steps to make it work, and this is largely to remind me for when I need to do it again.

First, you need to associate an IPv6 block of addresses with the VPC you're using. Go to the VPC in the console, find something that says Edit CIDRs and, in that, Add a new IPv6 CIDR. Just choose an Amazon provided one and you're good. That should give you a /56 block.

You then need to go into each of the subnets in that VPC and associate an IPv6 CIDR block with it. Find the Edit IPV6 CIDRs button, go into that, and Add IPv6 CIDR. You can add the entire block to one subnet if you want, but normally you can break it down. Under the allocation are some arrows - the up and down arrows change the size of the block, I just went down one to a /60. By default, the starting address of the subnet block is the same as the VPC block - for additional subnets you'll need to use the little right arrow to change the start address, as you can't associate overlapping blocks.

I also went into the subnet settings to Enable auto-assign IPv6 address.

What you then need to do is go to the route table (either from the subnet or the VPC) and add a route for ::/0 with the target being the internet gateway - there should only be the one gateway, the same as used for IPv4. If you don't add the route to the route table you'll get IPv6 addresses but you won't be able to talk to anything outside the VPC over IPv6.

With that, launch a new instance and you'll get IPV6 working nicely.

Nothing to do with EC2, but the one other thing I needed to do was add an additional IPV6 listen directive to each server in my nginx config, as nginx will only listen on IPv4 by default.

Thursday, June 05, 2025

Is the Information Security industry succeeding?

Yesterday I had a trip up to London and had a wander round Infosecurity Europe. It was an interesting day, lots of things to see, many interesting conversations.

The show itself is huge. We've clearly come out of the doldrums of the last few years where shows had become tiny. And this was a dedicated infosec event, not just one part of a larger IT event.

Going by the size of the event, the number of exhibitors, the number of attendees, the size and extravagance of the displays, I think it's fair to say that Information Security as a business sector is doing very well. There's clearly a huge amount of vendor cash to splash around, and a confidence that customers have plenty of cash to buy the products on offer.

But is making money the correct definition of success here?

Most of the industry has a focus on detection and remediation. The pitch is that your systems are horrendously insecure and you need to give vendor X lots of money so they can detect a failure and help get your business back on its feet.

There was very little, in fact almost nothing, aimed at actually building more secure systems. (Even training and awareness is really nothing more than glossing over the cracks.) Maybe the closest is things aimed at the supply chain, but even that's basically detection of someone else's vulnerabilities.

So, in terms of actually building better systems, the Infosecurity industry is failing. It's not even addressing the problem.

(I would say that one definition of success for an information security company would be for it to do such a good job it's no longer needed. Clearly that's not going to be in many business plans.)

Furthermore, a string of high-profile hacks and breaches clearly indicates that the industry is failing to keep businesses secure.

Tuesday, May 13, 2025

Random thoughts on a Next Generation Tribblix

I have a little private project called xTribblix.

What's the x stand for? eXtreme? eXtraordinary? eXperimental? neXt generation?

Honestly, I don't know. It doesn't matter, it's just a little bucket I can drop things in to. But essentially, a set of experiments around changing Tribblix that allows me to do interesting things. The aim would be that, if successful, they get folded back into regular Tribblix; if unsuccessful then it's a learning experience.

It's just the logical continuation of the drive I've always had to make Tribblix faster, leaner, cleaner, fitter, easier, more secure. While retaining compatibility and functionality.

There are a few bits of illumos that really ought to be removed. Printing is a prime example - CUPS is a better, more modern implementation, maintained, familiar to everyone, what most Solaris people wanted anyway, and to be honest printing *isn't* an illumos core competency, so it's an ideal target to be outsourced. That's a clear example with a superior replacement already available; most subsystems might have someone crawl out of the woodwork who's inconvenienced by their removal.

So far, I've simply looked at things and decided to implement many of the simple ones for the next release(s) without the need for a separate experimental release. This isn't new, it's been going on for many releases already, and so far I've managed not to break anything that matters.

Some of the things done already (some will be in the next release):

  • grub deprecated
  • update DEFAULT_MAXPID to allow pid > 30000 (eg 99999 like smartos)
  • delete ftpusers, as there's no illumos ftpd
  • long usernames now silent rather than warning
  • removed uucp, and removed the nuucp user
  • zones based on core-tribblix need to worry less about what to remove
  • overlays based on core-tribblix with the actual images having a driver layer on top, so cloud/virtual images can slim down
  • replace /usr/xpg4/bin/more with a link to less
  • replace pax with the heirloom version
  • create /var/adm/loginlog by default
  • increase PASSLENGTH in /etc/default/passwd to 8
  • remove /etc/log and /var/adm/log, latter only used by volcopy
  • transformed away and eliminated most uses of isaexec
  • remove /usr/games
  • remove all legacy printing
  • remove libadt_jni
  • remove ilb
  • remove the old as on x86, everything should use gas
  • remove oawk and man page (and ref in awk.1)
  • remove newform, listusers, asa
  • no longer install doctools by default
  • drop the closed iconv bits, as they're useless
  • remove libfru* on x86
  • replace sendmail with the upstream
  • deprecate mailwrapper

A lot of this is simple package manipulation as I convert the IPS repo produced by an illumos build into SVR4 packages, mostly avoiding the need to patch the source or the build.

There's a lot more that could be done, some examples of what I'm thinking of include:

  • xpgN by default (replace regular binaries in /usr/bin)
  • sort out cpp (last remaining closed bin)
  • everything 64-bit
  • remove /etc links more aggressively
  • no ucb at all [except mebbe install...]
  • see if there are any expensive and unused kstats we could remove
  • firewall on by default
  • passwd blocklists by default
  • extendedFILE(7) enabled by default (although not necessary if everything is 64-bit!)
  • refactor packages so they are along sensible boundaries (with reducing the number of distinct packages being the goal)

 Now all I need is some time to implement all this...

Thursday, April 24, 2025

On efficiency and resilience in IT

Years ago, I was in a meeting when a C-level executive proclaimed:

IT systems run at less than 10% utilization on average, so we're moving to the cloud to save money.

The logic behind this was that you could run systems in the cloud that were the size you needed, rather than the size you had on the floor.

Of course, this particular claim was specious. Did he know the average utilization of our systems, I asked. He did not. (It was at least 30%.)

Furthermore, measuring CPU utilization is just one aspect of a complex multidimensional space. Systems may have spare CPU cycles, but are hitting capacity limits on memory, memory bandwidth, network bandwidth, storage and storage bandwidth. It's rare to have a system so well balanced that it saturates all parameters equally.

Not only that, but the load on all systems fluctuate, even on very short timescales. There will always be troughs between the peaks. And, as we all know, busy systems tend to generate queues and congestion - or, as a technical term, higher utilization leads to increased latency.

Attempting to build systems that maximise efficiency implies minimizing waste. But if you always consider spare capacity as wasted capacity, then you will always get congested systems and slow response. (Just think about queueing at the tills in a supermarket where they've staffed them for average footfall.)

So guaranteeing performance and response time implies a certain level of overprovisioning.

Beyond that, resilient systems need to have sufficient capacity to not only handle normal fluctuations in usage, but abnormal usage due to failures and external events. And resilient design needs to have unused capacity to take up the slack when necessary.

In this case, a blinkered focus on efficiency not only leads to poor response, it also makes systems brittle and incapable of responding if a problem occurs.

A simple way to build resiliency is to have redundant systems - provision spare capacity that springs into action when needed. In such an  active-passive configuration, the standby system might be idle. It doesn't have to be - you might use redundant systems for development/test/batch workloads (this presupposes you have a mechanism like Solaris zones to provide strong workload isolation).

Going to the cloud might solve the problem for a customer, but the cloud provider has exactly the same problem to solve, on a larger scale. They need to provision excess capacity to handle the variability in customer workloads. Which leads to the creation of interesting pricing models - such as reserved instances and the spot markets on AWS.

Tuesday, April 08, 2025

Understanding emission scopes, or failing to

I've been trying to get my head around all this Scope 1, Scope 2, Scope 3 emissions malarkey. Although it appears that lots of people smarter than me are struggling with it.

Having spent a while looking at how the Scopes are defined, I can understand how this can be difficult.

OK, Scope 1 is an organisation's direct emissions. Presumably an organisation knows what it's doing and how it's doing it, so getting the Scope 1 emissions from that ought to be fairly straightforward.

And Scope 2 is electricity, steam, heating and cooling purchased from someone else. I'm immediately suspicious here because this is a weirdly specific categorisation. But at least it should be easy to calculate - there's a conversion factor but at least you know the usage because it's on a bill you have to pay.

Then Scope 3 is - everything else. The fact that there are 15 official categories included ought to be a big red flag. That it's problematic is shown by the fact so many organisations have problems with it. (And by the growth of an industry to solve the problem for you.)

Personally, I wouldn't have defined it this way. If the idea is to evaluate emissions across the supply chain, then dumping almost all the emissions into the vaguest bucket is always going to be problematic.

So, why wasn't Scope 2 simply defined as the combined Scope 1 emissions of everyone providing services to the organisation. (That includes upstream and downstream, suppliers and employees, by the way.) That has 2 advantages I can see:

  • It's easy to calculate, because Scope 1 is pretty easy to calculate for all the providers of services (and they may well be doing it anyway), and an organisation ought to know who's providing services to it
  • It makes Scope 2 bigger (obviously) because there's more included, and therefore makes Scope 3 smaller, so uncertainties in Scope 3 matter less
  • Because you can better identify the contributors to your Scope 2 emissions, it's easier to know where to start making improvement efforts
I presume there's some reason it wasn't done this way, but I can't immediately see it.

Friday, April 04, 2025

What is this AI anyway?

AI is all the rage right now. It's everywhere, you can't avoid it.

But what is AI?

I'm not going to try and answer that here. What I will do, though, is state the question somewhat differently:

What is meant by "AI" in a given context?

And this matters, because the words we use are important.

The reality is that when you see AI mentioned it really could be almost anything. Some things AI might mean are:

  • Copilot
  • ChatGPT
  • Gemini
  • Some other specific off the shelf public LLM
  • Anything involving any off the shelf LLM
  • A custom domain-specific LLM
  • Machine learning
  • Pattern matching
  • Image recognition
  • Any old computer program
  • One of the AI companies

And there's always the possibility that someone has simply slapped AI on a product as a marketing term with no AI involved.

This persistent abuse of terminology is really unhelpful. Yesterday I went to a very interesting event for conversations about Hopes and Fears around AI.

Am I hopeful or fearful about AI? It depends which of the above definitions you mean.

There are certain uses of what might now be lumped in with AI that have proven to be very successful, but in many cases they're really machine learning, and have actually been around for a long time. I'm very positive about those (for example, helping in medical diagnoses).

On the other hand, if the AI is a stochastic parrot trained via large scale abuse of copyright while wreaking massive environmental damage, then I'm very negative about that.

So I think it's important to get away from sticking the AI label onto everything that might have some remote association with a computer program, and be far more careful in our terminology.

Tuesday, March 25, 2025

Tribblix on SPARC: sparse devices in an LDOM

I recently added a ddu like capability to Tribblix.

In that article I showed the devices in a bhyve instance. As might be expected there really aren't a lot of devices you need to handle.

What about SPARC, you might ask? Even if you don't, I'll ask for you.

Running Tribblix in a LDOM, this is what you see:

root@sparc-m32:/root# zap ddu
Device SUNW,kt-rng handled by n2rng in TRIBsys-kernel-platform [installed]
Device SUNW,ramdisk handled by ramdisk in TRIBsys-kernel [installed]
Device SUNW,sun4v-channel-devices handled by cnex in TRIBsys-ldoms [installed]
Device SUNW,sun4v-console handled by qcn in TRIBsys-kernel-platform [installed]
Device SUNW,sun4v-disk handled by vdc in TRIBsys-ldoms [installed]
Device SUNW,sun4v-domain-service handled by vlds in TRIBsys-ldoms [installed]
Device SUNW,sun4v-network handled by vnet in TRIBsys-ldoms [installed]
Device SUNW,sun4v-virtual-devices handled by vnex in TRIBsys-kernel-platform [installed]
Device SUNW,virtual-devices handled by vnex in TRIBsys-kernel-platform [installed]
 

It's hardly surprising, but that's a fairly minimal list.

It does make me wonder whether to produce a special SPARC Tribblix image precisely to run in an LDOM. After all, I already have slightly different variants on x86 designed for cloud in general, and one for EC2 specifically, that don't need the whole variety of device drivers that the generic image has to include.

Sunday, March 23, 2025

Expecting an AI boom?

I recently went down to the smoke, to Tech Show London.

There were 5 constituent shows, and I found what each sub-show was offering - and the size of each component - quite interesting.

There wasn't much going on in Devops Live, to be honest. Relatively few players had shown up, nothing terribly interesting.

There wasn't that much in Big Data & AI World either. I was expecting much more here, and what there was seemed to be on the periphery. More support services than actual product.

The Cloud & Cyber Security Expo was middling, not great, and there was an AI slant in evidence. Not proper AI, but a sprinkling of AI dust on things just to keep up with the Joneses.

Cloud and AI Infrastructure had a few bright spots. I saw actual hardware on the floor - I had seen disk shelves over in the Big Data section, but here I spotted a Tape Library (I used to use those a lot, haven't seen much in that area for a while) and a VDI blade. Talked to a few people, including the Zabbix and Tailscale stands.

But when it came to Data Centre World, that was buzzing. It was about half the overall floor area, so it was far and away the dominant section. Tremendous diversity too - concrete, generators, power cables, electrical switching, fiber cables, cable management, thermal management, lots of power and cooling. Lots and lots of serious physical infrastructure.

There was an obvious expectation on display that there's a massive market around high-density compute. I saw multiple vendors with custom rack designs - rear-door and liquid cooling in evidence. Some companies addressing the massive demand for water.

If these people are at a trade show, then the target market isn't the 3 or 4 hyperscalers. What's being anticipated in this frenzy is very much companies building out their own datacentre facilities, and that's very much an interesting trend.

There's a saying "During a gold rush, sell shovels". What I saw here was a whole army of shovel-sellers getting ready for the diggers to show up.

Thursday, March 06, 2025

Tribblix, UEFI, and UFS

Somewhat uniquely among illumos distributions, Tribblix doesn't require installation to ZFS - it allows the possibility of installing to a UFS root file system.

I'm not sure how widely used this is, but it will get removed as an option at some point, as the illumos UFS won't work past Y2038.

I recently went through the process of testing an install of the very latest Tribblix to UFS, in a bhyve guest running UEFI. The UEFI part was a bit more work, and doing it clarified how some of the internals fit together.

(One reason for doing these unusual experiments is to better understand how things work, especially those that are handed automatically by more mainstream components.)

OK, on to installation.

While install to zfs will automatically lay out zfs pools and file systems, the ufs variant needs manual partitioning. There are two separate concerns - the Tribblix install, and UEFI boot.

The Tribblix installer for UFS assumes 2 things about the layout of the disk it will install to:

  1. The slice s0 will be used to install the operating system to, and mounted at /.
  2. The slice s1 will be used for swap. (On zfs, you create a zfs volume for swap; on ufs you use a separate raw partition.)

It's slightly unfortunate that these slices are hard-coded into the installer.

For UEFI boot we need 2 other slices:

  1. A system partition (this is what's called EFI System partition, aka ESP)
  2. A separate partition to put the stage2 bootloader in. (On zfs there's a little bit of free space you can use; there isn't enough on ufs so it needs to be handled separately.)

The question then arises as to how big these need to be. Now, if you create a root pool with ZFS (using zpool create -B) it will create a 256MB partition for ESP. This turns out to be the minimum size for FAT32 on 4k disks, so that's a size that should always work. On disks with a 512 block size, it needs to be 32MB or larger (there's a comment in the code about 33MB). The amount of data you're going to store there is very much less.

The stage2 partition doesn't have to be terribly big.

So as a result of this I'm going to create a GPT label with 4 slices - 0 and 1 for Tribblix, 3 and 4 for EFI system and boot.

There are 2 things to note here: First,the partitions you create don't have to be laid out on disk in numerical order, you can put the slices in any order you want. This was true for SMI disks too, where it was common practice in Solaris to put swap on slice 1 at the start of the disk with slice 0 after it. Second, EFI/GPT doesn't assign any special significance to slice 2, unlike the old SMI label where slice 2 was conventionally the whole disk. I'm avoiding slice 2 here not because it's necessary, but so as to not confuse anyone used to the old SMI scheme.

The first thing to do with a fresh disk is to go into format, invoked as format -e (expert mode in order to access the EFI options). Select the disk, run fdisk from inside format, and then install an EFI label.

format -e
#
# choose the disk
#
fdisk
y - to accept defaults
l - to label
1 - choose efi

Then we can lay out the partitions. Still in format, type p to enter the partition menu and p to display the partitions.

p - enter partition menu
p - show current partition table

At this point on a new disk it should have 8 as "reserved" and 0 as "usr", with everything else "unassigned". We're going to leave slice 8 untouched.

First note where slice 0 currently starts. I'll resize it at the end, but we're going to put slices 3, 4, and 1 at the start of the disk and then resize 0 to fill in what's left.

To configure the settings for a given slice, just type its number.

Start with slice 3, type 3 and configure the system partition.  This has to use the "system" tag.

tag: system
flags: wm (just hit return to accept)
start: 34
size: 64mb

Type p again to view the partition table and note the last sector of slice 3 we just created, and add 1 to it to give the start sector of the next slice. Type 4 to configure the boot partition, and it must have the tag "boot".

tag: boot
flags: wm (just hit return to accept)
start: 131106
size: 16mb

Type p again to view the partition table, take note of the last sector for the new slice 4, and add 1 to get the start sector for the next one. Which is 1 for the swap partition.

tag: swap
flags: wm (just hit return to accept)
start: 65570
size: 512mb

We're almost done. The final step is to resize partition 0. Again you get the start sector by adding 1 to the last sector of the swap partition you just created. And rather than giving a size you can give the end sector using an 'e' suffix, which should be one less than the start of the reserved partition 8, and also the last sector of the original partition 0. Type 0 and enter something like:

tag: usr
flags: wm (just hit return to accept)
start: 1212450
size: 16760798e

Type 'p' one last time to view the partition table, check that the Tag entries are correct, and that the First and Last Sectors don't overlap.

Then type 'l' to write the label to the disk. It will ask you for the label type - make sure it's EFI again - and for confirmation.

Then we can do the install

./ufs_install.sh c1t0d0s0

It will ask for confirmation that you want to create the file system

At the end it ought to say "Creating pcfs on ESP /dev/rdsk/c1t0d0s3"

If it says "Requested size is too small for FAT32." then that's a hint that you need the system partition to be bigger. (An alternative trick is to mkfs the pcfs file system yourself, if you create it using FAT16 it will still work but you can get away with it being a lot smaller.)

It should also tell you that it's writing the pmbr to slice 4 and to p0.

With that, rebooting into the newly installed system ought to work.

Now, the above is a fairly complicated set of instructions. I could automate this, but do we really want to make it that easy to install to UFS?

Wednesday, February 19, 2025

Introducing a ddu-alike for Tribblix

Introducing a new feature in Tribblix m36. There's a new ddu subcommand for zap.

In OpenSolaris, the Device Driver Utility would map the devices it found and work out what software was needed to drive them. This isn't that utility, but is inspired by that functionality, rewritten for Tribblix as a tiny little shell script.

As an example, this is the output of zap ddu for Tribblix in a bhyve instance:

jack@tribblix:~$ zap ddu
Device acpivirtnex handled by acpinex in TRIBsys-kernel-platform [installed]
Device pci1af4,1000,p handled by vioif in TRIBdrv-net-vioif [installed]
Device pci1af4,1001 handled by vioblk in TRIBdrv-storage-vioblk [installed]
Device pci1af4,1 handled by vioif in TRIBdrv-net-vioif [installed]
Device pciclass,030000 handled by vgatext in TRIBsys-kernel [installed]
Device pciclass,060100 handled by isa in TRIBsys-kernel-platform [installed]
Device pciex_root_complex handled by npe in TRIBsys-kernel-platform [installed]
Device pnpPNP,303 handled by kb8042 in TRIBsys-kernel [installed]
Device pnpPNP,f03 handled by mouse8042 in TRIBsys-kernel [installed]

Simply put, it will list the devices it finds, which driver is responsible for them, and which package that driver is contained in (and whether that package is installed).

This, while a tiny little feature, is one of those small things that is actually stunningly useful.

If there's a device that we have a driver for that isn't installed, this helps identify it so you know what to install.

What this doesn't do (yet, and unlike the original ddu) is show devices we don't have a driver for at all.

Monday, February 10, 2025

Is all this thing called AI worthwhile?

Before I even start, let's be clear: there are an awful lot of things currently being bundled under the "AI" banner, most of which of neither artificial nor intelligent.

So when I'm talking about AI here, I'm talking about what's being marketed to the masses as AI. This generally doesn't include the more traditional subjects of machine learning or image recognition, which I've often seen relabelled as AI.

But back to the title: is the modern thing called AI worthwhile?

Whatever it is, AI can do some truly remarkable things. That isn't something you can argue against. It can do some truly stupid and hopelessly wrong things as well.

But where does this good stuff fit in? Are businesses really going to benefit by embracing AI?

Well, yes, up to a point. There's a lot of menial work that can be handed off to an AI. It might be able to do it cheaper than a human.

The first snag is Jevon's paradox; by making menial tasks cheaper, a business simply opens the door to larger quantities of menial tasks, so it saves no money and its costs might even go up.

To be honest, though, I would have to say that if you can hand a task off to an AI, is it worth doing in the first place?

That's the rub, yes you might be able to optimise a process by using AI, but you can optimise it much more by eliminating it entirely.

(And you then don't have to pay extra for someone to come along and clean up after the AI has made a mess of it.)

It's not just the first level of process you need to look at. Take the example of summarising meetings. It's not so much that you don't need the summary, but to start with you need to run meetings better so they don't need to be summarised, and even better, the meeting probably wasn't needed at all.

Put it another way: the AI will get you to a local minimum of cost, but not to a global minimum. Worse, as AI gets cheaper and more widely used, that local optimisation makes it even harder to optimise the system globally.

So yes, I'm not convinced that much of the AI currently being rammed down our throats has any utility. It will actively block businesses in the pursuit of improvements, and the infatuation with current trendy AI will harm the development of useful AI.

Monday, December 16, 2024

Thoughts on Static Code Analysis

I use a number of tools in static code analysis for my projects - primarily Java based. Mostly

  1. codespell
  2. checkstyle
  3. shellcheck
  4. PMD
  5. SpotBugs

Wait, I hear you say. Spell checking? Absolutely, it's a key part of code and documentation quality. There's absolutely no excuse for shoddy spelling. And I sometimes find that if the spelling's off, it's a sign that concentration levels weren't what they should have been, and other errors might also have crept in.

checkstyle is far more than style, although it has very fixed ideas about that. I have a list of checks that must always pass (now I've cleaned them up at any rate), so that's now at the state where it's just looking for regressions - the remaining things it's complaining about I'm happy to ignore (or the cost of fixing them massively outweighs any benefit to fixing them).

One thing that checkstyle is keen on is thorough javadoc. Initially I might have been annoyed by some of its complaints, but then realised 2 things. First, it makes you consider whether a given API really should be public. And more generally as part of that, having to write javadoc can make you reevaluate the API you've designed, which pushes you towards improving it.

When it comes to shellcheck, I can summarise it's approach as "quote all the things". Which is fine, until it isn't and you actually want to expand a variable into its constituent words.

But even there, a big benefit again is that shellcheck makes you look at the code and think about what it's doing. Which leads to an important point - automatic fixing of reported problems will (apart from making mistakes) miss the benefit of code inspection.

Actual coding errors (or just imperfections) tend to be the domain of PMD and SpotBugs. I have a long list of exceptions for PMD, depending on each project. I'm writing applications for unix-like systems, and I really do want to write directly to stdout and stderr. If I want to shut the application down, then calling System.exit() really is the way to do it.

I've been using PMD for years, and it took a while to get the recent version 7 configured to my liking. But having run PMD against my code for so long means that a lot of the low hanging fruit had already been fixed (and early on my code was much much worse than it is now). I occasionally turn the exclusions off and see if I can improve my code, and occasionally win at this game, but it's a relatively hard slog.

So far, SpotBugs hasn't really added much. I find its output somewhat unhelpful (I do read the reports), but initial impressions are that it's finding things the other tools don't, so I need to work harder to make sense of it.

Sunday, November 10, 2024

Debugging an OpenJDK crash on SPARC

I had to spend a little time recently fixing a crash in OpenJDK on Solaris SPARC.

What we're seeing is, from the hs_err file:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0xffffffff57c745a8, pid=18442, tid=37
...
# Problematic frame:
# V  [libjvm.so+0x7745a8]  G1CollectedHeap::allocate_new_tlab(unsigned long, unsigned long, unsigned long*)+0xb8

Well that's odd. I only see this on SPARC, and I've seen it sporadically on Tribblix during the process of continually building OpenJDK on SPARC, but haven't seen it on Solaris. Until a customer hit it in production, which is rather a painful place to find a reproducer.

In terms of source, this is located in the file src/hotspot/share/gc/g1/g1CollectedHeap.cpp (all future source references will be relative to that directory), and looks like:

HeapWord* G1CollectedHeap::allocate_new_tlab(size_t min_size,
                                             size_t requested_size,
                                             size_t* actual_size) {
  assert_heap_not_locked_and_not_at_safepoint();
  assert(!is_humongous(requested_size), "we do not allow humongous TLABs");

  return attempt_allocation(min_size, requested_size, actual_size);
}

That's incredibly simple. There's not much that can go wrong there, is there?

The complexity here is that a whole load of functions get inlined. So what does it call? You find yourself in a twisty maze of passages, all alike. But anyway, the next one down is

inline HeapWord* G1CollectedHeap::attempt_allocation(size_t min_word_size,
                                                     size_t desired_word_size,
                                                     size_t* actual_word_size) {
  assert_heap_not_locked_and_not_at_safepoint();
  assert(!is_humongous(desired_word_size), "attempt_allocation() should not "
         "be called for humongous allocation requests");

  HeapWord* result = _allocator->attempt_allocation(min_word_size, desired_word_size, actual_word_size);

  if (result == NULL) {
    *actual_word_size = desired_word_size;
    result = attempt_allocation_slow(desired_word_size);
  }

  assert_heap_not_locked();
  if (result != NULL) {
    assert(*actual_word_size != 0, "Actual size must have been set here");
    dirty_young_block(result, *actual_word_size);
  } else {
    *actual_word_size = 0;
  }

  return result;
}

That then calls an inlined G1Allocator::attempt_allocation() in g1Allocator.hpp. That calls current_node_index(), which looks safe and then there are a couple of calls to mutator_alloc_region()->attempt_retained_allocation() and mutator_alloc_region()->attempt_allocation(), which come from g1AllocRegion.inline.hpp and both ultimately call a local par_allocate(), which then calls par_allocate_impl() or par_allocate() in heapRegion.inline.hpp.

Now, mostly all these are doing is calling something else. The one really complex piece of code is in par_allocate_impl() which contains

...
  do {
    HeapWord* obj = top();
    size_t available = pointer_delta(end(), obj);
    size_t want_to_allocate = MIN2(available, desired_word_size);
    if (want_to_allocate >= min_word_size) {
      HeapWord* new_top = obj + want_to_allocate;
      HeapWord* result = Atomic::cmpxchg(&_top, obj, new_top);
      // result can be one of two:
      //  the old top value: the exchange succeeded
      //  otherwise: the new value of the top is returned.
      if (result == obj) {
        assert(is_object_aligned(obj) && is_object_aligned(new_top), "checking alignment");
        *actual_size = want_to_allocate;
        return obj;
      }
    } else {
      return NULL;
    }
  } while (true);
}

Right, let's go back to the crash. We can open up the core file in
mdb, and look at the stack with $C

ffffffff7f39d751 libjvm.so`_ZN7VMError14report_and_dieEP6ThreadjPhPvS3_+0x3c(
    101cbb1d0?, b?, fffffffcb45dea7c?, ffffffff7f39ecb0?, ffffffff7f39e9a0?, 0?)
ffffffff7f39d811 libjvm.so`JVM_handle_solaris_signal+0x1d4(b?,
    ffffffff7f39ecb0?, ffffffff7f39e9a0?, 0?, ffffffff7f39e178?, 101cbb1d0?)
ffffffff7f39dde1 libjvm.so`_ZL17javaSignalHandleriP7siginfoPv+0x20(b?,
    ffffffff7f39ecb0?, ffffffff7f39e9a0?, 0?, 0?, ffffffff7e7dd370?)
ffffffff7f39de91 libc.so.1`__sighndlr+0xc(b?, ffffffff7f39ecb0?,
    ffffffff7f39e9a0?, fffffffcb4b38afc?, 0?, ffffffff7f20c7e8?)
ffffffff7f39df41 libc.so.1`call_user_handler+0x400((int) -1?,
    (siginfo_t *) 0xffffffff7f39ecb0?, (ucontext_t *) 0xc?)
ffffffff7f39e031 libc.so.1`sigacthandler+0xa0((int) 11?,
    (siginfo_t *) 0xffffffff7f39ecb0?, (void *) 0xffffffff7f39e9a0?)
ffffffff7f39e5b1 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb8(
    10013d030?, 100?, 520?, ffffffff7f39f000?, 0?, 0?)

What you see here is the allocate_new_tlab() at the botton, it throws a signal, the signal handler catches it, passes it ultimately to JVM_handle_solaris_signal() which bails, and the JVM exits.

We can look at the signal. It's at address 0xffffffff7f39ecb0 and is of type siginfo_t, so we can just print it

java:core> ffffffff7f39ecb0::print -t siginfo_t

and we first see

siginfo_t {
    int si_signo = 0t11 (0xb)
    int si_code = 1
    int si_errno = 0
...

OK, the signal was indeed 11 = SIGSEGV. The interesting thing is the si_code of 1, which is defined as

#define SEGV_MAPERR     1       /* address not mapped to object */

Ah. Now, in the jvm you actually see a lot of SIGSEGV, but a lot of them are handled by that mysterious JVM_handle_solaris_signal(). In particular, it'll handle anything with SEGV_ACCERR which is basically something running off the end of an array.

Further down, you can see the fault address

struct  __fault = {
            void *__addr = 0x10
            int __trapno = 0
            caddr_t __pc = 0
            int __adivers = 0
        }

So, we're faulting on address 0x10. Yes, you try messing around down there and you will fault.


That confirms the crash is a SEGV. What are we actually trying to do? We can disassemble the allocate_new_tlab() function and see what's happening - remember the crash was at offset 0xb8

java:core> libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm::dis
...
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb8:

       ldx       [%i4 + 0x10], %i5

That's interesting, 0x10 was the fault address. What's %i4 then?

java:core> ::regs
%i4 = 0x0000000000000000

Yep. Given that, we'll try and read 0x10, giving the SEGV we see.

There's a little more context around that call site. A slightly
expanded view is

 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xa0:        nop
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xa4:        add       %
i5, %g1, %g1
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xa8:        casx      [
%g3], %i5, %g1
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xac:        cmp       %
i5, %g1
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb0:        be,pn     %
xcc, +0x160  <libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0x210>
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb4:        nop
 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb8:        ldx       [
%i4 + 0x10], %i5

Now, the interesting thing here is the casx (compare and swap) instruction. That lines up with the Atomic::cmpxchg() in par_allocate_impl() that we were suspecting above. So the crash is somewhere around there.

It turns out there's another way to approach this. If we compile without optimization then effectively we turn off the inlining. The way to do this is to add an entry to the jvm Makefile via make/hotspot/lib/JvmOverrideFiles.gmk

...
else ifeq ($(call isTargetOs, solaris), true)
    ifeq ($(call isTargetCpuArch, sparc), true)
      # ptribble port tweaks
      BUILD_LIBJVM_g1CollectedHeap.cpp_CXXFLAGS += -O0
    endif
endif

If we rebuild (having touched all the files in the directory to force
make to rebuild everything correctly), and run again, we get the full
call stack:

Now the crash is

# V  [libjvm.so+0x80cc48]  HeapRegion::top() const+0xc

which we can expand to the following stack leading up to where it goes
into the signal handler.:

ffffffff7f39dff1 libjvm.so`_ZNK10HeapRegion3topEv+0xc(0?, ffffffff7f39ef40?,
    101583e38?, ffffffff7f39f020?, fffffffa46de8038?, 10000?)
ffffffff7f39e0a1 libjvm.so`_ZN10HeapRegion17par_allocate_implEmmPm+0x18(0?,
    100?, 10000?, ffffffff7f39ef60?, ffffffff7f39ef40?, 8f00?)
ffffffff7f39e181                     
libjvm.so`_ZN10HeapRegion27par_allocate_no_bot_updatesEmmPm+0x24(0?, 100?,
    10000?, ffffffff7f39ef60?, 566c?, 200031?)
ffffffff7f39e231                     
libjvm.so`_ZN13G1AllocRegion12par_allocateEP10HeapRegionmmPm+0x44(100145440?,
    0?, 100?, 10000?, ffffffff7f39ef60?, 0?)
ffffffff7f39e2e1 libjvm.so`_ZN13G1AllocRegion18attempt_allocationEmmPm+0x48(
    100145440?, 100?, 10000?, ffffffff7f39ef60?, 3?, fffffffa46ceff48?)
ffffffff7f39e3a1 libjvm.so`_ZN11G1Allocator18attempt_allocationEmmPm+0xa4(
    1001453b0?, 100?, 10000?, ffffffff7f39ef60?, 7c0007410?, ffffffff7f39ea41?)
ffffffff7f39e461 libjvm.so`_ZN15G1CollectedHeap18attempt_allocationEmmPm+0x2c(
    10013d030?, 100?, 10000?, ffffffff7f39ef60?, 7c01b15e8?, 0?)
ffffffff7f39e521 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0x24(
    10013d030?, 100?, 10000?, ffffffff7f39ef60?, 0?, 0?)

So yes, this confirms that we are indeed in par_allocate_impl() and
it's crashing on the very first line of the code segment I showed
above, where it calls top(). All top() does is return the _top member
of a HeapRegion.

So the only thing that can happen here is that the HeapRegion itself
is NULL. Then the _top member is presumably at offset 0x10, and trying
to access it gives the SIGSEGV.

Now, in G1AllocRegion::attempt_allocation() there's an assert:

  HeapRegion* alloc_region = _alloc_region;
  assert_alloc_region(alloc_region != NULL, "not initialized properly");

However, asserts aren't compiled into production builds.

But the fix here is to fail if we've got NULL and let the caller
retry. There are a lot of calls here, and the general approach is to
return NULL if anything goes wrong, so I do the same for this extra
failure case, adding the following:

  if (alloc_region == NULL) {
    return NULL;
  }

With that, no more of those pesky crashes. (There might be others
lurking elsewhere, of course.)

Of course, what this doesn't explain is why the HeapRegion wasn't
correctly initialized in the first place. But that's another problem
entirely.

Tuesday, July 09, 2024

What's a decent password length?

What's a decent length for a password?

I think it's pretty much agreed by now that longer passwords are, in general, better. And fortunately stupid complexity requirements are on the way out.

Reading the NIST password rules gives the following:

  • User chosen passwords must be at least 8 characters
  • Machine chosen passwords must be at least 6 characters
  • You must allow passwords to be at least 64 characters

Say what? A 6 character password is secure?

Initially, that seems way off, but it depends on your threat model. If you have a mechanism to block the really bad commonly used passwords, then 6 characters gives you a billion choices. Not many, but you should also be implementing technical measures such as rate limiting.

With that, if the only attack vector is brute force over the network, trying a billion passwords is simply impractical. Even with just passive rate limiting (limited by cpu power and network latency) an attacker will struggle; with active limiting they'll be trying for decades.

That's with just 6 random characters. Go to 8 and you're out of sight. And for this attack vector, no quantum computing developments will make any difference whatsoever.

But what if the user database itself is compromised?

Of course, if the passwords are in cleartext then no amount of fancy rules or length requirements is going to help you at all.

But if an attacker gets encrypted passwords then they can simply brute force them many orders of magnitude faster. Or use rainbow tables. And that's a whole different threat model.

Realistically, protecting against brute force or rainbow table attacks probably needs a 16 character password (or passphrase), and that requirement could get longer over time.

A corollary to this is that there isn't actually much to be gained to requiring password lengths between 8 and 16 characters.

In illumos, the default minimum password length is 6 characters. I recently increased the default in Tribblix to 8, which aligns with the user chosen limit that NIST give.

Wednesday, April 03, 2024

Tribblix image structural changes

The Tribblix live ISO and related images are put together every so slightly differently in the latest m34 release.

All along, there's been an overlay (think a group package) called base-iso that lists the packages that are present in the live image. On installation, this is augmented with a few extra packages that you would expect to be present in a running system but which don't make much sense in a live image, to construct the base system.

You can add additional software, but the base is assumed to be present.

The snag with this is that base-iso is very much a single-purpose generic concept. By its very nature it has to be minimal enough to not be overly bloated, yet contain as many drivers as necessary to handle the majority of systems.

As such, the regular ISO image has fallen between 2 stools - it doesn't have every single driver, so some systems won't work, while it has a lot of unnecessary drivers for a lot of common use cases.

So what I've done is split base-iso into 2 layers. There's a new core-tribblix overlay, which is the common packages, and then base-iso adds all the extra drivers. By and large, the regular live image for m34 isn't really any different to what was present before.

But the concepts of "what packages do I need for applications to work" and "what packages do I want to load on a given downloadable ISO" have now been split.

What this allows is to easily create other images with different rules. As of m34, for example, the "minimal" image is actually created from a new base-server overlay, which again sits atop core-tribblix and differs from base-iso in that it has all the FC drivers. If you're installing on a fibre-channel connected system then using the minimal image will work better (and if you're SAN-booted, it will work where the regular ISO won't).

The next use case is that images for cloud or virtual systems simply don't need most of the drivers. This cuts out a lot of packages (although it doesn't actually save that much space).

The standard Tribblix base system now depends on core-tribblix, not base-iso or any of the specific image layers. This is as it should be - userland and applications really shouldn't care what drivers are present.

One side-effect of this change is that it makes minimising zones easier, because what gets installed in a zone can be based on that stripped-down core-tribblix overlay.

Monday, February 19, 2024

The SunOS JDK builder

I've been building OpenJDK on Solaris and illumos for a while.

This has been moderately successful; illumos distributions now have access to up to date LTS releases, most of which work well. (At least 11 and 17 are fine; 21 isn't quite right.)

There are even some third-party collections of my patches, primarily for Solaris (as opposed to illumos) builds.

I've added another tool. The SunOS jdk builder.

The aim here is to be able to build every single jdk tag, rather than going to one of the existing repos which only have the current builds. And, yes, you could grope through the git history to get to older builds, but one problem with that is that you can't actually fix problems with past builds.

Most of the content is in the jdk-sunos-patches repository. Here there are patches for both illumos and Solaris (they're ever so slightly different) for every tag I've built.

(That's almost every jdk tag since the Solaris/SPARC/Studio removal, and a few before that. Every so often I find I missed one. And there's been the odd bad patch along the way.)

The idea here is to make it easy to build every tag, and to do so on a current system. I've had to add new patches to get some of the older builds to work. The world has changed, we have newer compilers and other tools, and the OS we're building on has evolved. So if someone wanted to start building the jdk from scratch (and remember that you have to build all the versions in sequence) then this would be useful.

I'm using it for a couple of other things.

One is to put back SPARC support on illumos and Solaris. The initial port I did was on x86 only, so I'm walking through older builds and getting them to work on SPARC. We'll almost certainly not get to jdk21, but 17 seems a reasonable target.

The other thing is to enable the test suites, and then run them, and hopefully get them clean. At the moment they aren't, but a lot of that is because many tests are OS-specific and they don't know what Solaris is so get confused. With all the tags, I can bisect on failures and (hopefully) fix them.

Wednesday, November 22, 2023

Building up networks of zones on Tribblix

With OpenSolaris and derivatives such as illumos, we gained the ability to build a whole IT infrastructure in a single box, using virtualized networking (crossbow) to build the underlying network and then attaching virtualized systems (zones) atop virtualized storage (zfs).

Some of this was present in Solaris 10, but it didn't have crossbow so the networking piece was a bit tricky (although I did manage to get surprisingly far by abusing the loopback interface).

In Tribblix, I've long had the notion of a router or proxy zone, which acts as a bridge between the outside world and a local virtual subnet. For the next release I've been expanding that into something much more flexible and capable.

What did I need to put this together?

The first thing is a virtual network. You use dladm to create an etherstub. Think of that as a virtual switch you can connect network links to.

To connect that to the world, a zone is created with 2 network interfaces (vnics). One over the system interface so it can connect to the outside world, and one over the etherstub.

That special router zone is a little bit more than that. It runs NAT to allow any traffic on the internal subnet - simple NAT, nothing complicated here. In order to do that the zone has to have IPFilter installed, and the zone creation script creates the right ipnat configuration file and ensures that IPFilter is started.

You also need to have IPFilter installed in the global zone. It doesn't have to be running there, but the installation is required to create the IPFilter devices. Those IPFilter devices are then exposed to the zone, and for that to work the zone needs to use exclusive-ip networking rather than shared-ip (and would need to do so anyway for packet forwarding to work).

One thing I learnt was that you can't lock the router zone's networking down with allowed-address. The anti-spoofing protection that allowed-address gives you prevents forwarding and breaks NAT.

The router zone also has a couple of extra pieces of software installed. The first is haproxy, which is intended as an ingress controller. That's not currently used, and could be replaced by something else. The second is dnsmasq, which is used as a dhcp server to configure any zones that get connected to the subnet.

With a network segment in place, and a router zone for management, you can then create extra zones.

The way this works in Tribblix is that if you tell zap to create a zone with an IP address that is part of a private subnet, it will attach its network to the corresponding etherstub. That works fine for an exclusive-ip zone, where the vnic can be created directly over the etherstub.

For shared-ip zones it's a bit trickier. The etherstub isn't a real network device, although for some purposes (like creating a vnic) it looks like one. To allow shared-ip, I create a dedicated shared vnic over the etherstub, and the virtual addresses for shared-ip zones are associated with that vnic. For this to work, it has to be plumbed in the global zone, but doesn't need an address there. The downside to the shared-ip setup (or it might be an upside, depending on what the zone's going to be used for) is that in this configuration it doesn't get a network route; normally this would be inherited off the parent interface, but there isn't an IP configuration associated with the vnic in the global zone.

The shared-ip zone is handed its IP address. For exclusive-ip zones, the right configuration fragment is poked into dnsmasq on the router zone, so that if the zone asks via dhcp it will get the answer you configured. Generally, though, if I can directly configure the zone I will. And that's either by putting the right configuration into the files in a zone so it implements the right networking at boot, or via cloud-init. (Or, in the case of a solaris10 zone, I populate sysidcfg.)

There's actually a lot of steps here, and doing it by hand would be rather (ahem, very) tedious. So it's all automated by zap, the package and system administration tool in Tribblix. The user asks for a router zone, and all it needs to be given is the zone's name, the public IP address, and the subnet address, and all the work will be done automatically. It saves all the required details so that they can be picked up later. Likewise for a regular zone, it will do all the configuration based on the IP address you specify, with no extra input required from the user.

The whole aim here is to make building zones, and whole systems of zones, much easier and more reliable. And there's still a lot more capability to add.

Saturday, November 04, 2023

Keeping python modules in check

Any operating system distribution - and Tribblix is no different - will have a bunch of packages for python modules.

And one thing about python modules is that they tend to depend on other python modules. Sometimes a lot of python modules. Not only that, the dependency will be on a specific version - or range of versions - of particular modules.

Which opens up the possibility that two different modules might require incompatible versions of a module they both depend on.

For a long time, I was a bit lax about this. Most of the time you can get away with it (often because module writers are excessively cautious about newer versions of their dependencies). But occasionally I got bitten by upgrading a module and breaking something that used it, or breaking it because a dependency hadn't been updated to match.

So now I always check that I've got all the dependencies listed in packaging with

pip3 show modulename

and every time I update a module I check the dependencies aren't broken with

pip3 check

Of course, this relies on the machine having all the (interesting) modules installed, but on my main build machine that is generally true.

If an incompatibility is picked up by pip3 check then I'll either not do the update, or update any other modules to keep in sync. If an update is impossible, I'll take a note of which modules are blockers, and wait until they get an update to unjam the process.

A case in point was that urllib3 went to version 2.x recently. At first, nothing would allow that, so I couldn't update urllib3 at all. Now we're in a situation where I have one module I use that won't allow me to update urllib3, and am starting to see a few modules requiring urllib3 to be updated, so those are held downrev for the time being.

The package dependencies I declare tend to be the explicit module dependencies (as shown by pip3 show). Occasionally I'll declare some or all of the optional dependencies in packaging, if the standard use case suggests it. And there's no obvious easy way to emulate the notion of extras in package dependencies. But that can be handled in package overlays, which is the safest way in any case.

Something else the checking can pick up is when a dependency is removed, which is something that can be easily missed.

Doing all the checking adds a little extra work up front, but should help remove one class of package breakage.

Friday, October 27, 2023

It seemed like a simple problem to fix

While a bit under the weather last week, I decided to try and fix what at first glance appears to be a simple problem:

need to ship the manpage with exa

Now, exa is a modern file lister, and the package on Tribblix doesn't ship a man page. The reason for that, it turns out, is that there isn't a man page in the source, but you can generate one.

To build the man page requires pandoc. OK, so how to get pandoc, which wasn't available on Tribblix? It's written in Haskell, and I did have a Haskell package.

Only my version of Haskell was a bit old, and wouldn't build pandoc. The build complains that it's too old and unsupported. You can't even build an old version of pandoc, which is a little peculiar.

Off to upgrade Haskell then. You need Haskell to build Haskell, and it has some specific requirements about precisely which versions of Haskell work. I wanted to get to 9.4, which is the last version of Haskell that builds using make (and I'll leave Hadrian for another day). You can't build Haskell 9.4 with 9.2 which it claims to be too new, you have to go back to 9.0.

Fortunately we do have some bootstrap kits for illumos available, so I pulled 9.0 from there, successfully built Haskell, then cabal, and finally pandoc.

Back to exa. At which point you notice that it's been deprecated and replaced by eza. (This is a snag with modern point tools. They can disappear on a whim.)

So let's build eza. At which point I find that the MSRV (Minimum Supported Rust Version) has been bumped to 1.70, and I only had 1.69. Another update required. Rust is actually quite simple to package, you can just download the stable version and package it.

After all this, exa still doesn't have a man page, because it's deprecated (if you run man exa you get something completely different from X.Org). But I did manage to upgrade Haskell and Cabal, I managed to package pandoc, I updated rust, and I added a replacement utility - eza - which does now come with a man page.