The Trouble with Tribbles...

Enabling IPv6 on EC2

2026-02-15T19:44:00.000+00:00

I've actually been using IPv6 on and off since the late 1990s - there was an addon for Solaris 2.6 that we installed on a bunch of test machines. It worked great, but wasn't something we ever did properly in production because at the time none of our customers had IPv6.

I've wanted to use IPv6 more, but my home ISP has never had it. Until recently, when I noticed my main machine (running Tribblix, naturally) has an Ipv6 address and was using it. At that point, investigating and testing IPv6 becomes a lot more interesting.

Some of the Tribblix web servers (specifically the pkgs and iso download servers) are hosted in the cloud. I do this for testing and dogfooding, so I know that Tribblix works on those cloud environments.

The iso machine is on Digital Ocean; sadly Digital Ocean don't support IPv6 for custom images, for no good reason that I can see. But the pkgs server is on AWS, and they do support IPv6, but it wasn't enabled.

My first test here was actually to see if I could launch an EC2 instance with an IPv6 address, using the aws cli:

aws ec2 run-instances --ipv6-address-count 1 ...

and this failed, because of the fact that IPv6 wasn't enabled where I was trying to launch it. So I needed to do a few steps to make it work, and this is largely to remind me for when I need to do it again.

First, you need to associate an IPv6 block of addresses with the VPC you're using. Go to the VPC in the console, find something that says Edit CIDRs and, in that, Add a new IPv6 CIDR. Just choose an Amazon provided one and you're good. That should give you a /56 block.

You then need to go into each of the subnets in that VPC and associate an IPv6 CIDR block with it. Find the Edit IPV6 CIDRs button, go into that, and Add IPv6 CIDR. You can add the entire block to one subnet if you want, but normally you can break it down. Under the allocation are some arrows - the up and down arrows change the size of the block, I just went down one to a /60. By default, the starting address of the subnet block is the same as the VPC block - for additional subnets you'll need to use the little right arrow to change the start address, as you can't associate overlapping blocks.

I also went into the subnet settings to Enable auto-assign IPv6 address.

What you then need to do is go to the route table (either from the subnet or the VPC) and add a route for ::/0 with the target being the internet gateway - there should only be the one gateway, the same as used for IPv4. If you don't add the route to the route table you'll get IPv6 addresses but you won't be able to talk to anything outside the VPC over IPv6.

With that, launch a new instance and you'll get IPV6 working nicely.

Nothing to do with EC2, but the one other thing I needed to do was add an additional IPV6 listen directive to each server in my nginx config, as nginx will only listen on IPv4 by default.

Is the Information Security industry succeeding?

2025-06-05T08:54:00.000+01:00

Yesterday I had a trip up to London and had a wander round Infosecurity Europe. It was an interesting day, lots of things to see, many interesting conversations.

The show itself is huge. We've clearly come out of the doldrums of the last few years where shows had become tiny. And this was a dedicated infosec event, not just one part of a larger IT event.

Going by the size of the event, the number of exhibitors, the number of attendees, the size and extravagance of the displays, I think it's fair to say that Information Security as a business sector is doing very well. There's clearly a huge amount of vendor cash to splash around, and a confidence that customers have plenty of cash to buy the products on offer.

But is making money the correct definition of success here?

Most of the industry has a focus on detection and remediation. The pitch is that your systems are horrendously insecure and you need to give vendor X lots of money so they can detect a failure and help get your business back on its feet.

There was very little, in fact almost nothing, aimed at actually building more secure systems. (Even training and awareness is really nothing more than glossing over the cracks.) Maybe the closest is things aimed at the supply chain, but even that's basically detection of someone else's vulnerabilities.

So, in terms of actually building better systems, the Infosecurity industry is failing. It's not even addressing the problem.

(I would say that one definition of success for an information security company would be for it to do such a good job it's no longer needed. Clearly that's not going to be in many business plans.)

Furthermore, a string of high-profile hacks and breaches clearly indicates that the industry is failing to keep businesses secure.

Random thoughts on a Next Generation Tribblix

2025-05-13T19:10:00.002+01:00

I have a little private project called xTribblix.

What's the x stand for? eXtreme? eXtraordinary? eXperimental? neXt generation?

Honestly, I don't know. It doesn't matter, it's just a little bucket I can drop things in to. But essentially, a set of experiments around changing Tribblix that allows me to do interesting things. The aim would be that, if successful, they get folded back into regular Tribblix; if unsuccessful then it's a learning experience.

It's just the logical continuation of the drive I've always had to make Tribblix faster, leaner, cleaner, fitter, easier, more secure. While retaining compatibility and functionality.

There are a few bits of illumos that really ought to be removed. Printing is a prime example - CUPS is a better, more modern implementation, maintained, familiar to everyone, what most Solaris people wanted anyway, and to be honest printing *isn't* an illumos core competency, so it's an ideal target to be outsourced. That's a clear example with a superior replacement already available; most subsystems might have someone crawl out of the woodwork who's inconvenienced by their removal.

So far, I've simply looked at things and decided to implement many of the simple ones for the next release(s) without the need for a separate experimental release. This isn't new, it's been going on for many releases already, and so far I've managed not to break anything that matters.

Some of the things done already (some will be in the next release):

grub deprecated
update DEFAULT_MAXPID to allow pid > 30000 (eg 99999 like smartos)
delete ftpusers, as there's no illumos ftpd
long usernames now silent rather than warning
removed uucp, and removed the nuucp user
zones based on core-tribblix need to worry less about what to remove
overlays based on core-tribblix with the actual images having a driver layer on top, so cloud/virtual images can slim down
replace /usr/xpg4/bin/more with a link to less
replace pax with the heirloom version
create /var/adm/loginlog by default
increase PASSLENGTH in /etc/default/passwd to 8
remove /etc/log and /var/adm/log, latter only used by volcopy
transformed away and eliminated most uses of isaexec
remove /usr/games
remove all legacy printing
remove libadt_jni
remove ilb
remove the old as on x86, everything should use gas
remove oawk and man page (and ref in awk.1)
remove newform, listusers, asa
no longer install doctools by default
drop the closed iconv bits, as they're useless
remove libfru* on x86
replace sendmail with the upstream
deprecate mailwrapper

A lot of this is simple package manipulation as I convert the IPS repo produced by an illumos build into SVR4 packages, mostly avoiding the need to patch the source or the build.

There's a lot more that could be done, some examples of what I'm thinking of include:

xpgN by default (replace regular binaries in /usr/bin)
sort out cpp (last remaining closed bin)
everything 64-bit
remove /etc links more aggressively
no ucb at all [except mebbe install...]
see if there are any expensive and unused kstats we could remove
firewall on by default
passwd blocklists by default
extendedFILE(7) enabled by default (although not necessary if everything is 64-bit!)
refactor packages so they are along sensible boundaries (with reducing the number of distinct packages being the goal)

Now all I need is some time to implement all this...

On efficiency and resilience in IT

2025-04-24T19:18:00.000+01:00

Years ago, I was in a meeting when a C-level executive proclaimed:

IT systems run at less than 10% utilization on average, so we're moving to the cloud to save money.

The logic behind this was that you could run systems in the cloud that were the size you needed, rather than the size you had on the floor.

Of course, this particular claim was specious. Did he know the average utilization of our systems, I asked. He did not. (It was at least 30%.)

Furthermore, measuring CPU utilization is just one aspect of a complex multidimensional space. Systems may have spare CPU cycles, but are hitting capacity limits on memory, memory bandwidth, network bandwidth, storage and storage bandwidth. It's rare to have a system so well balanced that it saturates all parameters equally.

Not only that, but the load on all systems fluctuate, even on very short timescales. There will always be troughs between the peaks. And, as we all know, busy systems tend to generate queues and congestion - or, as a technical term, higher utilization leads to increased latency.

Attempting to build systems that maximise efficiency implies minimizing waste. But if you always consider spare capacity as wasted capacity, then you will always get congested systems and slow response. (Just think about queueing at the tills in a supermarket where they've staffed them for average footfall.)

So guaranteeing performance and response time implies a certain level of overprovisioning.

Beyond that, resilient systems need to have sufficient capacity to not only handle normal fluctuations in usage, but abnormal usage due to failures and external events. And resilient design needs to have unused capacity to take up the slack when necessary.

In this case, a blinkered focus on efficiency not only leads to poor response, it also makes systems brittle and incapable of responding if a problem occurs.

A simple way to build resiliency is to have redundant systems - provision spare capacity that springs into action when needed. In such an active-passive configuration, the standby system might be idle. It doesn't have to be - you might use redundant systems for development/test/batch workloads (this presupposes you have a mechanism like Solaris zones to provide strong workload isolation).

Going to the cloud might solve the problem for a customer, but the cloud provider has exactly the same problem to solve, on a larger scale. They need to provision excess capacity to handle the variability in customer workloads. Which leads to the creation of interesting pricing models - such as reserved instances and the spot markets on AWS.

Understanding emission scopes, or failing to

2025-04-08T22:07:00.000+01:00

I've been trying to get my head around all this Scope 1, Scope 2, Scope 3 emissions malarkey. Although it appears that lots of people smarter than me are struggling with it.

Having spent a while looking at how the Scopes are defined, I can understand how this can be difficult.

OK, Scope 1 is an organisation's direct emissions. Presumably an organisation knows what it's doing and how it's doing it, so getting the Scope 1 emissions from that ought to be fairly straightforward.

And Scope 2 is electricity, steam, heating and cooling purchased from someone else. I'm immediately suspicious here because this is a weirdly specific categorisation. But at least it should be easy to calculate - there's a conversion factor but at least you know the usage because it's on a bill you have to pay.

Then Scope 3 is - everything else. The fact that there are 15 official categories included ought to be a big red flag. That it's problematic is shown by the fact so many organisations have problems with it. (And by the growth of an industry to solve the problem for you.)

Personally, I wouldn't have defined it this way. If the idea is to evaluate emissions across the supply chain, then dumping almost all the emissions into the vaguest bucket is always going to be problematic.

So, why wasn't Scope 2 simply defined as the combined Scope 1 emissions of everyone providing services to the organisation. (That includes upstream and downstream, suppliers and employees, by the way.) That has 2 advantages I can see:

It's easy to calculate, because Scope 1 is pretty easy to calculate for all the providers of services (and they may well be doing it anyway), and an organisation ought to know who's providing services to it
It makes Scope 2 bigger (obviously) because there's more included, and therefore makes Scope 3 smaller, so uncertainties in Scope 3 matter less
Because you can better identify the contributors to your Scope 2 emissions, it's easier to know where to start making improvement efforts

I presume there's some reason it wasn't done this way, but I can't immediately see it.

What is this AI anyway?

2025-04-04T14:53:00.001+01:00

AI is all the rage right now. It's everywhere, you can't avoid it.

But what is AI?

I'm not going to try and answer that here. What I will do, though, is state the question somewhat differently:

What is meant by "AI" in a given context?

And this matters, because the words we use are important.

The reality is that when you see AI mentioned it really could be almost anything. Some things AI might mean are:

Copilot
ChatGPT
Gemini
Some other specific off the shelf public LLM
Anything involving any off the shelf LLM
A custom domain-specific LLM
Machine learning
Pattern matching
Image recognition
Any old computer program
One of the AI companies

And there's always the possibility that someone has simply slapped AI on a product as a marketing term with no AI involved.

This persistent abuse of terminology is really unhelpful. Yesterday I went to a very interesting event for conversations about Hopes and Fears around AI.

Am I hopeful or fearful about AI? It depends which of the above definitions you mean.

There are certain uses of what might now be lumped in with AI that have proven to be very successful, but in many cases they're really machine learning, and have actually been around for a long time. I'm very positive about those (for example, helping in medical diagnoses).

On the other hand, if the AI is a stochastic parrot trained via large scale abuse of copyright while wreaking massive environmental damage, then I'm very negative about that.

So I think it's important to get away from sticking the AI label onto everything that might have some remote association with a computer program, and be far more careful in our terminology.

Tribblix on SPARC: sparse devices in an LDOM

2025-03-25T12:51:00.003+00:00

I recently added a ddu like capability to Tribblix.

In that article I showed the devices in a bhyve instance. As might be expected there really aren't a lot of devices you need to handle.

What about SPARC, you might ask? Even if you don't, I'll ask for you.

Running Tribblix in a LDOM, this is what you see:

root@sparc-m32:/root# zap ddu
Device SUNW,kt-rng handled by n2rng in TRIBsys-kernel-platform [installed]
Device SUNW,ramdisk handled by ramdisk in TRIBsys-kernel [installed]
Device SUNW,sun4v-channel-devices handled by cnex in TRIBsys-ldoms [installed]
Device SUNW,sun4v-console handled by qcn in TRIBsys-kernel-platform [installed]
Device SUNW,sun4v-disk handled by vdc in TRIBsys-ldoms [installed]
Device SUNW,sun4v-domain-service handled by vlds in TRIBsys-ldoms [installed]
Device SUNW,sun4v-network handled by vnet in TRIBsys-ldoms [installed]
Device SUNW,sun4v-virtual-devices handled by vnex in TRIBsys-kernel-platform [installed]
Device SUNW,virtual-devices handled by vnex in TRIBsys-kernel-platform [installed]

It's hardly surprising, but that's a fairly minimal list.

It does make me wonder whether to produce a special SPARC Tribblix image precisely to run in an LDOM. After all, I already have slightly different variants on x86 designed for cloud in general, and one for EC2 specifically, that don't need the whole variety of device drivers that the generic image has to include.

Expecting an AI boom?

2025-03-23T22:16:00.000+00:00

I recently went down to the smoke, to Tech Show London.

There were 5 constituent shows, and I found what each sub-show was offering - and the size of each component - quite interesting.

There wasn't much going on in Devops Live, to be honest. Relatively few players had shown up, nothing terribly interesting.

There wasn't that much in Big Data & AI World either. I was expecting much more here, and what there was seemed to be on the periphery. More support services than actual product.

The Cloud & Cyber Security Expo was middling, not great, and there was an AI slant in evidence. Not proper AI, but a sprinkling of AI dust on things just to keep up with the Joneses.

Cloud and AI Infrastructure had a few bright spots. I saw actual hardware on the floor - I had seen disk shelves over in the Big Data section, but here I spotted a Tape Library (I used to use those a lot, haven't seen much in that area for a while) and a VDI blade. Talked to a few people, including the Zabbix and Tailscale stands.

But when it came to Data Centre World, that was buzzing. It was about half the overall floor area, so it was far and away the dominant section. Tremendous diversity too - concrete, generators, power cables, electrical switching, fiber cables, cable management, thermal management, lots of power and cooling. Lots and lots of serious physical infrastructure.

There was an obvious expectation on display that there's a massive market around high-density compute. I saw multiple vendors with custom rack designs - rear-door and liquid cooling in evidence. Some companies addressing the massive demand for water.

If these people are at a trade show, then the target market isn't the 3 or 4 hyperscalers. What's being anticipated in this frenzy is very much companies building out their own datacentre facilities, and that's very much an interesting trend.

There's a saying "During a gold rush, sell shovels". What I saw here was a whole army of shovel-sellers getting ready for the diggers to show up.

Tribblix, UEFI, and UFS

2025-03-06T16:39:00.000+00:00

Somewhat uniquely among illumos distributions, Tribblix doesn't require installation to ZFS - it allows the possibility of installing to a UFS root file system.

I'm not sure how widely used this is, but it will get removed as an option at some point, as the illumos UFS won't work past Y2038.

I recently went through the process of testing an install of the very latest Tribblix to UFS, in a bhyve guest running UEFI. The UEFI part was a bit more work, and doing it clarified how some of the internals fit together.

(One reason for doing these unusual experiments is to better understand how things work, especially those that are handed automatically by more mainstream components.)

OK, on to installation.

While install to zfs will automatically lay out zfs pools and file systems, the ufs variant needs manual partitioning. There are two separate concerns - the Tribblix install, and UEFI boot.

The Tribblix installer for UFS assumes 2 things about the layout of the disk it will install to:

The slice s0 will be used to install the operating system to, and mounted at /.
The slice s1 will be used for swap. (On zfs, you create a zfs volume for swap; on ufs you use a separate raw partition.)

It's slightly unfortunate that these slices are hard-coded into the installer.

For UEFI boot we need 2 other slices:

A system partition (this is what's called EFI System partition, aka ESP)
A separate partition to put the stage2 bootloader in. (On zfs there's a little bit of free space you can use; there isn't enough on ufs so it needs to be handled separately.)

The question then arises as to how big these need to be. Now, if you create a root pool with ZFS (using zpool create -B) it will create a 256MB partition for ESP. This turns out to be the minimum size for FAT32 on 4k disks, so that's a size that should always work. On disks with a 512 block size, it needs to be 32MB or larger (there's a comment in the code about 33MB). The amount of data you're going to store there is very much less.

The stage2 partition doesn't have to be terribly big.

So as a result of this I'm going to create a GPT label with 4 slices - 0 and 1 for Tribblix, 3 and 4 for EFI system and boot.

There are 2 things to note here: First,the partitions you create don't have to be laid out on disk in numerical order, you can put the slices in any order you want. This was true for SMI disks too, where it was common practice in Solaris to put swap on slice 1 at the start of the disk with slice 0 after it. Second, EFI/GPT doesn't assign any special significance to slice 2, unlike the old SMI label where slice 2 was conventionally the whole disk. I'm avoiding slice 2 here not because it's necessary, but so as to not confuse anyone used to the old SMI scheme.

The first thing to do with a fresh disk is to go into format, invoked as format -e (expert mode in order to access the EFI options). Select the disk, run fdisk from inside format, and then install an EFI label.

format -e
#
# choose the disk
#
fdisk
y - to accept defaults
l - to label
1 - choose efi

Then we can lay out the partitions. Still in format, type p to enter the partition menu and p to display the partitions.

p - enter partition menu
p - show current partition table

At this point on a new disk it should have 8 as "reserved" and 0 as "usr", with everything else "unassigned". We're going to leave slice 8 untouched.

First note where slice 0 currently starts. I'll resize it at the end, but we're going to put slices 3, 4, and 1 at the start of the disk and then resize 0 to fill in what's left.

To configure the settings for a given slice, just type its number.

Start with slice 3, type 3 and configure the system partition. This has to use the "system" tag.

tag: system
flags: wm (just hit return to accept)
start: 34
size: 64mb

Type p again to view the partition table and note the last sector of slice 3 we just created, and add 1 to it to give the start sector of the next slice. Type 4 to configure the boot partition, and it must have the tag "boot".

tag: boot
flags: wm (just hit return to accept)
start: 131106
size: 16mb

Type p again to view the partition table, take note of the last sector for the new slice 4, and add 1 to get the start sector for the next one. Which is 1 for the swap partition.

tag: swap
flags: wm (just hit return to accept)
start: 65570
size: 512mb

We're almost done. The final step is to resize partition 0. Again you get the start sector by adding 1 to the last sector of the swap partition you just created. And rather than giving a size you can give the end sector using an 'e' suffix, which should be one less than the start of the reserved partition 8, and also the last sector of the original partition 0. Type 0 and enter something like:

tag: usr
flags: wm (just hit return to accept)
start: 1212450
size: 16760798e

Type 'p' one last time to view the partition table, check that the Tag entries are correct, and that the First and Last Sectors don't overlap.

Then type 'l' to write the label to the disk. It will ask you for the label type - make sure it's EFI again - and for confirmation.

Then we can do the install

./ufs_install.sh c1t0d0s0

It will ask for confirmation that you want to create the file system

At the end it ought to say "Creating pcfs on ESP /dev/rdsk/c1t0d0s3"

If it says "Requested size is too small for FAT32." then that's a hint that you need the system partition to be bigger. (An alternative trick is to mkfs the pcfs file system yourself, if you create it using FAT16 it will still work but you can get away with it being a lot smaller.)

It should also tell you that it's writing the pmbr to slice 4 and to p0.

With that, rebooting into the newly installed system ought to work.

Now, the above is a fairly complicated set of instructions. I could automate this, but do we really want to make it that easy to install to UFS?

Introducing a ddu-alike for Tribblix

2025-02-19T13:08:00.000+00:00

Introducing a new feature in Tribblix m36. There's a new ddu subcommand for zap.

In OpenSolaris, the Device Driver Utility would map the devices it found and work out what software was needed to drive them. This isn't that utility, but is inspired by that functionality, rewritten for Tribblix as a tiny little shell script.

As an example, this is the output of zap ddu for Tribblix in a bhyve instance:

jack@tribblix:~$ zap ddu
Device acpivirtnex handled by acpinex in TRIBsys-kernel-platform [installed]
Device pci1af4,1000,p handled by vioif in TRIBdrv-net-vioif [installed]
Device pci1af4,1001 handled by vioblk in TRIBdrv-storage-vioblk [installed]
Device pci1af4,1 handled by vioif in TRIBdrv-net-vioif [installed]
Device pciclass,030000 handled by vgatext in TRIBsys-kernel [installed]
Device pciclass,060100 handled by isa in TRIBsys-kernel-platform [installed]
Device pciex_root_complex handled by npe in TRIBsys-kernel-platform [installed]
Device pnpPNP,303 handled by kb8042 in TRIBsys-kernel [installed]
Device pnpPNP,f03 handled by mouse8042 in TRIBsys-kernel [installed]

Simply put, it will list the devices it finds, which driver is responsible for them, and which package that driver is contained in (and whether that package is installed).

This, while a tiny little feature, is one of those small things that is actually stunningly useful.

If there's a device that we have a driver for that isn't installed, this helps identify it so you know what to install.

What this doesn't do (yet, and unlike the original ddu) is show devices we don't have a driver for at all.

Is all this thing called AI worthwhile?

2025-02-10T20:56:00.001+00:00

Before I even start, let's be clear: there are an awful lot of things currently being bundled under the "AI" banner, most of which of neither artificial nor intelligent.

So when I'm talking about AI here, I'm talking about what's being marketed to the masses as AI. This generally doesn't include the more traditional subjects of machine learning or image recognition, which I've often seen relabelled as AI.

But back to the title: is the modern thing called AI worthwhile?

Whatever it is, AI can do some truly remarkable things. That isn't something you can argue against. It can do some truly stupid and hopelessly wrong things as well.

But where does this good stuff fit in? Are businesses really going to benefit by embracing AI?

Well, yes, up to a point. There's a lot of menial work that can be handed off to an AI. It might be able to do it cheaper than a human.

The first snag is Jevon's paradox; by making menial tasks cheaper, a business simply opens the door to larger quantities of menial tasks, so it saves no money and its costs might even go up.

To be honest, though, I would have to say that if you can hand a task off to an AI, is it worth doing in the first place?

That's the rub, yes you might be able to optimise a process by using AI, but you can optimise it much more by eliminating it entirely.

(And you then don't have to pay extra for someone to come along and clean up after the AI has made a mess of it.)

It's not just the first level of process you need to look at. Take the example of summarising meetings. It's not so much that you don't need the summary, but to start with you need to run meetings better so they don't need to be summarised, and even better, the meeting probably wasn't needed at all.

Put it another way: the AI will get you to a local minimum of cost, but not to a global minimum. Worse, as AI gets cheaper and more widely used, that local optimisation makes it even harder to optimise the system globally.

So yes, I'm not convinced that much of the AI currently being rammed down our throats has any utility. It will actively block businesses in the pursuit of improvements, and the infatuation with current trendy AI will harm the development of useful AI.

Thoughts on Static Code Analysis

2024-12-16T20:53:00.003+00:00

I use a number of tools in static code analysis for my projects - primarily Java based. Mostly

Wait, I hear you say. Spell checking? Absolutely, it's a key part of code and documentation quality. There's absolutely no excuse for shoddy spelling. And I sometimes find that if the spelling's off, it's a sign that concentration levels weren't what they should have been, and other errors might also have crept in.

checkstyle is far more than style, although it has very fixed ideas about that. I have a list of checks that must always pass (now I've cleaned them up at any rate), so that's now at the state where it's just looking for regressions - the remaining things it's complaining about I'm happy to ignore (or the cost of fixing them massively outweighs any benefit to fixing them).

One thing that checkstyle is keen on is thorough javadoc. Initially I might have been annoyed by some of its complaints, but then realised 2 things. First, it makes you consider whether a given API really should be public. And more generally as part of that, having to write javadoc can make you reevaluate the API you've designed, which pushes you towards improving it.

When it comes to shellcheck, I can summarise it's approach as "quote all the things". Which is fine, until it isn't and you actually want to expand a variable into its constituent words.

But even there, a big benefit again is that shellcheck makes you look at the code and think about what it's doing. Which leads to an important point - automatic fixing of reported problems will (apart from making mistakes) miss the benefit of code inspection.

Actual coding errors (or just imperfections) tend to be the domain of PMD and SpotBugs. I have a long list of exceptions for PMD, depending on each project. I'm writing applications for unix-like systems, and I really do want to write directly to stdout and stderr. If I want to shut the application down, then calling System.exit() really is the way to do it.

I've been using PMD for years, and it took a while to get the recent version 7 configured to my liking. But having run PMD against my code for so long means that a lot of the low hanging fruit had already been fixed (and early on my code was much much worse than it is now). I occasionally turn the exclusions off and see if I can improve my code, and occasionally win at this game, but it's a relatively hard slog.

So far, SpotBugs hasn't really added much. I find its output somewhat unhelpful (I do read the reports), but initial impressions are that it's finding things the other tools don't, so I need to work harder to make sense of it.

Debugging an OpenJDK crash on SPARC

2024-11-10T20:08:00.000+00:00

I had to spend a little time recently fixing a crash in OpenJDK on Solaris SPARC.

What we're seeing is, from the hs_err file:

# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0xffffffff57c745a8, pid=18442, tid=37
...
# Problematic frame:
# V [libjvm.so+0x7745a8] G1CollectedHeap::allocate_new_tlab(unsigned long, unsigned long, unsigned long*)+0xb8

Well that's odd. I only see this on SPARC, and I've seen it sporadically on Tribblix during the process of continually building OpenJDK on SPARC, but haven't seen it on Solaris. Until a customer hit it in production, which is rather a painful place to find a reproducer.

In terms of source, this is located in the file src/hotspot/share/gc/g1/g1CollectedHeap.cpp (all future source references will be relative to that directory), and looks like:

HeapWord* G1CollectedHeap::allocate_new_tlab(size_t min_size,
                                             size_t requested_size,
                                             size_t* actual_size) {
assert_heap_not_locked_and_not_at_safepoint();
assert(!is_humongous(requested_size), "we do not allow humongous TLABs");

return attempt_allocation(min_size, requested_size, actual_size);
}

That's incredibly simple. There's not much that can go wrong there, is there?

The complexity here is that a whole load of functions get inlined. So what does it call? You find yourself in a twisty maze of passages, all alike. But anyway, the next one down is

inline HeapWord* G1CollectedHeap::attempt_allocation(size_t min_word_size,
                                                     size_t desired_word_size,
                                                     size_t* actual_word_size) {
assert_heap_not_locked_and_not_at_safepoint();
assert(!is_humongous(desired_word_size), "attempt_allocation() should not "
         "be called for humongous allocation requests");

HeapWord* result = _allocator->attempt_allocation(min_word_size, desired_word_size, actual_word_size);

if (result == NULL) {
    *actual_word_size = desired_word_size;
    result = attempt_allocation_slow(desired_word_size);
}

assert_heap_not_locked();
if (result != NULL) {
    assert(*actual_word_size != 0, "Actual size must have been set here");
    dirty_young_block(result, *actual_word_size);
} else {
    *actual_word_size = 0;
}

return result;
}

That then calls an inlined G1Allocator::attempt_allocation() in g1Allocator.hpp. That calls current_node_index(), which looks safe and then there are a couple of calls to mutator_alloc_region()->attempt_retained_allocation() and mutator_alloc_region()->attempt_allocation(), which come from g1AllocRegion.inline.hpp and both ultimately call a local par_allocate(), which then calls par_allocate_impl() or par_allocate() in heapRegion.inline.hpp.

Now, mostly all these are doing is calling something else. The one really complex piece of code is in par_allocate_impl() which contains

...
do {
    HeapWord* obj = top();
    size_t available = pointer_delta(end(), obj);
    size_t want_to_allocate = MIN2(available, desired_word_size);
    if (want_to_allocate >= min_word_size) {
      HeapWord* new_top = obj + want_to_allocate;
      HeapWord* result = Atomic::cmpxchg(&_top, obj, new_top);
      // result can be one of two:
      // the old top value: the exchange succeeded
      // otherwise: the new value of the top is returned.
      if (result == obj) {
        assert(is_object_aligned(obj) && is_object_aligned(new_top), "checking alignment");
        *actual_size = want_to_allocate;
        return obj;
      }
    } else {
      return NULL;
    }
} while (true);
}

Right, let's go back to the crash. We can open up the core file in
mdb, and look at the stack with $C

ffffffff7f39d751 libjvm.so`_ZN7VMError14report_and_dieEP6ThreadjPhPvS3_+0x3c(
    101cbb1d0?, b?, fffffffcb45dea7c?, ffffffff7f39ecb0?, ffffffff7f39e9a0?, 0?)
ffffffff7f39d811 libjvm.so`JVM_handle_solaris_signal+0x1d4(b?,
    ffffffff7f39ecb0?, ffffffff7f39e9a0?, 0?, ffffffff7f39e178?, 101cbb1d0?)
ffffffff7f39dde1 libjvm.so`_ZL17javaSignalHandleriP7siginfoPv+0x20(b?,
    ffffffff7f39ecb0?, ffffffff7f39e9a0?, 0?, 0?, ffffffff7e7dd370?)
ffffffff7f39de91 libc.so.1`__sighndlr+0xc(b?, ffffffff7f39ecb0?,
    ffffffff7f39e9a0?, fffffffcb4b38afc?, 0?, ffffffff7f20c7e8?)
ffffffff7f39df41 libc.so.1`call_user_handler+0x400((int) -1?,
    (siginfo_t *) 0xffffffff7f39ecb0?, (ucontext_t *) 0xc?)
ffffffff7f39e031 libc.so.1`sigacthandler+0xa0((int) 11?,
    (siginfo_t *) 0xffffffff7f39ecb0?, (void *) 0xffffffff7f39e9a0?)
ffffffff7f39e5b1 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb8(
    10013d030?, 100?, 520?, ffffffff7f39f000?, 0?, 0?)

What you see here is the allocate_new_tlab() at the botton, it throws a signal, the signal handler catches it, passes it ultimately to JVM_handle_solaris_signal() which bails, and the JVM exits.

We can look at the signal. It's at address 0xffffffff7f39ecb0 and is of type siginfo_t, so we can just print it

java:core> ffffffff7f39ecb0::print -t siginfo_t

and we first see

siginfo_t {
    int si_signo = 0t11 (0xb)
    int si_code = 1
    int si_errno = 0
...

OK, the signal was indeed 11 = SIGSEGV. The interesting thing is the si_code of 1, which is defined as

#define SEGV_MAPERR     1       /* address not mapped to object */

Ah. Now, in the jvm you actually see a lot of SIGSEGV, but a lot of them are handled by that mysterious JVM_handle_solaris_signal(). In particular, it'll handle anything with SEGV_ACCERR which is basically something running off the end of an array.

Further down, you can see the fault address

struct __fault = {
            void *__addr = 0x10
            int __trapno = 0
            caddr_t __pc = 0
            int __adivers = 0
        }

So, we're faulting on address 0x10. Yes, you try messing around down there and you will fault.

That confirms the crash is a SEGV. What are we actually trying to do? We can disassemble the allocate_new_tlab() function and see what's happening - remember the crash was at offset 0xb8

java:core> libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm::dis
...
libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb8:

       ldx       [%i4 + 0x10], %i5

That's interesting, 0x10 was the fault address. What's %i4 then?

java:core> ::regs
%i4 = 0x0000000000000000

Yep. Given that, we'll try and read 0x10, giving the SEGV we see.

There's a little more context around that call site. A slightly
expanded view is

libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xa0:        nop
libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xa4:        add       %
i5, %g1, %g1
libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xa8:        casx      [
%g3], %i5, %g1
libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xac:        cmp       %
i5, %g1
libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb0:        be,pn     %
xcc, +0x160 <libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0x210>
libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb4:        nop
libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0xb8:        ldx       [
%i4 + 0x10], %i5

Now, the interesting thing here is the casx (compare and swap) instruction. That lines up with the Atomic::cmpxchg() in par_allocate_impl() that we were suspecting above. So the crash is somewhere around there.

It turns out there's another way to approach this. If we compile without optimization then effectively we turn off the inlining. The way to do this is to add an entry to the jvm Makefile via make/hotspot/lib/JvmOverrideFiles.gmk

...
else ifeq ($(call isTargetOs, solaris), true)
    ifeq ($(call isTargetCpuArch, sparc), true)
      # ptribble port tweaks
      BUILD_LIBJVM_g1CollectedHeap.cpp_CXXFLAGS += -O0
    endif
endif

If we rebuild (having touched all the files in the directory to force
make to rebuild everything correctly), and run again, we get the full
call stack:

Now the crash is

# V [libjvm.so+0x80cc48] HeapRegion::top() const+0xc

which we can expand to the following stack leading up to where it goes
into the signal handler.:

ffffffff7f39dff1 libjvm.so`_ZNK10HeapRegion3topEv+0xc(0?, ffffffff7f39ef40?,
    101583e38?, ffffffff7f39f020?, fffffffa46de8038?, 10000?)
ffffffff7f39e0a1 libjvm.so`_ZN10HeapRegion17par_allocate_implEmmPm+0x18(0?,
    100?, 10000?, ffffffff7f39ef60?, ffffffff7f39ef40?, 8f00?)
ffffffff7f39e181
libjvm.so`_ZN10HeapRegion27par_allocate_no_bot_updatesEmmPm+0x24(0?, 100?,
    10000?, ffffffff7f39ef60?, 566c?, 200031?)
ffffffff7f39e231
libjvm.so`_ZN13G1AllocRegion12par_allocateEP10HeapRegionmmPm+0x44(100145440?,
    0?, 100?, 10000?, ffffffff7f39ef60?, 0?)
ffffffff7f39e2e1 libjvm.so`_ZN13G1AllocRegion18attempt_allocationEmmPm+0x48(
    100145440?, 100?, 10000?, ffffffff7f39ef60?, 3?, fffffffa46ceff48?)
ffffffff7f39e3a1 libjvm.so`_ZN11G1Allocator18attempt_allocationEmmPm+0xa4(
    1001453b0?, 100?, 10000?, ffffffff7f39ef60?, 7c0007410?, ffffffff7f39ea41?)
ffffffff7f39e461 libjvm.so`_ZN15G1CollectedHeap18attempt_allocationEmmPm+0x2c(
    10013d030?, 100?, 10000?, ffffffff7f39ef60?, 7c01b15e8?, 0?)
ffffffff7f39e521 libjvm.so`_ZN15G1CollectedHeap17allocate_new_tlabEmmPm+0x24(
    10013d030?, 100?, 10000?, ffffffff7f39ef60?, 0?, 0?)

So yes, this confirms that we are indeed in par_allocate_impl() and
it's crashing on the very first line of the code segment I showed
above, where it calls top(). All top() does is return the _top member
of a HeapRegion.

So the only thing that can happen here is that the HeapRegion itself
is NULL. Then the _top member is presumably at offset 0x10, and trying
to access it gives the SIGSEGV.

Now, in G1AllocRegion::attempt_allocation() there's an assert:

HeapRegion* alloc_region = _alloc_region;
assert_alloc_region(alloc_region != NULL, "not initialized properly");

However, asserts aren't compiled into production builds.

But the fix here is to fail if we've got NULL and let the caller
retry. There are a lot of calls here, and the general approach is to
return NULL if anything goes wrong, so I do the same for this extra
failure case, adding the following:

if (alloc_region == NULL) {
    return NULL;
}

With that, no more of those pesky crashes. (There might be others
lurking elsewhere, of course.)

Of course, what this doesn't explain is why the HeapRegion wasn't
correctly initialized in the first place. But that's another problem
entirely.

What's a decent password length?

2024-07-09T19:34:00.000+01:00

What's a decent length for a password?

I think it's pretty much agreed by now that longer passwords are, in general, better. And fortunately stupid complexity requirements are on the way out.

Reading the NIST password rules gives the following:

User chosen passwords must be at least 8 characters
Machine chosen passwords must be at least 6 characters
You must allow passwords to be at least 64 characters

Say what? A 6 character password is secure?

Initially, that seems way off, but it depends on your threat model. If you have a mechanism to block the really bad commonly used passwords, then 6 characters gives you a billion choices. Not many, but you should also be implementing technical measures such as rate limiting.

With that, if the only attack vector is brute force over the network, trying a billion passwords is simply impractical. Even with just passive rate limiting (limited by cpu power and network latency) an attacker will struggle; with active limiting they'll be trying for decades.

That's with just 6 random characters. Go to 8 and you're out of sight. And for this attack vector, no quantum computing developments will make any difference whatsoever.

But what if the user database itself is compromised?

Of course, if the passwords are in cleartext then no amount of fancy rules or length requirements is going to help you at all.

But if an attacker gets encrypted passwords then they can simply brute force them many orders of magnitude faster. Or use rainbow tables. And that's a whole different threat model.

Realistically, protecting against brute force or rainbow table attacks probably needs a 16 character password (or passphrase), and that requirement could get longer over time.

A corollary to this is that there isn't actually much to be gained to requiring password lengths between 8 and 16 characters.

In illumos, the default minimum password length is 6 characters. I recently increased the default in Tribblix to 8, which aligns with the user chosen limit that NIST give.

Tribblix image structural changes

2024-04-03T17:30:00.000+01:00

The Tribblix live ISO and related images are put together every so slightly differently in the latest m34 release.

All along, there's been an overlay (think a group package) called base-iso that lists the packages that are present in the live image. On installation, this is augmented with a few extra packages that you would expect to be present in a running system but which don't make much sense in a live image, to construct the base system.

You can add additional software, but the base is assumed to be present.

The snag with this is that base-iso is very much a single-purpose generic concept. By its very nature it has to be minimal enough to not be overly bloated, yet contain as many drivers as necessary to handle the majority of systems.

As such, the regular ISO image has fallen between 2 stools - it doesn't have every single driver, so some systems won't work, while it has a lot of unnecessary drivers for a lot of common use cases.

So what I've done is split base-iso into 2 layers. There's a new core-tribblix overlay, which is the common packages, and then base-iso adds all the extra drivers. By and large, the regular live image for m34 isn't really any different to what was present before.

But the concepts of "what packages do I need for applications to work" and "what packages do I want to load on a given downloadable ISO" have now been split.

What this allows is to easily create other images with different rules. As of m34, for example, the "minimal" image is actually created from a new base-server overlay, which again sits atop core-tribblix and differs from base-iso in that it has all the FC drivers. If you're installing on a fibre-channel connected system then using the minimal image will work better (and if you're SAN-booted, it will work where the regular ISO won't).

The next use case is that images for cloud or virtual systems simply don't need most of the drivers. This cuts out a lot of packages (although it doesn't actually save that much space).

The standard Tribblix base system now depends on core-tribblix, not base-iso or any of the specific image layers. This is as it should be - userland and applications really shouldn't care what drivers are present.

One side-effect of this change is that it makes minimising zones easier, because what gets installed in a zone can be based on that stripped-down core-tribblix overlay.

The SunOS JDK builder

2024-02-19T20:19:00.000+00:00

I've been building OpenJDK on Solaris and illumos for a while.

This has been moderately successful; illumos distributions now have access to up to date LTS releases, most of which work well. (At least 11 and 17 are fine; 21 isn't quite right.)

There are even some third-party collections of my patches, primarily for Solaris (as opposed to illumos) builds.

I've added another tool. The SunOS jdk builder.

The aim here is to be able to build every single jdk tag, rather than going to one of the existing repos which only have the current builds. And, yes, you could grope through the git history to get to older builds, but one problem with that is that you can't actually fix problems with past builds.

Most of the content is in the jdk-sunos-patches repository. Here there are patches for both illumos and Solaris (they're ever so slightly different) for every tag I've built.

(That's almost every jdk tag since the Solaris/SPARC/Studio removal, and a few before that. Every so often I find I missed one. And there's been the odd bad patch along the way.)

The idea here is to make it easy to build every tag, and to do so on a current system. I've had to add new patches to get some of the older builds to work. The world has changed, we have newer compilers and other tools, and the OS we're building on has evolved. So if someone wanted to start building the jdk from scratch (and remember that you have to build all the versions in sequence) then this would be useful.

I'm using it for a couple of other things.

One is to put back SPARC support on illumos and Solaris. The initial port I did was on x86 only, so I'm walking through older builds and getting them to work on SPARC. We'll almost certainly not get to jdk21, but 17 seems a reasonable target.

The other thing is to enable the test suites, and then run them, and hopefully get them clean. At the moment they aren't, but a lot of that is because many tests are OS-specific and they don't know what Solaris is so get confused. With all the tags, I can bisect on failures and (hopefully) fix them.

Building up networks of zones on Tribblix

2023-11-22T14:31:00.000+00:00

With OpenSolaris and derivatives such as illumos, we gained the ability to build a whole IT infrastructure in a single box, using virtualized networking (crossbow) to build the underlying network and then attaching virtualized systems (zones) atop virtualized storage (zfs).

Some of this was present in Solaris 10, but it didn't have crossbow so the networking piece was a bit tricky (although I did manage to get surprisingly far by abusing the loopback interface).

In Tribblix, I've long had the notion of a router or proxy zone, which acts as a bridge between the outside world and a local virtual subnet. For the next release I've been expanding that into something much more flexible and capable.

What did I need to put this together?

The first thing is a virtual network. You use dladm to create an etherstub. Think of that as a virtual switch you can connect network links to.

To connect that to the world, a zone is created with 2 network interfaces (vnics). One over the system interface so it can connect to the outside world, and one over the etherstub.

That special router zone is a little bit more than that. It runs NAT to allow any traffic on the internal subnet - simple NAT, nothing complicated here. In order to do that the zone has to have IPFilter installed, and the zone creation script creates the right ipnat configuration file and ensures that IPFilter is started.

You also need to have IPFilter installed in the global zone. It doesn't have to be running there, but the installation is required to create the IPFilter devices. Those IPFilter devices are then exposed to the zone, and for that to work the zone needs to use exclusive-ip networking rather than shared-ip (and would need to do so anyway for packet forwarding to work).

One thing I learnt was that you can't lock the router zone's networking down with allowed-address. The anti-spoofing protection that allowed-address gives you prevents forwarding and breaks NAT.

The router zone also has a couple of extra pieces of software installed. The first is haproxy, which is intended as an ingress controller. That's not currently used, and could be replaced by something else. The second is dnsmasq, which is used as a dhcp server to configure any zones that get connected to the subnet.

With a network segment in place, and a router zone for management, you can then create extra zones.

The way this works in Tribblix is that if you tell zap to create a zone with an IP address that is part of a private subnet, it will attach its network to the corresponding etherstub. That works fine for an exclusive-ip zone, where the vnic can be created directly over the etherstub.

For shared-ip zones it's a bit trickier. The etherstub isn't a real network device, although for some purposes (like creating a vnic) it looks like one. To allow shared-ip, I create a dedicated shared vnic over the etherstub, and the virtual addresses for shared-ip zones are associated with that vnic. For this to work, it has to be plumbed in the global zone, but doesn't need an address there. The downside to the shared-ip setup (or it might be an upside, depending on what the zone's going to be used for) is that in this configuration it doesn't get a network route; normally this would be inherited off the parent interface, but there isn't an IP configuration associated with the vnic in the global zone.

The shared-ip zone is handed its IP address. For exclusive-ip zones, the right configuration fragment is poked into dnsmasq on the router zone, so that if the zone asks via dhcp it will get the answer you configured. Generally, though, if I can directly configure the zone I will. And that's either by putting the right configuration into the files in a zone so it implements the right networking at boot, or via cloud-init. (Or, in the case of a solaris10 zone, I populate sysidcfg.)

There's actually a lot of steps here, and doing it by hand would be rather (ahem, very) tedious. So it's all automated by zap, the package and system administration tool in Tribblix. The user asks for a router zone, and all it needs to be given is the zone's name, the public IP address, and the subnet address, and all the work will be done automatically. It saves all the required details so that they can be picked up later. Likewise for a regular zone, it will do all the configuration based on the IP address you specify, with no extra input required from the user.

The whole aim here is to make building zones, and whole systems of zones, much easier and more reliable. And there's still a lot more capability to add.

Keeping python modules in check

2023-11-04T22:34:00.000+00:00

Any operating system distribution - and Tribblix is no different - will have a bunch of packages for python modules.

And one thing about python modules is that they tend to depend on other python modules. Sometimes a lot of python modules. Not only that, the dependency will be on a specific version - or range of versions - of particular modules.

Which opens up the possibility that two different modules might require incompatible versions of a module they both depend on.

For a long time, I was a bit lax about this. Most of the time you can get away with it (often because module writers are excessively cautious about newer versions of their dependencies). But occasionally I got bitten by upgrading a module and breaking something that used it, or breaking it because a dependency hadn't been updated to match.

So now I always check that I've got all the dependencies listed in packaging with

pip3 show modulename

and every time I update a module I check the dependencies aren't broken with

pip3 check

Of course, this relies on the machine having all the (interesting) modules installed, but on my main build machine that is generally true.

If an incompatibility is picked up by pip3 check then I'll either not do the update, or update any other modules to keep in sync. If an update is impossible, I'll take a note of which modules are blockers, and wait until they get an update to unjam the process.

A case in point was that urllib3 went to version 2.x recently. At first, nothing would allow that, so I couldn't update urllib3 at all. Now we're in a situation where I have one module I use that won't allow me to update urllib3, and am starting to see a few modules requiring urllib3 to be updated, so those are held downrev for the time being.

The package dependencies I declare tend to be the explicit module dependencies (as shown by pip3 show). Occasionally I'll declare some or all of the optional dependencies in packaging, if the standard use case suggests it. And there's no obvious easy way to emulate the notion of extras in package dependencies. But that can be handled in package overlays, which is the safest way in any case.

Something else the checking can pick up is when a dependency is removed, which is something that can be easily missed.

Doing all the checking adds a little extra work up front, but should help remove one class of package breakage.

It seemed like a simple problem to fix

2023-10-27T11:49:00.000+01:00

While a bit under the weather last week, I decided to try and fix what at first glance appears to be a simple problem:

need to ship the manpage with exa

Now, exa is a modern file lister, and the package on Tribblix doesn't ship a man page. The reason for that, it turns out, is that there isn't a man page in the source, but you can generate one.

To build the man page requires pandoc. OK, so how to get pandoc, which wasn't available on Tribblix? It's written in Haskell, and I did have a Haskell package.

Only my version of Haskell was a bit old, and wouldn't build pandoc. The build complains that it's too old and unsupported. You can't even build an old version of pandoc, which is a little peculiar.

Off to upgrade Haskell then. You need Haskell to build Haskell, and it has some specific requirements about precisely which versions of Haskell work. I wanted to get to 9.4, which is the last version of Haskell that builds using make (and I'll leave Hadrian for another day). You can't build Haskell 9.4 with 9.2 which it claims to be too new, you have to go back to 9.0.

Fortunately we do have some bootstrap kits for illumos available, so I pulled 9.0 from there, successfully built Haskell, then cabal, and finally pandoc.

Back to exa. At which point you notice that it's been deprecated and replaced by eza. (This is a snag with modern point tools. They can disappear on a whim.)

So let's build eza. At which point I find that the MSRV (Minimum Supported Rust Version) has been bumped to 1.70, and I only had 1.69. Another update required. Rust is actually quite simple to package, you can just download the stable version and package it.

After all this, exa still doesn't have a man page, because it's deprecated (if you run man exa you get something completely different from X.Org). But I did manage to upgrade Haskell and Cabal, I managed to package pandoc, I updated rust, and I added a replacement utility - eza - which does now come with a man page.

When zfs was young

2023-10-09T20:34:00.001+01:00

On the Solaris 10 Platinum Beta program, one of the most exciting promised features was ZFS, the new file system.

I was especially interested, given that I was in a data-heavy position at the time. The limits of UFS were painful, we had datasets into several terabytes already - and even the multiterabyte file system support that got added was actually pretty useless because the inode density was so low. We tried QFS and SAM-QFS, and they were pretty appalling too.

ZFS was promised, and didn't arrive. In fact, there were about 4 of us on the beta program who saw the original zfs implementation, and it was quite different from what we have now. What eventually landed as zfs in Solaris was a complete rewrite. The beta itself was interesting - we were sent the driver, 3 binaries, and a 3-line cheatsheet, and that was it. There was a fundamental philosophy here that the whole thing was supposed to be so easy to use and sufficiently obvious that it didn't need a manual, and that was actually true. (It's gotten rather more complex since, to be fair.)

The original version was a bit different in terms of implementation than what you're used to, but not that much. The most obvious change was that originally there wasn't a top-level file system for a pool. You created a pool, and then created your file systems. I'm still not sure which is the correct choice. And there was a separate zacl program to handle the ACLs, which were rather different.

In fact, ACLs have been a nightmare of bad implementations throughout their history on Solaris. I already had previous here, having got the POSIX draft ACL implementation reworked for UFS. The original zfs implementation had default aka inheritable ACLs applied to existing objects in a directory. (If you don't immediately realise how bad that is, think of what this allows you to do with hard links to files.) The ACL implementations have continued to be problematic - consider that zfs allows 5 settings for the aclinherit property as evidence that we're glittering a turd at this point.

Eventually we did get zfs shipped in a Solaris 10 update, and it's been continually developed since then. The openzfs project has given the file system an independent existence, it's now in FreeBSD, you can run it (and it runs well) on Linux, and in other OS variations too.

One of the original claims was that zfs was infinitely scalable. I remember it being suggested that you could create a separate zfs file system for each user. I had to try this, so got together a test system (an Ultra 2 with an A1000 disk array) and started creating file systems. Sure, it got into several thousand without any difficulty, but that's not infinite - think universities or research labs and you can easily have 10,000 or 100,000 users, we had well over 20,000. And it fell apart at that scale. That's before each is an NFS share, too. So that idea didn't fly.

Overall, though, zfs was a step change. The fact that you had a file system that was flexible and easily managed was totally new. The fact that a file system actually returned correct data rather than randomly hoping for the best was years ahead of anything else. Having snapshots that allowed users to recover from accidentally deleted files without waiting days for a backup to be restored dramatically improved productivity. It's win after win, and I can't imagine using anything else for storing data.

Is zfs perfect? Of course not, and to my mind one of the most shocking things is that nothing else has even bothered to try and come close.

There are a couple of weaknesses with zfs (or related to zfs, if I put it more accurately). One is that it's still a single-node file system. While we have distributed storage, we still haven't really matured that into a distributed file system. The second is that while zfs has dragged storage into the 21st century, allowing much more sophisticated and scalable management of data, there hasn't been a corresponding improvement in backup, which is still stuck firmly in the 1980s.

SMF - part of the Solaris 10 legacy

2023-10-04T19:52:00.000+01:00

The Service Management Facility, or SMF, integrated extremely late in the Solaris 10 release cycle. We only got one or two beta builds to test, which seemed highly risky for such a key feature.

So there was very little time to gather feedback from users. And something that central really can't be modified once it's released. It had to work first time.

That said, we did manage some improvements. The current implementation of `svcs -x` is largely due to me struggling to work out why a service was broken.

One of the obvious things about SMF is that it relies on manifests written in XML. Yes, that's of its time - there's a lot of software you can date by the file format it uses.

I don't have a particular problem with the use of XML here, to be honest. What's more of a real problem is that the manifest files were presented as a user interface rather than an internal implementation detail, so that users were forced to write XML from scratch with little to no guidance.

There are a lot of good features around SMF.

Just the very basic restart of an application that dies is something that's so blindingly obvious as a requirement in an operating system. So much so that once it existed I refused to support anything that didn't have SMF when I was on call - after all, most of the 3am phone calls were to simply restart a crashed application. And yes, when we upgraded our systems to Solaris 10 with SMF our availability went way up and the on-call load plummeted.

Being able to grant privileges to a service, and just within the context of that service, without having to give privileges to an application (eg set*id) or a user, makes things so much safer. Although in practice it's letting applications bind to privileged ports while running as a regular user, as that's far and away the most common use case.

Dependencies has been a bit of a mixed bag. Partly because working out what the dependencies should be in the first place is just hard to get right, but also because dependency declaration is bidirectional - you can inject a dependency on yourself into another service, and that other service may not respond well, or you can create a circular dependency if the two services are developed independently.

One part of dependency management in services is deciding whether a given service should start or not given the state of other services (such as its dependencies). Ideally, you want strict dependency management. In the real world, systems are messy and complicated, the dependency tree isn't terribly well understood, and some failure modes don't matter. And in many cases you want the system to try and boot as far as possible so you can get in and fix it.

A related problem is that we've ended up with a complex mesh of services because someone had to take the old mess of rc scripts and translate them into something that would work on day 1. And nobody - either at the time or since - has gone though the services and studied whether the granularity is correct. One other thing - that again has never happened - once we got a good handle on what services there are is to look at whether the services we have are sensible, or whether there's an opportunity to rearchitect the system to do things better, And because all these services are now baked into SMF, it's actually quite difficult to do any major reworking of the system.

Not only that, but because people write SMF manifests, they simply copy something that looks similar to the problem at hand, so bad practices and inappropriate dependency declarations multiply.

This is one example of what I see as the big problem with SMF - we haven't got supporting tools that present the administrator with useful abstractions, so that everything is raw.

In terms of configuration management, SMF is very much a mixed bag. Yes, it guarantees a consistent and reproducible state of the system. The snag is that there isn't really an automated way to capture the essential state of a system and generate something that will reproduce it (either later or elsewhere) - it can be done, but it's essentially manual. (Backing up the state is a subset of this problem.)

It's clear that there were plans to extend the scope of SMF. Essentially, to be the Solaris version of the Windows registry. Thankfully (see also systemd for where this goes wrong) that hasn't happened much.

In fact, SMF hasn't really involved in any material sense since the day it was introduced. It's very much stuck in time.

There were other features that were left open. For example, there's the notion of the scope of SMF, and the only one available right now is the "localhost" scope - see the smf(7) manual in illumos - so in theory there could be other, non-localhost, scopes. And there was the notion of monitor methods, which never appeared but I can imagine solving a range of niggling application issues I've seen over the years.

Retiring isaexec in Tribblix

2023-09-11T15:56:00.005+01:00

One of the slightly unusual features in illumos, and Solaris because that's where it came from, is isaexec.

This facility allows you to have multiple implementations of a binary, and then isaexec will select the best one (for some definition of best).

The full implementation allows you to select from a wide range of architectures. On my machine it'll allow the following list:

amd64 pentium_pro+mmx pentium_pro
pentium+mmx pentium i486 i386 i86

If you wanted, you could ship a highly tuned pentium_pro binary, and eke out a bit more performance.

The common case, though, and it's actually the only way isaexec is used in illumos, is to simply choose between a 32-bit and 64-bit binary. This goes back to when Solaris and illumos supported 32-bit and 64-bit hardware in the same system (and you could actually choose whether to boot 32-bit or 64-bit under certain circumstances). In this case, if you're running a 32-bit kernel you get a 32-bit application; if you're running 64-bit then you can get the 64-bit version of that application.

Not all applications got this treatment. Anything that needed to interface directly with the kernel did (eg the ps utility). And for others it was largely about performance or scalability. But most userland applications were 32-bit, and still are in illumos. (Solaris has migrated most to 64-bit now, we ought to do the same.)

It's been 5 years or more since illumos removed the 32-bit kernel, so the only option is to run in 64-bit mode. So now, isaexec will only ever select the 64-bit binary.

A while ago, Tribblix simply removed the remaining 32-bit binaries that isaexec would have executed on a 32-bit system. This saved a bit of space.

The upcoming m32 release goes further. In almost all cases isaexec is no longer involved, and the 64-bit binary sits directly in the PATH (eg, in /usr/bin). There's none of the wasted redirection. I have put symbolic links in, just in case somebody explicitly referenced the 64-bit path.

This is all done by manipulating packaging - Tribblix runs the IPS package repo through a transformation step to produce the SVR4 packages that the distro uses, and this is just another filter in that process.

(There are a handful of exceptions where I still have 32-bit and 64-bit. Debuggers, for example, might need to match the bitness of the application being debugged. And the way that sh/ksh/ksh93 is installed needs a slightly less trivial transformation to get it right.)

Modernizing scripts in Tribblix

2023-09-04T18:56:00.001+01:00

It's something I've been putting off for far too long, but it's about time to modernize all the shell scripts that Tribblix is built on.

Part of the reason it's taken this long is the simple notion of, if it ain't broke, don't fix it.

But some of the scripting was starting to look a bit ... old. Antiquated. Prehistoric, even.

And there's a reason for that. Much of the scripting involved in Tribblix is directly derived from the system administration scripts I've been using since the mid-1990s. That involved managing Solaris systems with SVR4 packages, and when I built a distribution derived from OpenSolaris, using SVR4 packages, I just lifted many of my old scripts verbatim. And even new functionality was copied or slightly modified.

Coming from Solaris 2.3 through 10, this meant that they were very strictly Bourne Shell. A lot of the capabilities you might expect in a modern shell simply didn't exist. And much of the work was to be done in the context of installation (i.e. Jumpstart) where the environment was a little sparse.

The most obvious code smell is extensive use of backticks rather than $(). Some of this I've refactored over time, but looking at the code now, not all that much.

One push for this was adding ShellCheck to Tribblix (it was a little bit of a game getting Haskell and Cabal to play nice, but I digress).

Running ShellCheck across all my scripts gave it a lot to complain about. Some of the complaints are justified, although many aren't (it's very enthusiastic about quoting everything in sight, even when that would be completely wrong).

But generally it's encouraged me to clean the scripts up. It's even managed to find a bug, although looking at code it thinks is just rubbish has found a few more by inspection.

The other push here is to speed things up. Tribblix is often fairly quick in comparison to other systems, but it's not quick enough for me. But more of that story later.

Speed up zone installation with this one weird trick

2023-08-24T11:07:00.000+01:00

Sadly, the trick described below won't work in current releases of Solaris, or any of the illumos distributions. But back in the day, it was pretty helpful.

In Solaris 10, we had sparse root zones - which shared /usr with the global zone, which not only saved space because you didn't need a copy of all the files, but creating them was much quicker because you didn't need to take the time to copy all the files.

Zone installation for sparse root zones was typically about 3 minutes for us - this was 15 years ago, so mostly spinning rust and machines a bit slower than we're used to today.

That 3 minutes sounds quick, but I'm an impatient soul, and so were my users. Could I do better?

Actually, yes, quite a bit. What's contributing to that 3 minutes? There's a bit of adding files (the /etc and /var filesystems are not shared, for reasons that should be fairly obvious). And you need to copy the packaging metadata. But that's just a few files.

Most of the time was taken up by building the contents file, which simply lists all the installed files and what package they're in. It loops over all the packages, merging all the files in that package into the contents file, which thus grows every time you process a package.

The trick was to persuade it to process the packages in an optimal order. You want to do all the little packages first, so that the contents file stays small as long as possible.

And the way to do that was to recreate the /var/sadm/pkg directory. It was obvious that it was simply reading the directory and processing packages in the order that it found them. And, on ufs, this is the order that the packages were added to the directory. So what I did was move the packages to one side, create an empty /var/sadm/pkg, and move the package directories back in size order (which you can get fairly easily by looking as the size of the spooled pkgmap files).

This doesn't quite mean that the packages get processed in size order, as it does the install in dependency order, but as long as dependencies are specified it otherwise does them in size order.

The results were quite dramatic - with no other changes, this took zone install times from the original 3 minutes to 1 minute. Much happier administrators and users.

This trick doesn't work at all on zfs, sadly, because zfs doesn't simply create a linear list of directory entries and put new ones on the end.

And all this is irrelevant for anything using IPS packaging, which doesn't do sparse-root zones anyway, and is a completely different implementation.

And even in Tribblix, which does have sparse-root zones like Solaris 10 did, and uses SVR4 packaging, the implementation is orders of magnitude quicker because I just create the contents file in a single pass, so a sparse zone in Tribblix can install in a second or so.

Remnants of closed code in illumos

2023-08-23T16:36:00.000+01:00

One of the annoying issues with illumos has been the presence of a body of closed binaries - things that, for some reason or other, were never able to be open sourced as part of OpenSolaris.

Generally the illumos project has had some success in replacing the closed pieces, but what's left isn't entirely zero.It took me a little while to work out what's still left, but as of today the list is:

etc/security/tsol/label_encodings.gfi.single
etc/security/tsol/label_encodings.example
etc/security/tsol/label_encodings.gfi.multi
etc/security/tsol/label_encodings
etc/security/tsol/label_encodings.multi
etc/security/tsol/label_encodings.single
usr/sbin/chk_encodings
usr/xpg4/bin/more
usr/lib/raidcfg/mpt.so.1
usr/lib/raidcfg/amd64/mpt.so.1
usr/lib/iconv/646da.8859.t
usr/lib/iconv/8859.646it.t
usr/lib/iconv/8859.646es.t
usr/lib/iconv/8859.646fr.t
usr/lib/iconv/646en.8859.t
usr/lib/iconv/646de.8859.t
usr/lib/iconv/646it.8859.t
usr/lib/iconv/8859.646en.t
usr/lib/iconv/8859.646de.t
usr/lib/iconv/iconv_data
usr/lib/iconv/646fr.8859.t
usr/lib/iconv/8859.646da.t
usr/lib/iconv/646sv.8859.t
usr/lib/iconv/8859.646.t
usr/lib/iconv/646es.8859.t
usr/lib/iconv/8859.646sv.t
usr/lib/fwflash/verify/ses-SUN.so
usr/lib/fwflash/verify/sgen-SUN.so
usr/lib/fwflash/verify/sgen-LSILOGIC.so
usr/lib/fwflash/verify/ses-LSILOGIC.so
usr/lib/labeld
usr/lib/locale/POSIX
usr/lib/inet/certlocal
usr/lib/inet/certrldb
usr/lib/inet/amd64/in.iked
usr/lib/inet/certdb
usr/lib/mdb/kvm/amd64/mpt.so
usr/lib/libike.so.1
usr/lib/amd64/libike.so.1
usr/bin/pax
platform/i86pc/kernel/cpu/amd64/cpu_ms.GenuineIntel.6.46
platform/i86pc/kernel/cpu/amd64/cpu_ms.GenuineIntel.6.47
lib/svc/manifest/network/ipsec/ike.xml
kernel/kmdb/amd64/mpt
kernel/misc/scsi_vhci/amd64/scsi_vhci_f_asym_lsi
kernel/misc/scsi_vhci/amd64/scsi_vhci_f_asym_emc
kernel/misc/scsi_vhci/amd64/scsi_vhci_f_sym_emc
kernel/strmod/amd64/sdpib
kernel/drv/amd64/adpu320
kernel/drv/amd64/atiatom
kernel/drv/amd64/usbser_edge
kernel/drv/amd64/sdpib
kernel/drv/amd64/bcm_sata
kernel/drv/amd64/glm
kernel/drv/amd64/intel_nhmex
kernel/drv/amd64/lsimega
kernel/drv/amd64/marvell88sx
kernel/drv/amd64/ixgb
kernel/drv/amd64/acpi_toshiba
kernel/drv/amd64/mpt
kernel/drv/adpu320.conf
kernel/drv/usbser_edge.conf
kernel/drv/mpt.conf
kernel/drv/intel_nhmex.conf
kernel/drv/sdpib.conf
kernel/drv/lsimega.conf
kernel/drv/glm.conf

Actually, this isn't much. In terms of categories:

Trusted, which includes those label_encodings, and labeld. Seriously, nobody can realistically run trusted on illumos (I have, it's ... interesting). So these don't really matter.

The iconv files actually go with the closed iconv binary, which we replaced ages ago, and our copy doesn't and can't use those files. We should simply drop those (they will be removed in Tribblix next time around).

There's a set of files connected to IKE and IPSec. We should replace those, although I suspect that modern alternatives for remote access will start to obsolete all this over time.

The scsi_vhci files are to get multipathing correctly set up on some legacy SAN systems. If you have to use such a SAN, then you need them. If not, then you're in the clear.

There are a number of drivers. These are mostly somewhat aged. The sdp stuff is being removed anyway as part of IPD29, so that'll soon be gone. Chances are that very few people will need most of these drivers, although mpt was fairly widely used (there was an open mpt replacement in the works). Eventually the need for the drivers will dwindle to zero as systems with them in no longer exist (and, by the same token, we wouldn't need them for something like an aarch64 port).

Which just leaves 2 commands.

Realistically, the XPG4 more could be replaced by less. The standard was based on the behaviour of less, after all. I'm tempted to simply delete /usr/xpg4/bin/more and make it a link to less and have done with it.

As for pax, it's required by POSIX, but to be honest I've never used it, haven't seen anywhere that uses it, and read support is already present in things like libarchive and gtar. The heirloom pax is probably more than good enough.

In summary, illumos isn't quite fully open source, but it's pretty close and for almost all cases we could put together a fully functional open subset that'll work just fine.