The Trouble with Tribbles...

Thursday, April 24, 2025

On efficiency and resilience in IT

Years ago, I was in a meeting when a C-level executive proclaimed:

IT systems run at less than 10% utilization on average, so we're moving to the cloud to save money.

The logic behind this was that you could run systems in the cloud that were the size you needed, rather than the size you had on the floor.

Of course, this particular claim was specious. Did he know the average utilization of our systems, I asked. He did not. (It was at least 30%.)

Furthermore, measuring CPU utilization is just one aspect of a complex multidimensional space. Systems may have spare CPU cycles, but are hitting capacity limits on memory, memory bandwidth, network bandwidth, storage and storage bandwidth. It's rare to have a system so well balanced that it saturates all parameters equally.

Not only that, but the load on all systems fluctuate, even on very short timescales. There will always be troughs between the peaks. And, as we all know, busy systems tend to generate queues and congestion - or, as a technical term, higher utilization leads to increased latency.

Attempting to build systems that maximise efficiency implies minimizing waste. But if you always consider spare capacity as wasted capacity, then you will always get congested systems and slow response. (Just think about queueing at the tills in a supermarket where they've staffed them for average footfall.)

So guaranteeing performance and response time implies a certain level of overprovisioning.

Beyond that, resilient systems need to have sufficient capacity to not only handle normal fluctuations in usage, but abnormal usage due to failures and external events. And resilient design needs to have unused capacity to take up the slack when necessary.

In this case, a blinkered focus on efficiency not only leads to poor response, it also makes systems brittle and incapable of responding if a problem occurs.

A simple way to build resiliency is to have redundant systems - provision spare capacity that springs into action when needed. In such an active-passive configuration, the standby system might be idle. It doesn't have to be - you might use redundant systems for development/test/batch workloads (this presupposes you have a mechanism like Solaris zones to provide strong workload isolation).

Going to the cloud might solve the problem for a customer, but the cloud provider has exactly the same problem to solve, on a larger scale. They need to provision excess capacity to handle the variability in customer workloads. Which leads to the creation of interesting pricing models - such as reserved instances and the spot markets on AWS.

Tuesday, April 08, 2025

Understanding emission scopes, or failing to

I've been trying to get my head around all this Scope 1, Scope 2, Scope 3 emissions malarkey. Although it appears that lots of people smarter than me are struggling with it.

Having spent a while looking at how the Scopes are defined, I can understand how this can be difficult.

OK, Scope 1 is an organisation's direct emissions. Presumably an organisation knows what it's doing and how it's doing it, so getting the Scope 1 emissions from that ought to be fairly straightforward.

And Scope 2 is electricity, steam, heating and cooling purchased from someone else. I'm immediately suspicious here because this is a weirdly specific categorisation. But at least it should be easy to calculate - there's a conversion factor but at least you know the usage because it's on a bill you have to pay.

Then Scope 3 is - everything else. The fact that there are 15 official categories included ought to be a big red flag. That it's problematic is shown by the fact so many organisations have problems with it. (And by the growth of an industry to solve the problem for you.)

Personally, I wouldn't have defined it this way. If the idea is to evaluate emissions across the supply chain, then dumping almost all the emissions into the vaguest bucket is always going to be problematic.

So, why wasn't Scope 2 simply defined as the combined Scope 1 emissions of everyone providing services to the organisation. (That includes upstream and downstream, suppliers and employees, by the way.) That has 2 advantages I can see:

It's easy to calculate, because Scope 1 is pretty easy to calculate for all the providers of services (and they may well be doing it anyway), and an organisation ought to know who's providing services to it
It makes Scope 2 bigger (obviously) because there's more included, and therefore makes Scope 3 smaller, so uncertainties in Scope 3 matter less
Because you can better identify the contributors to your Scope 2 emissions, it's easier to know where to start making improvement efforts

I presume there's some reason it wasn't done this way, but I can't immediately see it.

Friday, April 04, 2025

What is this AI anyway?

AI is all the rage right now. It's everywhere, you can't avoid it.

But what is AI?

I'm not going to try and answer that here. What I will do, though, is state the question somewhat differently:

What is meant by "AI" in a given context?

And this matters, because the words we use are important.

The reality is that when you see AI mentioned it really could be almost anything. Some things AI might mean are:

Copilot
ChatGPT
Gemini
Some other specific off the shelf public LLM
Anything involving any off the shelf LLM
A custom domain-specific LLM
Machine learning
Pattern matching
Image recognition
Any old computer program
One of the AI companies

And there's always the possibility that someone has simply slapped AI on a product as a marketing term with no AI involved.

This persistent abuse of terminology is really unhelpful. Yesterday I went to a very interesting event for conversations about Hopes and Fears around AI.

Am I hopeful or fearful about AI? It depends which of the above definitions you mean.

There are certain uses of what might now be lumped in with AI that have proven to be very successful, but in many cases they're really machine learning, and have actually been around for a long time. I'm very positive about those (for example, helping in medical diagnoses).

On the other hand, if the AI is a stochastic parrot trained via large scale abuse of copyright while wreaking massive environmental damage, then I'm very negative about that.

So I think it's important to get away from sticking the AI label onto everything that might have some remote association with a computer program, and be far more careful in our terminology.

Tuesday, March 25, 2025

Tribblix on SPARC: sparse devices in an LDOM

I recently added a ddu like capability to Tribblix.

In that article I showed the devices in a bhyve instance. As might be expected there really aren't a lot of devices you need to handle.

What about SPARC, you might ask? Even if you don't, I'll ask for you.

Running Tribblix in a LDOM, this is what you see:

root@sparc-m32:/root# zap ddu
Device SUNW,kt-rng handled by n2rng in TRIBsys-kernel-platform [installed]
Device SUNW,ramdisk handled by ramdisk in TRIBsys-kernel [installed]
Device SUNW,sun4v-channel-devices handled by cnex in TRIBsys-ldoms [installed]
Device SUNW,sun4v-console handled by qcn in TRIBsys-kernel-platform [installed]
Device SUNW,sun4v-disk handled by vdc in TRIBsys-ldoms [installed]
Device SUNW,sun4v-domain-service handled by vlds in TRIBsys-ldoms [installed]
Device SUNW,sun4v-network handled by vnet in TRIBsys-ldoms [installed]
Device SUNW,sun4v-virtual-devices handled by vnex in TRIBsys-kernel-platform [installed]
Device SUNW,virtual-devices handled by vnex in TRIBsys-kernel-platform [installed]

It's hardly surprising, but that's a fairly minimal list.

It does make me wonder whether to produce a special SPARC Tribblix image precisely to run in an LDOM. After all, I already have slightly different variants on x86 designed for cloud in general, and one for EC2 specifically, that don't need the whole variety of device drivers that the generic image has to include.

Sunday, March 23, 2025

Expecting an AI boom?

I recently went down to the smoke, to Tech Show London.

There were 5 constituent shows, and I found what each sub-show was offering - and the size of each component - quite interesting.

There wasn't much going on in Devops Live, to be honest. Relatively few players had shown up, nothing terribly interesting.

There wasn't that much in Big Data & AI World either. I was expecting much more here, and what there was seemed to be on the periphery. More support services than actual product.

The Cloud & Cyber Security Expo was middling, not great, and there was an AI slant in evidence. Not proper AI, but a sprinkling of AI dust on things just to keep up with the Joneses.

Cloud and AI Infrastructure had a few bright spots. I saw actual hardware on the floor - I had seen disk shelves over in the Big Data section, but here I spotted a Tape Library (I used to use those a lot, haven't seen much in that area for a while) and a VDI blade. Talked to a few people, including the Zabbix and Tailscale stands.

But when it came to Data Centre World, that was buzzing. It was about half the overall floor area, so it was far and away the dominant section. Tremendous diversity too - concrete, generators, power cables, electrical switching, fiber cables, cable management, thermal management, lots of power and cooling. Lots and lots of serious physical infrastructure.

There was an obvious expectation on display that there's a massive market around high-density compute. I saw multiple vendors with custom rack designs - rear-door and liquid cooling in evidence. Some companies addressing the massive demand for water.

If these people are at a trade show, then the target market isn't the 3 or 4 hyperscalers. What's being anticipated in this frenzy is very much companies building out their own datacentre facilities, and that's very much an interesting trend.

There's a saying "During a gold rush, sell shovels". What I saw here was a whole army of shovel-sellers getting ready for the diggers to show up.

Thursday, March 06, 2025

Tribblix, UEFI, and UFS

Somewhat uniquely among illumos distributions, Tribblix doesn't require installation to ZFS - it allows the possibility of installing to a UFS root file system.

I'm not sure how widely used this is, but it will get removed as an option at some point, as the illumos UFS won't work past Y2038.

I recently went through the process of testing an install of the very latest Tribblix to UFS, in a bhyve guest running UEFI. The UEFI part was a bit more work, and doing it clarified how some of the internals fit together.

(One reason for doing these unusual experiments is to better understand how things work, especially those that are handed automatically by more mainstream components.)

OK, on to installation.

While install to zfs will automatically lay out zfs pools and file systems, the ufs variant needs manual partitioning. There are two separate concerns - the Tribblix install, and UEFI boot.

The Tribblix installer for UFS assumes 2 things about the layout of the disk it will install to:

The slice s0 will be used to install the operating system to, and mounted at /.
The slice s1 will be used for swap. (On zfs, you create a zfs volume for swap; on ufs you use a separate raw partition.)

It's slightly unfortunate that these slices are hard-coded into the installer.

For UEFI boot we need 2 other slices:

A system partition (this is what's called EFI System partition, aka ESP)
A separate partition to put the stage2 bootloader in. (On zfs there's a little bit of free space you can use; there isn't enough on ufs so it needs to be handled separately.)

The question then arises as to how big these need to be. Now, if you create a root pool with ZFS (using zpool create -B) it will create a 256MB partition for ESP. This turns out to be the minimum size for FAT32 on 4k disks, so that's a size that should always work. On disks with a 512 block size, it needs to be 32MB or larger (there's a comment in the code about 33MB). The amount of data you're going to store there is very much less.

The stage2 partition doesn't have to be terribly big.

So as a result of this I'm going to create a GPT label with 4 slices - 0 and 1 for Tribblix, 3 and 4 for EFI system and boot.

There are 2 things to note here: First,the partitions you create don't have to be laid out on disk in numerical order, you can put the slices in any order you want. This was true for SMI disks too, where it was common practice in Solaris to put swap on slice 1 at the start of the disk with slice 0 after it. Second, EFI/GPT doesn't assign any special significance to slice 2, unlike the old SMI label where slice 2 was conventionally the whole disk. I'm avoiding slice 2 here not because it's necessary, but so as to not confuse anyone used to the old SMI scheme.

The first thing to do with a fresh disk is to go into format, invoked as format -e (expert mode in order to access the EFI options). Select the disk, run fdisk from inside format, and then install an EFI label.

format -e
#
# choose the disk
#
fdisk
y - to accept defaults
l - to label
1 - choose efi

Then we can lay out the partitions. Still in format, type p to enter the partition menu and p to display the partitions.

p - enter partition menu
p - show current partition table

At this point on a new disk it should have 8 as "reserved" and 0 as "usr", with everything else "unassigned". We're going to leave slice 8 untouched.

First note where slice 0 currently starts. I'll resize it at the end, but we're going to put slices 3, 4, and 1 at the start of the disk and then resize 0 to fill in what's left.

To configure the settings for a given slice, just type its number.

Start with slice 3, type 3 and configure the system partition. This has to use the "system" tag.

tag: system
flags: wm (just hit return to accept)
start: 34
size: 64mb

Type p again to view the partition table and note the last sector of slice 3 we just created, and add 1 to it to give the start sector of the next slice. Type 4 to configure the boot partition, and it must have the tag "boot".

tag: boot
flags: wm (just hit return to accept)
start: 131106
size: 16mb

Type p again to view the partition table, take note of the last sector for the new slice 4, and add 1 to get the start sector for the next one. Which is 1 for the swap partition.

tag: swap
flags: wm (just hit return to accept)
start: 65570
size: 512mb

We're almost done. The final step is to resize partition 0. Again you get the start sector by adding 1 to the last sector of the swap partition you just created. And rather than giving a size you can give the end sector using an 'e' suffix, which should be one less than the start of the reserved partition 8, and also the last sector of the original partition 0. Type 0 and enter something like:

tag: usr
flags: wm (just hit return to accept)
start: 1212450
size: 16760798e

Type 'p' one last time to view the partition table, check that the Tag entries are correct, and that the First and Last Sectors don't overlap.

Then type 'l' to write the label to the disk. It will ask you for the label type - make sure it's EFI again - and for confirmation.

Then we can do the install

./ufs_install.sh c1t0d0s0

It will ask for confirmation that you want to create the file system

At the end it ought to say "Creating pcfs on ESP /dev/rdsk/c1t0d0s3"

If it says "Requested size is too small for FAT32." then that's a hint that you need the system partition to be bigger. (An alternative trick is to mkfs the pcfs file system yourself, if you create it using FAT16 it will still work but you can get away with it being a lot smaller.)

It should also tell you that it's writing the pmbr to slice 4 and to p0.

With that, rebooting into the newly installed system ought to work.

Now, the above is a fairly complicated set of instructions. I could automate this, but do we really want to make it that easy to install to UFS?

Wednesday, February 19, 2025

Introducing a ddu-alike for Tribblix

Introducing a new feature in Tribblix m36. There's a new ddu subcommand for zap.

In OpenSolaris, the Device Driver Utility would map the devices it found and work out what software was needed to drive them. This isn't that utility, but is inspired by that functionality, rewritten for Tribblix as a tiny little shell script.

As an example, this is the output of zap ddu for Tribblix in a bhyve instance:

jack@tribblix:~$ zap ddu
Device acpivirtnex handled by acpinex in TRIBsys-kernel-platform [installed]
Device pci1af4,1000,p handled by vioif in TRIBdrv-net-vioif [installed]
Device pci1af4,1001 handled by vioblk in TRIBdrv-storage-vioblk [installed]
Device pci1af4,1 handled by vioif in TRIBdrv-net-vioif [installed]
Device pciclass,030000 handled by vgatext in TRIBsys-kernel [installed]
Device pciclass,060100 handled by isa in TRIBsys-kernel-platform [installed]
Device pciex_root_complex handled by npe in TRIBsys-kernel-platform [installed]
Device pnpPNP,303 handled by kb8042 in TRIBsys-kernel [installed]
Device pnpPNP,f03 handled by mouse8042 in TRIBsys-kernel [installed]

Simply put, it will list the devices it finds, which driver is responsible for them, and which package that driver is contained in (and whether that package is installed).

This, while a tiny little feature, is one of those small things that is actually stunningly useful.

If there's a device that we have a driver for that isn't installed, this helps identify it so you know what to install.

What this doesn't do (yet, and unlike the original ddu) is show devices we don't have a driver for at all.