Wednesday, February 27, 2019

Building illumos-gate on AWS (2019 version)

I've covered building illumos on AWS before, but the instructions there are a bit out of date. Of course, these will become outdated too, in time, but should still be useful.

First spin up an EC2 instance, the AMI you want to use for this is "Tribblix-m20.5", ami-7cf2181b in London. (If you want to use a different region, then you'll have to copy the AMI.)

For instance size, try an m4.2xlarge - a t2.micro really won't cut it, you won't even be able to install the packages. With an m4.2xlarge you're looking at 30-45 minutes to build illumos, depending on whether debug is enabled.

Add an external EBS device of at least 8G, there isn't enough space on the AMI's root partition to handle the build. When you add it, attach it at /dev/sdf (there's no real reason for this, but that will match the zpool create below).

Once it's booted up (I assume you know about security groups and keypairs, so you can ssh in as root), then the first thing you need to do is update the image:

zap refresh
zap update-overlay -a

Install the illumos-build overlay and tweak the install

zap install-overlay illumos-build
rm -f /usr/bin/cpp
cd /usr/bin ; ln -s ../gnu/bin/xgettext gxgettext

Create a user and a storage pool for them to use

zpool create -O compression=lz4 illumos c2t5d0
useradd -g staff -d /illumos -s /bin/bash illumos
chown illumos /illumos
passwd illumos

Log in as illumos, and clone the gate. What I do here is have a reference copy of the gate, and then I can clone that very quickly locally every time I want to do a build.

mkdir ${HOME}/Illumos-reference
cd ${HOME}/Illumos-reference
git clone https://github.com/illumos/illumos-gate

If you want to create omnitribblix too

git clone https://github.com/omniosorg/illumos-omnios

Clone the relevant Tribblix repo

mkdir ${HOME}/Tribblix
cd ${HOME}/Tribblix
git clone https://github.com/tribblix/tribblix-build

The scripts need to know where the Tribblix repo(s) are checked out

export THOME=${HOME}/Tribblix
Create a build area with the gate checked out

mkdir ${HOME}/Illumos
cd ${HOME}/Illumos
git clone ${HOME}/Illumos-reference/illumos-gate

and if you want to build omnitribblix too

git clone ${HOME}/Illumos-reference/illumos-omnios omnitribblix

You need the closed tarballs

wget -c \
  https://download.joyent.com/pub/build/illumos/on-closed-bins.i386.tar.bz2 \
  https://download.joyent.com/pub/build/illumos/on-closed-bins-nd.i386.tar.bz2

Right, you're now ready to do a build.

cd illumos-gate

For a release build:

${THOME}/tribblix-build/illumos/releasebuild m20.6

For a debug build with a gcc7 shadow

${THOME}/tribblix-build/illumos/debugbuild

For a debug build without the gcc7 shadow

${THOME}/tribblix-build/illumos/debugbuild -q
For omnitribblix, which assumes (and requires) that you've done a vanilla gate build first

cd ${HOME}/Illumos/omnitribblix
${THOME}/tribblix-build/illumos/omnibuild m20lx.6
And there you have it, you should have a beautiful cleanly built gate.

This differs from my previous recipe in a couple of key ways:
  1. I no longer require you to build in a zone, although I would still recommend doing so if you want to use the system for other things. But as this is an AWS instance, we can dedicate it to gate building
  2. It's far more scripted. If you want to see what it's really doing, look inside the releasebuild, debugbuild, and omnibuild scripts.
 Next time, I'll cover how the files from the build are converted into packages.

Sunday, February 10, 2019

Thoughts on SPARC support in illumos

One interesting property of illumos is that its legacy stretches back decades - there is truly ancient code rubbing shoulders with the very modern.

An area where we have really old code is on SPARC, where illumos has support in the codebase for a large variety of Sun desktops and servers.

There's a reasonable chance that quite a bit of this code is currently broken. Not because it's fundamentally poor code (although it's probably fair to say that the code quality is of its time, and a lot of it is really old), but it lives within an evolving codebase and hasn't been touched in the lifetime of illumos, and likely much longer. Not only that, but it's probably making more assumptions about being built with the old Studio toolchain rather than with gcc.

What of this code is useful and worth keeping and fixing, and what should be dropped?

A first step in this was that I have recently removed support for starfire - the venerable Sun E10K. It seems extremely unlikely that anyone is running illumos on such a machine. Or indeed that anyone has them running at all - they're museum pieces at this point.

A similar, if rather newer, class of system is the starcat, the Sun F15K and variants. Again, it's big, expensive, requires dedicated controller hardware, and is unlikely to be the kind of thing anyone's going to have lying about. (And, if you're a business, there's no real point in trying to make such a system work - you would be much better off, both operationally and financially, in getting a current SPARC system.)

And if nobody has such a system, then not only is the code useless, it's also untestable.

The domained systems, like starfire and starcat, are also good candidates for removal because of the relative complexity and uniqueness of their code. And it's not as if the design specs for this hardware are out there to study.

What else might we consider removing (with starfire done and starcat a given)?

  1. The serengeti, Sun-Fire E2900-E6800. Another big blob of complex code.
  2. The lw8 (lightweight 8), aka the V-1280. This is basically some serengeti boards in a volume server chassis.
  3. Anything using Sbus. That would be the Ultra-2, and the E3000-E6000 (sunfire). There's also the socal, sf, and bpp drivers. One snag  with removing the Ultra-2 is that it's used as the base platfrom for the newer US-II desktops, which link back to it.
  4. The olympus platform. That's anything from Fujitsu. One slight snag here is that the M3000 was quite a useful box and is readily available on eBay, and quite affordable too.
  5. Netra systems. (Specifically NetraCT - there's a US-IIi NetraCT, and two US-IIe systems, the NetraCT-40 and the NetraCT-60. Code names montecarlo and makaha (something about Tonga too). Also CP2300 aka snowbird.
  6. Server blade. I'm talking the old B100s blade here.
  7. Binary compatibility with SunOS 4 - this is kernel support for a.out, and libbc.
I'm not saying at this point that all of this code and platform support will go, just that it lists the potential candidates. For example, I regard support for M3000 as useful, and definitely worth thinking about keeping.

What does that leave as supported? Most of the US-II and US-III desktops, most of the V-series servers, and pretty much all the early sun4v (T1 through T3 chips) systems. In other words, the sort of thing that you can pick up second hand fairly easily at this point.

Getting rid of code that we can never use has a number of benefits:

  • We end up with a smaller body of code, that is thus easier to manage.
  • We end up with less code that needs to be updated, for example to make it gcc7 clean, or to fix problems found by smatch, or to enable illumos to adopt newer toolchains.
  • We can concentrate on the code that we have left, and improve its quality.
If we put those together into a single strategy, the overall aim is to take illumos for SPARC from a large body of unknown, untested, and unsupportable code to a smaller body of properly maintained, testable, and supportable code. Reduce quantity to improve quality, if you like.

As part of this project, I've looked through much of the SPARC codebase. And it's not particularly pretty. One reason for attacking starfire was that I was able to convince myself relatively quickly that I could come up with a removal plan that was well-bounded - it was possible to take all of it out without accidentally affecting anything else. Some of the other platforms need bit more analysis to tease out all the dependencies and complexity - bits of code are shared between platforms in a variety of non-obvious ways.

The above represents my thoughts on what would be a reasonable strategy for supporting SPARC in illumos. I would naturally be interested in the views of others, and specifically if anyone is actually using illumos on any of the platforms potentially on the chopping block.

Friday, February 08, 2019

SPARC and tod modules on illumos

Following up from removing starfire support from illumos, I've been browsing through the codebase to identify more legacy code that shouldn't be there any more.

Along the way, I discovered a little tidbit about how the tod (time of day) modules - the interface to the hardware clock - work on SPARC.

If you look, there are a whole bunch of tod modules, and it's not at all obvious how they fit together - they all appear to be candidates for loading, and it's not obvious how the correct one for a platform is chosen.

The mechanism is actually pretty simple, if a little odd.

There's a global variable in the kernel named:

tod_module_name

This can be set in several ways - for some platforms, it's hard-coded in that platform's platmod. Or it could be extracted from the firmware (OBP). That tells the system which tod module should be used.

And the way this works is that each tod module has _init code that looks like

if (tod_module_name is myself) {
   initialize the driver
} else {
   do nothing
}

so at boot all the tod modules get loaded, but only the one that matches the name set by the platform actually initializes itself.

Later in boot, there's an attempt to unload all modules. Similarly the _fini for each driver essentially does

if (tod_module_name is myself) {
   I'm busy and can't be unloaded
} else {
   yeah, unload me
}

So, when the system finishes booting, you end up with only one tod module loaded and functional, and it's the right one.

Returning to the original question, can any of the tod modules be safely removed because no platform uses them? To be honest, I don't know. Some have names that match the platform they're for. It's pretty obvious, for example, that todstarfire goes with the starfire platform, so it was safe to take that out. But I don't know the module names returned by every possible piece of SPARC hardware, so it isn't really safe to remove some of the others. (And, as a further problem, I know that at least one is only referenced in closed source, binary only, platform modules.)