Sunday, February 10, 2019

Thoughts on SPARC support in illumos

One interesting property of illumos is that its legacy stretches back decades - there is truly ancient code rubbing shoulders with the very modern.

An area where we have really old code is on SPARC, where illumos has support in the codebase for a large variety of Sun desktops and servers.

There's a reasonable chance that quite a bit of this code is currently broken. Not because it's fundamentally poor code (although it's probably fair to say that the code quality is of its time, and a lot of it is really old), but it lives within an evolving codebase and hasn't been touched in the lifetime of illumos, and likely much longer. Not only that, but it's probably making more assumptions about being built with the old Studio toolchain rather than with gcc.

What of this code is useful and worth keeping and fixing, and what should be dropped?

A first step in this was that I have recently removed support for starfire - the venerable Sun E10K. It seems extremely unlikely that anyone is running illumos on such a machine. Or indeed that anyone has them running at all - they're museum pieces at this point.

A similar, if rather newer, class of system is the starcat, the Sun F15K and variants. Again, it's big, expensive, requires dedicated controller hardware, and is unlikely to be the kind of thing anyone's going to have lying about. (And, if you're a business, there's no real point in trying to make such a system work - you would be much better off, both operationally and financially, in getting a current SPARC system.)

And if nobody has such a system, then not only is the code useless, it's also untestable.

The domained systems, like starfire and starcat, are also good candidates for removal because of the relative complexity and uniqueness of their code. And it's not as if the design specs for this hardware are out there to study.

What else might we consider removing (with starfire done and starcat a given)?

  1. The serengeti, Sun-Fire E2900-E6800. Another big blob of complex code.
  2. The lw8 (lightweight 8), aka the V-1280. This is basically some serengeti boards in a volume server chassis.
  3. Anything using Sbus. That would be the Ultra-2, and the E3000-E6000 (sunfire). There's also the socal, sf, and bpp drivers. One snag  with removing the Ultra-2 is that it's used as the base platfrom for the newer US-II desktops, which link back to it.
  4. The olympus platform. That's anything from Fujitsu. One slight snag here is that the M3000 was quite a useful box and is readily available on eBay, and quite affordable too.
  5. Netra systems. (Specifically NetraCT - there's a US-IIi NetraCT, and two US-IIe systems, the NetraCT-40 and the NetraCT-60. Code names montecarlo and makaha (something about Tonga too). Also CP2300 aka snowbird.
  6. Server blade. I'm talking the old B100s blade here.
  7. Binary compatibility with SunOS 4 - this is kernel support for a.out, and libbc.
I'm not saying at this point that all of this code and platform support will go, just that it lists the potential candidates. For example, I regard support for M3000 as useful, and definitely worth thinking about keeping.

What does that leave as supported? Most of the US-II and US-III desktops, most of the V-series servers, and pretty much all the early sun4v (T1 through T3 chips) systems. In other words, the sort of thing that you can pick up second hand fairly easily at this point.

Getting rid of code that we can never use has a number of benefits:

  • We end up with a smaller body of code, that is thus easier to manage.
  • We end up with less code that needs to be updated, for example to make it gcc7 clean, or to fix problems found by smatch, or to enable illumos to adopt newer toolchains.
  • We can concentrate on the code that we have left, and improve its quality.
If we put those together into a single strategy, the overall aim is to take illumos for SPARC from a large body of unknown, untested, and unsupportable code to a smaller body of properly maintained, testable, and supportable code. Reduce quantity to improve quality, if you like.

As part of this project, I've looked through much of the SPARC codebase. And it's not particularly pretty. One reason for attacking starfire was that I was able to convince myself relatively quickly that I could come up with a removal plan that was well-bounded - it was possible to take all of it out without accidentally affecting anything else. Some of the other platforms need bit more analysis to tease out all the dependencies and complexity - bits of code are shared between platforms in a variety of non-obvious ways.

The above represents my thoughts on what would be a reasonable strategy for supporting SPARC in illumos. I would naturally be interested in the views of others, and specifically if anyone is actually using illumos on any of the platforms potentially on the chopping block.

Friday, February 08, 2019

SPARC and tod modules on illumos

Following up from removing starfire support from illumos, I've been browsing through the codebase to identify more legacy code that shouldn't be there any more.

Along the way, I discovered a little tidbit about how the tod (time of day) modules - the interface to the hardware clock - work on SPARC.

If you look, there are a whole bunch of tod modules, and it's not at all obvious how they fit together - they all appear to be candidates for loading, and it's not obvious how the correct one for a platform is chosen.

The mechanism is actually pretty simple, if a little odd.

There's a global variable in the kernel named:

tod_module_name

This can be set in several ways - for some platforms, it's hard-coded in that platform's platmod. Or it could be extracted from the firmware (OBP). That tells the system which tod module should be used.

And the way this works is that each tod module has _init code that looks like

if (tod_module_name is myself) {
   initialize the driver
} else {
   do nothing
}

so at boot all the tod modules get loaded, but only the one that matches the name set by the platform actually initializes itself.

Later in boot, there's an attempt to unload all modules. Similarly the _fini for each driver essentially does

if (tod_module_name is myself) {
   I'm busy and can't be unloaded
} else {
   yeah, unload me
}

So, when the system finishes booting, you end up with only one tod module loaded and functional, and it's the right one.

Returning to the original question, can any of the tod modules be safely removed because no platform uses them? To be honest, I don't know. Some have names that match the platform they're for. It's pretty obvious, for example, that todstarfire goes with the starfire platform, so it was safe to take that out. But I don't know the module names returned by every possible piece of SPARC hardware, so it isn't really safe to remove some of the others. (And, as a further problem, I know that at least one is only referenced in closed source, binary only, platform modules.)