Saturday, April 02, 2016

Minimal illumos networking

Playing around with Minimal Viable Illumos, it becomes clear how complex a system illumos is. Normally, if you're running a full system, everything is taken care of for you. In particular, SMF starts all the right services for you. Looking through the method scripts, there's a lot of magic happening under the hood.

If you start with a bare install, booted to a shell, with no startup scripts having run, you can then bring up networking automatically.

The simplest approach would be to use ifconfig to set up a network interface. For example

ifconfig e1000g0 plumb 10.0.0.1 up

This won't work. It will plumb the interface, but not assign the address. To get the address assigned, you need to have ipmgmtd running.

env SMF_FMRI=svc/net/ip:d /lib/inet/ipmgmtd

Note that SMF_FMRI is set. This is because ipmgmtd expects to be run under SMF, and checks for SMF_FMRI being set as the determinant. It doesn't matter what the value is, it just needs to exist.

OK, so if you know your network device and your address, that's pretty much it.

If you use dladm to look for interfaces, then you'll see nothing. In order for dladm show-phys to enumerate the available interfaces, you need to start up dlmgmtd and give it a poke

env SMF_FMRI=svc/net/dl:d /sbin/dlmgmtd
dladm init-phys

Note that SMF_FMRI is set here, just like for ipmgmtd.

At this point, dladm show-phys will give you back the list of interfaces as normal.

You don't actually need to run dladm init-phys. If you know you have a given interface, plumbing it will poke enough of the machinery to make it show up in the list.

If you don't know your address, then you might think of using dhcp to assign it. This turns out to require a bit more work.

The first thing you need to do is bring up the loopback.

ifconfig lo0 plumb 127.0.0.1 up
ifconfig lo0 inet6 plumb ::1 up

This is required because dhcpagent wants to bind to localhost:4999, so you need the loopback set up for it to be able to do that. (You don't necessarily need the IPv6 version, that's for completeness.)

Then

ifconfig e1000g0 plumb
ifconfig e1000g0 auto-dhcp primary

Ought to work. Sometimes it does, sometimes it doesn't.

I sometimes found that I needed to

dladm init-secobj

in order to quieten errors that occasionally appear, and sometimes need to run

ifconfig -adD4 auto-revarp netmask + broadcast + up

although I can't even imagine what that would do to help, I presume that it kicks enough of the machinery to make sure everything is properly initialized.

Within the context of mvi, setting a static IP address means you need to craete a new image for each IP address, which isn't actually too bad (creating mvi images is really quick), but isn't sustainable in the large. DHCP would give it an address, but then you need to track what address a given instance has been allocated. Is there a way of doing static IP by poking the values from outside?

A tantalizing hint is present at the end of /lib/svc/method/net-physical, where it talks about boot properties being passed in. The implementation there appears to be aimed at Xen, I'll have to investigate if it's possible to script variables into the boot menu.

Monday, March 28, 2016

Running illumos in 48M of RAM

Whilst tweaking mvi recently, I went back and had another look at just how minimal an illumos install I could make.

And, given sufficiently aggressive use of the rm command, the answer appears to be that it's possible to boot illumos in 48 meg of RAM.

No, it's not April 1st. 48 meg of RAM is pretty low. It's been a long time since I've seen a system that small.

I've added some option scripts to the mvi repo. (The ones with -fix.sh as part of their names.) You don't have to run mvi to see what these do.

First, I start with mvix.sh, which is fairly minimal up front.

Then I go 32bit, which halves the size.

Then I apply the extreme option, which removes zfs (it's the single biggest driver left), along with a bunch of crypto and other unnecessary files. And I clean up lots of bits of grub that aren't needed.

I then abstracted out a nonet and nodisk option. The nonet script removes anything that looks like networking from the kernel, and the bits of userland that I add in order to be able to configure an interface. The nodisk script removes all the remaining storage device drivers (I only included the ones that you normally see when running under a hypervisor in the first place), along with the underlying scsi and related driver frameworks.

What you end up with is a 17M root file system, which compresses down to a 6.8M root archive, which gets packaged up in an 8.7M iso.

For those interested the iso is here. It should run in VirtualBox - in 32-bit mode, and you should be able to push the memory allocated by VirtualBox down to 48M. Scary.

Of course, it doesn't do much. It boots to a prompt, the only tools you have are ksh, ls, and du.

(Oh, and if you run du against the devices tree, it will panic.)

While doing this, I found that there are a lot of dependencies between modules in the illumos kernel. Not all of them are obvious, and trying to work out what's needed and what can be removed has involved large amounts of trial and error. That said, it takes less than 5 seconds to create an iso, and it takes longer for VirtualBox to start than for this iso to boot, so the cycle is pretty quick.

Sunday, March 27, 2016

Almost there with Solarus

Every so often I'll have a go at getting some games running on Tribblix. It's not a gaming platform, but having the odd distraction can't hurt.

I accidentally stumbled across Solarus, and was immediately intrigued. Back in the day, I remember playing Zelda on the SNES. A few years ago it was released for the Game Boy Advance, and I actually went out and bought the console and the game. I haven't added any more games, nothing seemed compelling, although we had a stock of old Game Boy games (all mono) and those still worked, which is great.

So I thought I would try and build Solarus for Tribblix. Having had a quick look at the prerequisites, most of those would be useful anyway so the effort wouldn't be wasted in any event.

First up was SDL 2.0. I've had the older 1.2.15 version for a while, but hadn't had anything actually demand version 2 yet. That was easy enough, and because all the filenames are versioned, it can happily be installed alongside the older version.

While I was at it, I installed the extras SDL_image, SDL_net, SDL_ttf, smpeg, and SDL_mixer. The only build tweak needed was to supply LIBS="-lsocket -lnsl" to the configure script for smpeg and SDL_net. I needed to version the plaympeg binary installed by smpeg to avoid a conflict with the older version of smpeg, but that was about it.

Then came OpenAL, which turned out to be a bit tricky to find. Some of the obvious search results didn't seem appropriate. I have no idea whether the SourceForge site is the current one, but it appears to have the source code I needed.

Another thing that looks abandoned is modplug, where the Solarus folks have a github mirror of the copy that's right for them.

Next up, PhysicsFS. This isn't quite what you expect from the name, it's an abstraction layer that replaces traditional file system access.

On to Solarus itself. This uses cmake to build, and I ended up with the following incantation in a new empty build directory:

cmake ../solarus-1.4.5 \
 -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=NO \
 -DCMAKE_INSTALL_RPATH=/opt/solarus/lib \
 -DSOLARUS_USE_LUAJIT=OFF \
 -DCMAKE_INSTALL_PREFIX=/opt/solarus

Let's go through that. The CMAKE_INSTALL_PREFIX is fairly obvious - that's where I'm going to install it. And SOLARUS_USE_LUAJIT is necessary because I've got Lua, but have never had LuaJIT working successfully.

The two RPATH lines are necessary because of the strange way that cmake handles RPATH. Usually, it builds with RPATH set to the build location, then installs with RPATH stripped. This is plain stupid, but works when you simply dump everything into a single swamp. So you need to manually force it to put the correct RPATH into the binary (which is the sort of thing you would actually want a build system to get right on its own).

Unfortunately, it doesn't actually work properly. There are two problems which I haven't really had a chance to look at - the first is that it fails fatally with an Xlib error if I move the mouse (which is a little odd as it doesn't actually use the mouse); the second is that it runs an order of magnitude or two slower than useful, so I suspect a timing error.

Still, the build is pretty simple and it's so close to working that it would be nice to finish the job.

Saturday, March 26, 2016

Tweaking MVI

A few months ago I first talked about minimal viable illumos, an attempt to construct a rather more minimalist bootable copy of illumos than the gigabyte-size image that are becoming the norm.

I've made a couple of changes recently, which are present in the mvi repository.

The first is to make it easier for people who aren't me (and myself when I'm not on my primary build machine) to actually use mvi. The original version had the locations of the packages hardcoded to the values on my build machine. Now I've abstracted out package installation, which eliminates quite a lot of code duplication. And then I added an alternative package installation script which uses zap to retrieve and install packages from the Tribblix repo, just like the regular system does. So you can much more easily run mvi from a vanilla Tribblix system.

What I would like to add is a script that can run on pretty much any (illumos) system. This isn't too hard, but would involve copying most of the functionality of zap into the install script. I'm holding off for a short while, hoping that a better mechanism presents itself. (By better, what I mean is that I've actually got a number of image creation utilities, and it would be nice to rationalise them rather than keep creating new ones.)

The second tweak was to improve the way that the size of the root archive is calculated, to give better defaults and adapt to variations more intelligently.

There are two slightly different mechanisms used to create the image. In mvi.sh, I install packages, and then delete what I'm sure I don't need; with mvix.sh I install packages and then only take the files I do need. The difference in size is considerable - for the basic installation mvi.sh is 127M and mvix.sh is 57M.

Rather than a common base image size of 192M, I've set mvi.sh to 160M and mvix.sh to 96M. These sizes give a reasonable amount of free space - enough that adding the odd package doesn't require the sizes to be adjusted.

I then have standard scripts to construct 32-bit and 64-bit images. A little bit of experimentation indicates that the 32-bit image ends up being half the size of the base, whereas the 64-bit image comes in at two thirds. (The difference is that in the 32-bit image, you can simply remove all 64-bit files. For a 64-bit kernel, you still need both 32-bit and 64-bit userland.) So I've got those scripts to simply scale the image size, rather than try and pick a new number out of the air.

I also have a sample script to install Node.js. This again modifies the image size, just adding the extra space that Node needs. I've had to calculate this more accurately, as reducing the size of the base archive gave me less margin for error.

(As an aside, adding applications doesn't really work well in general with mvix.sh, as it doesn't know what dependencies applications might need - it only installs the bare minimum the OS needs to boot. Fortunately Node is fairly self-contained, but other applications are much less so.)

Sunday, March 13, 2016

Software selection - choice or constraint?

In Tribblix, software is preferentially managed using overlays rather than packages.

Overlays comprise a group of packages bundled together to supply a given software need - the question should be "what do you want to do?", and packages (and packaging) are merely an implementation artifact in providing the answer to that question.

Part of the idea was that, not only would the overlays match a user's mental model of what they want to install, but that there would be many fewer overlays than packages, and so it's much easier for the human brain to track that smaller number of items.

Now, it's true that there are fewer overlays than available packages. As of right now, there are 91 overlays and 1237 packages available for Tribblix. So that's better than an order of magnitude reduction in the number of things, and an enormous reduction in the possible combinations of items. However, it's no longer small in the absolute sense.

(Ideally, small for me means that it can be all seen on screen at once. If you have to page the screen to see all the information, your brain is also paging stuff in and out.)

So I've been toying with the idea of defining a more constrained set of overlays. Maybe along the lines of a desktop and server split, with small, large, and developer instances of each.

This would certainly help dramatically in that it would lead to significant simplification. However, after trying this out I'm unconvinced. The key point is that genuine needs are rather more complicated than can be addressed by half a dozen neat pigeonholes. (Past experience with the install metaclusters in Solaris 10 was also that they were essentially of no use to any particular user, they always needed to be customised.)

By removing choice, you're seriously constraining what users can do with the system. Worse, by crippling overlays you force users back into managing packages which is one of the things I was trying to avoid in the first place.

So, I'm abandoning the idea of removing choice, and the number of overlays is going to increase as more applications are added. Which means that I'm going to have to think a lot harder about the UX aspect of overlay management.

Sunday, March 06, 2016

Load balancers - improving site reliability

You want your online service to be reliable - a service that isn't working is of no use to your users or customers. Yet, the components that you're using - hardware and software - are themselves unreliable. How do you arrange things so that the overall service is more releiable than the individual components it's made up of?

This is logically 2 distinct problems. First, given a set of N systems able to provide a service, how do you maintain service if one or more of those fail? Second, given a single service, how do you make sure it's always available?

The usual solution here involves some form of load balancer. A software or hardware device that takes incoming requests and chooses a system to handle the request, only considering as candidates those systems that are actually working.

(The distinction between hardware and software here is more between prepackaged appliances and DIY. The point about the "hardware" solutions is that you buy a thing, and treat it as a black box with little access to its internals.)

For hardware appliances, most people have heard of F5. Other competitors in this space are Kemp and A10. All are relatively (sometimes eye-wateringly) expensive. Not necessarily bad value, mind, depending on your needs. At the more affordable end of the spectrum sits loadbalancer.org.

Evaluating these recently, one thing I've noticed is that there's a general tendency to move upmarket. They're no longer load balancers, there's a new term here - ADC, or Application Delivery Controllers. They may do SSL termination and add functionality such as simple firewall functionality, Web Applications Firewalls, or Intrusion Detection and Threat Management. While this is clearly to add differentiation and keep ahead of the cannibalization of the market, many of the additional functionality simply isn't relevant for me.

Then there is a whole range of open source software solutions that do load balancing. Often these are also reverse proxies.

HAProxy is well known, and very powerful and flexible. It's not just web, it's very good at handling generic TCP. Packed with features, my only criticism is that configuration is rather monolithic.

You might think of Nginx as a web server, but it's also an excellent reverse proxy and load balancer. It doesn't quite have the range of functionality of HAProxy, but most people don't need anything that powerful anyway. One thing I like about Nginx is directory-based configuration - drop a configuration fragment into a directory, signal nginx, and you're off. If you're managing a lot of sites behind it, such a configuration mode is a godsend.

There's an interesting approach used in SNI Proxy. It assumes an incoming HTTPS connection has an SNI header on it, picks that out, and uses that to decide where to forward the TCP session. By using SNI, you don't have to put certificates on the proxy host, or get it to decrypt (and possibly re-encrypt) anything.

Offering simpler configuration are Pound and Pen. Neither are very keen on virtual hosting configurations. If all your backend servers are the same and you do all the virtual hosting there, then that's fine, but if you need to route to different sets of back end servers depending on the incoming request, they aren't a good choice.

For more dynamic configurations, there's vulcand, where you put all you configuration into Etcd. If you're into microservices and containers the it's definitely worth a look.

All the above load balancers assume that they're relatively reliable (or are relatively stable) compared to the back end services they're proxying. So they give you protection against application or hardware failure, and allow you to manage replacement, upgrades, and general deployment tasks without affecting users. The operational convenience of being able to manage an application independent of it's user-facing endpoint can be a huge win.

To achieve availability of the service customers connect to needs a little extra work. What's to stop it failing?

In terms of the application failing, that should be less of a concern. Compared to a fully-fledged business application, the proxy is a fairly simple, usually stateless, so has less to fail and can be automatically restarted pretty quickly if and when it fails.

But what if the underlying system goes away? That's what you need to protect against. And what you're really doing here is trying to ensure that the IP address associated with that service is always live. If it goes away, move it someplace else and carry on.

Ignoring routing tricks and things like VRRP and anycast, some solutions here are:

UCARP is a userland implementation of the Common Address Redundancy Protocol (CARP). Basically, hosts in a group monitor each other. If the host holding the address disappears, another host in the group will bring up a virtual interface with the required IP address. The bringup/teardown is delegated to scripts, allowing you to perform any other steps you might need to as part of the failover.

Wackamole, which uses the Spread toolkit, is another implementation of the same idea. It's getting a bit old now and hasn't seen any work for a while.

A newer variation that might be seen as the logical successor to wackamole is vippy, which is built on Node. The downside here is that Node is a moving target, so vippy won't build as is on current versions of Node, and I had trouble building it at all.

As you can see, this is a pretty large subject, and I've probably only scratched the surface. If there are things I've missed, especially if they're relevant to illumos, let me know.

Friday, March 04, 2016

Supermicro - illumos compatible server configuration

We were recently in the market for some replacement servers. We run OmniOS in production, so something compatible with illumos is essential.

This is a little more tricky than it appears. Some of the things to be aware of are specific to illumos, while some are more generic. But, after spending some time reading an awful lot of spec sheets and with the help of the OmniOS mailing list, we got to a final spec that ought to be pretty good.

I'm going to talk about a subset of Supermicro systems here. While other vendors exist, it can be harder to put together a working spec.

To start with, Supermicro have a bewildering list of parts. But we started out looking at 2U storage servers, with plenty of 2.5" disk slots in the front for future expansion.

Why 2.5"? Well, it allows you twice as many slots (24 as opposed to 12) so you have more flexibility. Also, the industry is starting to move away from 3.5" drives to 2.5", but that's a slow process. More to the point, most SSDs come in the 2.5" form factor, and I was planning to go to SSDs by default. (It's a good thing now, I'm thinking that choosing spinning rust now will look a bit daft in a few years time.) If you want bulk on the cheap, then something with 3.5" slots that you can put 4TB or larger SAS HDDs in might be better, or something like the DataON 1660 JBOD.

We're also looking at current motherboards. That means the X10 at this point. The older X9 are often seen in Nexenta configurations, and those will work too. But we're planning for the long haul, so want things to not turn into antiques for as long as possible.

So there is a choice of chassis. These are:
  • 216 - 24 2.5" drive slots
  • 213 - 16 2.5" drive slots
  • 826 - 12 3.5" drive slots
  • 825 - 8 3.5" drive slots
The ones with smaller numbers of drives have space for something like a CD.

The next thing that's important, especially for illumos and ZFS, is whether it's got a direct attach backplane or whether it puts the disks behind an expander. Looking at the 216 chassis, you can have:
So you can see that something with A or AC is direct attach, E1C is a single expander, E2C is a dual expander. (With a dual expander setup you can use multipathing.)

Generally, for ZFS, you don't want expanders. Especially so if you have SATA drives - and many affordable SSDs are SATA. Vendors and salespeople will tell you that expanders never cause any problems, but most illumos folk appear allergic to the mere mention of them.

(I think this is one reason Nexenta-compatible configurations, and some of the preconfigured SuperServer setups, look expensive at first sight. They often have expanders, use exclusively SAS drives as a consequence, and SAS SSDs are expensive.)

So, we want the SC216BAC-R920LPB chassis. To connect up 24 drives, you'll need HBAs. We're using ZFS, so don't need (or want) any sort of hardware raid, just direct connectivity. So you're looking at the LSI 9300-8i HBA, which has 8 internal ports, and you're going to need 3 of them to connect all 24 drives.

For the motherboard, the X10 has a range of models. At this point, select how many and what speed network interfaces you want.
The dual network boards have 16 DIMM slots, the quad network boards have 24 DIMM slots. The network cards are Intel i350 (1Gbe) or X540 (10Gbe) which are both supported by illumos.

The 2U chassis can support a little 2-drive disk bay at the back of the machine, you can put a pair of boot drives in here and wire them up directly to the SATA ports on the motherboard, giving you an extra 2 free drive slots in the front. Note, though, that this is only possible with the dual network boards, the quad network boards take up too much room in the chassis. (It's not so much the extra network ports as such, but the extra DIMM slots.)

Another little quirk is that as far as I can tell the quad 1G board has fewer USB ports, and they're all USB3. You need USB2 for illumos, and I'm not sure if you can demote those ports down to USB2 or not.

So, if you want 4 network ports (to provide, for example, a public LACP pair and a private LACP pair), you want the X10DRi-T4+.

Any E5-2600 v3 CPUs will be fine. We're not CPU bound so just went for cheaper lower-power chips, but that's a workload thing. One thing to be aware of is that you do need to populate both sockets - if you just have one then you lose half of the DIMM slots (which is fairly obvious) and most of the PCI slots (which isn't - look at the documentation carefully if you're thinking of doing this, but better to get a single-socket motherboard in the first place).

As for drives, we went for a pair of small Intel S3510 for the boot drives, those will be mirrored using ZFS. For data, larger Intel S3610, as they support more drive writes - analysis of our I/O usage indicated that we are worryingly close to the DWPD (Drive Writes Per Day) of the S3510, so the S3610 was a safer choice, and isn't that much more expensive.

Hopefully I'll be able to tell you how well we get on once they're delivered and installed.