The Trouble with Tribbles...

Tuesday, May 17, 2016

Updating Tribblix for SPARC

Having just released an updated version of Tribblix for x86, a little commentary on the status of the SPARC version is probably in order.

After a rather extended delay, there is an updated version of Tribblix for SPARC available for download.

This is Milestone 16, so it's at the same level as the prior x86 release. More precisely, it's built from exactly the same illumos source as the x86 Milestone 16 release was. Yes, this means that it's a bit dated, but it's consistent, and there have been a number of breaking changes for SPARC builds introduced (and fixed) in the meantime.

I did want to get a release out of the door, before bumping the version to 0m17 ready for the next x86 build. Part of this was simply to ensure that I could actually still build a release - it gets tested on x86 all the time, but as it happens I had never built a SPARC release on my current infrastructure.

In terms of additional packages, the selection is still rather sparse. I haven't had the time to build up the full list. This isn't helped by the fact that my SPARC kit is rather slow, so building anything for SPARC simply takes longer. Much longer. (Not to mention the fact that they're noisy and power-hungry.)

It's not just time that's the problem. I've had some difficulty building certain packages. I'm not talking the likes of Go and Node, which I can simply ignore as they're not ported to SPARC at all, but some reasonably common packages would fail with obscure (and unexpected) build errors. If there's a problem with one component, that blocks anything dependent on it too.

Other than expanding the breadth of available packages, the key next steps are (i) to make Tribblix on SPARC self-hosting, like the x86 version has been for a very long time, and (ii) to try and keep the SPARC release more closely aligned so it doesn't drift away from the x86 release and require additional effort to bring back in sync.

Signing Packages in Tribblix

On any computer system you want to know exactly what software is installed and running.

Tribblix uses SVR4 packaging, so you can easily see what's installed. In addition, there are mechanisms - pkgchk - to compare what's on the disk with what the packaging system thinks should be there. But that's just a consistency check, it doesn't verify that the package installed is actually the one you wanted.

Tribblix has had simple integrity checking for a while. The catalog for a package repository includes both the expected size and the md5 checksum of a package. This is largely aimed at dealing with download errors - network drops, application errors, or errant intrusion detection systems mangling the data. In practice, because the downloaded packages are actually zip files, which have inbuilt consistency checking and the catalog at the end of the file, and because SVR4 packaging has its own consistency checks on package contents, the chances of a faulty download getting installed are remote, the checking is so that the layer above can make smart decisions in the case of failure.

But you want to be sure that, not only has the package you downloaded made it across the network intact, but that the source package is legitimate. So the packages are signed using gnupg, and will be verified upon download in upcoming releases. Initially this is just a warning check while the mechanisms get sorted out.

The actual signing and verification part is the easy bit, it's all the framework around it that takes the time to write and test.

One possibility would have been to sign the package catalogs, and use that to prove that the checksum is correct. That's not enough, for a couple of reasons. First, the catalog only includes current package versions, so there would be no way to verify prior versions. Second, there's no reason somebody (or me) couldn't take a subset of packages and create a new repo using them; the modified catalog couldn't be verified. In either case, you need to be able to verify individual packages. (But the package catalog should also be signed, of course.)

It turns out there's not much of a performance hit. Downloads are a little slower, because there's an extra request to get the detached signature, but it's a tiny change overall.

With this in place, you can be sure that whatever you install on Tribblix is legitimate. But all you're doing is verifying the packages at download time. This leaves open the problem of being able to go to a system and ask whether the installed files are legitimate. Yes, there's pkgchk, but there's no validated source of information for it to use as a reference - the contents file is updated with every packaging operation, so it clearly can't be signed by me each time.

This is likely to require the additional creation of a signed manifest for each package. This partially exists already, as the pkgmap fragments for each package are saved (in the global zone, anyway), and those could be signed (as they don't change) and used as the input to pkgchk. However, the checksums in the pkgmap and contents files aren't particularly strong (to put it mildly), so that file will need to be replaced by something with much stronger checksums.

Initial support for signed packages is available starting with the Tribblix Milestone 17 release. At this point, it will check the package signatures, but not act on them, enforcement will probably come in the next release when I can be reasonably sure that everything is actually working correctly.

Saturday, April 02, 2016

Minimal illumos networking

Playing around with Minimal Viable Illumos, it becomes clear how complex a system illumos is. Normally, if you're running a full system, everything is taken care of for you. In particular, SMF starts all the right services for you. Looking through the method scripts, there's a lot of magic happening under the hood.

If you start with a bare install, booted to a shell, with no startup scripts having run, you can then bring up networking automatically.

The simplest approach would be to use ifconfig to set up a network interface. For example

ifconfig e1000g0 plumb 10.0.0.1 up

This won't work. It will plumb the interface, but not assign the address. To get the address assigned, you need to have ipmgmtd running.

env SMF_FMRI=svc/net/ip:d /lib/inet/ipmgmtd

Note that SMF_FMRI is set. This is because ipmgmtd expects to be run under SMF, and checks for SMF_FMRI being set as the determinant. It doesn't matter what the value is, it just needs to exist.

OK, so if you know your network device and your address, that's pretty much it.

If you use dladm to look for interfaces, then you'll see nothing. In order for dladm show-phys to enumerate the available interfaces, you need to start up dlmgmtd and give it a poke

env SMF_FMRI=svc/net/dl:d /sbin/dlmgmtd
dladm init-phys

Note that SMF_FMRI is set here, just like for ipmgmtd.

At this point, dladm show-phys will give you back the list of interfaces as normal.

You don't actually need to run dladm init-phys. If you know you have a given interface, plumbing it will poke enough of the machinery to make it show up in the list.

If you don't know your address, then you might think of using dhcp to assign it. This turns out to require a bit more work.

The first thing you need to do is bring up the loopback.

ifconfig lo0 plumb 127.0.0.1 up
ifconfig lo0 inet6 plumb ::1 up

This is required because dhcpagent wants to bind to localhost:4999, so you need the loopback set up for it to be able to do that. (You don't necessarily need the IPv6 version, that's for completeness.)

Then

ifconfig e1000g0 plumb
ifconfig e1000g0 auto-dhcp primary

Ought to work. Sometimes it does, sometimes it doesn't.

I sometimes found that I needed to

dladm init-secobj

in order to quieten errors that occasionally appear, and sometimes need to run

ifconfig -adD4 auto-revarp netmask + broadcast + up

although I can't even imagine what that would do to help, I presume that it kicks enough of the machinery to make sure everything is properly initialized.

Within the context of mvi, setting a static IP address means you need to craete a new image for each IP address, which isn't actually too bad (creating mvi images is really quick), but isn't sustainable in the large. DHCP would give it an address, but then you need to track what address a given instance has been allocated. Is there a way of doing static IP by poking the values from outside?

A tantalizing hint is present at the end of /lib/svc/method/net-physical, where it talks about boot properties being passed in. The implementation there appears to be aimed at Xen, I'll have to investigate if it's possible to script variables into the boot menu.

Monday, March 28, 2016

Running illumos in 48M of RAM

Whilst tweaking mvi recently, I went back and had another look at just how minimal an illumos install I could make.

And, given sufficiently aggressive use of the rm command, the answer appears to be that it's possible to boot illumos in 48 meg of RAM.

No, it's not April 1st. 48 meg of RAM is pretty low. It's been a long time since I've seen a system that small.

I've added some option scripts to the mvi repo. (The ones with -fix.sh as part of their names.) You don't have to run mvi to see what these do.

First, I start with mvix.sh, which is fairly minimal up front.

Then I go 32bit, which halves the size.

Then I apply the extreme option, which removes zfs (it's the single biggest driver left), along with a bunch of crypto and other unnecessary files. And I clean up lots of bits of grub that aren't needed.

I then abstracted out a nonet and nodisk option. The nonet script removes anything that looks like networking from the kernel, and the bits of userland that I add in order to be able to configure an interface. The nodisk script removes all the remaining storage device drivers (I only included the ones that you normally see when running under a hypervisor in the first place), along with the underlying scsi and related driver frameworks.

What you end up with is a 17M root file system, which compresses down to a 6.8M root archive, which gets packaged up in an 8.7M iso.

For those interested the iso is here. It should run in VirtualBox - in 32-bit mode, and you should be able to push the memory allocated by VirtualBox down to 48M. Scary.

Of course, it doesn't do much. It boots to a prompt, the only tools you have are ksh, ls, and du.

(Oh, and if you run du against the devices tree, it will panic.)

While doing this, I found that there are a lot of dependencies between modules in the illumos kernel. Not all of them are obvious, and trying to work out what's needed and what can be removed has involved large amounts of trial and error. That said, it takes less than 5 seconds to create an iso, and it takes longer for VirtualBox to start than for this iso to boot, so the cycle is pretty quick.

Sunday, March 27, 2016

Almost there with Solarus

Every so often I'll have a go at getting some games running on Tribblix. It's not a gaming platform, but having the odd distraction can't hurt.

I accidentally stumbled across Solarus, and was immediately intrigued. Back in the day, I remember playing Zelda on the SNES. A few years ago it was released for the Game Boy Advance, and I actually went out and bought the console and the game. I haven't added any more games, nothing seemed compelling, although we had a stock of old Game Boy games (all mono) and those still worked, which is great.

So I thought I would try and build Solarus for Tribblix. Having had a quick look at the prerequisites, most of those would be useful anyway so the effort wouldn't be wasted in any event.

First up was SDL 2.0. I've had the older 1.2.15 version for a while, but hadn't had anything actually demand version 2 yet. That was easy enough, and because all the filenames are versioned, it can happily be installed alongside the older version.

While I was at it, I installed the extras SDL_image, SDL_net, SDL_ttf, smpeg, and SDL_mixer. The only build tweak needed was to supply LIBS="-lsocket -lnsl" to the configure script for smpeg and SDL_net. I needed to version the plaympeg binary installed by smpeg to avoid a conflict with the older version of smpeg, but that was about it.

Then came OpenAL, which turned out to be a bit tricky to find. Some of the obvious search results didn't seem appropriate. I have no idea whether the SourceForge site is the current one, but it appears to have the source code I needed.

Another thing that looks abandoned is modplug, where the Solarus folks have a github mirror of the copy that's right for them.

Next up, PhysicsFS. This isn't quite what you expect from the name, it's an abstraction layer that replaces traditional file system access.

On to Solarus itself. This uses cmake to build, and I ended up with the following incantation in a new empty build directory:

cmake ../solarus-1.4.5 \
 -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=NO \
 -DCMAKE_INSTALL_RPATH=/opt/solarus/lib \
 -DSOLARUS_USE_LUAJIT=OFF \
 -DCMAKE_INSTALL_PREFIX=/opt/solarus

Let's go through that. The CMAKE_INSTALL_PREFIX is fairly obvious - that's where I'm going to install it. And SOLARUS_USE_LUAJIT is necessary because I've got Lua, but have never had LuaJIT working successfully.

The two RPATH lines are necessary because of the strange way that cmake handles RPATH. Usually, it builds with RPATH set to the build location, then installs with RPATH stripped. This is plain stupid, but works when you simply dump everything into a single swamp. So you need to manually force it to put the correct RPATH into the binary (which is the sort of thing you would actually want a build system to get right on its own).

Unfortunately, it doesn't actually work properly. There are two problems which I haven't really had a chance to look at - the first is that it fails fatally with an Xlib error if I move the mouse (which is a little odd as it doesn't actually use the mouse); the second is that it runs an order of magnitude or two slower than useful, so I suspect a timing error.

Still, the build is pretty simple and it's so close to working that it would be nice to finish the job.

Saturday, March 26, 2016

Tweaking MVI

A few months ago I first talked about minimal viable illumos, an attempt to construct a rather more minimalist bootable copy of illumos than the gigabyte-size image that are becoming the norm.

I've made a couple of changes recently, which are present in the mvi repository.

The first is to make it easier for people who aren't me (and myself when I'm not on my primary build machine) to actually use mvi. The original version had the locations of the packages hardcoded to the values on my build machine. Now I've abstracted out package installation, which eliminates quite a lot of code duplication. And then I added an alternative package installation script which uses zap to retrieve and install packages from the Tribblix repo, just like the regular system does. So you can much more easily run mvi from a vanilla Tribblix system.

What I would like to add is a script that can run on pretty much any (illumos) system. This isn't too hard, but would involve copying most of the functionality of zap into the install script. I'm holding off for a short while, hoping that a better mechanism presents itself. (By better, what I mean is that I've actually got a number of image creation utilities, and it would be nice to rationalise them rather than keep creating new ones.)

The second tweak was to improve the way that the size of the root archive is calculated, to give better defaults and adapt to variations more intelligently.

There are two slightly different mechanisms used to create the image. In mvi.sh, I install packages, and then delete what I'm sure I don't need; with mvix.sh I install packages and then only take the files I do need. The difference in size is considerable - for the basic installation mvi.sh is 127M and mvix.sh is 57M.

Rather than a common base image size of 192M, I've set mvi.sh to 160M and mvix.sh to 96M. These sizes give a reasonable amount of free space - enough that adding the odd package doesn't require the sizes to be adjusted.

I then have standard scripts to construct 32-bit and 64-bit images. A little bit of experimentation indicates that the 32-bit image ends up being half the size of the base, whereas the 64-bit image comes in at two thirds. (The difference is that in the 32-bit image, you can simply remove all 64-bit files. For a 64-bit kernel, you still need both 32-bit and 64-bit userland.) So I've got those scripts to simply scale the image size, rather than try and pick a new number out of the air.

I also have a sample script to install Node.js. This again modifies the image size, just adding the extra space that Node needs. I've had to calculate this more accurately, as reducing the size of the base archive gave me less margin for error.

(As an aside, adding applications doesn't really work well in general with mvix.sh, as it doesn't know what dependencies applications might need - it only installs the bare minimum the OS needs to boot. Fortunately Node is fairly self-contained, but other applications are much less so.)

Sunday, March 13, 2016

Software selection - choice or constraint?

In Tribblix, software is preferentially managed using overlays rather than packages.

Overlays comprise a group of packages bundled together to supply a given software need - the question should be "what do you want to do?", and packages (and packaging) are merely an implementation artifact in providing the answer to that question.

Part of the idea was that, not only would the overlays match a user's mental model of what they want to install, but that there would be many fewer overlays than packages, and so it's much easier for the human brain to track that smaller number of items.

Now, it's true that there are fewer overlays than available packages. As of right now, there are 91 overlays and 1237 packages available for Tribblix. So that's better than an order of magnitude reduction in the number of things, and an enormous reduction in the possible combinations of items. However, it's no longer small in the absolute sense.

(Ideally, small for me means that it can be all seen on screen at once. If you have to page the screen to see all the information, your brain is also paging stuff in and out.)

So I've been toying with the idea of defining a more constrained set of overlays. Maybe along the lines of a desktop and server split, with small, large, and developer instances of each.

This would certainly help dramatically in that it would lead to significant simplification. However, after trying this out I'm unconvinced. The key point is that genuine needs are rather more complicated than can be addressed by half a dozen neat pigeonholes. (Past experience with the install metaclusters in Solaris 10 was also that they were essentially of no use to any particular user, they always needed to be customised.)

By removing choice, you're seriously constraining what users can do with the system. Worse, by crippling overlays you force users back into managing packages which is one of the things I was trying to avoid in the first place.

So, I'm abandoning the idea of removing choice, and the number of overlays is going to increase as more applications are added. Which means that I'm going to have to think a lot harder about the UX aspect of overlay management.

Sunday, March 06, 2016

Load balancers - improving site reliability

You want your online service to be reliable - a service that isn't working is of no use to your users or customers. Yet, the components that you're using - hardware and software - are themselves unreliable. How do you arrange things so that the overall service is more releiable than the individual components it's made up of?

This is logically 2 distinct problems. First, given a set of N systems able to provide a service, how do you maintain service if one or more of those fail? Second, given a single service, how do you make sure it's always available?

The usual solution here involves some form of load balancer. A software or hardware device that takes incoming requests and chooses a system to handle the request, only considering as candidates those systems that are actually working.

(The distinction between hardware and software here is more between prepackaged appliances and DIY. The point about the "hardware" solutions is that you buy a thing, and treat it as a black box with little access to its internals.)

For hardware appliances, most people have heard of F5. Other competitors in this space are Kemp and A10. All are relatively (sometimes eye-wateringly) expensive. Not necessarily bad value, mind, depending on your needs. At the more affordable end of the spectrum sits loadbalancer.org.

Evaluating these recently, one thing I've noticed is that there's a general tendency to move upmarket. They're no longer load balancers, there's a new term here - ADC, or Application Delivery Controllers. They may do SSL termination and add functionality such as simple firewall functionality, Web Applications Firewalls, or Intrusion Detection and Threat Management. While this is clearly to add differentiation and keep ahead of the cannibalization of the market, many of the additional functionality simply isn't relevant for me.

Then there is a whole range of open source software solutions that do load balancing. Often these are also reverse proxies.

HAProxy is well known, and very powerful and flexible. It's not just web, it's very good at handling generic TCP. Packed with features, my only criticism is that configuration is rather monolithic.

You might think of Nginx as a web server, but it's also an excellent reverse proxy and load balancer. It doesn't quite have the range of functionality of HAProxy, but most people don't need anything that powerful anyway. One thing I like about Nginx is directory-based configuration - drop a configuration fragment into a directory, signal nginx, and you're off. If you're managing a lot of sites behind it, such a configuration mode is a godsend.

There's an interesting approach used in SNI Proxy. It assumes an incoming HTTPS connection has an SNI header on it, picks that out, and uses that to decide where to forward the TCP session. By using SNI, you don't have to put certificates on the proxy host, or get it to decrypt (and possibly re-encrypt) anything.

Offering simpler configuration are Pound and Pen. Neither are very keen on virtual hosting configurations. If all your backend servers are the same and you do all the virtual hosting there, then that's fine, but if you need to route to different sets of back end servers depending on the incoming request, they aren't a good choice.

For more dynamic configurations, there's vulcand, where you put all you configuration into Etcd. If you're into microservices and containers the it's definitely worth a look.

All the above load balancers assume that they're relatively reliable (or are relatively stable) compared to the back end services they're proxying. So they give you protection against application or hardware failure, and allow you to manage replacement, upgrades, and general deployment tasks without affecting users. The operational convenience of being able to manage an application independent of it's user-facing endpoint can be a huge win.

To achieve availability of the service customers connect to needs a little extra work. What's to stop it failing?

In terms of the application failing, that should be less of a concern. Compared to a fully-fledged business application, the proxy is a fairly simple, usually stateless, so has less to fail and can be automatically restarted pretty quickly if and when it fails.

But what if the underlying system goes away? That's what you need to protect against. And what you're really doing here is trying to ensure that the IP address associated with that service is always live. If it goes away, move it someplace else and carry on.

Ignoring routing tricks and things like VRRP and anycast, some solutions here are:

UCARP is a userland implementation of the Common Address Redundancy Protocol (CARP). Basically, hosts in a group monitor each other. If the host holding the address disappears, another host in the group will bring up a virtual interface with the required IP address. The bringup/teardown is delegated to scripts, allowing you to perform any other steps you might need to as part of the failover.

Wackamole, which uses the Spread toolkit, is another implementation of the same idea. It's getting a bit old now and hasn't seen any work for a while.

A newer variation that might be seen as the logical successor to wackamole is vippy, which is built on Node. The downside here is that Node is a moving target, so vippy won't build as is on current versions of Node, and I had trouble building it at all.

As you can see, this is a pretty large subject, and I've probably only scratched the surface. If there are things I've missed, especially if they're relevant to illumos, let me know.

Friday, March 04, 2016

Supermicro - illumos compatible server configuration

We were recently in the market for some replacement servers. We run OmniOS in production, so something compatible with illumos is essential.

This is a little more tricky than it appears. Some of the things to be aware of are specific to illumos, while some are more generic. But, after spending some time reading an awful lot of spec sheets and with the help of the OmniOS mailing list, we got to a final spec that ought to be pretty good.

I'm going to talk about a subset of Supermicro systems here. While other vendors exist, it can be harder to put together a working spec.

To start with, Supermicro have a bewildering list of parts. But we started out looking at 2U storage servers, with plenty of 2.5" disk slots in the front for future expansion.

Why 2.5"? Well, it allows you twice as many slots (24 as opposed to 12) so you have more flexibility. Also, the industry is starting to move away from 3.5" drives to 2.5", but that's a slow process. More to the point, most SSDs come in the 2.5" form factor, and I was planning to go to SSDs by default. (It's a good thing now, I'm thinking that choosing spinning rust now will look a bit daft in a few years time.) If you want bulk on the cheap, then something with 3.5" slots that you can put 4TB or larger SAS HDDs in might be better, or something like the DataON 1660 JBOD.

We're also looking at current motherboards. That means the X10 at this point. The older X9 are often seen in Nexenta configurations, and those will work too. But we're planning for the long haul, so want things to not turn into antiques for as long as possible.

So there is a choice of chassis. These are:

216 - 24 2.5" drive slots
213 - 16 2.5" drive slots
826 - 12 3.5" drive slots
825 - 8 3.5" drive slots

The ones with smaller numbers of drives have space for something like a CD.

The next thing that's important, especially for illumos and ZFS, is whether it's got a direct attach backplane or whether it puts the disks behind an expander. Looking at the 216 chassis, you can have:

SC216BAC-R920LPB - direct attach
SC216BE1C-R920LPB - single expander
SC216BE1C-R920LPB - dual expander

So you can see that something with A or AC is direct attach, E1C is a single expander, E2C is a dual expander. (With a dual expander setup you can use multipathing.)

Generally, for ZFS, you don't want expanders. Especially so if you have SATA drives - and many affordable SSDs are SATA. Vendors and salespeople will tell you that expanders never cause any problems, but most illumos folk appear allergic to the mere mention of them.

(I think this is one reason Nexenta-compatible configurations, and some of the preconfigured SuperServer setups, look expensive at first sight. They often have expanders, use exclusively SAS drives as a consequence, and SAS SSDs are expensive.)

So, we want the SC216BAC-R920LPB chassis. To connect up 24 drives, you'll need HBAs. We're using ZFS, so don't need (or want) any sort of hardware raid, just direct connectivity. So you're looking at the LSI 9300-8i HBA, which has 8 internal ports, and you're going to need 3 of them to connect all 24 drives.

For the motherboard, the X10 has a range of models. At this point, select how many and what speed network interfaces you want.

X10DRi - dual 1G
X10DRi-T - dual 10G
X10DRi-LN4+ - quad 1G
X10DRi-T4+ - quad 10G

The dual network boards have 16 DIMM slots, the quad network boards have 24 DIMM slots. The network cards are Intel i350 (1Gbe) or X540 (10Gbe) which are both supported by illumos.

The 2U chassis can support a little 2-drive disk bay at the back of the machine, you can put a pair of boot drives in here and wire them up directly to the SATA ports on the motherboard, giving you an extra 2 free drive slots in the front. Note, though, that this is only possible with the dual network boards, the quad network boards take up too much room in the chassis. (It's not so much the extra network ports as such, but the extra DIMM slots.)

Another little quirk is that as far as I can tell the quad 1G board has fewer USB ports, and they're all USB3. You need USB2 for illumos, and I'm not sure if you can demote those ports down to USB2 or not.

So, if you want 4 network ports (to provide, for example, a public LACP pair and a private LACP pair), you want the X10DRi-T4+.

Any E5-2600 v3 CPUs will be fine. We're not CPU bound so just went for cheaper lower-power chips, but that's a workload thing. One thing to be aware of is that you do need to populate both sockets - if you just have one then you lose half of the DIMM slots (which is fairly obvious) and most of the PCI slots (which isn't - look at the documentation carefully if you're thinking of doing this, but better to get a single-socket motherboard in the first place).

As for drives, we went for a pair of small Intel S3510 for the boot drives, those will be mirrored using ZFS. For data, larger Intel S3610, as they support more drive writes - analysis of our I/O usage indicated that we are worryingly close to the DWPD (Drive Writes Per Day) of the S3510, so the S3610 was a safer choice, and isn't that much more expensive.

Hopefully I'll be able to tell you how well we get on once they're delivered and installed.

Wednesday, March 02, 2016

Moving goalposts with openssl

The most recent openssl release fixed a number of security issues.

In order to mitigate against DROWN, SSLv2 was disabled by default. OK, that's a reasonable thing to do. The mere presence of the code is harmful, and improving security is a good thing. Right?

Well, maybe. Unfortunately, this breaks binary compatibility by default. Suddenly, a number of functions that used to be present in libssl.so have disappeared. In particular:

SSLv2_client_method
SSLv2_method
SSLv2_server_method

and the problem is that if you have an application that references those symbols, the linker can't find them, and your application won't even run.

This hit pkg(5) on OmniOS, which is pretty nasty.

So I had a look around on Tribblix to see what would break. It's largely what you would expect:

curl, specifically libcurl
wget
neon
the python ssl module
the ruby ssl module
apache
mysql

And, of course, by the magic of dependency hell that spreads to affect a large number of applications.

In many cases the SSLv2 code is only present if it detects the corresponding calls being present in libssl. Which leads you into a chicken and egg situation - you have to install the newer openssl, thus breaking your system, in order to rebuild the applications that are broken.

And even if a distributor rebuilds what the distro ships, there's still any 3rd-party or locally built binaries which could be broken.

For Tribblix, I'm rebuilding what I can in any case, explicitly disabling SSLv2 (because the automatic detection is wrong). I'll temporarily ship openssl with SSLv2 enabled, until I've finally nailed everything.

But this is a game changer in terms of expectations. The argument before was that you should link dynamically in order to take advantage of updates to critical libraries rather than have to rebuild everything individually. Now, you have to assume that any security fix to a library you use could break compatibility.

For critical applications, another reason to build your own stack.

Monday, February 29, 2016

Updating gcc for Tribblix

One of the items on the Tribblix Roadmap was improving the system compiler.

I'm currently shipping a copy of gcc 4.8.3. When Tribblix was originally created, I inherited the old gcc 3.4.3 build from OpenIndiana. It took a modest amount of effort to get a copy of gcc v4 that worked (at one point I bailed out completely and rolled back to gcc3), originally I had 4.7.2 but settled on 4.8.3. And for many purposes that's worked out quite well.

It's starting to show problems, though. In no particular order, some are:

It's not fully compatible with modern C++ code. In particular, there are instances of modern code that simply keel over and complain that 'to_string isn't a member of std', and various other functions that ought to be supported.
There are places that have the path to the gcc install encoded. Given that each compiler version is installed to a different path, publishing packages that have a hardcoded dependency on the install location of this particular compiler build doesn't work so well. In many cases I can fix up any .la files that have errant paths in them, and do so as part of my build process.
For (bad) reasons associated with the way I transitioned from gcc3 to gcc4, the current gcc build has libssp, libgomp, and libquadmath disabled. I've recently come across builds that need the first two at least.
I've decided that it was a mistake to put the full version in the install prefix. It would have been better to use 4.8 rather than 4.8.3. (Yes, I know that gcc has an internal directory hierarchy that uses the full version number.)

Now, gcc has moved on in the meantime. I can see some basic options:

Update to gcc 4.8.5, fixing the build process to add the missing features and fix up any errant paths
Update to gcc 4.9.3, fixing the build process to add the missing features and fix up any errant paths
Update to whatever the latest gcc5 is.
Wait for gcc6 to come out and switch to that

I've been doing a little experimentation.

While I would like to keep as current as possible, waiting for gcc doesn't solve the problems I have now, so maybe not.

I've tried building gcc5 and had some components fail on me. With 5.1.0, the fortran build didn't work; with 5.3.0 it was gccgo that failed. Not having fortran is probably a showstopper. Losing gccgo less so, as we have golang proper. On x64, at any rate.

Another issue arguing against gcc5 is the C++ ABI changes. Not that they're a problem of themselves, but that it makes rolling back a real pain. I've already had to do a compiler rollback once, and it was painful. And given that I've seen problems with gcc5 builds (that I haven't been able to satisfactorily resolve) then I'm wary of other problems cropping up.

Building 4.9.3 has been uneventful so far. I've enabled (or, rather, not disabled) libssp, gomp, and quadmath. I added obj-c++ as it didn't seem to hurt. I've disabled libitm. Generally, this brings me much closer to the OpenIndiana and OmniOS build settings.

Testing this compiler, the 'to_string isn't a member of std' family of errors are gone, which opens up more code (LibreOffice 5 in particular). And a handful of test builds seem to work just fine.

I have one issue to resolve before I can get this rolled out. It looks like the way that libgcc is linked has changed (looking at the dumpspecs output I can see it's changed) so that every binary ends up depending on libgcc_s. That's something I really don't want. And manually forcing -static-libgcc on every build seems wrong (and besides, there are builds where it's the wrong thing to do).

Tribblix Roadmap

It seems like only yesterday, but I put together a Tribblix scorecard about a year ago.

Tribblix keeps improving, of course. There have been a number of enhancements to the available software, and this continues with a steady stream of package updates.

Actual releases have been relatively thin on the ground. I managed to get Milestone 15 out in April, and Milestone 16 in September. It's about time for another one, looking at that cadence, and I don't really want to go much slower.

What constitutes a release anyway? For Tribblix, there are 2 things that can only happen at a release.

The first is updating the underlying illumos build - this is slightly artificial, but you can't really do on the fly updates of illumos, whereas most of the applications can be updated just fine. So that leads to a distinction between updating packages and upgrading the release.

The second is any structural change to the distro itself, including the way it's packaged and that packages are managed (so if zap changes that's a new release).

Note that adding packages, or replacing them with new versions, doesn't generally involve a release. Even if it's a fairly major package, it will simply be available when it's built.

Looking forward, then, what do I have in the pipeline in terms of releases and changes aligned with them?

Overall, the general stability of the Tribblix pipeline, and the fact that I've had managed to achieve many of the objectives I set myself initially, means that I'm reasonably close to calling it ready, and putting an actual 1.0 release together. What needs doing to make that a reality?

Fully fledged and aligned SPARC support would be good. Not that it's necessary to align the release dates or support every piece of software, but what I need to be sure of is that a SPARC release isn't going to need breaking changes to the rest of Tribblix.
Fixing the compiler baseline. I currently have a slightly tweaked gcc 4.8.3. I need to settle on a version (possibly a newer version) and untweak the configuration.
Making sure that upgrades are solid. At the present time, they mostly work, but are a little clumsy.

There's a lot of devil in those details, but there's nothing massive there. And the result will be an illumos distro with regular updates and lots of the functionailty I expect.

Beyond that, I'm already making plans for Tribblix2. This is the opportunity for me to play with different ideas for how illumos could be used. And this ability to experiment with new ideas is one of the reasons I built my own distribution in the first place.

Think of Tribblix2 as more of a series of concepts rather than a single release vehicle. I can try things like removing 32-bit (and do it wholesale) to see what the benefits would be (and identify any drawbacks along the way). I can see how much of the legacy we currently ship in /usr can be ditched entirely. I can look further at projects like minimal viable illumos and minimal memory systems. It would be fun to see what happens if you apply some of the ideas behind Docker and Unikernels to illumos. Of course, successful experiments could lead to improvements trickling down to regular Tribblix.

OK, so it's not really a roadmap, but it should give an idea of where I'm heading.

Wednesday, February 17, 2016

Do suppliers want to go out of business?

The onrushing cloud behemoth seems destined to sweep many legacy IT suppliers aside, at least if you believe the pundits.

However, it's not just down to the (often imagined) inherent superiority of cloud computing. In many cases, IT suppliers have only themselves to blame.

Now, I have no business training, but even I can understand that making it easy for customers to buy your stuff is probably a good idea.

It's clear, though, that many companies obviously don't want to sell me stuff.

Starting with a website. That's what you do now. In the old days you might have gone to a trade show or looked in a magazine, and ended up with a brochure. No more - if I want product details, I'll look at your website. If your product details are sketchy, non-existent, out of date, inconsistent, vaporware, and generally devoid of technical content, then I'll look elsewhere. If you want me to register to view your technical documentation - even something as simple as the prerequisite system requirements - then I'll likely look elsewhere. If it's impossible to even guess what ballpark your prices are in, then I'll assume I can't afford it.

Then, please tell me how to buy the stuff. If you have resellers, have a list. Make sure that your resellers have actually heard of you. And keep that list up to date and accurate. If you sell direct, say so.

And last, but certainly not least, if a potential customer emails you - either direct or via that stupid form on your website that has a little postage stamp sized box to put the query in - showing an interest in buying your stuff, actually show some interest in selling your product. Answer the email, at the very least. Do it promptly. Do it accurately.

Looking for products recently, many supplier websites simply fail, completely. In some cases, there's a confusingly jumbled list of products that you have to visit individually to work out if they meet your requirements. Other times, they've decided to randomly segment their offerings into neat customer buckets, so I have to trawl through all their sections to find the product I want. One site had a useful handy product chart - with no hyperlinks, and listing a number of models that didn't exist as far as I can tell. Being forced to use a search engine to navigate a supplier's site is not without its problems - often you land on old pages with no indication whether the product is even current.

I've been trying to get a number of quotes recently. Maybe a third of suppliers and/or manufacturers are pretty good (thank you to all of you, by the way). Another third have such a dismal web presence that they're clearly not fit for the 21st century, so why would I even think of using them? The remaining third have simply ignored all attempts to contact them. Given how difficult it is for IT companies to survive in the current rapidly changing and highly competitive landscape, it staggers me what a poor job so many companies are doing.

Monday, February 15, 2016

On FOSDEM

I recently travelled to FOSDEM. Elsewhere I've talked about getting to Brussels and Back; here are some of my notes on the event itself.

I stayed in a hotel that was out of the centre, about a third of the way to the campus. This meant I could walk there and back, which made up for missing my regular morning swim. Shame it was so damp and drizzly.

On Friday evening, a bunch of illumos aficionados met up and went to Manhattns for dinner, before heading off to the beer event. The Delirium Cafe wasn't quite as packed as I expected, although the queues at the bars were pretty long. There was a qualifying question to gain entry as a FOSDEM attendee - what's your favourite distribution? What, you've never heard of Tribblix?

I went to quite a few talks. One thing that was managed extremely well was that the talks ran to time. There are a lot of rooms and tracks, but they do a very good job of sticking to the timetable. What this means is that if you leg it across the campus for a talk you want to see, you can be pretty confident that it'll be on when it says it will.

I went to Mark Reinhold's talk on The State of OpenJDK. Interesting to see what the current focus is and where they're heading. Of particular interest to me was project Panama, aiming to supplant the user-hostile JNI as a bridge to native code.

Then Dalibor and Rory on Preparing for JDK9. Apart from all the changes coming up, the one thing that I noted was the version string changes. I also learnt about the jdeps tool, which could be very useful

Changing tack webwards, I learnt about telemetry in Firefox, and telemetry.mozilla.org. Following that, more on HTTP/2 - 30% adoption is pretty good after less than a year, but I guess that's a reflection of how web traffic is dominated by a relatively small number of sites. There's a huge long tail of small sites that are going to take much longer to migrate, if ever. And one thing I didn't know is that client certificates aren't yet supported in HTTP/2, which is a bit of a pain.

In between, I spent time going round the various project stands. We had an illumos booth, it would have been nice to spend more time at that.

I spent a lot of Sunday in the main Janson lecture theatre.

First up, Re-thinking Linux Distributions. Or, as I interpreted it, moving on from package management as the defining characteristic of a distribution. This is a subject I'm deeply interested in, as it forms part of my thoughts about severing the link between applications and the OS, and thinking about software stacks as a useful unit.

Then, Reproducible Builds. Although it's not really about reproducible builds as reproducible package archives. The two aren't necessarily the same. For example, IPS doesn't even have package archives, and is only interested in changes to the binary content rather than to irrelevant metadata. And we have tooling like wsdiff to identify changes in illumos builds. Still, knowing that your build is completely reproducible is a goal we ought to wor towards, although we may end up with a slightly different slant on the subject.

Next, Dan talked about illumos at 5, even mentioning yours truly. And it was great to talk to Thomas from opencsw after the talk.

The State of Go was looking forward to the forthcoming release of Go 1.6. I was motivated to test the release candidate on illumos, and was pleased to see that rc2 builds and runs just fine.

One of the most interesting talks was on LibreOffice Online. How it works (tiling like online map viewers) and some of the waste they have managed to eliminate. Something else I picked up on was the potential for LibreOfficeKit to expose a simple API for other tools to talk to.

It was a busy weekend, and quite focussed. There wasn't as much casual chat as I might have liked. To take advantage of FOSDEM, you have to be organised - plan your schedule out in advance. If I go next year I'll try and give a Tribblix lightning talk, just to get a bit of exposure.

Tuesday, February 09, 2016

Building influxdb and grafana on Tribblix

For a while now I've been looking at alternative ways to visualize kstat data. Something beyond JKstat and KAR, at least.

An obvious thought is: there are many time-series databases being used for monitoring now, with a variety of user-configurable dashboards that can be used to query and display the data. Why not use one of those?

First the database. For this test I'm using InfluxDB, which is written, like many applications are these days, in Go. Fortunately, Go works fine on illumos and I package it for Tribblix, so it's fairly easy to follow the instructions, so make a working directory, cd there, and:


export GOPATH=`pwd`
go get github.com/influxdb/influxdb

cd $GOPATH/src/github.com/influxdb/influxdb
go get -u -f -t ./...

go clean ./...
go install ./...

(Note that it's influxdb/influxdb, not influxdata/influxdb. The name was changed, but the source and the build still use the old name.)

That should just work, leaving you with binaries in $GOPATH/bin.

So then you'll want a visualization front end. Now, there is Chronograf. Unfortunately it's closed source (that's fine, companies can make whatever choices they like) which means I can't build it for Tribblix. The other obvious path is Grafana.

Building Grafana requires Go, which we've already got, and Node.js. Again, Tribblix has Node.js, so we're (almost) good to go.

Again, it's mostly a case of following the build instructions. For Grafana, this comes in 2 parts. The back-end is Go, so make a working directory, cd there, and:

export GOPATH=`pwd`
go get github.com/grafana/grafana

cd $GOPATH/src/github.com/grafana/grafana
go run build.go setup
$GOPATH/bin/godep restore
go run build.go build

You'll find the Grafana server in $GOPATH/src/github.com/grafana/grafana/bin/grafana-server

The front-end involves a little variation to get it to work properly. The problem here is that a basic 'npm install' will install both production and development dependencies. We don't actually want to do development of Grafana, which ultimately requires webkit and won't work anyway. So we really just want the production pieces, and we don't want to install anything globally. But we still need to run 'npm install' to start with, as otherwise the dependencies get messed up. Just ignore the errors and warnings around PhantomJS.

npm install

npm install --production
npm install grunt-cli
./node_modules/.bin/grunt --force

With that, you can fire up influxd and grafana-server, and get them to talk to each other.

For the general aspects of getting Grafana and Influxdb to talk to each other, here's a tutorial I found useful.

Now, with all this in place, I can go back to playing with kstats.

Thursday, December 24, 2015

The palatability of complexity

There seems to be a general trend to always add complexity to any system. Perhaps it's just the way most of our brains are wired, but we just can't help it.

Whether this be administrative tasks (filing your expenses), computer software (who hasn't suffered the dead hand of creeping featurism), systems administration, or even building a tax system, the trend seems always to be to keep adding additional layers of complexity.

Eventually, this stops when the complexity becomes unsustainable. People can rebel - they will go round the back of the system, taking short cuts to achieve their objectives without having to deal with the complexity imposed on them. Or they leave - for another company without the overblown processes, or another piece of software that is easier to use.

But there's another common way of dealing with the problem that is superficially attractive but with far worse consequences, which involves the addition of what I'll call a palatability layer. Rather than address the underlying problem, an additional layer is added on top to make it easier to deal with.

Which fails in two ways: you have failed to actually eliminate the underlying complexity, and the layer you've added will itself grow in complexity until it reaches the palatability threshold. (At which point, someone will add another layer, and the cycle repeats.)

Sometimes, existing bugs and accidental implementation artefacts become embedded as dogma in the new palatability layer. Worse, over time all expertise gravitates to the outermost layer, leaving you with nobody capable of understanding the innermost internals.

On occasion, the palatability layer becomes inflated to the position of a standard. (Which perhaps explains why standards are often so poor, and there are so many to choose from.)

For example, computer languages have grown bloated and complex. Features have been added, dependencies have grown. Every so often a new language emerges as an escape hatch.

Historically, I've often been opposed to the use of Configuration Management, because it would end up being used to support complexity rather than enforcing simplicity. This is not a fault of the tool, but of the humans who would abuse it.

As another example, I personally use an editor to write code rather than an IDE. That way, I can't write overly complex code, and it forces me to understand every line of code I write.

Every time you add a palatability layer, while you might think you're making things better, in reality you're helping build a house of cards on quicksand.

Monday, December 14, 2015

The cost of user-applied updates

Having updated a whole bunch of my (proprietary) devices with OS updates today, I was moved to tweet:

Imagining a world in which you could charge a supplier for the time it takes you to regularly update their software

On most of my Apple devices I'm applying updates to either iOS or MacOS on a regular basis. Very roughly, it's probably taking away an hour a month - I'm not including the elapsed time for the update (you just schedule this so you get yourself a cup of coffee or something), but there's a bit of planning involved, some level of interaction during the process, and then the need to fix up anything afterwards that got mangled by the update.

I don't currently run Windows, but I used to have to do that as well. And web browsers and applications. And that used to take forever, although current hardware helps (particularly the move away from spinning rust).

And then there's the constant stream of updates at the installed apps. Not all of which you can ignore - some games have regular mandatory updates and if you don't apply them the game won't even start.

If you charge for the time involved at commercial rates, you could easily justify $100 per month or $1000 per year. It's a significant drain on time and productivity, a burden being pushed from suppliers onto end users. Multiply that by the entire user base and you're looking at it having a significant impact on the economy of the planet.

And that's when things go smoothly. Sometimes systems go and apply updates at inconvenient times - I once had Windows update suddenly decide to update my work laptop just as I was shutting it down to go to the airport. Or just before an important meeting. If the update interferes with a critical business function, then costs can skyrocket very easily.

So you could avoid the manual interaction and associated costs, but then you end up giving users no way to prevent bad updates or to schedule them appropriately. Of course, if the things were adequately tested beforehand, or minimised, then there would be much less of a problem, but the update model seems to be to replace the whole shebang and not bother with testing. (Or worry about compatibility.)

It's not just time (or sanity), there's a very real cost in bandwidth. With phone or tablet images being measured in gigabytes, you can very easily blow your usage cap. (Even on broadband - if you're on metered domestic broadband then the usage cap might be 25GB/month, which is fine for email and general browsing, but OS and app updates for a family could easily hit that limit.)

The problem extends beyond computers (or things like phones that people do now think of as computers). My TV and BluRay player have a habit of updating themselves. (And one's significant other gets really annoyed if the thing decides to spend 10 minutes updating itself just as her favourite soap opera is about to start.)

As more and more devices are connected to the network, and update over the network, the problem's only going to get worse. While some updates are going to be necessary due to newly found bugs and security issues, there does seem to be a philosophy of not getting things right in the first place but shipping half-baked and buggy software, relying on being able to update it later.

Any realistic estimate of the actual cost involved in expecting all your end users to maintain the shoddy software that you ship is so high that the industry could never be expected to foot even a small fraction of the bill. Which is unfortunate, because a financial penalty would focus the mind and maybe lead to a much better update process.

Sunday, December 13, 2015

Zones beside Zones

Previously, I've described how to use the Crossbow networking stack in illumos to create a virtualized network topology with Zones behind Zones.

The result there was to create the ability to have zones on a private network segment, behind a proxy/router zone.

What, however, if you want the zones on one of those private segments to communicate with zones on a different private segment?

Consider the following two proxy zones:

A: address 192.168.10.1, subnet 10.1.0.0/16
B: address 192.168.10.2, subnet 10.2.0.0/16

And we want the zones in the 10.1.0.0 and 10.2.0.0 subnets to talk to each other. The first step is to add routes, so that packets from system A destined for the 10.2.0.0 subnet are sent to host B. (And vice versa.)

A: route add net 10.2.0.0/16 192.168.10.2
B: route add net 10.1.0.0/16 192.168.10.1

This doesn't quite work. The packets are sent, but recall that the proxy zone is doing NAT on behalf of the zones behind it. So packets leaving 10.1.0.0 get NATted on the way out, get delivered successfully to the 10.2.0.0 destination but then the reply packet gets NATted on its way back, so it doesn't really work.

So, all that's needed is to not NAT the packets that are going to the other private subnet. Remember the original NAT rule in ipnat.conf on host A would have been:

map pnic0 10.1.0.0/16 -> 0/32 portmap tcp/udp auto
map pnic0 10.1.0.0/16 -> 0/32

and we don't want to NAT anything that is going to 10.2.0.0, which would be:

map pnic0 from 10.1.0.0/16 ! to 10.2.0.0/16 -> 0/32 portmap tcp/udp auto
map pnic0 from 10.1.0.0/16 ! to 10.2.0.0/16 -> 0/32

And that's all there is to it. You now have a very simple private software-defined network with the 10.1 and 10.2 subnets joined together.

If you think this looks like the approach underlying Project Calico, you would be right. In Calico, you build up the network by managing routes (many more as it's per-host rather than the per-subnet I have here), although Calico has a lot more manageability and smarts built in to it rather than manually adding routes to each host.

While simple, there are obvious problems associated with scaling such a solution.

While adding and deleting routes isn't so bad, listing all the subnets in ipnat.conf would be tedious to say the least. The solution here would be to use the ippool facility to group the subnets.

How do we deal with a dynamic environment? While the back-end zones would come and go all the time, I expect the proxy/router zone topology to be fairly stable, so configuration churn would be fairly low.

The mechanism described here isn't limited to a single host, it easily spans multiple hosts. (With the simplistic routing as I've described it here, those hosts would have to be on the same network, but that's not a fundamental limitation.) My scripts in Tribblix just save details of how the proxy/router zones on a host are configured locally, so I need to extend the logic to a network-wide configuration store. That, at least, is well-known territory.

Thursday, December 10, 2015

Building an application in Docker

We have an application that we want to make easy for people to run. As in, really easy. And for people who aren't necessarily systems administrators or software developers.

The application ought to work on pretty much anything, but each OS platform has its quirks. So we support Ubuntu - it's pretty common, and it's available on most cloud providers too. And there's a simple install script that will set everything up for the user.

In the modern world, Docker is all the rage. And one advantage of Docker from the point of a systems administrator is that it decouples the application environment from the systems environment - if you're a Red Hat shop, you just run Red Hat on the systems, then add Docker and your developers can get a Ubuntu (or whatever) environment to run the application in. (The downside of this decoupling is that it gives people an excuse to write even less portable code than they do even now.)

So, one way for us to support more platforms is to support Docker. I already have the script that does everything, so isn't it going to be just a case of creating a Dockerfile like this and building it:

FROM ubuntu:14.04
RUN my_installer.sh

Actually, that turns out to be (surprisingly) close. It turns out to fail on just one line. The Docker build process runs as root, and when we try and initialise postgres with initdb, it errors out as it won't let you run postgres as root.

(As an aside, this notion of "root is unsafe" needs a bit of a rethink. In a containerized or unikernel world, there's nothing beside the one app, so there's no fundamental difference between root and the application user in many cases, and root in a containerized world is a bogus root anyway.)

OK, so we can run the installation as another user. We have to create the user first, of course, so something like:

FROM ubuntu:14.04
RUN useradd -m hbuild
USER hbuild
RUN my_installer.sh

Unfortunately, this turns out to fail all over the place. One thing my install script does is run apt-get via sudo to get all the packages that are necessary. We're user hbuild in the container and can't run sudo, and if we could we would get prompted, which is a bit tricky for the non-interactive build process. So we need to configure sudo so that this user won't get prompted for a password. Which is basically:

FROM ubuntu:14.04
RUN useradd -m -U -G sudo hbuild && \
echo "hbuild ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
USER hbuild
RUN my_installer.sh

Which solves all the sudo problems, but the script also references $USER (it creates some directories as root, then chowns them to the running user so the build can populate them), and the Docker build environment doesn't set USER (or LOGNAME, as far as I can tell). So we need to populate the environment the way the script expects:

FROM ubuntu:14.04
RUN useradd -m -U -G sudo hbuild && \
echo "hbuild ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
USER hbuild
ENV USER hbuild
RUN my_installer.sh

And off it goes, cheerfully downloading and building everything.

I've skipped over how the install script itself ends up on the image. I could use COPY, or even something very crude like:

FROM ubuntu:14.04
RUN apt-get install -y wget
RUN useradd -m -U -G sudo hbuild && \
    echo "hbuild ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
USER hbuild
ENV USER hbuild
RUN cd /home/hbuild && \
    wget http://my.server/my_installer.sh && \
    chmod a+x my_installer.sh && \
./my_installer.sh

This all works, but is decidedly sub-optimal. Leaving aside the fact that we're running both the application and the database inside a single container (changing that is a rather bigger architectural change than we're interested in right now), the Docker images end up being huge, and you're downloading half the universe each time. So to do this properly you would add an extra RUN step that did all the packaging and cleaned up after itself, so you have a base layer to build the application on.

What this does show, though, is that it's not that hard to take an existing deployment script and wrap it inside Docker - all it took here was a little fakery of the environment to more closely align with how the script was expecting to be run.

Monday, November 30, 2015

Zones behind zones

With Solaris 10 came a host of innovations - ZFS, DTrace, and zones were the big three, there was also SMF, FMA, NFS4, and a supporting cast of improvements across the board.

The next big followup was Crossbow, giving full network virtualization. It never made it back into Solaris 10, and it's largely gone under the radar.

Which is a shame, because it's present in illumos (as well as Solaris 11), and allows you to construct essentially arbitrary network configuration in software. Coupled with zones you can build a whole datacentre in a box.

Putting everything together is a bit tedious, however. One of the things I wanted with Tribblix is to enable people (myself being the primary customer) to easily take advantages of the technologies, to automate away all the tedious grunt work.

This is already true up to a point - zap can create and destroy zones with a single command. No more mucking around with the error-prone process of writing config files and long streams of commands, computers exist to do all this for us - leaving humans to worry about what you want to do, not how to remember the minutiae of the how.

So the next thing I wanted to do was to have a zone that can act as a router or proxy (I've not yet really settled on a name), so you have a hidden network with zones that can only be reached from your proxy zone. There are a couple of obvious uses cases:

You have a number of web applications in isolated zones, and run a reverse proxy like nginx or haproxy in your proxy zone to act as a customer-facing endpoint.
You have a multitier application, with just the customer-facing tier in the proxy zone, and the other tiers (such as your database) safely hidden away.

Of course, you could combine the two.

So the overall aim is that:

Given an appropriate flag and an argument that's a network description (ideally a.b.c.d/prefix in CIDR notation) the tool will automatically use Crossbow to create the appropriate plumbing, hook the proxy zone up to that network, and configure it appropriately
In the simplest case, the proxy zone will use NAT to forward packets from the zones behind it, and be the default gateway for those zones (but I don't want it to do any real network routing)
If you create a zone with an address on the hidden subnet, then again all the plumbing will get set up so that the zone is connected up to the appropriate device and has its network settings correctly configured

This will be done automatically, but it's worth walking through the steps manually.

As an example, I want to set up the network 10.2.0.0/16. By convention, the proxy zone will be connected to it with the bottom address - 10.2.0.1 in this case.

The first step is to create an etherstub:

dladm create-etherstub zstub0

And then a vnic over it that will be the interface to this new private network:

dladm create-vnic -l zstub0 znic0

Now, for the zone to be able to manage all the networking stuff it needs to have an exclusive-ip network stack. So you need to create another vnic for the public-facing side of the network, let's suppose you're going to use the e1000g0 interface:

dladm create-vnic -l e1000g0 pnic0

You create the zone with exclusive-ip and add the pnic0 and znic0 interfaces.

Within the zone, configure the address of znic0 to be 10.2.0.1/16.

You need to set up IP forwarding on all the interfaces in the zone:

ipadm set-ifprop -p forwarding=on -m ipv4 znic0
ipadm set-ifprop -p forwarding=on -m ipv4 pnic0

The zone also needs to NAT the traffic coming in from the 10.2 network. The file /etc/ipf/ipnat.conf needs to contain:

map pnic0 10.2.0.0/16 -> 0/32 portmap tcp/udp auto
map pnic0 10.2.0.0/16 -> 0/32

and you need to enable ipfilter in the zone with the command svcadm enable network/ipfilter.

Then, if you create a zone with address 10.2.0.2, for example, you need to create a new vnic over the zstub0 etherstub:

dladm create-vnic -l zstub0 znic1

and allocate the znic1 interface to the zone. Then, in that new zone, set the address of znic1 to be 10.2.0.2 and its default gateway to be 10.2.0.1.

That's just about manageable. But in reality it gets far more complicated:

With multiple networks and zones, you have to dynamically allocate the etherstub and vnic names, they aren't fixed
You have to make sure to delete all the items you have created when you destroy a zone
You need to be able to find which etherstub is associated with a given network, so you attach a new zone to the correct etherstub
Ideally, you want all the hidden networks to be unique (you don't have to, but as the person writing this I can make it so to keep things simple for me)
You want to make sure you can't delete a proxy zone if there are zones on the network behind it
You want the zones to boot up with their networks fully and correctly configured (there's a lot going on here that I haven't even mentioned)
You may need to configure rather more of a firewall than the simple NAT configuration
In the case of a reverse proxy, you need a way to update the reverse proxy configuration automatically as zones come and go

Overall, there are a whole lot of hoops to jump through, and a lot of information to track and verify.

I'm about halfway through writing this at the moment, with most of the basic functionality present. I can, as the author, make a number of simplifying assumptions - I get to choose the naming convention, I can declare than the hidden networks must be unique, I can declare that I will only support simple prefixes (/8, /16, and /24) rather than arbitrary prefixes, and so on.

Thursday, November 26, 2015

Buggy basename

Every so often you marvel at the lengths people go to to break things.

Take the basename command in illumos, for example. This comes in two incarnations - /usr/bin/basename, and /usr/xpg4/bin/basename.

Try this:

# /usr/xpg4/bin/basename -- /a/b/c.txt
c.txt

Which is correct, and:

# /usr/bin/basename -- /a/b/c.txt

--

Which isn't.

Wait, it gets worse:

# /usr/xpg4/bin/basename /a/b/c.txt .t.t
c.txt

Correct. But:

# /usr/bin/basename /a/b/c.txt .t.t
c

Err, what?

Perusal of the source code reveals the answer to the "--" handling - it's only caught in XPG4 mode. Which is plain stupid, there's no good reason to deliberately restrict correct behaviour to XPG4.

Then the somewhat bizarre handling with the ".t.t" suffix. So it turns out that the default basename command is doing pattern matching rather then the expected string matching. So the "." will match any character, rather than being interpreted literally. Given how commonly "." is used to separate the filename from its suffix, and the common usage of basename to strip off the suffix, this is a guarantee for failure and confusion. For example:

# /usr/bin/basename /a/b/cdtxt .txt
c

The fact that there's a difference here is actually documented in the man page, although not very well - it points you to expr(1) which doesn't tell you anything relevant.

So, does anybody rely on the buggy broken behaviour here?

It's worth noting that the ksh basename builtin and everybody else's basename implementation seems to do the right thing.

Fixing this would also get rid of a third of the lines of code and we could just ship 1 binary instead of 2.

Tuesday, November 24, 2015

Replacing SunSSH with OpenSSH in Tribblix

I recently did some work to replace the old SSH implementation used by Tribblix, which was the old SunSSH from illumos, with OpenSSH.

This was always on the list - our SunSSH implementation was decrepit and unmaintained, and there seemed little point in general in maintaining our own version.

The need to replace has become more urgent recently, as the mainstream SSH implementations have drifted to the point that we're no longer compatible - to the point that our implementation will not interoperate at all with that on modern Linux distributions with the default settings.

As I've been doing a bit of work with some of those modern Linux distributions, being unable to connect to them was a bit of a pain in the neck.

Other illumos distributions such as OmniOS and SmartOS have also recently been making the switch.

Then there was a proposal to work on the SunSSH implementation so that it was mediated - allowing you to install both SunSSH and OpenSSH and dynamically switch between them to ease the transition. Personally, I couldn't see the point - it seemed to me much easier to simply nuke SunSSH, especially as some distros had already made or were in the process of making the transition. But I digress.

If you look at OmniOS, SmartOS, or OpenIndiana, they have a number of patches. In some cases, a lot of patches to bring OpenSSH more in line with old SunSSH.

I studied these at some length, looked at them, and largely rejected them. There are a couple of reasons for this:

In Tribblix, I have a philosophy of making minimal modifications to upstream projects. I might apply patches to make software build, or when replacing older components so that I don't break binary compatibility, but in general what I ship is as close to what you would get if you did './configure --prefix=/usr; make ; make install' as I can make it.
Some of the fixes were for functionality that I don't use, probably won't use, and have no way of testing. So blindly applying patches and hoping that what I produce still works, and doesn't arbitrarily break something else, isn't appealing. Unfortunately all the gssapi stuff falls into this bracket.

One thing that might change this in the future, and something we've discussed a little, is to have something like Joyent's illumos-extra brought up to a state where it can be used as a common baseline across all illumos distributions. It's a bit too specific to SmartOS right now, so won't work for me out of the box, and it's a little unfortunate that I've just about reimplemented all the same things for Tribblix myself.

So what I ship is almost vanilla OpenSSH. The modifications I have made are fairly few:

It's split into the same packages (3 of them) along just about the same boundaries as before. This is so that you don't accidentally mix bits of SunSSH with the new OpenSSH build.

The server has

KexAlgorithms +diffie-hellman-group1-sha1

added to /etc/ssh/sshd_config to allow connections from older SunSSH clients.

The client has

PubkeyAcceptedKeyTypes +ssh-dss

added to /etc/ssh/ssh_config so that it will allow you to send DSA keys, for users who still have just DSA keys.

Now, I'm not 100% happy about the fact that I might have broken something that SunSSH might have done, but having a working SSH that will interoperate with all the machines I need to talk to outweighs any minor disadvantages.

Sunday, November 22, 2015

On Keeping Your Stuff to Yourself

One of the fundamental principles of OmniOS - and indeed probably its defining characteristic - is KYSTY, or Keep Your Stuff* To Yourself.

(*um, whatever.)

This isn't anything new. I've expressed similar opinions in the past. To reiterate - any software that is critical for the success of your business/project/infrastructure/whatever should be directly under your control, rather than being completely at the whim of some external entity (in this case, your OS supplier).

We can flesh this out a bit. The software on a system will fall, generally, into 3 categories:

The operating system, the stuff required for the system to boot and run reliably
Your application, and its dependencies
General utilities

As an aside, there are more modern takes on the above problem: with Docker, you bundle the operating system with your application; with unikernels you just link whatever you need from classes 1 and 2 into your application. Problem solved - or swept under the carpet, rather.

Looking at the above, OmniOS will only ship software in class 1, leaving the rest to the end user. SmartOS is a bit of a hybrid - it likes to hide everything in class 1 from you and relies on pkgsrc to supply classes 2 and 3, and the bits of class 1 that you might need.

Most (of the major) Linux distributions ship classes 1, 2, and 3, often in some crazily interdependent mess that you have to spend ages unpicking. The problem being that you need to work extra hard to ensure your own build doesn't accidentally acquire a dependency on some system component (or that you build somehow reads a system configuration file).

Generally missing from discussions is that class 3 - the general utilities. Stuff that you could really do with an instance of to make your life easier, but where you don't really care about the specifics of.

For example, it helps to have a copy of the gnu userland around. Way too much source out there needs GNU tar to unpack, or GNU make to build, or assumes various things about the userland that are only true of the GNU tools. (Sometimes, the GNU tools aren't just a randomly incompatible implementation, occasionally have capabilities that are missing from standard tools - like in-place editing in gsed.)

Or a reasonably complete suite of compression utilities. More accurately, uncompression, so that you have a pretty good chance of being able to unpack some arbitrary format that people have decided to use.

Then there are generic runtimes. There's an awful lot of python or perl out there, and sometimes the most convenient way to get a job done is to put together a small script or even a one-liner. So while you don't really care about the precise details, having copies of the appropriate runtimes (and you might add java, erlang, ruby, node, or whatever to that list) really helps for the occasions when you just want to put together a quick throwaway component. Again, if your business-critical application stack requires that runtime, you maintain your own, with whatever modules you need.

There might also be a need for basic graphics. You might not want or need a desktop, but something is linked against X11 anyway. (For example, java was mistakenly linked against X11 for font handling, even in headless mode - a bug recently fixed.) Even if it's not X11, applications might use common code such as cairo or pango for drawing. Or they might need to read or write image formats for web display.

So the chances are that you might pull in a very large code surface, just for convenience. Certainly I've spent a lot of time building 3rd-party libraries and applications on OmniOS that were included as standard pretty much everywhere else.

In Tribblix, I've attempted to build and package software cognizant of the above limitations. So I supply as wide a range of software in class 3 as I can - this is driven by my own needs and interests, as a rule, but over time it's increasingly complete. I do supply application stacks, but these are built to be in a separate location, and are kept at arms length from the rest of the system. This then integrated with Zones in a standardized zone architecture in a way that can be managed by zap. My intention here is not necessarily to supply the building blocks that can be used by users, but to provide the whole application, fully configured and ready to go.