Thursday, December 24, 2015

The palatability of complexity

There seems to be a general trend to always add complexity to any system. Perhaps it's just the way most of our brains are wired, but we just can't help it.

Whether this be administrative tasks (filing your expenses), computer software (who hasn't suffered the dead hand of creeping featurism), systems administration, or even building a tax system, the trend seems always to be to keep adding additional layers of complexity.

Eventually, this stops when the complexity becomes unsustainable. People can rebel - they will go round the back of the system, taking short cuts to achieve their objectives without having to deal with the complexity imposed on them. Or they leave - for another company without the overblown processes, or another piece of software that is easier to use.

But there's another common way of dealing with the problem that is superficially attractive but with far worse consequences, which involves the addition of what I'll call a palatability layer. Rather than address the underlying problem, an additional layer is added on top to make it easier to deal with.

Which fails in two ways: you have failed to actually eliminate the underlying complexity, and the layer you've added will itself grow in complexity until it reaches the palatability threshold. (At which point, someone will add another layer, and the cycle repeats.)

Sometimes, existing bugs and accidental implementation artefacts become embedded as dogma in the new palatability layer. Worse, over time all expertise gravitates to the outermost layer, leaving you with nobody capable of understanding the innermost internals.

On occasion, the palatability layer becomes inflated to the position of a standard. (Which perhaps explains why standards are often so poor, and there are so many to choose from.)

For example, computer languages have grown bloated and complex. Features have been added, dependencies have grown. Every so often a new language emerges as an escape hatch.

Historically, I've often been opposed to the use of Configuration Management, because it would end up being used to support complexity rather than enforcing simplicity. This is not a fault of the tool, but of the humans who would abuse it.

As another example, I personally use an editor to write code rather than an IDE. That way, I can't write overly complex code, and it forces me to understand every line of code I write.

Every time you add a palatability layer, while you might think you're making things better, in reality you're helping build a house of cards on quicksand.

Monday, December 14, 2015

The cost of user-applied updates

Having updated a whole bunch of my (proprietary) devices with OS updates today, I was moved to tweet:

Imagining a world in which you could charge a supplier for the time it takes you to regularly update their software

On most of my Apple devices I'm applying updates to either iOS or MacOS on a regular basis. Very roughly, it's probably taking away an hour a month - I'm not including the elapsed time for the update (you just schedule this so you get yourself a cup of coffee or something), but there's a bit of planning involved, some level of interaction during the process, and then the need to fix up anything afterwards that got mangled by the update.

I don't currently run Windows, but I used to have to do that as well. And web browsers and applications. And that used to take forever, although current hardware helps (particularly the move away from spinning rust).

And then there's the constant stream of updates at the installed apps. Not all of which you can ignore - some games have regular mandatory updates and if you don't apply them the game won't even start.

If you charge for the time involved at commercial rates, you could easily justify $100 per month or $1000 per year. It's a significant drain on time and productivity, a burden being pushed from suppliers onto end users. Multiply that by the entire user base and you're looking at it having a significant impact on the economy of the planet.

And that's when things go smoothly. Sometimes systems go and apply updates at inconvenient times - I once had Windows update suddenly decide to update my work laptop just as I was shutting it down to go to the airport. Or just before an important meeting. If the update interferes with a critical business function, then costs can skyrocket very easily.

So you could avoid the manual interaction and associated costs, but then you end up giving users no way to prevent bad updates or to schedule them appropriately. Of course, if the things were adequately tested beforehand, or minimised, then there would be much less of a problem, but the update model seems to be to replace the whole shebang and not bother with testing. (Or worry about compatibility.)

It's not just time (or sanity), there's a very real cost in bandwidth. With phone or tablet images being measured in gigabytes, you can very easily blow your usage cap. (Even on broadband - if you're on metered domestic broadband then the usage cap might be 25GB/month, which is fine for email and general browsing, but OS and app updates for a family could easily hit that limit.)

The problem extends beyond computers (or things like phones that people do now think of as computers). My TV and BluRay player have a habit of updating themselves. (And one's significant other gets really annoyed if the thing decides to spend 10 minutes updating itself just as her favourite soap opera is about to start.)

As more and more devices are connected to the network, and update over the network, the problem's only going to get worse. While some updates are going to be necessary due to newly found bugs and security issues, there does seem to be a philosophy of not getting things right in the first place but shipping half-baked and buggy software, relying on being able to update it later.

Any realistic estimate of the actual cost involved in expecting all your end users to maintain the shoddy software that you ship is so high that the industry could never be expected to foot even a small fraction of the bill. Which is unfortunate, because a financial penalty would focus the mind and maybe lead to a much better update process.

Sunday, December 13, 2015

Zones beside Zones

Previously, I've described how to use the Crossbow networking stack in illumos to create a virtualized network topology with Zones behind Zones.

The result there was to create the ability to have zones on a private network segment, behind a proxy/router zone.

What, however, if you want the zones on one of those private segments to communicate with zones on a different private segment?

Consider the following two proxy zones:

A: address, subnet
B: address, subnet

And we want the zones in the and subnets to talk to each other. The first step is to add routes, so that packets from system A destined for the subnet are sent to host B. (And vice versa.)

A: route add net
B: route add net

This doesn't quite work. The packets are sent, but recall that the proxy zone is doing NAT on behalf of the zones behind it. So packets leaving get NATted  on the way out, get delivered successfully to the destination but then the reply packet gets NATted on its way back, so it doesn't really work.

So, all that's needed is to not NAT the packets that are going to the other private subnet. Remember the original NAT rule in ipnat.conf on host A would have been:

map pnic0 -> 0/32 portmap tcp/udp auto
map pnic0 -> 0/32

and we don't want to NAT anything that is going to, which would be:

map pnic0 from ! to -> 0/32 portmap tcp/udp auto
map pnic0 from ! to -> 0/32

And that's all there is to it. You now have a very simple private software-defined network with the 10.1 and 10.2 subnets joined together.

If you think this looks like the approach underlying Project Calico, you would be right. In Calico, you build up the network by managing routes (many more as it's per-host rather than the per-subnet I have here), although Calico has a lot more manageability and smarts built in to it rather than manually adding routes to each host.

While simple, there are obvious problems associated with scaling such a solution.

While adding and deleting routes isn't so bad, listing all the subnets in ipnat.conf would be tedious to say the least. The solution here would be to use the ippool facility to group the subnets.

How do we deal with a dynamic environment? While the back-end zones would come and go all the time, I expect the proxy/router zone topology to be fairly stable, so configuration churn would be fairly low.

The mechanism described here isn't limited to a single host, it easily spans multiple hosts. (With the simplistic routing as I've described it here, those hosts would have to be on the same network, but that's not a fundamental limitation.) My scripts in Tribblix just save details of how the proxy/router zones on a host are configured locally, so I need to extend the logic to a network-wide configuration store. That, at least, is well-known territory.

Thursday, December 10, 2015

Building an application in Docker

We have an application that we want to make easy for people to run. As in, really easy. And for people who aren't necessarily systems administrators or software developers.

The application ought to work on pretty much anything, but each OS platform has its quirks. So we support Ubuntu - it's pretty common, and it's available on most cloud providers too. And there's a simple install script that will set everything up for the user.

In the modern world, Docker is all the rage. And one advantage of Docker from the point of a systems administrator is that it decouples the application environment from the systems environment - if you're a Red Hat shop, you just run Red Hat on the systems, then add Docker and your developers can get a Ubuntu (or whatever) environment to run the application in. (The downside of this decoupling is that it gives people an excuse to write even less portable code than they do even now.)

So, one way for us to support more platforms is to support Docker. I already have the script that does everything, so isn't it going to be just a case of creating a Dockerfile like this and building it:

FROM ubuntu:14.04

Actually, that turns out to be (surprisingly) close. It turns out to fail on just one line. The Docker build process runs as root, and when we try and initialise postgres with initdb, it errors out as it won't let you run postgres as root.

(As an aside, this notion of "root is unsafe" needs a bit of a rethink. In a containerized or unikernel world, there's nothing beside the one app, so there's no fundamental difference between root and the application user in many cases, and root in a containerized world is a bogus root anyway.)

OK, so we can run the installation as another user. We have to create the user first, of course, so something like:

FROM ubuntu:14.04
RUN useradd -m hbuild

USER hbuild

Unfortunately, this turns out to fail all over the place. One thing my install script does is run apt-get via sudo to get all the packages that are necessary. We're user hbuild in the container and can't run sudo, and if we could we would get prompted, which is a bit tricky for the non-interactive build process. So we need to configure sudo so that this user won't get prompted for a password. Which is basically:

FROM ubuntu:14.04
RUN useradd -m -U -G sudo hbuild && \

    echo "hbuild ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
USER hbuild

Which solves all the sudo problems, but the script also references $USER (it creates some directories as root, then chowns them to the running user so the build can populate them), and the Docker build environment doesn't set USER (or LOGNAME, as far as I can tell). So we need to populate the environment the way the script expects:

FROM ubuntu:14.04
RUN useradd -m -U -G sudo hbuild && \

    echo "hbuild ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
USER hbuild
ENV USER hbuild

And off it goes, cheerfully downloading and building everything.

I've skipped over how the install script itself ends up on the image. I could use COPY, or even something very crude like:

FROM ubuntu:14.04
RUN apt-get install -y wget
RUN useradd -m -U -G sudo hbuild && \

    echo "hbuild ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
USER hbuild
ENV USER hbuild
RUN cd /home/hbuild && \

    wget http://my.server/ && \
    chmod a+x && \

This all works, but is decidedly sub-optimal. Leaving aside the fact that we're running both the application and the database inside a single container (changing that is a rather bigger architectural change than we're interested in right now), the Docker images end up being huge, and you're downloading half the universe each time. So to do this properly you would add an extra RUN step that did all the packaging and cleaned up after itself, so you have a base layer to build the application on.

What this does show, though, is that it's not that hard to take an existing deployment script and wrap it inside Docker - all it took here was a little fakery of the environment to more closely align with how the script was expecting to be run.

Monday, November 30, 2015

Zones behind zones

With Solaris 10 came a host of innovations - ZFS, DTrace, and zones were the big three, there was also SMF, FMA, NFS4, and a supporting cast of improvements across the board.

The next big followup was Crossbow, giving full network virtualization. It never made it back into Solaris 10, and it's largely gone under the radar.

Which is a shame, because it's present in illumos (as well as Solaris 11), and allows you to construct essentially arbitrary network configuration in software. Coupled with zones you can build a whole datacentre in a box.

Putting everything together is a bit tedious, however. One of the things I wanted with Tribblix is to enable people (myself being the primary customer) to easily take advantages of the technologies, to automate away all the tedious grunt work.

This is already true up to a point - zap can create and destroy zones with a single command. No more mucking around with the error-prone process of writing config files and long streams of commands, computers exist to do all this for us - leaving humans to worry about what you want to do, not how to remember the minutiae of the how.

So the next thing I wanted to do was to have a zone that can act as a router or proxy (I've not yet really settled on a name), so you have a hidden network with zones that can only be reached from your proxy zone. There are a couple of obvious uses cases:

  • You have a number of web applications in isolated zones, and run a reverse proxy like nginx or haproxy in your proxy zone to act as a customer-facing endpoint.
  • You have a multitier application, with just the customer-facing tier in the proxy zone, and the other tiers (such as your database) safely hidden away.

Of course, you could combine the two.

So the overall aim is that:

  • Given an appropriate flag and an argument that's a network description (ideally a.b.c.d/prefix in CIDR notation) the tool will automatically use Crossbow to create the appropriate plumbing, hook the proxy zone up to that network, and configure it appropriately
  • In the simplest case, the proxy zone will use NAT to forward packets from the zones behind it, and be the default gateway for those zones (but I don't want it to do any real network routing)
  • If you create a zone with an address on the hidden subnet, then again all the plumbing will get set up so that the zone is connected up to the appropriate device and has its network settings correctly configured

This will be done automatically, but it's worth walking through the steps manually.

As an example, I want to set up the network By convention, the proxy zone will be connected to it with the bottom address - in this case.

The first step is to create an etherstub:

dladm create-etherstub zstub0
And then a vnic over it that will be the interface to this new private network:

dladm create-vnic -l zstub0 znic0

Now, for the zone to be able to manage all the networking stuff it needs to have an exclusive-ip network stack. So you need to create another vnic for the public-facing side of the network, let's suppose you're going to use the e1000g0 interface:

dladm create-vnic -l e1000g0 pnic0

You create the zone with exclusive-ip and add the pnic0 and znic0 interfaces.

Within the zone, configure the address of znic0 to be

You need to set up IP forwarding on all the interfaces in the zone:

ipadm set-ifprop -p forwarding=on -m ipv4 znic0
ipadm set-ifprop -p forwarding=on -m ipv4 pnic0

The zone also needs to NAT the traffic coming in from the 10.2 network. The file /etc/ipf/ipnat.conf needs to contain:

map pnic0 -> 0/32 portmap tcp/udp auto
map pnic0 -> 0/32

and you need to enable ipfilter in the zone with the command svcadm enable network/ipfilter.

Then, if you create a zone with address, for example, you need to create a new vnic over the zstub0 etherstub:

dladm create-vnic -l zstub0 znic1

and allocate the znic1 interface to the zone. Then, in that new zone, set the address of znic1 to be and its default gateway to be

That's just about manageable. But in reality it gets far more complicated:

  • With multiple networks and zones, you have to dynamically allocate the etherstub and vnic names, they aren't fixed
  • You have to make sure to delete all the items you have created when you destroy a zone
  • You need to be able to find which etherstub is associated with a given network, so you attach a new zone to the correct etherstub
  • Ideally, you want all the hidden networks to be unique (you don't have to, but as the person writing this I can make it so to keep things simple for me)
  • You want to make sure you can't delete a proxy zone if there are zones on the network behind it
  • You want the zones to boot up with their networks fully and correctly configured (there's a lot going on here that I haven't even mentioned)
  • You may need to configure rather more of a firewall than the simple NAT configuration
  • In the case of a reverse proxy, you need a way to update the reverse proxy configuration automatically as zones come and go

Overall, there are a whole lot of hoops to jump through, and a lot of information to track and verify.

I'm about halfway through writing this at the moment, with most of the basic functionality present. I can, as the author, make a number of simplifying assumptions - I get to choose the naming convention, I can declare than the hidden networks must be unique, I can declare that I will only support simple prefixes (/8, /16, and /24) rather than arbitrary prefixes, and so on.

Thursday, November 26, 2015

Buggy basename

Every so often you marvel at the lengths people go to to break things.

Take the basename command in illumos, for example. This comes in two incarnations - /usr/bin/basename, and /usr/xpg4/bin/basename.

Try this:

# /usr/xpg4/bin/basename -- /a/b/c.txt

Which is correct, and:

# /usr/bin/basename -- /a/b/c.txt

Which isn't.

Wait, it gets worse:

# /usr/xpg4/bin/basename /a/b/c.txt .t.t

Correct. But:

# /usr/bin/basename /a/b/c.txt .t.t

Err, what?

Perusal of the source code reveals the answer to the "--" handling - it's only caught in XPG4 mode. Which is plain stupid, there's no good reason to deliberately restrict correct behaviour to XPG4.

Then the somewhat bizarre handling with the ".t.t" suffix. So it turns out that the default basename command is doing pattern matching rather then the expected string matching. So the "." will match any character, rather than being interpreted literally. Given how commonly "." is used to separate the filename from its suffix, and the common usage of basename to strip off the suffix, this is a guarantee for failure and confusion. For example:

# /usr/bin/basename /a/b/cdtxt .txt

The fact that there's a difference here  is actually documented in the man page, although not very well - it points you to expr(1) which doesn't tell you anything relevant.

So, does anybody rely on the buggy broken behaviour here?

It's worth noting that the ksh basename builtin and everybody else's basename implementation seems to do the right thing.

Fixing this would also get rid of a third of the lines of code and we could just ship 1 binary instead of 2.

Tuesday, November 24, 2015

Replacing SunSSH with OpenSSH in Tribblix

I recently did some work to replace the old SSH implementation used by Tribblix, which was the old SunSSH from illumos, with OpenSSH.

This was always on the list - our SunSSH implementation was decrepit and unmaintained, and there seemed little point in general in maintaining our own version.

The need to replace has become more urgent recently, as the mainstream SSH implementations have drifted to the point that we're no longer compatible - to the point that our implementation will not interoperate at all with that on modern Linux distributions with the default settings.

As I've been doing a bit of work with some of those modern Linux distributions, being unable to connect to them was a bit of a pain in the neck.

Other illumos distributions such as OmniOS and SmartOS have also recently been making the switch.

Then there was a proposal to work on the SunSSH implementation so that it was mediated - allowing you to install both SunSSH and OpenSSH and dynamically switch between them to ease the transition. Personally, I couldn't see the point - it seemed to me much easier to simply nuke SunSSH, especially as some distros had already made or were in the process of making the transition. But I digress.

If you look at OmniOS, SmartOS, or OpenIndiana, they have a number of patches. In some cases, a lot of patches to bring OpenSSH more in line with old SunSSH.

I studied these at some length, looked at them, and largely rejected them. There are a couple of reasons for this:

  • In Tribblix, I have a philosophy of making minimal modifications to upstream projects. I might apply patches to make software build, or when replacing older components so that I don't break binary compatibility, but in general what I ship is as close to what you would get if you did './configure --prefix=/usr; make ; make install' as I can make it.
  • Some of the fixes were for functionality that I don't use, probably won't use, and have no way of testing. So blindly applying patches and hoping that what I produce still works, and doesn't arbitrarily break something else, isn't appealing. Unfortunately all the gssapi stuff falls into this bracket.
One thing that might change this in the future, and something we've discussed a little, is to have something like Joyent's illumos-extra brought up to a state where it can be used as a common baseline across all illumos distributions. It's a bit too specific to SmartOS right now, so won't work for me out of the box, and it's a little unfortunate that I've just about reimplemented all the same things for Tribblix myself.

So what I ship is almost vanilla OpenSSH. The modifications I have made are fairly few:

It's split into the same packages (3 of them) along just about the same boundaries as before. This is so that you don't accidentally mix bits of SunSSH with the new OpenSSH build.

The server has
KexAlgorithms +diffie-hellman-group1-sha1
added to /etc/ssh/sshd_config to allow connections from older SunSSH clients.

The client has
PubkeyAcceptedKeyTypes +ssh-dss
added to /etc/ssh/ssh_config so that it will allow you to send DSA keys, for users who still have just DSA keys.

Now, I'm not 100% happy about the fact that I might have broken something that SunSSH might have done, but having a working SSH that will interoperate with all the machines I need to talk to outweighs any minor disadvantages.