Monday, November 30, 2015

Zones behind zones

With Solaris 10 came a host of innovations - ZFS, DTrace, and zones were the big three, there was also SMF, FMA, NFS4, and a supporting cast of improvements across the board.

The next big followup was Crossbow, giving full network virtualization. It never made it back into Solaris 10, and it's largely gone under the radar.

Which is a shame, because it's present in illumos (as well as Solaris 11), and allows you to construct essentially arbitrary network configuration in software. Coupled with zones you can build a whole datacentre in a box.

Putting everything together is a bit tedious, however. One of the things I wanted with Tribblix is to enable people (myself being the primary customer) to easily take advantages of the technologies, to automate away all the tedious grunt work.

This is already true up to a point - zap can create and destroy zones with a single command. No more mucking around with the error-prone process of writing config files and long streams of commands, computers exist to do all this for us - leaving humans to worry about what you want to do, not how to remember the minutiae of the how.

So the next thing I wanted to do was to have a zone that can act as a router or proxy (I've not yet really settled on a name), so you have a hidden network with zones that can only be reached from your proxy zone. There are a couple of obvious uses cases:

  • You have a number of web applications in isolated zones, and run a reverse proxy like nginx or haproxy in your proxy zone to act as a customer-facing endpoint.
  • You have a multitier application, with just the customer-facing tier in the proxy zone, and the other tiers (such as your database) safely hidden away.

Of course, you could combine the two.

So the overall aim is that:

  • Given an appropriate flag and an argument that's a network description (ideally a.b.c.d/prefix in CIDR notation) the tool will automatically use Crossbow to create the appropriate plumbing, hook the proxy zone up to that network, and configure it appropriately
  • In the simplest case, the proxy zone will use NAT to forward packets from the zones behind it, and be the default gateway for those zones (but I don't want it to do any real network routing)
  • If you create a zone with an address on the hidden subnet, then again all the plumbing will get set up so that the zone is connected up to the appropriate device and has its network settings correctly configured

This will be done automatically, but it's worth walking through the steps manually.

As an example, I want to set up the network 10.2.0.0/16. By convention, the proxy zone will be connected to it with the bottom address - 10.2.0.1 in this case.

The first step is to create an etherstub:

dladm create-etherstub zstub0
And then a vnic over it that will be the interface to this new private network:

dladm create-vnic -l zstub0 znic0

Now, for the zone to be able to manage all the networking stuff it needs to have an exclusive-ip network stack. So you need to create another vnic for the public-facing side of the network, let's suppose you're going to use the e1000g0 interface:

dladm create-vnic -l e1000g0 pnic0

You create the zone with exclusive-ip and add the pnic0 and znic0 interfaces.

Within the zone, configure the address of znic0 to be 10.2.0.1/16.

You need to set up IP forwarding on all the interfaces in the zone:

ipadm set-ifprop -p forwarding=on -m ipv4 znic0
ipadm set-ifprop -p forwarding=on -m ipv4 pnic0

The zone also needs to NAT the traffic coming in from the 10.2 network. The file /etc/ipf/ipnat.conf needs to contain:

map pnic0 10.2.0.0/16 -> 0/32 portmap tcp/udp auto
map pnic0 10.2.0.0/16 -> 0/32

and you need to enable ipfilter in the zone with the command svcadm enable network/ipfilter.

Then, if you create a zone with address 10.2.0.2, for example, you need to create a new vnic over the zstub0 etherstub:

dladm create-vnic -l zstub0 znic1

and allocate the znic1 interface to the zone. Then, in that new zone, set the address of znic1 to be 10.2.0.2 and its default gateway to be 10.2.0.1.

That's just about manageable. But in reality it gets far more complicated:

  • With multiple networks and zones, you have to dynamically allocate the etherstub and vnic names, they aren't fixed
  • You have to make sure to delete all the items you have created when you destroy a zone
  • You need to be able to find which etherstub is associated with a given network, so you attach a new zone to the correct etherstub
  • Ideally, you want all the hidden networks to be unique (you don't have to, but as the person writing this I can make it so to keep things simple for me)
  • You want to make sure you can't delete a proxy zone if there are zones on the network behind it
  • You want the zones to boot up with their networks fully and correctly configured (there's a lot going on here that I haven't even mentioned)
  • You may need to configure rather more of a firewall than the simple NAT configuration
  • In the case of a reverse proxy, you need a way to update the reverse proxy configuration automatically as zones come and go

Overall, there are a whole lot of hoops to jump through, and a lot of information to track and verify.

I'm about halfway through writing this at the moment, with most of the basic functionality present. I can, as the author, make a number of simplifying assumptions - I get to choose the naming convention, I can declare than the hidden networks must be unique, I can declare that I will only support simple prefixes (/8, /16, and /24) rather than arbitrary prefixes, and so on.

Thursday, November 26, 2015

Buggy basename

Every so often you marvel at the lengths people go to to break things.

Take the basename command in illumos, for example. This comes in two incarnations - /usr/bin/basename, and /usr/xpg4/bin/basename.

Try this:

# /usr/xpg4/bin/basename -- /a/b/c.txt
c.txt

Which is correct, and:

# /usr/bin/basename -- /a/b/c.txt
--

Which isn't.

Wait, it gets worse:

# /usr/xpg4/bin/basename /a/b/c.txt .t.t
c.txt

Correct. But:

# /usr/bin/basename /a/b/c.txt .t.t
c

Err, what?

Perusal of the source code reveals the answer to the "--" handling - it's only caught in XPG4 mode. Which is plain stupid, there's no good reason to deliberately restrict correct behaviour to XPG4.

Then the somewhat bizarre handling with the ".t.t" suffix. So it turns out that the default basename command is doing pattern matching rather then the expected string matching. So the "." will match any character, rather than being interpreted literally. Given how commonly "." is used to separate the filename from its suffix, and the common usage of basename to strip off the suffix, this is a guarantee for failure and confusion. For example:

# /usr/bin/basename /a/b/cdtxt .txt
c

The fact that there's a difference here  is actually documented in the man page, although not very well - it points you to expr(1) which doesn't tell you anything relevant.

So, does anybody rely on the buggy broken behaviour here?

It's worth noting that the ksh basename builtin and everybody else's basename implementation seems to do the right thing.

Fixing this would also get rid of a third of the lines of code and we could just ship 1 binary instead of 2.

Tuesday, November 24, 2015

Replacing SunSSH with OpenSSH in Tribblix

I recently did some work to replace the old SSH implementation used by Tribblix, which was the old SunSSH from illumos, with OpenSSH.

This was always on the list - our SunSSH implementation was decrepit and unmaintained, and there seemed little point in general in maintaining our own version.

The need to replace has become more urgent recently, as the mainstream SSH implementations have drifted to the point that we're no longer compatible - to the point that our implementation will not interoperate at all with that on modern Linux distributions with the default settings.

As I've been doing a bit of work with some of those modern Linux distributions, being unable to connect to them was a bit of a pain in the neck.

Other illumos distributions such as OmniOS and SmartOS have also recently been making the switch.

Then there was a proposal to work on the SunSSH implementation so that it was mediated - allowing you to install both SunSSH and OpenSSH and dynamically switch between them to ease the transition. Personally, I couldn't see the point - it seemed to me much easier to simply nuke SunSSH, especially as some distros had already made or were in the process of making the transition. But I digress.

If you look at OmniOS, SmartOS, or OpenIndiana, they have a number of patches. In some cases, a lot of patches to bring OpenSSH more in line with old SunSSH.

I studied these at some length, looked at them, and largely rejected them. There are a couple of reasons for this:

  • In Tribblix, I have a philosophy of making minimal modifications to upstream projects. I might apply patches to make software build, or when replacing older components so that I don't break binary compatibility, but in general what I ship is as close to what you would get if you did './configure --prefix=/usr; make ; make install' as I can make it.
  • Some of the fixes were for functionality that I don't use, probably won't use, and have no way of testing. So blindly applying patches and hoping that what I produce still works, and doesn't arbitrarily break something else, isn't appealing. Unfortunately all the gssapi stuff falls into this bracket.
One thing that might change this in the future, and something we've discussed a little, is to have something like Joyent's illumos-extra brought up to a state where it can be used as a common baseline across all illumos distributions. It's a bit too specific to SmartOS right now, so won't work for me out of the box, and it's a little unfortunate that I've just about reimplemented all the same things for Tribblix myself.

So what I ship is almost vanilla OpenSSH. The modifications I have made are fairly few:

It's split into the same packages (3 of them) along just about the same boundaries as before. This is so that you don't accidentally mix bits of SunSSH with the new OpenSSH build.

The server has
KexAlgorithms +diffie-hellman-group1-sha1
added to /etc/ssh/sshd_config to allow connections from older SunSSH clients.

The client has
PubkeyAcceptedKeyTypes +ssh-dss
added to /etc/ssh/ssh_config so that it will allow you to send DSA keys, for users who still have just DSA keys.

Now, I'm not 100% happy about the fact that I might have broken something that SunSSH might have done, but having a working SSH that will interoperate with all the machines I need to talk to outweighs any minor disadvantages.

Sunday, November 22, 2015

On Keeping Your Stuff to Yourself

One of the fundamental principles of OmniOS - and indeed probably its defining characteristic - is KYSTY, or Keep Your Stuff* To Yourself.

(*um, whatever.)

This isn't anything new. I've expressed similar opinions in the past. To reiterate - any software that is critical for the success of your business/project/infrastructure/whatever should be directly under your control, rather than being completely at the whim of some external entity (in this case, your OS supplier).

We can flesh this out a bit. The software on a system will fall, generally, into 3 categories:

  1. The operating system, the stuff required for the system to boot and run reliably
  2. Your application, and its dependencies
  3. General utilities

As an aside, there are more modern takes on the above problem: with Docker, you bundle the operating system with your application; with unikernels you just link whatever you need from classes 1 and 2 into your application. Problem solved - or swept under the carpet, rather.

Looking at the above, OmniOS will only ship software in class 1, leaving the rest to the end user. SmartOS is a bit of a hybrid - it likes to hide everything in class 1 from you and relies on pkgsrc to supply classes 2 and 3, and the bits of class 1 that you might need.

Most (of the major) Linux distributions ship classes 1, 2, and 3, often in some crazily interdependent mess that you have to spend ages unpicking. The problem being that you need to work extra hard to ensure your own build doesn't accidentally acquire a dependency on some system component (or that you build somehow reads a system configuration file).

Generally missing from discussions is that class 3 - the general utilities. Stuff that you could really do with an instance of to make your life easier, but where you don't really care about the specifics of.

For example, it helps to have a copy of the gnu userland around. Way too much source out there needs GNU tar to unpack, or GNU make to build, or assumes various things about the userland that are only true of the GNU tools. (Sometimes, the GNU tools aren't just a randomly incompatible implementation, occasionally have capabilities that are missing from standard tools - like in-place editing in gsed.)

Or a reasonably complete suite of compression utilities. More accurately, uncompression, so that you have a pretty good chance of being able to unpack some arbitrary format that people have decided to use.

Then there are generic runtimes. There's an awful lot of python or perl out there, and sometimes the most convenient way to get a job done is to put together a small script or even a one-liner. So while you don't really care about the precise details, having copies of the appropriate runtimes (and you might add java, erlang, ruby, node, or whatever to that list) really helps for the occasions when you just want to put together a quick throwaway component. Again, if your business-critical application stack requires that runtime, you maintain your own, with whatever modules you need.

There might also be a need for basic graphics. You might not want or need a desktop, but something is linked against X11 anyway. (For example, java was mistakenly linked against X11 for font handling, even in headless mode - a bug recently fixed.) Even if it's not X11, applications might use common code such as cairo or pango for drawing. Or they might need to read or write image formats for web display.

So the chances are that you might pull in a very large code surface, just for convenience. Certainly I've spent a lot of time building 3rd-party libraries and applications on OmniOS that were included as standard pretty much everywhere else.

In Tribblix, I've attempted to build and package software cognizant of the above limitations. So I supply as wide a range of software in class 3 as I can - this is driven by my own needs and interests, as a rule, but over time it's increasingly complete. I do supply application stacks, but these are built to be in a separate location, and are kept at arms length from the rest of the system. This then integrated with Zones in a standardized zone architecture in a way that can be managed by zap. My intention here is not necessarily to supply the building blocks that can be used by users, but to provide the whole application, fully configured and ready to go.

Sunday, November 15, 2015

The Phallacy of Web Scale

A couple of times recently I've had interviewers ask me to quickly sketch out the design for a web scale architecture. Of course, being a scientist by training the first thing I did was to work out what sort of system requirements we're looking at, to see what sort of scale we might really need.

In both cases even my initial extreme estimate, just rounding everything up, didn't indicate a scaling problem. Most sites aren't Facebook or Google, they see limited use by a fraction of the population. The point here is that while web scale sites exist, they are the exception rather than the rule, so why does everyone think they have to go to the complexity and expense of architecting a "web scale" solution?

To set this into perspective, assume you want to track everyone in the UK's viewing habits. If everyone watches 10 programmes per day and channel-hops 3 times for each programme, and there are 30 million viewers, then that's 1 billion data points per day or 10,000 per second. Each is small, so at 100 bytes each that's 100GB/day or ~10megabits/s. So, you're still talking single server. You can hold a week's data in RAM, a year on disk.

And most businesses don't need anything like that level of traffic.

Part of the problem is that most implementations are horrifically inefficient. The site itself may be inefficient - you know the ones that have hundreds of assets on each page, multi-megabyte page weight, widgets all over, that take forever to load and look awful - if their customers bother waiting long enough. The software implementation behind the site is almost certainly inefficient (and is probably trying to do lots of stupid things it shouldn't as well).

Another trend fueling this is the "army of squirrels" approach to architecture. Rather than design an architecture that is correctly sized to do what you need, it seems all too common to simply splatter everything across a lot of little boxes. (Perhaps because the application is so badly designed it has cripplingly limited scalability so you need to run many instances.) Of course, all you've done here is simply multiplied your scaling problem, not solved it.

As an example,  see this article Scalability! But at what COST? I especially like the quote that Big data systems may scale well, but this can often be just because they introduce a lot of overhead.

Don't underestimate the psychological driver, either. A lot of people want to be seen as operating at "web scale" or with "Big Data", either to make themselves or their company look good, to pad their own CV, or to appeal to unsophisticated potential employees.

There are problems that truly require web scale, but for the rest it's an ego trip combined with inefficient applications on badly designed architectures.

Thursday, November 12, 2015

On the early web

I was browsing around, as one does, when I came across a list of early websites. Or should I say, a seriously incomplete list of web servers of the time.

This was November 1992, and I had been running a web server at the Institute of Astronomy in Cambridge for some time. That wasn't the only technology we were using at the time, of course - there was Gopher, the emergent Hyper-G, WAIS, ftp, fsp, USENET, and a few others that never made the cut.

Going back a bit further in time, about a year earlier, is an email regarding internet connectivity in Cambridge. I vaguely remember this - I had just arrived at the IoA at the time and was probably making rather a nuisance of myself, having come back from Canada where the internet was already a thing.

I can't remember exactly when we started playing with the web proper, but it would have been some time about Easter 1992. As the above email indicates, 1991 saw the department having the staggering bandwidth of 64k/s, and I think it would have taken the promised network upgrade for us to start advertising our site.

Graphical browsers came quite late - people might think of Mosaic (which you can still run if you like), but to start with we just had the CERN line mode browser, and things like Viola. Around this time there were other graphical browsers - there was one in the Andrew system, as I recall, and chimera was aimed at the lightweight end of the scale.

Initially we ran the CERN web server, but it was awful - it burnt seconds of cpu time to deliver every page, and as soon as the NCSA server came out we switched to that, and the old Sun 630MP that hosted all this was much the happier for it. (That was the machine called cast0 in the above email - the name there got burnt into the site URL, it took a while for people to get used to the idea of adding functional aliases to DNS.)

With the range of new tools becoming available, it wasn't entirely obvious which technologies would survive and prosper.

With my academic background I was initially very much against the completely unstructured web, preferring properly structured and indexed technologies. In fact, one of the things I remember saying at the time, as the number of sites started to grow, was "How on earth are people going to be able to find stuff?". Hm. Missed business opportunity there!

Although I have to say that even with search engines, actually finding stuff on the web now is a total lottery - Google have made a lot of money along the way, though. One thing I miss, again from the early days of the web (although we're talking later in the 90s now) is the presence of properly curated and well maintained indices of web content.

Another concern I had about the web was that, basically, any idiot could create a web page, leading to most of the content being complete and utter garbage (both in terms of what it contained and how it was coded). I think I got the results of that one dead right, but it failed to account for the huge growth that the democratization of the web allowed.

After a couple of years the web was starting to emerge as a clear front runner. OK, there were only a couple of thousand sites in total at this point (I think that up to the first thousand or so I had visited every single one), and the concept was only just starting to become known to the wider public.

One of the last things I did at the IoA, when I left in May 1994, was to set up all the computers in the building running Mosaic, with it looping through all the pages on a local website showcasing some glorious astronomical images, all for the departmental open day. This was probably the first time many of the visitors had come across the web, and the Wow factor was incredible.