Tuesday, April 15, 2014

Partial root zones

In Tribblix, I support sparse-root and whole-root zones, which work largely the same way as in Solaris 10.

The implementation of zone creation is rather different. The original Solaris implementation extended packaging - so the packaging system, and every package, had to be zone-aware. This is clearly unsustainable. (Unfortunately, the same mistake was made when IPS was introduced.)

Apart from creating work, this approach limits flexibility - in order to innovate with zones, for example by adding new types, you have to extend the packaging system, and then modify every package in existence.

The approach taken by Tribblix is rather different. Instead of baking zone architecture into packaging, packaging is kept dumb and the zone creation scripts understand how packages are put together.

In particular, the decision as to whether a given file is present in a zone (and how it ends up there) is not based on package attributes, but is a simple pathname filter. For example, files under /kernel never end up in a zone. Files under /usr might be copied (for a whole-root zone) or loopback mounted (for a sparse-root zone). If it's under /var or /etc, you get a fresh copy. And so on. But the decision is based on pathname.

It's not just the files within packages that get copied. The package metadata is also copied; the contents file is simply filtered by pathname - and that's how the list of files to copy is generated. This filtering takes place during zone creation, and is all done by the zone scripts - the packaging tools aren't invoked (one reason why it's so quick). The scripts, if you want to look, are at /usr/lib/brand/*/pkgcreatezone.

In the traditional model, the list of installed packages in the zone is (initially) identical to that in the global zone. For a sparse-root zone, you're pretty much stuck with that. For a whole-root zone, you can add and remove packages later.

I've been working on some alternative models for zones in Tribblix that add more flexibility to zone creation. These will appear in upcoming releases, but I wanted to talk about the technology.

The first of these is what you might call a partial-root zone. This is similar to a whole-root zone in the sense that you get an independent copy, rather than being loopback mounted. And, it's using the same TRIBwhole brand. The difference is that you can specify a subset of the overlays present in the global zone to be installed in the zone. For example, you would use the following install invocation:

zoneadm -z myzone install -o developer

and only the developer overlay (and the overlays it depends on) will be installed in the zone.

This is still a copy - the installed files in the global zone are the source of the files that end up in the zone, so there's still no package installation, no need for repository access, and it's pretty quick.

This is still a filter, but you're now filtering both on pathname and package name.

As for package metadata, for partial-root zones, references to the packages that don't end up being used are removed.

That's the subset variant. The next obvious extension is to be able to specify additional packages (or, preferably, overlays) to be installed at zone creation time. That does require an additional source of packages - either a repository or a local cache - which is why I treat it as a logically distinct operation.

Time to get coding.

Sunday, April 13, 2014

Cloud analogies: Food As A Service

There's a recurring analogy of Cloud as utility, such as electrical power. I'm not convinced by this, and regard a comparison of the Cloud with the restaurant trade as more interesting. Read on...

Few IT departments build their own hardware, in the same way that few people grow their own food or keep their own livestock. Most buy from a supplier, in the same way that most buy food from a supermarket.

You could avoid cooking by eating out for every meal. Food as a Service, in current IT parlance.

The Cloud shares other properties with a restaurant. It operates on demand. It's self service, in the sense that anyone can walk in and order - you don't have to be a chef. There's a fixed menu of dishes, and portion sizes are fixed. It deals with wide fluctuations of usage throughout the day. For basic dishes, it can be more expensive than cooking at home. It's elastic, and scales, whereas most people would struggle if 100 visitors suddenly dropped by for dinner.

There's a wide choice of restaurants. And a wide variety of pricing models to match - Prix Fixe, a la carte, all you can eat.

Based on this analogy, the current infatuation with moving everything to the cloud would be the same as telling everybody that they shouldn't cook at home, but should always order in or eat out. You no longer need a kitchen, white goods, or utensils, nor do you need to retain any culinary skills.

Sure, some people do eat primarily at a basic burger bar. Some eat out all the time. Some have abandoned the kitchen. Is it appropriate for everyone?

Many people go out to eat not necessarily to avoid preparing their own food, but to eat dishes they cannot prepare at home, to try something new, or for special occasions.

In other words, while you can eat out for every meal, Food as a Service really comes into its own when it delivers capabilities beyond that of your own kitchen. Whether that be in the expertise of its staff, the tools in its kitchens, or the special ingredients that it can source, a restaurant can take your tastebuds places that your own kitchen can't.

As for the lunacy that is Private Cloud, that's really like setting up your own industrial kitchen and hiring your own chefs to run it.

Wednesday, April 02, 2014

Slimming down logstash

Following on from my previous post on logstash, it rapidly becomes clear that the elasticsearch indices grow rather large.

After a very quick look, it was obvious that some of the fields I was keeping were redundant or unnecessary.

For example, why keep the pathname of the log file itself? It doesn't change over time, and you can work out the name of the file easily (if you ever wanted it, and I can't see why you ever would - if you wanted to identify a source, that ought to be some other piece of data you create).

Also, why keep the full log message? You've parsed it, broken it up, and stored the individual fields you're interested in. So why keep the whole thing, a duplicate of the information you're already storing?

With that in mind, I used a mutate clause to remove the file name and the original log entry, like so:

  mutate {
     remove_field => "path"
     remove_field => "message"

After this simple change, the daily elasticsearch indices on the first system I tried this on shrank from 4.5GB to 1.6GB - almost a factor of 3. Definitely worthwhile, and there are benefits in terms of network traffic, search performance, elasticsearch memory utilization, and capacity for future growth as well.

Saturday, February 08, 2014

Zone logs and logstash

Today I was playing with logstash, with the plan to produce a real-time scrolling view of our web traffic.

It's easy enough. Run a logstash shipper on each node, feed everything into redis, get logstash to pull from redis into elasticsearch, then run the logstash front-end and use Kibana to create a dashboard.

Then the desire for efficiency strikes. We're running Solaris zones, and there are a lot of them. Each logstash instance takes a fair chunk of memory, so it seems like a waste to run one in each zone.

So what I wanted to do was run a single copy of logstash in the global zone, and get it to read all the zone logs, yet present the data just as though it had been run in the zone.

The first step was to define which logs to read. The file input can take wildcards, leading to a simple pattern:

input {
  file {
    type => "apache"
    path => "/storage/*/opt/proquest/*/apache/logs/access_log"

There's a ZFS pool storage, each zone has a zfs file system named after the zone. So the name of the zone is the directory under /storage. So I can pick out the name of the zone and put it into a variable called zonename like so:

  grok {
    type => "apache"
    match => ["path","/storage/%{USERNAME:zonename}/%{GREEDYDATA}"]

(If it looks odd to use the USERNAME pattern, the naming rules for our zones happen to be the same as for user names, so I use an existing pattern rather than define a new one.)

I then want the host entry associated with this log to be that of the zone, rather than the default of the global zone. So I mutate the host entry:

  mutate {
    type => "apache"
    replace => [ "host","%{zonename}.our.company.name" ]

And that's pretty much it. It's very simple, but most of the documentation I could find was incorrect in the sense that it applied to old versions of logstash.

There were a couple of extra pieces of information that I then found it useful to add. The simplest was to duplicate the original host entry into a servername, so I can aggregate all the traffic associated with a physical host. The second was to pick out the website name from the zone name (in this case, the zone name is the short name of the website, with a suffix appended to distinguish the individual zones).

  grok {
    type => "apache"
    match => ["zonename","%{WORD:sitename}-%{GREEDYDATA}"]

Then sitename contains the short name of the site, again allowing me to aggregate the statistics from all the zones that serve that site.

Friday, November 29, 2013

Tribblix - making PXE boot work

One of the key changes in the latest milestone of Tribblix is the ability to bot and install a system over the network, using PXE. I've covered how to set this up elsewhere, but here I'll talk a little about how this is implemented under the covers.

Essentially, the ISO image has 3 pieces.
  1. The platform directory contains the kernel, and the boot archive. This is what's loaded at boot.
  2. The file solaris.zlib is a lofi compressed file containing an image of the /usr filesystem.
  3. The pkgs directory contains additional SVR4 packages that can be installed.
When booting from CD, grub loads the boot archive into memory and hands over control. There's then a little bit of magic where it tries to mount every device it can find looking for the CD image - it actually checks that the .volsetid file found on a device matches the one in the boot archive to ensure it gets the right device, but once that's done it mounts the CD image in a standard place and from then on knows precisely where to find everything.

When you boot via PXE, you can't blindly search everywhere in the network for the location of solaris.zlib, so the required location is set as a boot argument in menu.lst, and the system extracts the required value from the boot arguments.

What it will get back is a URL of a server, so it appends solaris.zlib to that and retrieves it using wget. The file is saved to a known location and then lofi mounted. Then boot proceeds as normal.

Note that you can use any dhcp/tftp server  for the PXE part, and any http server. There's no requirement on the server side for a given platform, configuration, or software. (And it doesn't even have to be http, as long as it's a protocol built into wget.)

It's actually very simple. There are, of course, a few wrinkles along the way.
  • There are some files in /usr that are needed to mount /usr, so the boot archive contains a minimally populated copy of /usr that allows you to bootstrap the system until you mount the real /usr over the top of it
  • For PXE boot, you need more such files in the boot archive than you do for booting from CD. In particular, I had to add prtconf (used in parsing boot arguments) and wget (to do the retrieve over http)
  • I add wget rather than curl, as the wget package is much smaller than the curl package, even though I had previously standardised on curl for package installation
  • Memory requirements are a little higher than for a CD boot, as the whole of solaris.zlib is copied into memory. However, because it's in memory, the system is really fast
It's possible to avoid this by simply putting all of /usr into the boot archive. The downside to that is that it's quite slow to load - you haven't got a fully-fledged OS running at that point, tftp isn't as reliable as it should be and can fail when retrieving larger files, and it pushes up the hard memory requirement for a CD boot. So I've stuck with what I have, and it works reasonably well.

The final piece of the ISO image is the additional packages. If you tell the system nothing, it will go off to the main repositories to download any packages. (Please, don't do this. I'm not really set up to deliver that much traffic.) But you can copy the pkgs directory from the iso image and specify that location as a boot argument so the installer knows where the packages are. What it actually does underneath is set that location up as the primary repository temporarily during the install.

The present release doesn't have automation - booting via PXE is just like booting from CD, and you have to run the install interactively. But all the machinery is now in place to build a fully automated install mechanism (think like jumpstart, although it'll achieve the same goals via completely different means).

One final note. Unlike the OpenSolaris/Solaris 11/OpenIndiana releases which have separate images for server, desktop, and network install, Tribblix has a single image that does all 3 in one. The ability to define installed packages eliminates the need for separate desktop (live) and server (text) images, and the PXE implementation described here means you can take the regular iso and use that for network booting.

Tribblix - getting boot arguments

This explains how I handled boot arguments for Tribblix, but it's generally true for all illumos and similar distributions. This is necessary for things like PXE boot and network installation, where you need to be able to tell the system critical information without baking it into the source.

And this particular mechanism described here is for x86 only. It's unfortunate that the boot mechanism is architecture specific.

Anyway, back to boot arguments. Using grub, you use the menu.lst file to determine how the system boots. In particular, the kernel$ line specifies which kernel to boot, and you can pass boot arguments. For example, it might say

kernel$ /platform/i86pc/kernel/$ISADIR/unix -B console=ttya

and, in this case, what comes after -B is the boot arguments. This is a list of key=value pairs, comma separated.

Another example, from my implementation of PXE boot,might be:

-B install_pkgs=

So that's how they're defined, and you can really define anything you like. It's up to the system to interpret them as it sees fit.

When the system boots, how do you access these parameters? They're present in the system configuration as displayed by prtconf. In particular

prtconf -v /devices

gets you the information you want - containing a bunch of standard information and the boot arguments. Try this on a running system, and you'll see things like what program actually got booted:

        name='bootprog' type=string items=1

So, all you have to do to find the value of a boot argument is look through the prtconf output for the name of the boot argument you're after, and then pick the value off the next line. Going back to my example earlier, we just look for install_pkgs and get the value. This little snippet does the job:

PKGMEDIA=`/usr/sbin/prtconf -v /devices | \
    /usr/bin/sed -n '/install_pkgs/{;n;p;}' | \
    /usr/bin/cut -f 2 -d \'`

(Breaking this down, sed -n outputs nothing by default, looks for the pattern in /install_pkgs/, then the {;n;p;} skips to the next line and prints it, then cut grabs the second word, split by the quote. Ugly as heck.)

At this point, you can test whether the argument was defined, and use it in your scripts.

Friday, June 14, 2013

Do we hate our users?

As part of my job, I get to deal with all sorts of oddball systems and setups. Whether this is something we've inherited through acquisition, trying to resurrect or repair some antique legacy system, or needing to make some strange application nobody has ever heard of, it tends to veer in my direction.

As a result, I've had the misfortune to use and fix a wide variety of systems and applications, obviously all built by someone else.

Based on this, I can only come to one conclusion: most Unix Adminstrators hate their users, and do everything they can to make their lives miserable.

That's a pretty grim statement, and I'm hoping that most of the people reading here won't fall into that category. But here's just one example today:

I have to migrate an application, so was given a login to the system so I could check it out. What interactive shell do I get? They've given me, and most users by the looks of it, /bin/sh, on a Solaris 8 box.

Sheesh. I've been using an interactive shell that supports command line recall and editing, not to mention completion and spell-checking, since the 1980s. There is absolutely no excuse in the 21st century not to give users a decent shell. If it's not deliberate hatred of your users, then it's either laziness or incompetence.

It goes beyond that, of course. There's no excuse not to provide users with a properly configured environment, install the tools they need to do their job, and provide enough disk space to store their data. (OK. Here's another example: how many storage shops still allocate itty-bitty storage measured in gigabytes?) Yet I see too many systems set up in such a way that it's completely painful to use.

Worse, users (and developers) assume that the systems are intrinsically rubbish and the IT department incompetent. OK, the second part might be true. But that's one reason they go off and try to provide resources for themselves.

As I said earlier, I'm preaching to the converted, right?

Tuesday, May 28, 2013

The disappearance of packaging

One key differentiator between different Linux distributions has been the packaging system used. The same is happening in the world of Illumos distributions, some use IPS, some debian packaging, SmartOS uses pkgsrc, Tribblix sticks true to the retro feel of Solaris by using SVR4.

Overall, there's been a huge amount of effort expended on packaging. Consider the replacement of SVR4 packaging with IPS - a huge multi-year multi-person effort, that required almost the whole of Solaris to be retooled to fit. And yet, this is all wasted effort.

When choosing a packaging system for Tribblix I deliberately chose SVR4 for 3 reasons: it was compatible with what had gone before, it was something I was familiar with, and it was reasonably lightweight and simple. If I had come from a Linux background, I may well have just gone with rpm or dpkg. The key is simplicity and minimal footprint.

What of packaging in the future? I see it largely disappearing. You can see this in the consumerization of applications: it's the App Store, not a package repository. Package management is conspicuous by its absence in the modern world of IT. Looking at where Ubuntu are heading, you can see the same thing. That's not the only initiative - look at AppStream for another example.

And that brings me back to using SVR4 in Tribblix - it's about the lightest weight packaging option I have available. And frankly, it's still far too bloated and complex. But it's merely an implementation detail that is largely invisible, and I want it to become less visible than it is at present.

The point here is that packages aren't relevant to users. Applications are. Which is why the notion of overlays is central to Tribblix - at their simplest, overlays are simply collections of packages (I could have used the term cluster, but that already has meaning to the Solaris installer, although it was never exposed to administrators later which was a terrible design), but the idea is that you manage software at the level of abstraction of an overlay, rather than at a package level.

Even as a unit of delivery, packages aren't that useful - they normally arise as build artifacts, which don't necessarily map well to user needs. And that's another thing - what constitutes a useful component of a package isn't fixed, but is very much context dependent. Worse, the possible contexts in which a package can be used isn't known ahead of time, so the packager cannot enumerate all the possible uses of the software they're packaging. And an individual package is almost never useful in isolation - most working applications are the leaf nodes of a large complex tree. Dependency management is another game where, if you play, you lose. Rather than tightly-coupled systems with strong dependency management, I'm looking for loosely coupled largely self contained units of delivery. If necessary, application bundles manage their own dependencies rather than relying on the system to do so.

Despite the title, it's not that packaging will disappear, but it will (I hope) become largely invisible.