Wednesday, May 21, 2014

Building illumos-gate on Tribblix

Until recently, I've used an OpenIndiana system to build the illumos packages that go into Tribblix. Clearly this is less than ideal - it would be nice to be able to build all of Tribblix on Tribblix.

This has always been a temporary expedient. So, here's how to build illumos-gate on Tribblix.

(Being able to do so is also good in that it increases the number of platforms on which a vanilla illumos-gate can be built.)

First, download and install Tribblix (version 0m10 or later). I recommend installing the kitchen-sink.

Then, if you're running 0m10, apply some necessary updates. As root:

    zap refresh-overlays
    zap refresh-catalog
    zap update-overlay develop
    zap uninstall TRIBdev-object-file
    zap install TRIBdev-object-file


This won't be necessary in future releases, but I found some packaging issues which interfered with the illumos build (although other software doesn't bother), including some symlinks so that various utilities are where illumos-gate expects.

I run the build in a zone. It requires a non-standard environment, and using a zone means that I don't have to corrupt the global zone, and I can repeatably guarantee that I get a correct build environment.

Then, install a build zone. This will be a whole-root zone in which we copy the develop overlay from the global zone, and add the illumos-build overlay into the zone. (It will download the packages for the illumos-build overlay the first time you do this, but will cache them so if you repeat this later - and I tend to create build zones more or less at will - it won't have to). You need to specify a zone name and give it an IP address.

    zap create-zone -t whole \
        -z il-build -i 172.18.1.206 \
        -o develop -O illumos-build

This will automatically boot the zone, you just have to wait until SMF has finished initialising.

Configure the zone so it can resolve names from DNS:

    cp /etc/resolv.conf /export/zones/il-build/root/etc/
    cp /etc/nsswitch.dns /export/zones/il-build/root/etc/nsswitch.conf


Go into the zone

  zlogin il-build

In the zone, create a user to do the build, and a couple of hacky fixes

    rm /usr/bin/cpp
    cd /usr/bin ; ln -s ../gnu/bin/xgettext gxgettext


(The first is a bug in my gcc, the latter is a Makefile bug.)

If you want to build with SMB printing

    zap install TRIBcups

Now, as the user, the build largely follows the normal instructions: you can use git to clone illumos-gate, unpack the closed bins, copy illumos.sh and nightly.sh, and edit illumos.sh to customize the build.

There are a few things you need to do to get a successful build. The first is to add the following to illumos.sh

export SUPPRESSPKGDEP=true

this is necessary as the IPS dependency step uses the installed image; as Tribblix uses SVR4 packaging, there isn't one. You can still create the IPS repo (and I do, as that's what I then turn into SVR4 packages), but the dependency step needs to be suppressed.

If you want to build with CUPS, then you'll need to have installed cups, and you'll need to patch smb. Alternatively, avoid pulling in CUPS by adding this to illumos.sh:

export ENABLE_SMB_PRINTING='#'

As Tribblix uses newer glib, the API has changed slightly and hal uses the old API. There is a proper fix, but you can simply:

gsed -i '/g_type_init/d' usr/src/cmd/hal/hald/hald.c

Note that this means that you won't be able to run the hal components on a system with a downrev glib.

Then you should be able to run a build:

time ./nightly.sh illumos.sh

The build should be clean (I see ELF runtime attribute warnings, all coming from glib and ffi, but those don't actually matter, and I'm not sure illumos should be complaining about errors in its external dependencies anyway).

Friday, May 16, 2014

Software verification of SVR4 packages with pkgchk

On Solaris (and Tribblix) you can use the pkgchk command to verify that the contents of a software package are correctly installed.

The simplest invocation is to give pkgchk the name of a package:

pkgchk SUNWcsl

I would expect SUNWcsl to normally validate cleanly. Whereas something like SUNWcsr will tend to produce lots of output as it contains lots of configuration files that get modified. (Use the -n flag to suppress most of the noise.

If you want to check individual files, then you can use

pkgchk -p /usr/bin/ls

or (and I implemented this as part of the OpenSolaris project) you can feed a list of files on stdin:

find /usr/bin -mtime -150 | pkgchk -i -

However, it turns out that there's a a snag with the basic usage of pkgchk to analyze a package, in that it will trust the contents file - both for the list of files in the package, and for their attributes.

Modifying the list of files can be a result of using installf and removef. For example, I delete some of the junk out of /usr/ucb (such as /usr/ucb/cc so as to be sure no poor unfortunate user can ever run it), and use removef to clean up the contents file. A side-effect of this is that pkgchk won't normally be able to detect that those files are missing.

Modifying file attributes can be the result of a second package installing the same pathname with different attributes. Having multiple packages deliver a directory is common, but you can also have multiple packages own a file. Whichever package was installed last gets to choose which attributes are correct, and the normal pkgchck is blind to any changes as a result.

There's a trick to get round this. From Solaris 10, the original package metadata (and unmodified copies of editable files) are kept. Each package has a directory in /var/sadm/pkg, and in each of those you'll find a save directory. This is used when installing zones, so you get a pristine copy. However, you can also use the pkgmap file to verify a package:

pkgchk -m /var/sadm/pkg/SUNWscpu/save/pspool/SUNWscpu/pkgmap

and this form of usage will detect files that have been removed or modified by tools that are smart enough to update the contents file.

(Because those save files are used by zones, you'll find they don't exist in a zone because they wouldn't be needed there. So this trick only works in a global zone, or you need to manually copy the pkgmap file.)

Tuesday, April 15, 2014

Partial root zones

In Tribblix, I support sparse-root and whole-root zones, which work largely the same way as in Solaris 10.

The implementation of zone creation is rather different. The original Solaris implementation extended packaging - so the packaging system, and every package, had to be zone-aware. This is clearly unsustainable. (Unfortunately, the same mistake was made when IPS was introduced.)

Apart from creating work, this approach limits flexibility - in order to innovate with zones, for example by adding new types, you have to extend the packaging system, and then modify every package in existence.

The approach taken by Tribblix is rather different. Instead of baking zone architecture into packaging, packaging is kept dumb and the zone creation scripts understand how packages are put together.

In particular, the decision as to whether a given file is present in a zone (and how it ends up there) is not based on package attributes, but is a simple pathname filter. For example, files under /kernel never end up in a zone. Files under /usr might be copied (for a whole-root zone) or loopback mounted (for a sparse-root zone). If it's under /var or /etc, you get a fresh copy. And so on. But the decision is based on pathname.

It's not just the files within packages that get copied. The package metadata is also copied; the contents file is simply filtered by pathname - and that's how the list of files to copy is generated. This filtering takes place during zone creation, and is all done by the zone scripts - the packaging tools aren't invoked (one reason why it's so quick). The scripts, if you want to look, are at /usr/lib/brand/*/pkgcreatezone.

In the traditional model, the list of installed packages in the zone is (initially) identical to that in the global zone. For a sparse-root zone, you're pretty much stuck with that. For a whole-root zone, you can add and remove packages later.

I've been working on some alternative models for zones in Tribblix that add more flexibility to zone creation. These will appear in upcoming releases, but I wanted to talk about the technology.

The first of these is what you might call a partial-root zone. This is similar to a whole-root zone in the sense that you get an independent copy, rather than being loopback mounted. And, it's using the same TRIBwhole brand. The difference is that you can specify a subset of the overlays present in the global zone to be installed in the zone. For example, you would use the following install invocation:

zoneadm -z myzone install -o developer

and only the developer overlay (and the overlays it depends on) will be installed in the zone.

This is still a copy - the installed files in the global zone are the source of the files that end up in the zone, so there's still no package installation, no need for repository access, and it's pretty quick.

This is still a filter, but you're now filtering both on pathname and package name.

As for package metadata, for partial-root zones, references to the packages that don't end up being used are removed.

That's the subset variant. The next obvious extension is to be able to specify additional packages (or, preferably, overlays) to be installed at zone creation time. That does require an additional source of packages - either a repository or a local cache - which is why I treat it as a logically distinct operation.

Time to get coding.

Sunday, April 13, 2014

Cloud analogies: Food As A Service

There's a recurring analogy of Cloud as utility, such as electrical power. I'm not convinced by this, and regard a comparison of the Cloud with the restaurant trade as more interesting. Read on...

Few IT departments build their own hardware, in the same way that few people grow their own food or keep their own livestock. Most buy from a supplier, in the same way that most buy food from a supermarket.

You could avoid cooking by eating out for every meal. Food as a Service, in current IT parlance.

The Cloud shares other properties with a restaurant. It operates on demand. It's self service, in the sense that anyone can walk in and order - you don't have to be a chef. There's a fixed menu of dishes, and portion sizes are fixed. It deals with wide fluctuations of usage throughout the day. For basic dishes, it can be more expensive than cooking at home. It's elastic, and scales, whereas most people would struggle if 100 visitors suddenly dropped by for dinner.

There's a wide choice of restaurants. And a wide variety of pricing models to match - Prix Fixe, a la carte, all you can eat.

Based on this analogy, the current infatuation with moving everything to the cloud would be the same as telling everybody that they shouldn't cook at home, but should always order in or eat out. You no longer need a kitchen, white goods, or utensils, nor do you need to retain any culinary skills.

Sure, some people do eat primarily at a basic burger bar. Some eat out all the time. Some have abandoned the kitchen. Is it appropriate for everyone?

Many people go out to eat not necessarily to avoid preparing their own food, but to eat dishes they cannot prepare at home, to try something new, or for special occasions.

In other words, while you can eat out for every meal, Food as a Service really comes into its own when it delivers capabilities beyond that of your own kitchen. Whether that be in the expertise of its staff, the tools in its kitchens, or the special ingredients that it can source, a restaurant can take your tastebuds places that your own kitchen can't.

As for the lunacy that is Private Cloud, that's really like setting up your own industrial kitchen and hiring your own chefs to run it.

Wednesday, April 02, 2014

Slimming down logstash

Following on from my previous post on logstash, it rapidly becomes clear that the elasticsearch indices grow rather large.

After a very quick look, it was obvious that some of the fields I was keeping were redundant or unnecessary.

For example, why keep the pathname of the log file itself? It doesn't change over time, and you can work out the name of the file easily (if you ever wanted it, and I can't see why you ever would - if you wanted to identify a source, that ought to be some other piece of data you create).

Also, why keep the full log message? You've parsed it, broken it up, and stored the individual fields you're interested in. So why keep the whole thing, a duplicate of the information you're already storing?

With that in mind, I used a mutate clause to remove the file name and the original log entry, like so:

  mutate {
     remove_field => "path"
     remove_field => "message"
  }


After this simple change, the daily elasticsearch indices on the first system I tried this on shrank from 4.5GB to 1.6GB - almost a factor of 3. Definitely worthwhile, and there are benefits in terms of network traffic, search performance, elasticsearch memory utilization, and capacity for future growth as well.

Saturday, February 08, 2014

Zone logs and logstash

Today I was playing with logstash, with the plan to produce a real-time scrolling view of our web traffic.

It's easy enough. Run a logstash shipper on each node, feed everything into redis, get logstash to pull from redis into elasticsearch, then run the logstash front-end and use Kibana to create a dashboard.

Then the desire for efficiency strikes. We're running Solaris zones, and there are a lot of them. Each logstash instance takes a fair chunk of memory, so it seems like a waste to run one in each zone.

So what I wanted to do was run a single copy of logstash in the global zone, and get it to read all the zone logs, yet present the data just as though it had been run in the zone.

The first step was to define which logs to read. The file input can take wildcards, leading to a simple pattern:

input {
  file {
    type => "apache"
    path => "/storage/*/opt/proquest/*/apache/logs/access_log"
  }
}


There's a ZFS pool storage, each zone has a zfs file system named after the zone. So the name of the zone is the directory under /storage. So I can pick out the name of the zone and put it into a variable called zonename like so:

  grok {
    type => "apache"
    match => ["path","/storage/%{USERNAME:zonename}/%{GREEDYDATA}"]
  }


(If it looks odd to use the USERNAME pattern, the naming rules for our zones happen to be the same as for user names, so I use an existing pattern rather than define a new one.)

I then want the host entry associated with this log to be that of the zone, rather than the default of the global zone. So I mutate the host entry:

  mutate {
    type => "apache"
    replace => [ "host","%{zonename}.our.company.name" ]
  }


And that's pretty much it. It's very simple, but most of the documentation I could find was incorrect in the sense that it applied to old versions of logstash.

There were a couple of extra pieces of information that I then found it useful to add. The simplest was to duplicate the original host entry into a servername, so I can aggregate all the traffic associated with a physical host. The second was to pick out the website name from the zone name (in this case, the zone name is the short name of the website, with a suffix appended to distinguish the individual zones).

  grok {
    type => "apache"
    match => ["zonename","%{WORD:sitename}-%{GREEDYDATA}"]
  }


Then sitename contains the short name of the site, again allowing me to aggregate the statistics from all the zones that serve that site.

Friday, November 29, 2013

Tribblix - making PXE boot work

One of the key changes in the latest milestone of Tribblix is the ability to bot and install a system over the network, using PXE. I've covered how to set this up elsewhere, but here I'll talk a little about how this is implemented under the covers.

Essentially, the ISO image has 3 pieces.
  1. The platform directory contains the kernel, and the boot archive. This is what's loaded at boot.
  2. The file solaris.zlib is a lofi compressed file containing an image of the /usr filesystem.
  3. The pkgs directory contains additional SVR4 packages that can be installed.
When booting from CD, grub loads the boot archive into memory and hands over control. There's then a little bit of magic where it tries to mount every device it can find looking for the CD image - it actually checks that the .volsetid file found on a device matches the one in the boot archive to ensure it gets the right device, but once that's done it mounts the CD image in a standard place and from then on knows precisely where to find everything.

When you boot via PXE, you can't blindly search everywhere in the network for the location of solaris.zlib, so the required location is set as a boot argument in menu.lst, and the system extracts the required value from the boot arguments.

What it will get back is a URL of a server, so it appends solaris.zlib to that and retrieves it using wget. The file is saved to a known location and then lofi mounted. Then boot proceeds as normal.

Note that you can use any dhcp/tftp server  for the PXE part, and any http server. There's no requirement on the server side for a given platform, configuration, or software. (And it doesn't even have to be http, as long as it's a protocol built into wget.)

It's actually very simple. There are, of course, a few wrinkles along the way.
  • There are some files in /usr that are needed to mount /usr, so the boot archive contains a minimally populated copy of /usr that allows you to bootstrap the system until you mount the real /usr over the top of it
  • For PXE boot, you need more such files in the boot archive than you do for booting from CD. In particular, I had to add prtconf (used in parsing boot arguments) and wget (to do the retrieve over http)
  • I add wget rather than curl, as the wget package is much smaller than the curl package, even though I had previously standardised on curl for package installation
  • Memory requirements are a little higher than for a CD boot, as the whole of solaris.zlib is copied into memory. However, because it's in memory, the system is really fast
It's possible to avoid this by simply putting all of /usr into the boot archive. The downside to that is that it's quite slow to load - you haven't got a fully-fledged OS running at that point, tftp isn't as reliable as it should be and can fail when retrieving larger files, and it pushes up the hard memory requirement for a CD boot. So I've stuck with what I have, and it works reasonably well.

The final piece of the ISO image is the additional packages. If you tell the system nothing, it will go off to the main repositories to download any packages. (Please, don't do this. I'm not really set up to deliver that much traffic.) But you can copy the pkgs directory from the iso image and specify that location as a boot argument so the installer knows where the packages are. What it actually does underneath is set that location up as the primary repository temporarily during the install.

The present release doesn't have automation - booting via PXE is just like booting from CD, and you have to run the install interactively. But all the machinery is now in place to build a fully automated install mechanism (think like jumpstart, although it'll achieve the same goals via completely different means).

One final note. Unlike the OpenSolaris/Solaris 11/OpenIndiana releases which have separate images for server, desktop, and network install, Tribblix has a single image that does all 3 in one. The ability to define installed packages eliminates the need for separate desktop (live) and server (text) images, and the PXE implementation described here means you can take the regular iso and use that for network booting.

Tribblix - getting boot arguments

This explains how I handled boot arguments for Tribblix, but it's generally true for all illumos and similar distributions. This is necessary for things like PXE boot and network installation, where you need to be able to tell the system critical information without baking it into the source.

And this particular mechanism described here is for x86 only. It's unfortunate that the boot mechanism is architecture specific.

Anyway, back to boot arguments. Using grub, you use the menu.lst file to determine how the system boots. In particular, the kernel$ line specifies which kernel to boot, and you can pass boot arguments. For example, it might say

kernel$ /platform/i86pc/kernel/$ISADIR/unix -B console=ttya

and, in this case, what comes after -B is the boot arguments. This is a list of key=value pairs, comma separated.

Another example, from my implementation of PXE boot,might be:

-B install_pkgs=http://172.18.1.7:8080/pkgs/0m8/pkgs/

So that's how they're defined, and you can really define anything you like. It's up to the system to interpret them as it sees fit.

When the system boots, how do you access these parameters? They're present in the system configuration as displayed by prtconf. In particular

prtconf -v /devices

gets you the information you want - containing a bunch of standard information and the boot arguments. Try this on a running system, and you'll see things like what program actually got booted:

        name='bootprog' type=string items=1
            value='/platform/i86pc/multiboot'


So, all you have to do to find the value of a boot argument is look through the prtconf output for the name of the boot argument you're after, and then pick the value off the next line. Going back to my example earlier, we just look for install_pkgs and get the value. This little snippet does the job:

PKGMEDIA=`/usr/sbin/prtconf -v /devices | \
    /usr/bin/sed -n '/install_pkgs/{;n;p;}' | \
    /usr/bin/cut -f 2 -d \'`


(Breaking this down, sed -n outputs nothing by default, looks for the pattern in /install_pkgs/, then the {;n;p;} skips to the next line and prints it, then cut grabs the second word, split by the quote. Ugly as heck.)

At this point, you can test whether the argument was defined, and use it in your scripts.