The Trouble with Tribbles...: 2014

Tuesday, December 23, 2014

Setting up Logical Domains, part 2

In part 1, I talked about the server side of Logical Domains. This time, I'll cover how I set up a guest domain.

First, I have created a zfs pool called storage on the host, and I'm going to present a zvol (or maybe several) to the guests.

I'm going to create a little domain called ldom1.

ldm add-domain ldom1
ldm set-core 1 ldom1
ldm add-memory 4G ldom1
ldm add-vnet vnet1 primary-vsw0 ldom1

Now create and add a virtual disk. Create a dataset for the ldom and a 24GB volume inside it, add it to the storage service and set it as the boot device.

zfs create storage/ldom1
zfs create -V 24gb storage/ldom1/disk0
ldm add-vdsdev /dev/zvol/dsk/storage/ldom1/disk0 ldom1_disk0@primary-vds0
ldm add-vdisk disk0 ldom1_disk0@primary-vds0 ldom1
ldm set-var auto-boot\?=true ldom1
ldm set-var boot-device=disk0 ldom1

Then bind resources, list, and start:

ldm bind-domain ldom1
ldm list-domain ldom1
ldm start-domain ldom1

You can connect to the console by looking at the number under CONS in the list-domain output:

# ldm list-domain ldom1
NAME STATE FLAGS CONS VCPU MEMORY UTIL UPTIME
ldom1 bound ------ 5000 8 4G

# telnet localhost 5000
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

Connecting to console "ldom1" in group "ldom1" ....
Press ~? for control options ..

T5140, No Keyboard
Copyright (c) 1998, 2014, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.33.6.e, 4096 MB memory available, Serial #83586105.
Ethernet address 0:14:4f:fb:6c:39, Host ID: 84fb6c39.

Boot device: disk0 File and args:
Bad magic number in disk label
ERROR: /virtual-devices@100/channel-devices@200/disk@0: Can't open disk label package

ERROR: boot-read fail

Evaluating:

Can't open boot device

{0} ok

Attempt to boot it off the net and it says:

Requesting Internet Address for 0:14:4f:f9:87:f9

which doesn't match the MAC address in the console output. So if you want to jumpstart the box, and that's easy enough, you need to do a 'boot net' first to get the actual MAC address of the LDOM to add it to your jumpstart server.

To add an iso image to boot from:

ldm add-vdsdev options=ro /var/tmp/img.iso iso@primary-vds0
ldm add-vdisk iso iso@primary-vds0 ldom1

At the ok prompt, you can issue the 'show-disks' command to see what disk devices are present. To boot off the iso:

boot /virtual-devices@100/channel-devices@200/disk@1:f

And it should work. This is how I've been testing the Tribblix images for SPARC, by the way.

Setting up Logical Domains, part 1

I've recently been playing with Logical Domains (aka LDOMs, aka Oracle VM Server for SPARC). For those unfamiliar with the technology, it's a virtualization framework built into the hardware of pretty well all current SPARC systems, more akin to VMware than Solaris zones.

For more information, see here, here, or here.

First, why use it? Especially when Solaris has zones. The answer is that it addresses a different set of problems. Individual LDOMs are more independent and much more isolated than zones. You can partition resources more cleanly, and different LDOMs don't have to be at the same patch level (to my mind, it's not that you can have a different level of patches in each LDOM that matters, but that you can do maintenance of each LDOM to different schedules that matters). One key advantage I find is that the virtual switch you set up with LDOMs is much better at dealing with complex network configuration (I have hosts scattered across maybe dozens of VLANs, trying to fake that up on Solaris 10 is a bit of a bind). And some applications don't really get on with zones - I would build new systems around zones, but ill-understood and poorly documented legacy systems might be easier to drop inside an LDOM.

That dealt with, here's how I setup up one of my machines (a T5140, as practice for live deployment on a bunch of replacement T4-1 systems) as an LDOM host. I'll cover setting up the guest side in a second post.

Make sure the firmware is current - these are the minimum revs:

T2 - 7.4.5
T3 - 8.3
T4 - 8.4.2c

Then install the LDOM software.

cd /var/tmp
unzip p17291713_31_SOLARIS64.zip
cd OVM_Server_SPARC-3_1/Install
./install-ldm

You'll get asked if you want to launch the configuration assistant after installation. I chose n, and you can run ldmconfig at any later time. (If you want - it's best not to.)

Now need to apply the LDOM patch

svcadm disable -s ldmd
patchadd 150817-02
svcadm enable ldmd

Verify things are working as expected:

# ldm list
NAME STATE FLAGS CONS VCPU MEMORY UTIL UPTIME
primary active -n-c-- SP 128 32544M 0.1% 16m

You should see the primary domain.

The next step is to establish default services and configure the control domain.

The necessary services are:

Virtual console
Virtual disk server
Virtual switch service

ldm add-vds primary-vds0 primary
ldm add-vcc port-range=5000-5100 primary-vcc0 primary
ldm add-vsw net-dev=nxge0 primary-vsw0 primary

verify with

ldm list-services primary

And we need to limit the control domain to a limited set of resources. The way I do this (this is just a personal view) is, on a system with N cores, to define N units each with 1 core with 1/N of the total memory. Assign one of those units to the primary domain, and then build the guest domains with 1 or more of those units (sizing as necessary - note that you can resize them on the fly so you don't have to get it perfect first time around). You can get very detailed and start allocating down to individual threads and specific amounts of memory, but it's so much better to just keep it simple.

For a T2/T2+/T3 you need to futz with crypto MAUs. This is unnecessary on later systems.

To show:

ldm list -o crypto primary

To add

ldm set-mau 1 primary

To set the CPU resources of the control domain

ldm set-core 1 primary

(I want to just set 1 core. Allocation best done by cores, not threads.)

Start the reconfig

ldm start-reconf primary

Fix the memory

ldm set-memory 4G primary

Save the config to the SP

ldm add-config initial

verify the config has been saved

ldm list-config

Reboot to activate

shutdown -y -g0 -i6

OK, so that creates a 1-core (8-thread) 4G control domain. And that all seems to work.

Next steps are to configure networking and enable terminal services. From the console (as you're reconfiguring the primary network):

ifconfig vsw0 plumb
ifconfig nxge0 down unplumb
ifconfig nxge0 inet6 down unplumb
ifconfig vsw0 inet 172.18.1.128 netmask 255.255.255.0 broadcast 172.18.1.255 up
ifconfig vsw0 inet6 plumb up
mv /etc/hostname.nxge0 /etc/hostname.vsw0
mv /etc/hostname6.nxge0 /etc/hostname6.vsw0

For a T4, replace nxge with igb.

At this point, you have a machine with minimal resources assigned to the primary domain, which looks for all the world like a regular Solaris box, ready to create guest domains using the remaining resources.

Saturday, October 25, 2014

Tribblix progress

I recently put out a Milestone 12 image for Tribblix.

It updates illumos, built natively on Tribblix. There's been a bit of discussion recently about whether illumos needs actual releases, as opposed to being continuously updated. It doesn't have releases, so when I come to make a Tribblix release I simply check out the current gate, build, and package it. After all, it's supposed to be ready to ship at any time.

Note that I don't maintain a fork of illumos-gate, I build it essentially as-is. This is the same for all the components I build for Tribblix - I keep true to unmodified upstream as much as possible.

The one change I have made is to SVR4 packaging. I've removed the dependency on openssl and wanboot (bug #5188), which is a good thing. It means that you can't use signed SVR4 packages, but I've never encountered one. Nor can pkgadd now directly retrieve a package via http, but the implementation via wanboot was spectacularly dire, and you're much better off using curl or wget, which allows proper repository management (as zap does). Packaging is a little quicker now, but this change also makes it much easier to update openssl in future (it's difficult to update something your packaging system is linked against).

Tribblix is now firmly committed to gcc4 (as opposed to the old gcc3 in OpenSolaris). I've rebuilt gcc to fix visibility support. If you've ever seen 'warning: visibility attribute not supported in this configuration' then you'll have stumbled across this. Basically, you need to ensure objdump is found during the gcc build - either by making sure it's in the path or by setting OBJDUMP to point to it.

I've added a new style of zones - alternate root zones. These are sparse root zones, but instead of inheriting from the global zone you can use an alternate installed image. More on that later.

There's the usual slew of updates to various packages, including the obviously sensitive bash and openssl.

There's an interesting fix to python. I put software that might come in multiple versions underneath /usr/versions and use symlinks so that applications can be found in the normal locations. Originally, /usr/bin/python was a symlink that went to ../versions/python-x.y.x/bin/python. This works fine most of the time. However, if you called it as /bin/python it couldn't find its modules, so the symlink has to be ../../usr/versions/python-x.y.x/bin/python which makes things work as desired.

The package catalogs now contain package sizes and checksums, allowing verification of downloaded packages. I need to update zap to actually use this data, and to retry or resume failed or incomplete downloads. (It's a shame that curl doesn't automatically resume incomplete downloads the way that wget does.)

At a future milestone, upgrades will be supported (regular package updates have worked for a while, I'm talking about a whole distro upgrade here). It's possible to upgrade by hand already, but it requires a few extra workarounds (such as forcing postremove scripts to always exit 0) to make it work properly. I've got most of the preparatory work in place now. Upgrading zones looks a whole lot more complicated, though (and I haven't really seen it done well elsewhere).

Now, off to work on the next update.

Wednesday, May 21, 2014

Building illumos-gate on Tribblix

Until recently, I've used an OpenIndiana system to build the illumos packages that go into Tribblix. Clearly this is less than ideal - it would be nice to be able to build all of Tribblix on Tribblix.

This has always been a temporary expedient. So, here's how to build illumos-gate on Tribblix.

(Being able to do so is also good in that it increases the number of platforms on which a vanilla illumos-gate can be built.)

First, download and install Tribblix (version 0m10 or later). I recommend installing the kitchen-sink.

Then, if you're running 0m10, apply some necessary updates. As root:

    zap refresh-overlays
    zap refresh-catalog
    zap update-overlay develop
    zap uninstall TRIBdev-object-file
    zap install TRIBdev-object-file

This won't be necessary in future releases, but I found some packaging issues which interfered with the illumos build (although other software doesn't bother), including some symlinks so that various utilities are where illumos-gate expects.

I run the build in a zone. It requires a non-standard environment, and using a zone means that I don't have to corrupt the global zone, and I can repeatably guarantee that I get a correct build environment.

Then, install a build zone. This will be a whole-root zone in which we copy the develop overlay from the global zone, and add the illumos-build overlay into the zone. (It will download the packages for the illumos-build overlay the first time you do this, but will cache them so if you repeat this later - and I tend to create build zones more or less at will - it won't have to). You need to specify a zone name and give it an IP address.

    zap create-zone -t whole \
        -z il-build -i 172.18.1.206 \
        -o develop -O illumos-build

This will automatically boot the zone, you just have to wait until SMF has finished initialising.

Configure the zone so it can resolve names from DNS:

    cp /etc/resolv.conf /export/zones/il-build/root/etc/
    cp /etc/nsswitch.dns /export/zones/il-build/root/etc/nsswitch.conf

Go into the zone

zlogin il-build

In the zone, create a user to do the build, and a couple of hacky fixes

    rm /usr/bin/cpp
    cd /usr/bin ; ln -s ../gnu/bin/xgettext gxgettext

(The first is a bug in my gcc, the latter is a Makefile bug.)

If you want to build with SMB printing

    zap install TRIBcups

Now, as the user, the build largely follows the normal instructions: you can use git to clone illumos-gate, unpack the closed bins, copy illumos.sh and nightly.sh, and edit illumos.sh to customize the build.

There are a few things you need to do to get a successful build. The first is to add the following to illumos.sh

export SUPPRESSPKGDEP=true

this is necessary as the IPS dependency step uses the installed image; as Tribblix uses SVR4 packaging, there isn't one. You can still create the IPS repo (and I do, as that's what I then turn into SVR4 packages), but the dependency step needs to be suppressed.

If you want to build with CUPS, then you'll need to have installed cups, and you'll need to patch smb. Alternatively, avoid pulling in CUPS by adding this to illumos.sh:

export ENABLE_SMB_PRINTING='#'

As Tribblix uses newer glib, the API has changed slightly and hal uses the old API. There is a proper fix, but you can simply:

gsed -i '/g_type_init/d' usr/src/cmd/hal/hald/hald.c

Note that this means that you won't be able to run the hal components on a system with a downrev glib.

Then you should be able to run a build:

time ./nightly.sh illumos.sh

The build should be clean (I see ELF runtime attribute warnings, all coming from glib and ffi, but those don't actually matter, and I'm not sure illumos should be complaining about errors in its external dependencies anyway).

Friday, May 16, 2014

Software verification of SVR4 packages with pkgchk

On Solaris (and Tribblix) you can use the pkgchk command to verify that the contents of a software package are correctly installed.

The simplest invocation is to give pkgchk the name of a package:

pkgchk SUNWcsl

I would expect SUNWcsl to normally validate cleanly. Whereas something like SUNWcsr will tend to produce lots of output as it contains lots of configuration files that get modified. (Use the -n flag to suppress most of the noise.

If you want to check individual files, then you can use

pkgchk -p /usr/bin/ls

or (and I implemented this as part of the OpenSolaris project) you can feed a list of files on stdin:

find /usr/bin -mtime -150 | pkgchk -i -

However, it turns out that there's a a snag with the basic usage of pkgchk to analyze a package, in that it will trust the contents file - both for the list of files in the package, and for their attributes.

Modifying the list of files can be a result of using installf and removef. For example, I delete some of the junk out of /usr/ucb (such as /usr/ucb/cc so as to be sure no poor unfortunate user can ever run it), and use removef to clean up the contents file. A side-effect of this is that pkgchk won't normally be able to detect that those files are missing.

Modifying file attributes can be the result of a second package installing the same pathname with different attributes. Having multiple packages deliver a directory is common, but you can also have multiple packages own a file. Whichever package was installed last gets to choose which attributes are correct, and the normal pkgchck is blind to any changes as a result.

There's a trick to get round this. From Solaris 10, the original package metadata (and unmodified copies of editable files) are kept. Each package has a directory in /var/sadm/pkg, and in each of those you'll find a save directory. This is used when installing zones, so you get a pristine copy. However, you can also use the pkgmap file to verify a package:

pkgchk -m /var/sadm/pkg/SUNWscpu/save/pspool/SUNWscpu/pkgmap

and this form of usage will detect files that have been removed or modified by tools that are smart enough to update the contents file.

(Because those save files are used by zones, you'll find they don't exist in a zone because they wouldn't be needed there. So this trick only works in a global zone, or you need to manually copy the pkgmap file.)

Tuesday, April 15, 2014

Partial root zones

In Tribblix, I support sparse-root and whole-root zones, which work largely the same way as in Solaris 10.

The implementation of zone creation is rather different. The original Solaris implementation extended packaging - so the packaging system, and every package, had to be zone-aware. This is clearly unsustainable. (Unfortunately, the same mistake was made when IPS was introduced.)

Apart from creating work, this approach limits flexibility - in order to innovate with zones, for example by adding new types, you have to extend the packaging system, and then modify every package in existence.

The approach taken by Tribblix is rather different. Instead of baking zone architecture into packaging, packaging is kept dumb and the zone creation scripts understand how packages are put together.

In particular, the decision as to whether a given file is present in a zone (and how it ends up there) is not based on package attributes, but is a simple pathname filter. For example, files under /kernel never end up in a zone. Files under /usr might be copied (for a whole-root zone) or loopback mounted (for a sparse-root zone). If it's under /var or /etc, you get a fresh copy. And so on. But the decision is based on pathname.

It's not just the files within packages that get copied. The package metadata is also copied; the contents file is simply filtered by pathname - and that's how the list of files to copy is generated. This filtering takes place during zone creation, and is all done by the zone scripts - the packaging tools aren't invoked (one reason why it's so quick). The scripts, if you want to look, are at /usr/lib/brand/*/pkgcreatezone.

In the traditional model, the list of installed packages in the zone is (initially) identical to that in the global zone. For a sparse-root zone, you're pretty much stuck with that. For a whole-root zone, you can add and remove packages later.

I've been working on some alternative models for zones in Tribblix that add more flexibility to zone creation. These will appear in upcoming releases, but I wanted to talk about the technology.

The first of these is what you might call a partial-root zone. This is similar to a whole-root zone in the sense that you get an independent copy, rather than being loopback mounted. And, it's using the same TRIBwhole brand. The difference is that you can specify a subset of the overlays present in the global zone to be installed in the zone. For example, you would use the following install invocation:

zoneadm -z myzone install -o developer

and only the developer overlay (and the overlays it depends on) will be installed in the zone.

This is still a copy - the installed files in the global zone are the source of the files that end up in the zone, so there's still no package installation, no need for repository access, and it's pretty quick.

This is still a filter, but you're now filtering both on pathname and package name.

As for package metadata, for partial-root zones, references to the packages that don't end up being used are removed.

That's the subset variant. The next obvious extension is to be able to specify additional packages (or, preferably, overlays) to be installed at zone creation time. That does require an additional source of packages - either a repository or a local cache - which is why I treat it as a logically distinct operation.

Time to get coding.

Sunday, April 13, 2014

Cloud analogies: Food As A Service

There's a recurring analogy of Cloud as utility, such as electrical power. I'm not convinced by this, and regard a comparison of the Cloud with the restaurant trade as more interesting. Read on...

Few IT departments build their own hardware, in the same way that few people grow their own food or keep their own livestock. Most buy from a supplier, in the same way that most buy food from a supermarket.

You could avoid cooking by eating out for every meal. Food as a Service, in current IT parlance.

The Cloud shares other properties with a restaurant. It operates on demand. It's self service, in the sense that anyone can walk in and order - you don't have to be a chef. There's a fixed menu of dishes, and portion sizes are fixed. It deals with wide fluctuations of usage throughout the day. For basic dishes, it can be more expensive than cooking at home. It's elastic, and scales, whereas most people would struggle if 100 visitors suddenly dropped by for dinner.

There's a wide choice of restaurants. And a wide variety of pricing models to match - Prix Fixe, a la carte, all you can eat.

Based on this analogy, the current infatuation with moving everything to the cloud would be the same as telling everybody that they shouldn't cook at home, but should always order in or eat out. You no longer need a kitchen, white goods, or utensils, nor do you need to retain any culinary skills.

Sure, some people do eat primarily at a basic burger bar. Some eat out all the time. Some have abandoned the kitchen. Is it appropriate for everyone?

Many people go out to eat not necessarily to avoid preparing their own food, but to eat dishes they cannot prepare at home, to try something new, or for special occasions.

In other words, while you can eat out for every meal, Food as a Service really comes into its own when it delivers capabilities beyond that of your own kitchen. Whether that be in the expertise of its staff, the tools in its kitchens, or the special ingredients that it can source, a restaurant can take your tastebuds places that your own kitchen can't.

As for the lunacy that is Private Cloud, that's really like setting up your own industrial kitchen and hiring your own chefs to run it.

Wednesday, April 02, 2014

Slimming down logstash

Following on from my previous post on logstash, it rapidly becomes clear that the elasticsearch indices grow rather large.

After a very quick look, it was obvious that some of the fields I was keeping were redundant or unnecessary.

For example, why keep the pathname of the log file itself? It doesn't change over time, and you can work out the name of the file easily (if you ever wanted it, and I can't see why you ever would - if you wanted to identify a source, that ought to be some other piece of data you create).

Also, why keep the full log message? You've parsed it, broken it up, and stored the individual fields you're interested in. So why keep the whole thing, a duplicate of the information you're already storing?

With that in mind, I used a mutate clause to remove the file name and the original log entry, like so:

mutate {
remove_field => "path"
remove_field => "message"
}

After this simple change, the daily elasticsearch indices on the first system I tried this on shrank from 4.5GB to 1.6GB - almost a factor of 3. Definitely worthwhile, and there are benefits in terms of network traffic, search performance, elasticsearch memory utilization, and capacity for future growth as well.

Saturday, February 08, 2014

Zone logs and logstash

Today I was playing with logstash, with the plan to produce a real-time scrolling view of our web traffic.

It's easy enough. Run a logstash shipper on each node, feed everything into redis, get logstash to pull from redis into elasticsearch, then run the logstash front-end and use Kibana to create a dashboard.

Then the desire for efficiency strikes. We're running Solaris zones, and there are a lot of them. Each logstash instance takes a fair chunk of memory, so it seems like a waste to run one in each zone.

So what I wanted to do was run a single copy of logstash in the global zone, and get it to read all the zone logs, yet present the data just as though it had been run in the zone.

The first step was to define which logs to read. The file input can take wildcards, leading to a simple pattern:

input {
file {
    type => "apache"
    path => "/storage/*/opt/proquest/*/apache/logs/access_log"
}
}

There's a ZFS pool storage, each zone has a zfs file system named after the zone. So the name of the zone is the directory under /storage. So I can pick out the name of the zone and put it into a variable called zonename like so:

grok {
    type => "apache"
    match => ["path","/storage/%{USERNAME:zonename}/%{GREEDYDATA}"]
}

(If it looks odd to use the USERNAME pattern, the naming rules for our zones happen to be the same as for user names, so I use an existing pattern rather than define a new one.)

I then want the host entry associated with this log to be that of the zone, rather than the default of the global zone. So I mutate the host entry:

mutate {
    type => "apache"
    replace => [ "host","%{zonename}.our.company.name" ]
}

And that's pretty much it. It's very simple, but most of the documentation I could find was incorrect in the sense that it applied to old versions of logstash.

There were a couple of extra pieces of information that I then found it useful to add. The simplest was to duplicate the original host entry into a servername, so I can aggregate all the traffic associated with a physical host. The second was to pick out the website name from the zone name (in this case, the zone name is the short name of the website, with a suffix appended to distinguish the individual zones).

grok {
    type => "apache"
    match => ["zonename","%{WORD:sitename}-%{GREEDYDATA}"]
}

Then sitename contains the short name of the site, again allowing me to aggregate the statistics from all the zones that serve that site.