The Trouble with Tribbles...: September 2008

Wednesday, September 24, 2008

Would you pass?

Sun have made some free pre-assessment tests available.

Just for fun, I went through the UNIX Essentials one. Would I pass? (Given that I have never had any formal training and have a rather eclectic skills mix it's not a foregone conclusion.)

According to the test, yes. I got a whopping 37/42 which is clearly enough to pass.

I suspect, though, that this says more about the accuracy (and grammatical validity, in one case) of the questions. One of the questions was incapable of being parsed into english, and I had to guess at random. Another one had two possible correct answers depending on factors you weren't told about. Another one had no correct answer on a vanilla Solaris system. There were a couple of questions that I looked at and thought to myself 'you wouldn't ever do it like that'.

(Plus a couple of questions on stuff that I would never use under any circumstances. I had a similar test when I applied for my current job, and my answer to every question mentioning vi was ':q' and use a proper editor.)

I tried the SCSA sample tests - scoring a little better on the part I test, and slightly lower on the part II. But again, there were questions that were simply wrong; some where the correct answer would always be 'look it up in the man page'; some artificially contrived questions; and I'm more than a little concerned about the coverage and subject matter. And on the SCSA tests there are a couple of areas where I haven't done much for a few years now (my OBP and LDAP skills are obviously getting a little rusty).

Tuesday, September 23, 2008

On OpenSolaris Change

When I noted that Sun's plans for OpenSolaris threatened the Solaris ecosystem, I got a mixed bag of comments.

Some of the comments missed the point, which is that compatibility (across the board) is a key strength, and that producing something that forces what is essentially a new platform on the world will drive away old customers without necessarily attracting new ones.

The key point is compatibility. And while modernization is essential (I'll come back to that later), it is possible to do it compatibly, in an evolutionary manner, rather than doing a rip and replace job.

Evolutionary change allows you to keep existing customers who thus have an easier migration path; makes it easier for new adopters who can tap into the existing skills pool; and allows the improvements to be fed back to older (ie. current, such as Solaris 10) releases which still have a long service life ahead of them.

Replacing the packaging system and installer from scratch is just something you should never do. It's probably cost the Solaris/OpenSolaris ecosystem about 2 years, and we can only hope that we can eventually recover in the way that Firefox did after Netscape's mistake.

Saturday, September 20, 2008

Too radical a change?

Attempting to predict the future is difficult, but what I do know about Sun's plans for Solaris and OpenSolaris fills me with concern.

What we seem to be looking at is an OpenSolaris derived replacement for Solaris. Which means a completely replaced packaging system and installer. Being essentially incompatible with what we currently have, this means a fork-lift upgrade: you can't simply go forward as you are before.

Forcing change upon customers is bad. It makes the upgrade a decision point, and customers are then forced to make a choice. So what might customers do? Let's consider some classes of customer:

Solaris-only shops: they have to go from what they have to something different. So given that they have to change, some might take the replacement; I suspect many will choose something different.

Heterogeneous shops: many large shops are heterogeneous, and already support multiple platforms. I see significant resistance to adopting any new platforms, and many shops will simply migrate to one of their existing platforms rather than adopt a new one.

Alien shops: there's going to be problems getting a new platform into a shop that doesn't already use Solaris. Solaris is mature, well tested, has a reasonable number of practitioners available in the job market. An OpenSolaris based platform may be unattractive to such shops: not only would they be unable to bring in expertise for something new, but Sun are advertising it as just the same as Linux, so why would they change to something that isn't different?

So, as I see it, scrapping Solaris and replacing it with a fundamentally different OpenSolaris distribution is going to drive a fraction (possibly quite a large fraction) of the existing Solaris base to other platforms, and I simply can't see any corresponding takeup of new deployments.

Contrast this with the story if you take the existing Solaris and produce a new version (Solaris 11 would be the obvious numbering) that uses the same packaging, installation, deployment, and administration tools as the existing Solaris. In other words, that could be deployed painlessly and seamlessly without any need for additional training or rebuilding new administrative infrastructure, but contains all the advancements that have been made to OpenSolaris in the last 4 years - things such as CIFS client and server, Crossbow, NFS enhancements, and an updated desktop just to name a few. Existing users would simply adopt it as a logical progression; new users would be more attracted because they could concentrate on the technical features and would be able to take advantage of the pool of experience available to deploy it.

The problem is simply one of change. The technical merits of the old and new systems are essentially irrelevant to the discussion. Given how dangerous change is, why is OpenSolaris so insistent on rip and replace rather than improving and enhancing what we already have?

Tuesday, September 16, 2008

Better documentation style

I don't write as much documentation as I should, and frankly what I do write often isn't done that well. But the OpenSolaris Editorial Cheat Sheet contains a lot of useful advice and hints condensed into a small space. Now, if I could just relearn my writing style my documentation wouldn't look like it was written in such an amateurish fashion!

Monday, September 15, 2008

Solaris Advantages

I make extensive use of Solaris, so thought it would be worth summarizing some of the key advantages that it brings for me. Other people might consider other aspects important, and you could construct similar lists for other platforms.

Compatibility - software that works on one release or for a given patch revision of Solaris is pretty well guaranteed to run subsequently. This is huge, and isn't generally true for other platforms. I've got 20-year old applications running happily day in, day out. By and large, everything just works, and continues to work.

Installation Automation - jumpstart is a huge competitive advantage. You can trivially deploy systems, being able to completely reproduce a configuration, and roll out systems and updates effortlessly.

Lightweight virtualization - Zones, especially sparse root zones, allow you to consolidate large numbers of small systems onto a server, with minimal overhead and without adding to the management overhead normally associated with adding another system to your network. (Note that the real advantage here comes from the use of sparse root zones, which not only guarantee that the zone looks like its parent, but mean that you don't manage the software on the zones at all but just manage the parent host. Whole root zones aren't as lightweight and don't have anything like the same advantages, and branded zones - while a neat trick - don't have any compelling advantages over other virtualization solutions.)

ZFS - for storage on a massive scale, combined with ease of management, and the ability to verify that your data is what you though it was, this is key. To that you can add snapshots (which we use automatically now any time we change something, which makes our backout plans for a change request way simpler than they used to be), and compression (storing text - or xml - files just got a whole lot cheaper), and it's free.

Ease of management - while Sun have generally failed completely to provide advanced management tools, the fact is that you don't need them - the underlying facilities in Solaris are pretty solid and it's trivial to write your own layer of tools on top, and integrate Solaris into a range of other tools. Not only that, but the tools are consistent - while things do evolve, you don't have to completely relearn how to manage the system on a regular and frequent basis.

Cheap - it's free, and not only that but you don't have to pay for a 3rd-party virtualization solution, I/O multipathing, volume manager, or filesystem, as they're all included.

Sunday, September 14, 2008

Refactoring solview and jkstat

Originally, JKstat and SolView were completely separate.

The latest released version of SolView comes with JKstat, so you can launch some of the JKstat demos.

I'm looking at much closer ties between the two. I've had a much more in-depth use for JKstat in mind all along, and the way I'm doing it is by adding what I'm calling a "System Explorer" to SolView.

So SolView will have a view of all the interesting components of a system: processors (chips, cores, and threads), memory, disks, filesystems, networks. Anything else if I can think of how to do it. And then will display pretty much everything you can about the selected object. A lot of that information is gleaned from kstats using JKstat.

From that point of view, something like the ZFS ARC demo makes more sense as a sophisticated component inside SolView rather than a standalone JKstat demo application.

So what I'm planning on doing (and this may take a while) is to have a spring clean of the demos in JKstat, removing the bad ones entirely and moving the more complex and involved ones to SolView. And then splitting JKstat into two logically separate parts: the core API (which has no graphical components), and the graphical browser with some basic demos. The two parts of JKstat will still be developed together, although I wouldn't expect that much development once the process is complete, as JKstat will be stable and the higher-level fancy tools will be developed independently under the SolView banner.

Friday, September 05, 2008

How to confuse ImageMagick

I mentioned some huge files generated by ImageMagick.

I worked out what was going wrong. What we do is take a 600dpi original and generate a bunch of images at different resolutions and formats. Looking at the headers:


  Software: Adobe Photoshop CS2 Windows

That's odd. Someone has fiddled with the image.


  Image Width: 2943 Image Length: 4126

Hm. Not so bad.


  Resolution: 0.393, 0.393 pixels/cm

Yikes! If my calculations are correct that's 1 dpi.

So when I resize it to 300 dpi I end up trying to create a 882900x1237800 image. 10^12 pixels. No wonder it can't cope.

Moral of the story: never trust your input data.

Thursday, September 04, 2008

When to bury the pager

If anyone's been following me on twitter recently you may have noticed a few fraught messages about SANs and pagers.

We have an on-call rota. Being a relatively small department, this actually means that we cover the entire department - so it's possible that I might get a call to sort out a Windows problem, or that one of the Windows guys might get to sort out one of my Sun servers. But it's not usually too stressful.

This last week has been a bit of a nightmare and the problem has been so bad and so apparently intractable that I've simply buried the pager, turned off notification of email and texts on the phone, and relied on someone phoning me if anything new came up. Otherwise I would get woken up several hundred times a night for no good purpose.

Of course, today being the final day of my stint (yay!) I finally work out what's causing it.

What we've been having is the SAN storage on one of our boxes going offline. Erratically, unpredictably, and pretty often. Started last Friday, continuing on and off since.

This isn't the first time. We've seen some isolated panics, and updated drivers. They fix the panic, for sure, but now it just stays broken when it sees a problem. The system vendor, the storage vendor, and the HBA vendor got involved.

We've tried a number of fixes. Replaced the HBA. Made no difference. Put another HBA in a different slot. Made no difference. Tried running one port on each HBA rather than 2 on one. Made no difference. We're seeing problems down all paths to the storage (pretty much equally).

Last night (OK, early this morning) I noticed that the block addresses that were reporting errors weren't entirely random. There were a set of blocks that were being reported again and again. And the errors come in groups, but each group contained one of the common blocks (presumably the other were just random addresses that happened to be being accessed during the error state).

I've had conversations with some users who've been having trouble getting one of their applications to run to completion with all the problems we've had. And they're getting fraught because they have deadlines to meet.

And then I start putting two and two together. Can I find out exactly when they were running their application? OK, so they started last Friday (just about when the problem started). And we know that the system was fine for a while after a reboot, and going back it turns out that either a plain reboot, or a reboot for hardware replacement, kills whatever they're doing, and it may be later in the evening or the next morning before they start work again.

So, it's an absolutely massive coincidence - an almost perfect correlation - that we have problems that kill the entire system for hours an hour after they start their applications up, and the problems finish within seconds of their application completing a task.

So, it looks very much like there's something in their data that's killing either the SAN, the HBA, or the driver. Some random pattern of bits that causes something involved to just freak out. (I don't really thing it's a storage hardware error. It could be, but there are so many layers of abstraction and virtualisation in the way that a regular bad block would get mangled long before it gets to my server.) And it's only the one dataset that's causing grief - we have lots of other applications, and lots of servers, and none of them are seeing significant problems.

So, we can fix the problem - just don't run that thing!

And then I realize that I've seen this before. Now that's on a completely different model of server running a different version of solaris running a different filesystem on different storage. But it's files (different files) but from the same project. Creepy.

Thank heaven for sparse files!

We use ImageMagick to do a lot of image processing. I'm not sure what it's up to, but some processing needs to create temporary working files that can be quite large (in /var/tmp by default, I've moved them with TMPDIR because that filled up).

However, I now see this:


/bin/ls -l /storage/tmp
total 203037932
-rw-------   1 user grp       169845176062560 Sep  4 11:18 magick-XXT0aaXE
-rw-------   1 user grp       222497224416 Sep  4 13:24 magick-XXU0aaXE
-rw-------   1 user grp       11499827272024 Sep  4 13:11 magick-XXbFaiKF
-rw-------   1 user grp       15064771904 Sep  4 13:24 magick-XXcFaiKF
-rw-------   1 user grp       18904557170048 Sep  4 10:51 magick-XXtlaGCE
-rw-------   1 user grp       24764978480 Sep  4 13:24 magick-XXulaGCE

or, in more readable units a few seconds later:


/bin/ls -lhs /storage/tmp
total 203038194
33272257 -rw-------   1 user grp          154T Sep  4 11:18 magick-XXT0aaXE
34295031 -rw-------   1 user grp          207G Sep  4 13:24 magick-XXU0aaXE
29432967 -rw-------   1 user grp           10T Sep  4 13:11 magick-XXbFaiKF
9271301 -rw-------   1 user grp           14G Sep  4 13:24 magick-XXcFaiKF
48382483 -rw-------   1 user grp           17T Sep  4 10:51 magick-XXtlaGCE
48384155 -rw-------   1 user grp           23G Sep  4 13:24 magick-XXulaGCE

Ouch. That's on an internal 146G drive.

What on earth is it doing with a 154 terabyte file?