Thursday, August 28, 2008

JKstat meets the ZFS ARC

Recently, Ben Rockwood posted a useful script to display ZFS cache statistics.

Now, all it's doing is grabbing kstats, so it wasn't much of a stretch to put together a new version of JKstat that has a new demo to display the ZFS cache statistics. Try

jkstat arcstat
.
(Requires OpenSolaris, Solaris Nevada, or Solaris 10 8/07 or later to actually have the kstats to display.)

This release of JKstat is a bit rough, as there are a few other things I was working on that aren't neatly finished off yet, but I thought it worth putting out just for the arcstat demo - any comments and suggestions for improvement would be gratefully appreciated!

So, download JKstat now - and here's a little snapshot of the new demo:

Wednesday, August 27, 2008

Computers - unpredictable creatures

Computers are unpredictable beasts. You would think they would be more deterministic, but reality is otherwise.

I have a server with a tape drive. We've used it for about a year, most days. Then suddenly we start getting errors. At first we thought it was a bad tape, but then multiple tapes started giving us grief. Easy enough, just use a different drive. I finally got around to debugging it last week. Swapped the drives over - still errors. Turned out to be a bad cable. That's a new one - I've not seen a SCSI cable fail like that before. (Usually they fail straight away or when you change something, not after working stably and untouched for the best part of a year.)

Yesterday I set up SNMP on some machines for monitoring purposes. Pointed the monitoring system at them, and a couple of minutes later a couple stop responding. That wasn't part of the plan. So I go to the LOM interface, and they're powered off. Call the datacenter, they haven't done anything. I have seen strange things, but snmp (running unprivileged, I might add) powering a machine off when queried? So I tell them to power themselves back on. One comes up fine, the other boots but no ZFS filesystems or zones. I try format. No SAN disks. And then:
# fcinfo hba-port
No Adapters Found.
Yikes! It had a couple of fiber-channel HBAs in it a few minutes ago.

I still don't know what happened, but some electrical gremlins had gotten into the works. So the machines had obviously shut themselves off due to lack of power. And I'm guessing that the PSUs were capable of supplying just enough power to boot the machine, but not enough to get the HBAs powered up properly. Another new failure mode to go in the book.