Thursday, September 04, 2008

When to bury the pager

If anyone's been following me on twitter recently you may have noticed a few fraught messages about SANs and pagers.

We have an on-call rota. Being a relatively small department, this actually means that we cover the entire department - so it's possible that I might get a call to sort out a Windows problem, or that one of the Windows guys might get to sort out one of my Sun servers. But it's not usually too stressful.

This last week has been a bit of a nightmare and the problem has been so bad and so apparently intractable that I've simply buried the pager, turned off notification of email and texts on the phone, and relied on someone phoning me if anything new came up. Otherwise I would get woken up several hundred times a night for no good purpose.

Of course, today being the final day of my stint (yay!) I finally work out what's causing it.

What we've been having is the SAN storage on one of our boxes going offline. Erratically, unpredictably, and pretty often. Started last Friday, continuing on and off since.

This isn't the first time. We've seen some isolated panics, and updated drivers. They fix the panic, for sure, but now it just stays broken when it sees a problem. The system vendor, the storage vendor, and the HBA vendor got involved.

We've tried a number of fixes. Replaced the HBA. Made no difference. Put another HBA in a different slot. Made no difference. Tried running one port on each HBA rather than 2 on one. Made no difference. We're seeing problems down all paths to the storage (pretty much equally).

Last night (OK, early this morning) I noticed that the block addresses that were reporting errors weren't entirely random. There were a set of blocks that were being reported again and again. And the errors come in groups, but each group contained one of the common blocks (presumably the other were just random addresses that happened to be being accessed during the error state).

I've had conversations with some users who've been having trouble getting one of their applications to run to completion with all the problems we've had. And they're getting fraught because they have deadlines to meet.

And then I start putting two and two together. Can I find out exactly when they were running their application? OK, so they started last Friday (just about when the problem started). And we know that the system was fine for a while after a reboot, and going back it turns out that either a plain reboot, or a reboot for hardware replacement, kills whatever they're doing, and it may be later in the evening or the next morning before they start work again.

So, it's an absolutely massive coincidence - an almost perfect correlation - that we have problems that kill the entire system for hours an hour after they start their applications up, and the problems finish within seconds of their application completing a task.

So, it looks very much like there's something in their data that's killing either the SAN, the HBA, or the driver. Some random pattern of bits that causes something involved to just freak out. (I don't really thing it's a storage hardware error. It could be, but there are so many layers of abstraction and virtualisation in the way that a regular bad block would get mangled long before it gets to my server.) And it's only the one dataset that's causing grief - we have lots of other applications, and lots of servers, and none of them are seeing significant problems.

So, we can fix the problem - just don't run that thing!

And then I realize that I've seen this before. Now that's on a completely different model of server running a different version of solaris running a different filesystem on different storage. But it's files (different files) but from the same project. Creepy.

No comments: