Wednesday, June 07, 2006

That's Reliable?

I was just looking at a Yankee Group report on server reliability, as reported by Yahoo!.

Now it's nice to hear that they think Solaris is winning on reliability. I knew that :-)

However, going beyond the headline they find:
  • 3-5 failures per server per year
  • 10-19.5 hours of downtime per server per year

Of course, that's server uptime, not service uptime. With a decent architecture you would have some backup so that the service would be available even if a server failed. And servers do fail, no matter how good they are, or need maintenance work.

But whatever, I don't regard 99.8% availabilty as anything like good. In fact, it's terrible.

My own experience is that Solaris is pretty damn reliable. Much better than the figures quoted, at any rate. And Windows servers themselves don't seem to be too bad (although they do seem vulnerable to major corruption events which, while rare, involve significant outage), although PC networks overall seem very fragile. Linux I've found to be less robust, with older versions simply wedging and hanging regularly (something that I believe has been dramatically improved), but I suspect a lot of Linux problems are due to people believing the myth that it's free and will run on any old piece of junk hardware, and so they use junk hadware and don't manage it properly - with predictable consequences.

The other aspect of system reliability is applications and, quite frankly, application reliability is often simply not up to scratch.


Tatjana Heuser said...

Reading the article, it looks like their averages were given across all platforms. It's also not obvious to me if they counted planned/scheduled downtimes into their measure. It'd be nice if they had published something more detailled. Without, it's of little use, imho.

I agree that application (and desktop!) reliability is hardly ever taken into consideration. There's no other explanation for the widespread use of personal, stateful desktop systems as system console of mission-critical servers, tying the reliability of a higly redundant server as closely as possible to a completely non-redundant system.

If reliability were evaluated from the pov of the "endpoint", people would swarm for SunRay. There's simply no other desktop solution.

Jaime Cardoso said...

First of all, I haven't read the actual paper but, reading the link Peter points to in the Yankee Group, the way the text is written leads me to believe this is an ordered study.
with that sayd, I'm not sure the distinction between scheduled or unscheduled downtime is that important (I guess it all comes down to what are you measuring) since both imply a "problem" with the service (down, less performant or whatever) and both have to be addressed.
Tatjana, conserning your point on SunRays what can I say, it's an obvious true