Tuesday, November 17, 2020

Adventures in Server Rooms

I'm a fan of the cloud. Honest. Providing someone else is paying for it, that is, given how ridiculously expensive it is.

But running stuff in the cloud is a whole lot easier for me. Someone else fixing hardware and infrastructure is a big win. No more driving to the datacenter at 3 in the morning. No more wandering into a room that's baking at 70C because the AC failed. No crawling about under the raised floor trying to trace that cable. No having a disk array in pieces because vendor A's SCSI adapter doesn't quite work right with vendor B's SCSI disk drive.

That said, some of the things that happened while I was fixing hardware do stand out.

Like the time I was putting together one of these stupid cable management arms. Feeling very satisfied with myself, I tried to step back only to discover that I had managed to velcro and cable tie myself into the damn thing. Unable to actually move the one arm, or get round it with the other, I eventually wriggled out of the stuck sweater and spent another 10 minutes rescuing my clothing from the clutches of the evil server.

Or, one morning, when I came in to find an SGI Challenge dead as a dodo. This was the under the desk model, back in the day when every machine had its own VT320 or similar perched on the desk above or alongside. I speak to SGI, and they assure me that everything's fine, I just need to replace the fuse. I dutifully do so and crawl under the desk to turn it back on, with my boss (the benefits of rank!) standing as far away as possible. One huge flash and enormous bang were followed by another bang as my head slammed into the underside of the desk. Some sharp words with the SGI support line followed.

Before we had datacenters, we had server rooms. Before that, we had cupboards. As a temporary measure, the campus router (oh, the days when you could find a Cisco AGS+ anywhere) was in a cupboard. The air conditioning unit extracted moisture from the air  into what was effectively a bucket on a scale, When the bucket was full, it was heavy enough to tip the scale, and powered off the AC unit - after all, you don't want puddles all over the floor. Depending on humidity, this wouldn't last overnight, let alone a weekend, resulting in extra trips across the county to empty the thing out of hours.

Servers are relatively benign beasts compared to the unreliable monstrosities responsible for datacenter power and cooling. In the early days at Hinxton, we had regular power cuts. The generator would decide to cut in, but there were several seconds before it had spun up to speed. It turned out that the thresholds on the generator for detecting low power had been set back in the days when mains voltage was 240V. Now, it's 220V, and the official range is technically the same, but when all the local farms brought the cows in for milking there was enough of a drop in voltage that the generator thought there was going to be loss of power and kicked in unnecessarily.

Once, when it cut over, the bus bars stuck halfway, fused, and the generator shed had a pool of molten copper on the floor. It was some hours before that had cooled far enough for the electricians to safely enter the building.

When we got a new building, it was much bigger, so the generators they put in were sized for critical equipment only. Along comes the first big mains outage, and whoever had assigned the "critical" circuits had done so essentially at random. Specifically, our servers (which had UPS protection to tide them through blips long enough for the generators to come on stream) clearly weren't getting any juice. There was a lot of yelling, and we had to shut half the stuff down to buy time, before they connected up the right circuits.

Someone higher up decided we needed fire protection. So that was put in, wired up, the idea is that it kills all power and cooling. There's also a manual switch, which is put by the door so you can hit the big red button as you're legging it out of there. Too close to the door, as it turns out, on the wall exactly where the door handle goes when you open the door.

They installed a status panel on the wall outside too. The electrician wiring this up never satisfactorily explained how he managed to cut all power, although he must have almost had a heart attack when a posse of system administrators came flying out of offices demanding to know what had happened.

Numeracy isn't known as a builder's strong point. Common sense likewise. So the datacenter has an outside door, then a ramp up to an inner door. We believed in having a decent amount of room under the raised floor - enough to crawl under if necessary. So they put in the doorframe, but hadn't allowed for the raised floor. When we went for the initial inspection we had to crawl through the door, it was that low.

Even the final version wasn't really high enough. The old Sun 42U racks are actually quite short - the power supply at the base eats into the dimensions, so the external height is just over 42U and there was really only 36U available for servers. When we ordered our first Dell 42U rack, we had to take all the servers out, half dismantle it, lay it down, and it took six of us to wangle it through the dogleg and the door (avoiding the aforementioned big red emergency cut off button!). After that, we went out and ordered custom 39U racks to be sure they would fit.

When you run a network connection to a campus, you use multiple routes - so we had one fibre running to Cambridge, another to London. Unfortunately, as it turned out, there's about 100 yards from the main building ingress to the main road, and someone cheated and laid the two fibres next to each other. The was some building work one day, and it had carefully been explained where the hole should and shouldn't be dug. Mid-afternoon, Sun Net Manager (remember that?) went totally mental. We go to the window, and there's a large hole, a JCB, and the obligatory gaggle of men in hard hats staring into the hole. My boss says "They're not supposed to be digging there!" and, indeed, they had managed to dig straight through the cables. If it had been a clean cut it wouldn't have been so bad (there's always a little slack so they can splice the fibres in the event of a break), but they had managed to pull the fibres out completely so they had to be blown back from London/Cambridge before we got back on the air.