Lessons learnt from serving Queen Victoria's Journals
Towards the end of last year I was asked about how easy it would be to launch a very public website. Most of what the company does is relatively highly specialised, low traffic, high value, for a very narrow and targeted audience.
We were essentially unfamiliar with sites that were wide open and potentially interesting to the whole world (or a large fraction thereof). And we knew that there would be widespread media coverage. So there were real concerns that whatever we built would buckle, turning into a PR disaster. (Everyone's heard of the census launch, I expect.)
Almost 6 months later, we launched Queen Victoria's Journals. And yes, it ended up both nationally and locally on the BBC, in the UK newspapers including The Guardian, The Independent and the Daily Mail, and overseas in Canada and India.
After a huge amount of work, the launch went without a hitch. Traffic levels were right where we expected, and the system handled the traffic exactly as predicted. What's also clear is that if we hadn't done all the preparation work, it would most likely have been a disaster.
We're using pretty standard components - Java, Apache, Tomcat, Solr - and as I've explained previously, we maintain our own software stack. This is all hosted on Solaris Zones, built our way - so we can trivially build a bunch more, clone and restore them.
There's no real tuning involved in the standard components. They'll cope just fine, provided you don't do anything spectacularly stupid with the applications or data that you're serving. I built an isolated test setup, cloned regularly from a development build, so that I could run capacity tests without my work being affected by or impacting on regular development.
The site doesn't have that many pages, so I started by simply testing each one - using wget or ab (apache bench). I needed a whole bunch of servers to send the requests from - easy, just build a bunch more zones. And this showed that we could serve hundreds of pages a second from each tomcat, apart from one page which was returning a page every few seconds. The problem page - it's the Illustrations page linked to from the main toolbar - was being created dynamically via multiple queries to the search back-end which were being rendered each time. The content never changes (until we update the product, at any rate) so this is really a static page. Replacing it with something static not only fixed that problem, but dramatically reduced the memory footprint of tomcat, as we were holding search references open in the user session and generating huge numbers of temporary objects each time it was rendered.
The server capacity and performance issues solved, we went back to looking at network utilization. That's harder to solve from an infrastructure point of view - while I can trivially deploy a whole bunch more zones in a minute or so, it takes months to get additional fibre put in the ground. And our initial estimates, which were based on the bandwidth characteristics of some our existing sites, indicated we could well get close to saturating our network.
Most of the testing for bandwidth was really simple - construct a sample test, run it, and count the bytes transferred by looking at the apache logs. Simply replaying the session allows you to see what effect a change has, and you can easily see which requests are most important to address.
We also took the precaution of having some of the site hosted elsewhere, thanks to our good friends at EveryCity. They're using a Solaris derivative so everything's incredibly simple and familiar, making setup a breeze.
Test. Identify worst offender. Fix. Repeat. Every time you go round the loop improves your chances of success.