The moment the Entrepreneur opened his mouth on prime-time national TV, spelled out the URL and waxed big on how exciting ‘his’ new website was, I knew I was in for a busy night. I’d designed and built it.
All at once, half a million people tried to log into the website. Although all my stress-testing paid off, I have to admit that the network locked up tight long before there was any danger of a database or website problem.
Soon afterwards, the Entrepreneur and the Big Boss were there in the autopsy meeting. We picked through all our systems in detail to see how they’d borne the unexpected strain. Mercifully, in view of the sour mood of the Big Boss, it turned out that the only thing we could have done better was buy a bigger pipe to and from the internet. We’d specified that ‘big pipe’ when designing the system. The Big Boss had then railed at the cost and so we’d subsequently compromised. I felt that my design decisions were vindicated. The Big Boss brooded for a while. Then he made the significant comment:
“What really ****** me off is the fact that, for ten minutes, we couldn’t take people’s money.”
At that point I stopped feeling smug. Had the internet connection been better, the system would have reached its limit and failed rather precipitously, and that wasn’t what he wanted. Then it occurred to me that what had gummed up the connection was all those images on the site, that had made it so impressive for the visitors. If there had been a way to automatically pare down the site to the bare essentials under stress… Hmm.
I began to consider disaster-recovery in the broadest sense – maintaining a service in spite of unusual or unexpected events. What he said makes a lot of sense: sacrifice whatever isn’t essential to keep the core service running when we approach the capacity limits. Maybe in IT we should borrow (or revive) the business concept of the ‘Skeleton service’, maintaining only the priority parts under stress, using a process that is well-prepared and carefully rehearsed. How might this work? Whatever the event we have to prepare for, it is all about understanding the priorities; knowing what one can dispense with when the going gets tough. In the event of database disaster, it’s much faster to deploy a skeletal system with only the essential data than to restore the entire system, though there would have to be a reconciliation process to update the revived database retrospectively, once the emergency was over.
It isn’t just the database that could be designed for resilience. One could prepare for unusually high traffic in a website by designing a system that degraded gradually to a ‘skeletal’ site, one that maintained the commercial essentials without fat images, JavaScript libraries and razzmatazz. This is all what the Big Boss scathingly called ‘a mere technicality’. It seems to me that what is needed first is a culture of application and database design which acknowledges that we live in a very imperfect world, and react accordingly when things go awry.