How far should we take the N+N redundancy craziness ?
- by Brann
The industry standard when it comes from redundancy is quite high, to say the least. To illustrate my point, here is my current setup (I'm running a financial service).
Each server has a RAID array in case something goes wrong on one hard drive
.... and in case something goes wrong on the server, it's mirrored by another spare identical server
... and both server cannot go down at the same time, because I've got redundant power, and redundant network connectivity, etc
... and my hosting center itself has dual electricity connections to two different energy providers, and redundant network connectivity, and redundant toilets in case the two security guards (sorry, four) needs to use it at the same time
... and in case something goes wrong anyway (a nuclear nuke? can't think of anything else), I've got another identical hosting facility in another country with the exact same setup.
Cost of reputational damage if down = very high
Probability of a hardware failure with my setup : <<1%
Probability of a hardware failure with a less paranoiac setup : <<1% ASWELL
Probability of a software failure in our application code : 1% (if your software is never down because of bugs, then I suggest you doublecheck your reporting/monitoring system is not down. Even SQLServer - which is arguably developed and tested by clever people with a strong methodology - is sometimes down)
In other words, I feel like I could host a cheap laptop in my mother's flat, and the human/software problems would still be my higher risk.
Of course, there are other things to take into consideration such as :
scalability
data security
the clients expectations that you meet the industry standard
But still, hosting two servers in two different data centers (without extra spare servers, nor doubled network equipment apart from the one provided by my hosting facility) would provide me with the scalability and the physical security I need.
I feel like we're reaching a point where redundancy is just a communcation tool. Honestly, what's the difference between a 99.999% uptime and a 99.9999% uptime when you know you'll be down 1% of the time because of software bugs ?
How far do you push your redundancy crazyness ?