|
|
|
|
|
by YetAnotherNick
2146 days ago
|
|
You wrote one server but describe the failure modes of having one data center. I think it is very very uncommon and hard to allow for data center level issue. After all Instagram and 100 other site failed when one AWS data center went down. I would interested to know how/whether anyone's backend will work if any data center and its databases completely fails due to fire/earthquake/networking etc. Second thing is having multiple machines for server. In theory it might help in increasing the availability but in practice I haven't seen any random issue due to machine which occurs just based on probability. I think almost all failure modes that exist, they are correlated between machines. eg suppose you have data loss on one machine, you could more likely than not, blame it on code and it would be similar across machines. |
|
Re: multiple servers. Power supplies fail, memory modules fail, cpus fail, fans fail, storage drives fail. Sometimes those are correlated --- the HP SSDs that failed when the power on hours hit a limit (two separate models) are going to be pretty correlated if they were purchased new and stuck into servers at a similar time and then on 24/7. Most of those failures aren't that correlated though. Software failures would be more likely to be correlated though, of course.
The key thing is to really think about what the cost for being down is, how long is acceptable/desirable to be down, and how much you're willing to spend to hit those goals.