Hacker News new | ask | show | jobs
by josephlord 4863 days ago
For any given user the probability of the one machine with everyone on it going bang is similar to the probability of the particular server that they were connected to in a horizontally scaled scenario. However the cost of redundancy may be higher if it is a replication of 100% of main system on the other hand a big system may be designed for high uptimes.
1 comments

Would the probability not be less in this case? In general, less moving parts = less chance of outage. E.g. if a device is rated for 300,000 hours MTBF and you have 2 of them, their individual MTBF remains the same, but your chance of experiencing an outage in either one has doubled because you have 2 of them.

It's more the impact side of the risk equation i'm thinking of than the probability.

EDIT: typo

Depends whether looked at from the ops point of view or the end user point of view. You expressed concern about 1 million customers simultaneously having a bad experience. For a given end user if the hardware is equally reliable the odds of something happening are the same whether they are sharing with 1 million or 1 hundred thousand (or even have the server to themselves). On the ops side there is more to go wrong and failures will be more frequent but affect less end users each time.

The positive in the one big machine scenario is that you have potential to take strong efforts to keep it reliable. The advantage in the lots of machines scenario is that there is a better chance you have well tested failover solutions.

It is the combination of impact and risk that I am discussing.