|
|
|
|
|
by lukev
5650 days ago
|
|
Very interesting. As with all outages of major services, it seems it started with a confluence of independently minor, unforeseen events. One question they didn't address, though, is if they're going to address the core problem - that a positive feedback loop of overloading supernodes is possible. It seems to me that a p2p system should be able to recover from having 20% of its nodes taken offline, rather than spawning a full collapse. Avoiding the scenario where 20% of supernodes go offline to begin with is of course desirable, but since any number of things could cause that, it seems like a genuinely resilient system should remain functional (even in a degraded capacity) even if only a small fraction of nodes remains available. |
|
You can also gracefully degrade performance, by rejecting client connections, disconnecting progressively some clients, accepting loss of consistency etc. It depends how far you can go without infuriating your customers.
We discovered that large-scale real-time systems(in our case, currently 400.000 concurrent connections) are really hard to stabilize against presence storms, network problems and buggy clients, among others.