|
|
|
|
|
by otterley
1609 days ago
|
|
Since the issue's root cause was a pathological database software issue, Roblox would have suffered the same issue in the public cloud. (I am assuming for this analysis that their software stack would be identical.) Perhaps they would have been better off with other distributed databases than Consul (e.g., DynamoDB), but at their scale, that's not guaranteed, either. Different choices present different potential difficulties. Playing "what-if" thought experiments is fun, but when the rubber hits the road, you often find that things that are stable for 99.99%+ of load patterns encounter previously unforeseen problems once you get into that far-right-hand side of the scale. And it's not like we've completely mastered squeezing performance out of huge CPU core counts on NUMA architectures while avoiding bottlenecking on critical sections in software. This shit is hard, man. |
|
Everything is duplicated which is potentially wasteful but ensures complete redundancy and it’s an insurance policy. If you rollout, you rollout to each datacenter separately. So in this case rolling out in one complete datacenter and waiting a day for their Consul streaming changes probably would have caught it.