| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by otterley 1609 days ago
	Since the issue's root cause was a pathological database software issue, Roblox would have suffered the same issue in the public cloud. (I am assuming for this analysis that their software stack would be identical.) Perhaps they would have been better off with other distributed databases than Consul (e.g., DynamoDB), but at their scale, that's not guaranteed, either. Different choices present different potential difficulties. Playing "what-if" thought experiments is fun, but when the rubber hits the road, you often find that things that are stable for 99.99%+ of load patterns encounter previously unforeseen problems once you get into that far-right-hand side of the scale. And it's not like we've completely mastered squeezing performance out of huge CPU core counts on NUMA architectures while avoiding bottlenecking on critical sections in software. This shit is hard, man.

1 comments

baskethead 1609 days ago

This is not true, if they handled the rollout properly. Companies like Uber have two entirely different data centers and during outages they failover you either datacenter.

Everything is duplicated which is potentially wasteful but ensures complete redundancy and it’s an insurance policy. If you rollout, you rollout to each datacenter separately. So in this case rolling out in one complete datacenter and waiting a day for their Consul streaming changes probably would have caught it.

link

Symbiote 1609 days ago

> So in this case rolling out in one complete datacenter and waiting a day for their Consul streaming changes probably would have caught it.

But this has nothing to do with cloud vs. colo.

link

baskethead 1609 days ago

The parent poster said that it would have happened even if they had cloud, ie. another datacenter. That's my assumption for the comment.

As far as I can tell from reading, Roblox doesn't have multiple datacenters. I find that really hard to believe, so if that's not true, then my point would be incorrect. If it is true, then if they completely duplicated their datacenters, they would be able to make the switch in one datacenter to streaming while keeping the other datacenter the old setting until they validated that everything was fine. That would have caught the problem, having slow rollout across datacenters.

link

yuliyp 1609 days ago

Uber is also a service that has a much lower tolerance for downtime: If people can't play a game, they're sad. If they're trying to get a ride and it doesn't work, or drivers apps stop working suddenly, the stranded people get very upset in a hurry, and the company loses a lot of customers.

It can be totally reasonable for Uber to pay for 2x the amount of infra they need for serving their products while not being worth it for a company like Roblox.

link

otterley 1609 days ago

The Consul streaming changes were rolled out months before the incident occurred.

link

baskethead 1609 days ago

You didn't read it properly. The changes were rolled out months before, but the switch to streaming based on that rollout was made 1 day before the incident. That was the root cause.

link