Hacker News new | ask | show | jobs
by baskethead 1608 days ago
This is not true, if they handled the rollout properly. Companies like Uber have two entirely different data centers and during outages they failover you either datacenter.

Everything is duplicated which is potentially wasteful but ensures complete redundancy and it’s an insurance policy. If you rollout, you rollout to each datacenter separately. So in this case rolling out in one complete datacenter and waiting a day for their Consul streaming changes probably would have caught it.

3 comments

> So in this case rolling out in one complete datacenter and waiting a day for their Consul streaming changes probably would have caught it.

But this has nothing to do with cloud vs. colo.

The parent poster said that it would have happened even if they had cloud, ie. another datacenter. That's my assumption for the comment.

As far as I can tell from reading, Roblox doesn't have multiple datacenters. I find that really hard to believe, so if that's not true, then my point would be incorrect. If it is true, then if they completely duplicated their datacenters, they would be able to make the switch in one datacenter to streaming while keeping the other datacenter the old setting until they validated that everything was fine. That would have caught the problem, having slow rollout across datacenters.

Uber is also a service that has a much lower tolerance for downtime: If people can't play a game, they're sad. If they're trying to get a ride and it doesn't work, or drivers apps stop working suddenly, the stranded people get very upset in a hurry, and the company loses a lot of customers.

It can be totally reasonable for Uber to pay for 2x the amount of infra they need for serving their products while not being worth it for a company like Roblox.

The Consul streaming changes were rolled out months before the incident occurred.
You didn't read it properly. The changes were rolled out months before, but the switch to streaming based on that rollout was made 1 day before the incident. That was the root cause.