Hacker News new | ask | show | jobs
by exitheone 932 days ago
That's still ridiculously slow. I'd expect them to have hundreds of Microservices. Each one of those should be able to handle a random restart at any point in time so they should absolutely be able to restart 100s of servers concurrently without major disruptions. Hell on Facebook scale a whole-Datacenter going down should not cause service disruptions.
1 comments

This does assume that nothing is getting broken along the way.

Taking 45 days is probably more about caution and resolving issues systematically rather than pushing a big button and hoping you don’t cause issues.

I’d expect them to have thousands of microservices - and you only have to find a way to break one to cause big issues.

Regular random crashes should be exercised regardless at Facebook scale. Not being resilient to that would be very unprofessional.