| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by londons_explore 1710 days ago

This to me shows Google hasn't gotten in place sufficient monitoring to know the scale of problems and the correct scale of response.

For example, if a service has an outage affecting 1% of users in some corner case, it perhaps makes sense to do an urgent rolling restart of the service, perhaps taking 15 minutes. (On top of diagnosis and response times)

Whereas if there is a 100% outage, it makes sense to do an "insta-nuke-and-restart-everything", taking perhaps 15 seconds.

Obviously the latter is a really large load on all surrounding infrastructure, so needs to be tested properly. But doing so can reduce a 25 minute outage down to just 10.5 minutes (10 minutes to identify the likely part of the service responsible, 30 seconds to do a nuke-everything rollback)

4 comments

nine_k 1710 days ago

The 15 seconds figure may be very wishful thinking. Often a service startup is a short burst of severe resource consumption. Doing in with 100% of the fleet at once may stall everything in an uncontrollable overloaded state.

link

codeflo 1710 days ago

Is infrastructure at this scale typically unable to do a cold start? I can believe that this is very difficult to design for, but being unable to do it sounds dangerous to me.

(Edit for the downvoters: I was genuinely curious how these kinds of things work at Google’s scale. Asking stupid questions is sometimes necessary for learning.)

link

cranekam 1710 days ago

I guess it depends what "infrastructure" means.

If you mean "all of Google" then a cold restart would probably be very hard. At Facebook a cold restart/network cutoff of a datacenter region (a test we did periodically) took considerable planning. There is a lot to coordinate — many components and teams involved, lots of capacity planning, and so on. Over time this process got faster but it is still far from just pulling out the power cord and plugging it in again.

If you mean a single backend component then cold starting it may or may not be easy. Easy if it's a stateless service that's not in the critical path. But it seems this GCP outage was in the load balancing layer and likely harder to handle. A parent comment suggested this could be restarted in 15s, which is probably far from the truth. If it takes 5s to get an individual node restarted and serving traffic you'd need to take down a third of capacity at a time, almost certainly overloading the rest.

In some cases the component may also have state that needs to be kept or refilled. Again, at FB, cold starting the cache systems was a fairly tricky process. Just turning them off and on again would leave cold caches and overload all the systems behind them.

Lastly, needing to be able to quickly cold restart something is probably a design smell. In the case of this GCP outage rather than building infra that can handle all the load balancers restarting in 15s it would probably be easier and safer to add the capability of keeping the last known good configuration in memory and exposing a mechanism to roll back to it quickly. This wouldn't avoid needing to restart for code bugs in the service but it would provide some safety from configuration-related issues.

link

mschuster91 1710 days ago

> Lastly, needing to be able to quickly cold restart something is probably a design smell.

For everyone not at a scale to afford their own transoceanic fiber cables, a major internet service disruption is equivalent to a cold start. And as long as hackers or governments are able to push utter bullshit to the global BGP tables with a single mouse click, this threat remains present.

link

cranekam 1708 days ago

The comment I was replying to mentions "at [Google] scale", so my answer was with that in mind.

link

sofixa 1710 days ago

When Amazon S3 in us-east-1 failed a few years ago, the reason for the long outage(6 hours? 8 hours? I don't recall) was that they needed to restart the metadata service, and it took a long time for it to come back with the mind boggling amount of data on S3. Cold starts are hard to plan for precisely at this type of scale

link

allset_ 1710 days ago

It can be done. It takes a heck of a lot longer than 15s though.

link

piyh 1710 days ago

Everyone flushing the toilet at the same time to clean the pipes

link

sdenton4 1710 days ago

'Whereas if there is a 100% outage, it makes sense to do an "insta-nuke-and-restart-everything", taking perhaps 15 seconds.'

Take some time to consider what a restart means, across many data centers on machines which have no memory of the world before the start of their present job...

link

detaro 1710 days ago

> rollback to the last known good configuration

could very much be the "fast" option. 15s restart, or anything close to it, across the entirety of it sounds quite unlikely.

link

foota 1710 days ago

15 second rollbacks don't exist at scale.

link