| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by londons_explore 2033 days ago

One requirement on my "production ready" checklist is that any catastrophic system failure can be resolved by starting a completely new instance of the service, and it be ready to serve traffic inside 10 minutes.

That should be tested at least quarterly (but preferably automatically with every build).

If Amazon did that, this outage would have been reduced to 10 mins, rather than the 12+ hours that some super slow rolling restarts took...

3 comments

WookieRushing 2033 days ago

This only works for stateless services. If you’ve got frontends that take longer than 10 mins to serve traffic then you have a problem.

But if you’re running a DB or a storage system, 10 mins is a blink of an eye. Storage systems in particular can run a few hundred TB per node and moving that data to another node can take over an hour.

In this case, the frontends have a shard map which is definitely not stateless. This is typically okay if you have a fast load operation which blocks other traffic until shard map is fully loaded

link

londons_explore 2033 days ago

It's possible (albeit much harder) for stateful services too.

It basically boils down to "We must be able to restore the minimum necessary parts of a full backup in under 10 minutes".

Take wikipedia as an example. I'd expect them to be able to restore a backup of the latest version of all pages in 10 minutes. It's 20GB of data, and I assume it's sharded at least 10 ways. That means each instance will have to grab 2GB from the backups. Very do-able.

As a service gets bigger, you typically scale horizontally, so the problem doesn't get harder.

Restoring all the old page versions and re enabling editing might take longer, but that's less critical functionality.

link

why-el 2033 days ago

The same OS limits would apply to new instances, unless they knew the root cause and forced new instances to be configured with larger descriptor limits, which is....well, hindsight is 20/20, no?

link

WJW 2033 days ago

Kinesis probably runs well over 100k instances. Restarting it might not be so trivial that you can do it in 10 minutes.

link