|
|
|
|
|
by londons_explore
2033 days ago
|
|
One requirement on my "production ready" checklist is that any catastrophic system failure can be resolved by starting a completely new instance of the service, and it be ready to serve traffic inside 10 minutes. That should be tested at least quarterly (but preferably automatically with every build). If Amazon did that, this outage would have been reduced to 10 mins, rather than the 12+ hours that some super slow rolling restarts took... |
|
But if you’re running a DB or a storage system, 10 mins is a blink of an eye. Storage systems in particular can run a few hundred TB per node and moving that data to another node can take over an hour.
In this case, the frontends have a shard map which is definitely not stateless. This is typically okay if you have a fast load operation which blocks other traffic until shard map is fully loaded