|
|
|
|
|
by WookieRushing
2032 days ago
|
|
This only works for stateless services. If you’ve got frontends that take longer than 10 mins to serve traffic then you have a problem. But if you’re running a DB or a storage system, 10 mins is a blink of an eye. Storage systems in particular can run a few hundred TB per node and moving that data to another node can take over an hour. In this case, the frontends have a shard map which is definitely not stateless. This is typically okay if you have a fast load operation which blocks other traffic until shard map is fully loaded |
|
It basically boils down to "We must be able to restore the minimum necessary parts of a full backup in under 10 minutes".
Take wikipedia as an example. I'd expect them to be able to restore a backup of the latest version of all pages in 10 minutes. It's 20GB of data, and I assume it's sharded at least 10 ways. That means each instance will have to grab 2GB from the backups. Very do-able.
As a service gets bigger, you typically scale horizontally, so the problem doesn't get harder.
Restoring all the old page versions and re enabling editing might take longer, but that's less critical functionality.