| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by james_cowling 3632 days ago

Yup we do this, and yes, the verifier has been broken before (during early development, not at production scale).

We do a number of things here, like taking down a sufficient number of storage nodes in a single region to make blocks appear "missing" in the region and force an automatic failover to the other region (this is transparent to users, apart from slightly more latency), or more direct/risky checks in our Staging cluster (we don't ever mess with data in our main production cluster).

In reality a large system like this regularly encounters timeouts or failures of sub-components which are masked by our multi-region redundancy but show up as spikes in the verifiers. These remind us that everything is working, in between more explicit DRT (Disaster Recovery Training) tests.

1 comments

sitkack 3631 days ago

ABF, Always Be Failing (just a little bit).

link