Hacker News new | ask | show | jobs
by bob1029 904 days ago
For our product, I'd consider an acceptable RPO for our most stringent customer to be no more than 5 seconds of committed data loss under a catastrophic scenario (i.e. primary log writer wiped off face of planet instantly). The reason I say 5 seconds and not 0 seconds is because performance is also a feature, and we don't want to shoot our other foot off if blocking/synchronous log replication happens to get stuck for a few minutes. A region-ending event is orders of magnitude more rare than a fiber cut, so an argument can be made for async replication.

There are other bank systems that do utilize hard, synchronous replication semantics (hopefully for obvious reasons), but they can afford to sacrifice performance and availability during business hours (more than we can). We always defer to the records in these systems.

We can tolerate a few bits of work getting out of sync and needing to be recovered in the back office. Mopping up a little bit of extra trash after the apocalypse is acceptable to all involved. We cannot tolerate all users being blocked for more than a few minutes, nor can we tolerate a de-sync of aggregate system state exceeding the same. Anything beyond this and we are at risk of having to wipe all work and disrupt end customer activity to keep ops and support teams from crashing.

2 comments

Thanks a lot for sharing bob1029, nice explanation and justifications, much appreciated!
Someone who understands the careful balance to strike, when data availability is actually the most important factor.

I'm in a very similar situation in my industry. Thanks for your words.