Hacker News new | ask | show | jobs
by Scorpiion 892 days ago
Hi bob1029, very interesting to hear about your experiences and point of view. Easy to forget the impact of how easy it's to sell an end product/service based on your selection of database (it does not always matter of course). I have thought about this as well, no one (?) will question if your app/service uses postgres but if you say sqlite I assume there will be some more questions.

Would you mind sharing some numbers of what you consider an "aggressive RPO", and what RPO number sqlite+litestream can still handle? Of course, it will depend on the use case, but if you would do a rough estimate. I think it could be helpful for a lot of readers.

1 comments

For our product, I'd consider an acceptable RPO for our most stringent customer to be no more than 5 seconds of committed data loss under a catastrophic scenario (i.e. primary log writer wiped off face of planet instantly). The reason I say 5 seconds and not 0 seconds is because performance is also a feature, and we don't want to shoot our other foot off if blocking/synchronous log replication happens to get stuck for a few minutes. A region-ending event is orders of magnitude more rare than a fiber cut, so an argument can be made for async replication.

There are other bank systems that do utilize hard, synchronous replication semantics (hopefully for obvious reasons), but they can afford to sacrifice performance and availability during business hours (more than we can). We always defer to the records in these systems.

We can tolerate a few bits of work getting out of sync and needing to be recovered in the back office. Mopping up a little bit of extra trash after the apocalypse is acceptable to all involved. We cannot tolerate all users being blocked for more than a few minutes, nor can we tolerate a de-sync of aggregate system state exceeding the same. Anything beyond this and we are at risk of having to wipe all work and disrupt end customer activity to keep ops and support teams from crashing.

Thanks a lot for sharing bob1029, nice explanation and justifications, much appreciated!
Someone who understands the careful balance to strike, when data availability is actually the most important factor.

I'm in a very similar situation in my industry. Thanks for your words.