Our backup strategy is to periodically snapshot the entire host volume via relevant hypervisor tools. We have negotiated RPOs with all of our customers that allow for a small amount of data loss intraday (I.e. w/ 15 minute snapshot intervals, we might lose up to 15 minutes of live business state). There are other mitigating business processes we have put into place which bridge enough of this gap for it to be tolerable for all of our customers.
In the industry we work in, as long as your RTO/RPO is superior to the system of record you interface with, you are never the sore thumb sticking out of the tech pile.
In our 6-7 years of operating in this manner, we still have not had to restore a single environment from snapshot. We have tested it several times though.
You will probably find that VM snapshot+restore is a ridiculously easy and reliable way to provide backups if you put all of your eggs into one basket.
>> You will probably find that VM snapshot+restore is a ridiculously easy and reliable way to provide backups if you put all of your eggs into one basket.
Yep, this is something we rely on whenever we perform risky upgrades or migrations. Just snapshot the entire thing and restore it if something goes wrong, and it's both fast and virtually risk-free.
I’m not the OP but I’m the author of an open source tool called Litestream[1] that does streaming replication of SQLite databases to AWS S3. I’ve found it to be a good, cheap way of keeping your data safe.
I am definitely interested in a streaming backup solution. Right now, our application state is scattered across many independent SQLite databases and files.
We would probably have to look at a rewrite under a unified database schema to leverage something like this (at least for the business state we care about). Streaming replication implies serialization of total business state in my head, and this has some implications for performance.
Also, for us, backup to the cloud is a complete non-starter. We would have to have our customers set up a second machine within the same network (not necessarily same building) to receive these backups due to the sensitive nature of the data.
What I really want to do is keep all the same services & schemas we have today, but build another layer on top so that we can have business services directly aware of replication concerns. For instance, I might want to block on some targeted replication activity rather than let it complete asynchronously. Then, instead of a primary/backup, we can just have 4-5 application nodes operating as a cluster with some sort of scheme copying important entities between nodes as required. We already moved to GUIDs for a lot of identity due to configuration import/export problems, so that problem is solved already. There are very few areas of our application that actually require consensus (if we had multiple participants in the same environment), so this is a compelling path to explore.
You can stream back ups of multiple database files with Litestream. Right now you have to explicitly name them in the Litestream configuration file but in the future it will support using a glob or file pattern to pick up multiple files automatically.
As for cloud backup, that's just one replica type. It's usually the most common so I just state that. Litestream also supports file-based backups so you could do a streaming backup to an NFS mount instead. There's an HTTP replica type coming in v0.4.0 that's mainly for live read replication (e.g. distribute your query load out to multiple servers) but it could also be used as a backup method.
As for synchronous replication, that's something that's on the roadmap but I don't have an exact timeline. It'll probably be v0.5.0. The idea is that you can wait to confirm that data is replicated before returning a confirmation to the client.
We have a Slack[1] as well as a bunch of docs on the site[2] and an active GitHub project page. I do office hours[3] every Friday too if you want to chat over zoom.
I really like what I am seeing so far. What is the rundown on how synchronous replication would be realized? Feels like I would have to add something to my application for this to work, unless we are talking about modified versions of SQLite or some other process hooking approach.
Litestream maintains a WAL position so it would need to expose the current local WAL position & the highest replicated WAL position via some kind of shared memory—probably just a file similar to SQLite's "-shm" file. The application can check the current position when a transaction starts and then it can block until the transaction has been replicated. That's the basic idea from a high level.
In the industry we work in, as long as your RTO/RPO is superior to the system of record you interface with, you are never the sore thumb sticking out of the tech pile.
In our 6-7 years of operating in this manner, we still have not had to restore a single environment from snapshot. We have tested it several times though.
You will probably find that VM snapshot+restore is a ridiculously easy and reliable way to provide backups if you put all of your eggs into one basket.