Hacker News new | ask | show | jobs
by hruk 805 days ago
Yes, we use it to back several services in production at an 8 figure ARR business, with Litestream as our streaming backup option. I'm quite happy with it because it's trivial to debug and test against a local copy of the production data if necessary, but there are a few things to think about.

* You need to have your code repopulate index statistics with ANALYZE or PRAGMA OPTIMIZE now and again, or you may get a confused query planner.

* It's best if your code defines meaningful types to scan SQLite values into, and you ideally avoid ever writing to the DB via the CLI. Alternatively, be aggressive in writing CHECK constraints and triggers to ensure data validity.

* You need to be working with a team of developers that are pretty comfortable with databases to get a lot of value out of SQLite. I've been surprised at how performant features can be if a single mind is designing the schema, constraints, triggers, and queries, examining the query plan to choose indexes, and writing the application code that accesses the database. SQLite minimizes the barrier to doing this with real snapshots of production data - but that still doesn't mean it's going to be easy.

* You need to think about how long your code holds a write transaction open. Individual statements in SQLite run very quickly (I've seen ~250 microseconds per insert on a EBS-backed EC2 instance), but if you have a Django app with @transaction.atomic everywhere, you're going to run into lock contention quite quickly because your Python functions take much longer to run. If possible, "organize" all the data for your writes outside the transaction and then hold the lock for as short as possible.

* Any transaction that may eventually write must start with BEGIN IMMEDIATE, otherwise SQLite may throw an error to keep it's promise of serialized isolation on writes.

All said, I'd do it again for any other services-based backend. It also incidentally enforces the good hygiene of "services can only access their own data." Happy to answer any questions you have.

4 comments

To expand on the BEGIN IMMEDIATE part in the context of Django (since this is/was a pet peeve of mine), they recently added the option to do so after keeping the bug open for years: https://code.djangoproject.com/ticket/29280
Cool, I am using Django on some projects so I'll check this out.
This is good insight.

I've been experimenting with the newer features to get concurrent writes working but I don't like that I have to have more than one database file, means I end up having to move lots of files around - might as well use MySQL and.. move files around.

https://www.sqlite.org/cgi/src/doc/begin-concurrent/doc/begi...

Use the best tool for the job, most of the time is the best thing to do :)

I never dabbled in multiple files, but I can see how that will be beneficial. Especially if you have a service that have completely isolated instances - like let's say per organization basis.

How do you manage cron jobs with a setup like that? Does each lambda pull down its own SQLite db, write changes, and then litestream replicates the changes to the web and other lambda instances?

Similar scenario for multiple web nodes and saving data. Also, do you use sticky sessions so that any routes that write to a db also read from the same node/db so you don’t have to wait for litestream?

What do you do for BI? Are you able to ETL the data from sqlite to a warehouse? If so, what does that look like?

Interesting question. Our services currently contain "own" all of cron jobs, so the jobs run as background tasks on the host containing the db. I think we used Huey for the Django app, and just some CLI applications fired via systemd timers for the Go services.

Every service is running on a single (somewhat meaty) host, so we get true snapshot isolation on reads and serialized isolation on writes without doing anything extra. We run weekly tests to determine how long AWS takes to spin everything back up upon catastrophic failure, and it's between 6 and 10 minutes, depending on how much data the service has. It does bother me that this can only get longer, but we've never actually had a host crash in production.

Like the other commentator, we did our BI for a while by restoring the Litestream backup to another server. Then that started to get expensive (because the file was very large), so we just added a scraping endpoint to each table. I think there's probably a more elegant way of doing this.

Sounds complicated. What object storage you use? If on aws, 's3 sync' might help.
S3, yes. Litestream doesn't store the backup as a single DB file - rather, as snapshots and WAL segments that are streaming in, so it's unfortunately not so simple.
re: BI and Metabase usecase

If it fits the usecase, you can use a VFS SQLite so the process will just pull the needed ranges from storage.

Not the author of this reply, but what we do is just to scp the file into another instance and use Metabase on it. It is pretty sweet as we can have metabase not running on production and "pollute" the environment. Extremely easy.
Wow this is an awesome breakdown!

Do you have written anything more extensive about this? Especially the ANALYIZe or PRAGMA OPTIMIZE, and/or Litestream?

Or maybe you can refer some links?

I have not, but I keep meaning to collate everything I've learned into a set of useful defaults just to remind myself what settings I should be enabling and why.

Regarding Litestream, I learned pretty much all I know from their documentation: https://litestream.io/

I'd be the first one to read about that and share it :)

And thank you, I'll try to set it up for one of our projects and see what happens!