Hacker News new | ask | show | jobs
by benbjohnson 1426 days ago
LiteFS author here (also Litestream author). I'm happy to answer any questions folks have about how it works or what's on the roadmap.
9 comments

Very cool. I've been working on CRDTs for SQLite to allow for conflict-free replication. Its still super early but I have a prototype that works and, if you're up to it, I would love to pick your brain on the details of sqlite WALs and journaling.

https://www.loom.com/share/0934f93364d340e0ba658146a974edb4

Right on, I always thought a CRDT SQLite database would be awesome. I thought about looking at the session extension[1] to handle the merges but I didn't get very far into it.

Feel free to ping me at benbjohnson@yahoo.com if you want to chat WAL & journaling. I think it'd be difficult to implement CRDT at the journal level since it's all physical pages. Merging data in pages is going to sometimes cause splits and that'll be hairy to deal with.

[1] https://www.sqlite.org/sessionintro.html

that was my assumption but it's been suggested a few times to implement it there. Nice to hear some validation that that isn't a great spot.
Why a fuse filesystem instead of just using sqlite's vfs features? FUSE adds an extra user-kernel-user context switch which is expensive compared to a quick user-user call for the vfs drivers.
I was going to raise that point exactly.

As someone who spends an awful amount of time using FUSE, my recommendation is to only use it in cases where the software that interacts with the file system isn't easily changeable. For example, for Buildbarn which I maintain (https://github.com/buildbarn/bb-remote-execution), I need to use it. It's infeasible to change arbitrary compilers and tests to all interact with a network distributed build cache. Designing the FUSE file system was a pretty heavy investment though, as you really need to be POSIXly correct to make it all work. The quality of implementations of FUSE also varies between OSes and their versions. macFUSE, for example, is quite different from Linux FUSE.

Given that SQLite already has all of the hooks in place, I would strongly recommend using those. In addition to increasing portability, it also makes it easier to set up/run. As an example, it's pretty hard to mount a FUSE file system inside of a container running on Kubernetes without risking locking up the underlying host. Doing the same thing with the SQLite VFS hooks is likely easy and also doesn't require your container to run with superuser privileges.

The VFS system is great and I plan to support that as well. SQLite is simple to get started with but it can be complicated to integrate a VFS depending on the language using. My goal is to make the deployment as simple as possible and also to support legacy applications.

There's some interesting work I'm looking at with the VFS and WASM where you could deploy something like this across pure serverless functions like Vercel or Deno.

It just so happens that I am working on a versioned file system for sqlite based on its VFS interface. Planning to release it in a week's time. It will be in pre-alpha stage and its more of a proof-of-concept.

Having worked with FUSE in the past, there is always some performance penalty, but the advantage is that there is no change in the application.

The VFS will need change in the application but less overhead. Trade-offs!

Would doing the VFS thing require participating applications to be linked to a LiteFS library?
That's a good question. I thought that was the case but I just double checked and it looks like you can load a VFS as an extension [1].

[1] https://www.sqlite.org/loadext.html

I can confirm that it does. I am working on it as I write this :).

Technically, an extension and VFS are separate things.

An extension when loads, allows you to register the VFS with the Sqlite. It can also register virtual tables as well.

Thanks for sharing this!

As WAL'd sqlite on networked filesystems is a no-go, once LiteFS gets wal support I think this will be a savior software for many operators with existing sqlite deployments.

Could we even host LiteFS with underlying dbs on a filesystem that otherwise wouldn't play well with WAL (say nfs, ceph or gluster)? An effectively single-node deployment that would achieve redundancy through the FS.

That would be an amazing hack. "LiteFS makes sqlite work on NFS".
Kudos on the release! I've done almost exactly the same thing in Rust with WAL only; writes are appended to a distributed wal which achieves consensus by raft and distributed time leases. it all worked so nearly I was impressed. It was for a client though, so it's unfortunately proprietary.
Thanks! That sounds like a fun project. Did you run the WAL writes through Raft or did you just use Raft for leader election?
Awesome! Glad to see it.

I was very excited for the experimental support for live replication in Litestream, but as I understand that didn't pan out. This looks to be the successor, with fanout replication an explicit feature.

Using a FUSE layer that's detecting changes is sure to have some performance tradeoffs. What benchmarks are under way? Do you need any help?

Thanks! In hindsight, I'm glad I pulled the live replication in Litestream and left it focused on disaster recovery only. A lot of what I'm doing in LiteFS would be much more difficult if I didn't have control over the write path. I'm able to perform rolling checksums to verify database integrity across replicas and that would be nearly impossible with Litestream (which runs as a separate process).

The FUSE layer does have performance trade-offs. Read queries aren't affected as much since the OS page cache is in the kernel and avoids the FUSE layer entirely. Most databases are under 10GB (or honestly probably less than 1GB) so you can fit most of that in the page cache for any moderately sized VM. The write side will incur a performance hit but I haven't taken benchmarks yet. There's a lot that can still be optimized around that.

The biggest help that I could use is just to get feedback on how it feels to use and any ways I can improve it. The end goal is to make it ridiculously simple to spin up and "just work" so any ideas towards that goal would be awesome.

I’m curious, was there a specific reason that FUSE was used instead of the FVS layer inside to SQLite itself? I realize that it would mean that a custom module would have to be loaded, and maybe that was undesirable. Full disclosure I have not read the link yet as I am a little pressed for time, but I will later tonight when I have time. If it’s mentioned in the readme, feel free to ignore me :) I ask because I have been building a VFS for SQLite to handle multiple distributed writers and replicating data blocks around to different places.
It's a good question. It's mainly about reducing friction for the end user. I wrote up a related response here: https://news.ycombinator.com/item?id=32243305
Given a situation where replication is desirable, is using SQLite + LiteFS a better choice than just replicated Postgres?
Postgres is great and replicating it can work as well. One benefit to SQLite is that it's in-process so you avoid most per-query latency so you don't have to worry as much about N+1 query performance issues. From my benchmarks, I see latency from an application node to a Postgres node as high as 1 millisecond -- even if both are in the same region. SQLite, on the other hand, has per-query latency overhead of about 10-20µs.

Another goal is to simplify deployment. It's a lot easier and cost-effective to just deploy out 10 application nodes across a bunch of regions rather than having to also deploy Postgres replicas in each of those regions too.

SQLite v Postgres here is apples v oranges. Postgres is multi-reader multi-writer whereas SQLite is single-writer multi-reader among other things. They are both very fine databases but solve different use cases.
From https://www.sqlite.org/isolation.html :

> Isolation And Concurrency: SQLite implements isolation and concurrency control (and atomicity) using transient journal files that appear in the same directory as the database file. There are two major "journal modes". The older "rollback mode" corresponds to using the "DELETE", "PERSIST", or "TRUNCATE" options to the journal_mode pragma. In rollback mode, changes are written directly into the database file, while simultaneously a separate rollback journal file is constructed that is able to restore the database to its original state if the transaction rolls back. Rollback mode (specifically DELETE mode, meaning that the rollback journal is deleted from disk at the conclusion of each transaction) is the current default behavior.

> Since version 3.7.0 (2010-07-21), SQLite also supports "WAL mode". In WAL mode, changes are not written to the original database file. Instead, changes go into a separate "write-ahead log" or "WAL" file. Later, after the transaction commits, those changes will be moved from the WAL file back into the original database in an operation called "checkpoint". WAL mode is enabled by running "PRAGMA journal_mode=WAL".

> In rollback mode, SQLite implements isolation by locking the database file and preventing any reads by other database connections while each write transaction is underway. Readers can be active at the beginning of a write, before any content is flushed to disk and while all changes are still held in the writer's private memory space. But before any changes are made to the database file on disk, all readers must be (temporarily) expelled in order to give the writer exclusive access to the database file. Hence, readers are prohibited from seeing incomplete transactions by virtue of being locked out of the database while the transaction is being written to disk. Only after the transaction is completely written and synced to disk and committed are the readers allowed back into the database. Hence readers never get a chance to see partially written changes.

> WAL mode permits simultaneous readers and writers. It can do this because changes do not overwrite the original database file, but rather go into the separate write-ahead log file. That means that readers can continue to read the old, original, unaltered content from the original database file at the same time that the writer is appending to the write-ahead log. In WAL mode, SQLite exhibits "snapshot isolation". When a read transaction starts, that reader continues to see an unchanging "snapshot" of the database file as it existed at the moment in time when the read transaction started. Any write transactions that commit while the read transaction is active are still invisible to the read transaction, because the reader is seeing a snapshot of database file from a prior moment in time.

> An example: Suppose there are two database connections X and Y. X starts a read transaction using BEGIN followed by one or more SELECT statements. Then Y comes along and runs an UPDATE statement to modify the database. X can subsequently do a SELECT against the records that Y modified but X will see the older unmodified entries because Y's changes are all invisible to X while X is holding a read transaction. If X wants to see the changes that Y made, then X must end its read transaction and start a new one (by running COMMIT followed by another BEGIN.)

Or:

  ROLLBACK; // cancel the tx e.g.  because a different dbconn thread detected updated data before the tx was to be COMMITted.

  // Replay the tx 
  BEGIN;
  // replay the same SQL statements
  COMMIT;
This is saying that SQLite allows reads and writes to happen simultaneously, but it's still single-writer. There's a WIP branch to add concurrent writes.

> Usually, SQLite allows at most one writer to proceed concurrently. The BEGIN CONCURRENT enhancement allows multiple writers to process write transactions simultanously if the database is in "wal" or "wal2" mode, although the system still serializes COMMIT commands.

https://www.sqlite.org/cgi/src/doc/begin-concurrent/doc/begi...

Shit. I was wrong. Thank you for sharing.
Does LiteFS needs to do "pattern matching" to know what SQLite writes (a txn or not e.g.)? With Litestream, it seems simply use SQLite API. Just curious how do you think through this.
Transactions start with the creation of the journal file and end when it's unlinked so there's some "pattern matching" but it's not terribly complicated. Once the file is unlinked, LiteFS can read and verify the journal and pull the change set of pages from the database file.

Litestream interacts with the SQLite API for locking but it uses a regular file descriptor to read the WAL. It has to do similar parsing of pages but it's a bit more complicated since it needs to re-verify its position every time it reads from the WAL.

I am under the impression that if you are in the WAL mode, there is no rollback journal created?
Yes, there is no rollback journal in WAL mode, but the WAL is its own file.
A while ago, when I discovered litestream, I played around with it and just setup cloning to a local file://; is this the same idea?
Sorta. LiteFS sets up a FUSE mount so once you have that in place then you can create a database within that mount directory and it is automatically replicated.
Can you describe one or more use cases?
One main advantage of LiteFS is being able to deploy a SQLite-based application and not have downtime during deploys. This is especially true in ephemeral systems like Kubernetes where pods will rollover.

Another use case is moving data close to users. If you're only targeting users in a single country then it's not as big of a deal but RTT from the US to Europe is ~100ms and US to Asia is ~250ms. That's a big latency hit depending on what you're trying to do.

It might be interesting for embedded scenarios where you need an always-available database but can't guarantee there will be a connection available consistently to a central database server.