Hacker News new | ask | show | jobs
by renke1 1581 days ago
So I am planning to use CRDT sometime in the future.

Any thoughts on Automerge vs. yjs? – I am not doing a text editor. I just want to build a solid offline-first web application.

Also, is there any way to "squash" the history of changes? Let's say I have a central server through which all changes are synced (no peer-to-peer syncing). Does it make sense to force clients that haven't synced for a long time (let's say a weeks) to just discard their non-synced changed and use the "current" state as stored on the server?

Okay, one more question: Let's say I want to add an API to my server that uses the data that was synced to server (assuming the sync state of Automerge/yjs is stored somewhere). Would the server in this case just be another client that just get's the data from the synced state and stores in an appropriate store (say a SQL database, Elasticsearch, etc.)?

2 comments

Here’s what I know from going on a similar journey recently.

1. Choose Yjs for now.

2. Look at Yjs’s binary “update” format. That is what you should store in your database’s “blob” column. This also allows your backend to receive and transmit updates without hydrating the CRDT into JavaScript class instances. https://docs.yjs.dev/api/document-updates

3. Yjs has its own “gc” that discards deleted content. Without GC, deleted content remains in the CRDT but is hidden from the user’s perspective. You will need to hydrate the CRDT into memory for GC feature. I’m not sure how to run this GC, maybe it runs whenever you apply an update on a Y.Doc with doc.gc=true.

4. As long as GC is disabled, you can use “snapshots” to restore old versions of the doc. https://docs.yjs.dev/ecosystem/editor-bindings/prosemirror#v...

So, knowing the above, how to design a system like your question? I think you could go with a kind of hot/cold storage. Keep the “hot” version of your document in the “current” row of your Postgres table for a document. Send/receive updates to the hot row. Take snapshots on the server whenever you’d like to.

Then, the cold storage. Periodically, you want to GC the hot storage. Before you do that, apply it as an update to some cold storage, maybe a blob in S3 so you don’t permanently lose those deleted values, and your snapshots can work in perpetuity against the cold storage data. Then GC the hot storage.

I am more unsure about squashing. The naive way I implemented it is to just iterate copy all the data from OldHotDoc into a totally new independent NewHotDoc, and then archive/discard OldHotDoc. This will start a totally new history. What I’ve considered is that if any writes come from old clients before the squash, you can still apply the straggler writes to the old hot doc/old cold storage, and then manually diff the OldHotDoc before/after the change and then try to patch NewHotDoc the same way. Eventually you arrange for all clients to switch the the New doc history, and you can choose how long you’ll continue to try this janky patch strategy to accept straggler writes or just discard them.

I’m also not sure when you want to squash. I suggest fuzzing your system with the hot/cold storage part first to figure out what the rate of data growth of the “hot” storage is before you consider the squashing part.

> Yjs has its own “gc” that discards deleted content. Without GC, deleted content remains in the CRDT but is hidden from the user’s perspective.

This alone makes Yjs the clear choice for me. If you're building an app where a user prepares a record and then shares it, senders assume recipients can't view the record's previous revisions from before it was shared (unless your app has an obvious 'history' feature). If a CRDT doesn't do garbage collection, recipients receive past revisions, and could extract those states from the CRDT if they wished.

Without GC, you have to address this by creating a new CRDT with no history each time the recipient list changes, and that breaks offline changes made against the old CRDT.

> Also, is there any way to "squash" the history of changes?

My understanding is that this is one of the areas that Yjs does a little better than Automerge, it has a heavily optimised binary representation that combines consecutive changes into a single action.

Most people (who have looked into it) probably associate Yjs with its editor bindings but it’s brilliant for any type of syncing. I used it for automatic conflict resolution for Pouch/CouchDB, works really well.

On your server question, you can go either way, load the Yjs document on the server to read it or store a json representation of the most resent state along side it. Personally I would go for the latter as it give you flexibility.

There are two implementations of Yjs, the JavaScript one and a newer Rust one which will have binding for other languages. Last I looked the Rust one was still a work in progress but that was a few months ago. It will provide great support for Yjs on the server side once it complete.