Hacker News new | ask | show | jobs
by mike_red5hift 2166 days ago
Does anyone take issue with the fact that CRDTs seem to require keeping a history of every change ever made to the document?

Seems like it could get unwieldy very fast. Especially, in the face of a bad actor spamming a document with updates.

I've considered using CRDTs in a few projects now, but the requirement to keep a running log of updates forever has ruled them out. I've ended up using other less sound (more prone to failure), but more practical methods of doing sync.

Perhaps, I'm missing something. Wouldn't be the first time.

Are there alternatives without this requirement, or that would at least allow a cap on the update log?

7 comments

You don't actually need to keep a log of all events. You only need to keep around just enough information to merge any 2 replicas.

Here's a very simple example: If there are only 5 users concurrently editing a document and each user has seen operations o1...oN, then you can safely compress the data for o1...oN.

Depending on the CRDT type you may not need to store any log at all. For an add-only set, for example, you only need to store the elements.

I think what's harder to solve is the metadata overhead problem. Most CRDT based text editors have a huge per-character overhead. As Martin mentioned, Automerge used to have a overhead of ~200bytes per character, but using a new binary encoding format they were able to reduce the overhead down to 1.5-7 bytes per character. (https://speakerdeck.com/ept/crdts-the-hard-parts?slide=67).

That's not quite right. There are a lot of sound strategies for culling/merging/resolving CRDT state in-part or in-total depending on the use case and/or the topology of the system that interacts with the CRDT.

It's possible to construct a pathological case where it's impossible to soundly GC the CRDT state, and where you have to keep around an arbitrarily long list of per-agent updates or list of agents forever, but that shouldn't be the normative case.

Yeah, CRDTs don't require keeping history of every change forever or at all. In other words, all the changes coming from a bad actor can be merged locally into a single change or a small set of changes or whatever is appropriate that will actually be propagated to other nodes. Plus nodes can easily know how far all of them have progressed and drop history before the most far behind point. Only during outages history should grow a bit more than usual.
> Plus nodes can easily know how far all of them have progressed and drop history before the most far behind point.

That may be easy coordinating servers that are almost always online, but it's definitely not easy for desktop/mobile clients that go offline for long periods (and sometimes don't come back).

A middle ground could be a combination of "historic" nodes that keep all known history and "client" nodes that only care about history from their own moment of sync onward, the optimistically drop history via some heuristic.
Do have any pointers to javascript libraries that work this way? All the libs I've looked at (not recently) require the server to keep a running log of updates for the life of the document.

For example, the automerge library linked elsewhere on this thread requires it.

My understanding (flawed) is that you need to keep all the changes on the server because you never know how long it's been since a client has pulled/pushed changes to a document.

I guess arbitrary limits based on number of updates or time can be imposed, but I haven't seen libraries that do that.

Thanks.

Did you look at hypermerge, also in the automerge org? It is based on DAT protocol and hypercore, and the p2p version of automerge I think (just found the project).

https://github.com/automerge/hypermerge

No I didn't. I'll check it out. Thanks!
We tend to overestimate how often a document will be modified. For example wikipedia seems to keep "a history of every change ever made to the document" and they seem to be managing fine. If there are bad actors, what you need is better moderation. Having easy undo helps a lot in that case!
Depending on the use case, you can at a specific and suitable time create a snapshot of the value and clear the history of changesets.

I do that with my OT collaborative editor.

> I've ended up using other less sound (more prone to failure), but more practical methods of doing sync.

Could you share the alternative methods that you’ve used?

Stuff like diffing multiple json documents and merging adds, updates and deletes where conflicts did not exist and overwriting conflicting attributes based on timestamps (latest wins). There's some reasonably good jsonpatch libraries out there that will do the heavy lifting.

Plenty of room for problems to arise, but it did not require keeping a log of updates. My use cases did not require real-time collaboration and the structure of the document was known beforehand, though.

Got it. I’ve typically seen Operational Transforms brought up as the alternative to CRDT for real-time collaboration. But, if you don’t need to solve for that use case and the format is JSON, it’s a different problem to solve and you don’t need that journal of changes.

If you have code available in the public domain, I’d love to see it.

Operational Transformation, though not the easiest to grasp, offers a solution to the same problem that is perhaps more loyal to the original user intent.
OT is just a fancy name for a CRDT
Is this really true (I just watched the video, so still learning these concepts, so please be kind)?

The linked video makes a clear distinction that OT and CRDT are different, as OT has the idea of a centralised coordinator to ensure consistency by mutating the proposed, conflicting operations whereas CRDT uses commutativity to attempt to make conflicting operations an inaccessible state

There is no formal distinction between an OT and CRDT algorithm.

It's true that the most popular OT systems have a central server -- but there exist OT algorithms without a central server.

It's also true that CRDT algorithms tend to require too much memory usage to be practical -- but there exist CRDT algorithms with bounded memory.

There are even algorithms that can be equally called OT and CRDT.

No, CRDTs and Operational Transforms are not the same at all.
Well, so OT is the same as CRDT except that you also have a centralised server that decides the merge conflicts.
Not true whatsoever.
Actually, the algorithms have a lot of overlap. The original CRDT (called WOOT, for "WithOut Operational Transform") was derived from an OT algorithm.

Historically, what happened was that the OT research community couldn't solve consistency without adding tombstones into their data structures that recorded deletions (and generally kept these deletion marks around forever). They called these "tombstones." People didn't like that you had to keep them around forever, but they seemed to solve the problem.

Over time, some researchers decided to just keep more things around forever, and promoted the idea that you don't need to do as much transforming of operations if you just preserved all operations in a spatial data structure, not just the tombstones, but everything. They came up with a new name this style: CRDT.

But in fact, the technical definition of a CRDT actually fits OT algorithms quite well. A CRDT just means that you can accept any operation at any time. That's what a good OT system also needs to do. So there isn't a technical distinction between an OT or a CRDT. They each define a set of features, and your synchronizer can possess OT features, and it can possess CRDT features.

You could introduce a quiet period in which no updates are let in, coalesce changes into a snapshot, clear history and open the floodgates again.
“Ever” in practice is usually a few months or years.