| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mike_red5hift 2166 days ago

Does anyone take issue with the fact that CRDTs seem to require keeping a history of every change ever made to the document?

Seems like it could get unwieldy very fast. Especially, in the face of a bad actor spamming a document with updates.

I've considered using CRDTs in a few projects now, but the requirement to keep a running log of updates forever has ruled them out. I've ended up using other less sound (more prone to failure), but more practical methods of doing sync.

Perhaps, I'm missing something. Wouldn't be the first time.

Are there alternatives without this requirement, or that would at least allow a cap on the update log?

7 comments

vivekseth 2165 days ago

You don't actually need to keep a log of all events. You only need to keep around just enough information to merge any 2 replicas.

Here's a very simple example: If there are only 5 users concurrently editing a document and each user has seen operations o1...oN, then you can safely compress the data for o1...oN.

Depending on the CRDT type you may not need to store any log at all. For an add-only set, for example, you only need to store the elements.

I think what's harder to solve is the metadata overhead problem. Most CRDT based text editors have a huge per-character overhead. As Martin mentioned, Automerge used to have a overhead of ~200bytes per character, but using a new binary encoding format they were able to reduce the overhead down to 1.5-7 bytes per character. (https://speakerdeck.com/ept/crdts-the-hard-parts?slide=67).

im_down_w_otp 2166 days ago

That's not quite right. There are a lot of sound strategies for culling/merging/resolving CRDT state in-part or in-total depending on the use case and/or the topology of the system that interacts with the CRDT.

It's possible to construct a pathological case where it's impossible to soundly GC the CRDT state, and where you have to keep around an arbitrarily long list of per-agent updates or list of agents forever, but that shouldn't be the normative case.

zzzcpan 2166 days ago

Yeah, CRDTs don't require keeping history of every change forever or at all. In other words, all the changes coming from a bad actor can be merged locally into a single change or a small set of changes or whatever is appropriate that will actually be propagated to other nodes. Plus nodes can easily know how far all of them have progressed and drop history before the most far behind point. Only during outages history should grow a bit more than usual.

zmj 2165 days ago

> Plus nodes can easily know how far all of them have progressed and drop history before the most far behind point.

That may be easy coordinating servers that are almost always online, but it's definitely not easy for desktop/mobile clients that go offline for long periods (and sometimes don't come back).

IggleSniggle 2165 days ago

A middle ground could be a combination of "historic" nodes that keep all known history and "client" nodes that only care about history from their own moment of sync onward, the optimistically drop history via some heuristic.

mike_red5hift 2165 days ago

Do have any pointers to javascript libraries that work this way? All the libs I've looked at (not recently) require the server to keep a running log of updates for the life of the document.

For example, the automerge library linked elsewhere on this thread requires it.

My understanding (flawed) is that you need to keep all the changes on the server because you never know how long it's been since a client has pulled/pushed changes to a document.

I guess arbitrary limits based on number of updates or time can be imposed, but I haven't seen libraries that do that.

Thanks.

rapnie 2165 days ago

Did you look at hypermerge, also in the automerge org? It is based on DAT protocol and hypercore, and the p2p version of automerge I think (just found the project).

https://github.com/automerge/hypermerge

mike_red5hift 2165 days ago

No I didn't. I'll check it out. Thanks!

ghj 2165 days ago

We tend to overestimate how often a document will be modified. For example wikipedia seems to keep "a history of every change ever made to the document" and they seem to be managing fine. If there are bad actors, what you need is better moderation. Having easy undo helps a lot in that case!

topicseed 2165 days ago

Depending on the use case, you can at a specific and suitable time create a snapshot of the value and clear the history of changesets.

I do that with my OT collaborative editor.

appwiz 2166 days ago

> I've ended up using other less sound (more prone to failure), but more practical methods of doing sync.

Could you share the alternative methods that you’ve used?

mike_red5hift 2166 days ago

Stuff like diffing multiple json documents and merging adds, updates and deletes where conflicts did not exist and overwriting conflicting attributes based on timestamps (latest wins). There's some reasonably good jsonpatch libraries out there that will do the heavy lifting.

Plenty of room for problems to arise, but it did not require keeping a log of updates. My use cases did not require real-time collaboration and the structure of the document was known beforehand, though.

appwiz 2165 days ago

Got it. I’ve typically seen Operational Transforms brought up as the alternative to CRDT for real-time collaboration. But, if you don’t need to solve for that use case and the format is JSON, it’s a different problem to solve and you don’t need that journal of changes.

If you have code available in the public domain, I’d love to see it.

topicseed 2165 days ago

Operational Transformation, though not the easiest to grasp, offers a solution to the same problem that is perhaps more loyal to the original user intent.

heavenlyblue 2165 days ago

OT is just a fancy name for a CRDT

alextheparrot 2165 days ago

Is this really true (I just watched the video, so still learning these concepts, so please be kind)?

The linked video makes a clear distinction that OT and CRDT are different, as OT has the idea of a centralised coordinator to ensure consistency by mutating the proposed, conflicting operations whereas CRDT uses commutativity to attempt to make conflicting operations an inaccessible state

toomim 2162 days ago

There is no formal distinction between an OT and CRDT algorithm.

It's true that the most popular OT systems have a central server -- but there exist OT algorithms without a central server.

It's also true that CRDT algorithms tend to require too much memory usage to be practical -- but there exist CRDT algorithms with bounded memory.

There are even algorithms that can be equally called OT and CRDT.

sagichmal 2165 days ago

No, CRDTs and Operational Transforms are not the same at all.

heavenlyblue 2165 days ago

Well, so OT is the same as CRDT except that you also have a centralised server that decides the merge conflicts.

topicseed 2165 days ago

Not true whatsoever.

toomim 2162 days ago

Actually, the algorithms have a lot of overlap. The original CRDT (called WOOT, for "WithOut Operational Transform") was derived from an OT algorithm.

Historically, what happened was that the OT research community couldn't solve consistency without adding tombstones into their data structures that recorded deletions (and generally kept these deletion marks around forever). They called these "tombstones." People didn't like that you had to keep them around forever, but they seemed to solve the problem.

Over time, some researchers decided to just keep more things around forever, and promoted the idea that you don't need to do as much transforming of operations if you just preserved all operations in a spatial data structure, not just the tombstones, but everything. They came up with a new name this style: CRDT.

But in fact, the technical definition of a CRDT actually fits OT algorithms quite well. A CRDT just means that you can accept any operation at any time. That's what a good OT system also needs to do. So there isn't a technical distinction between an OT or a CRDT. They each define a set of features, and your synchronizer can possess OT features, and it can possess CRDT features.

rzzzt 2165 days ago

You could introduce a quiet period in which no updates are let in, coalesce changes into a snapshot, clear history and open the floodgates again.

neeeeees 2166 days ago

“Ever” in practice is usually a few months or years.