| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by haadcode 2292 days ago

almost! :) let me elaborate on few points.

> you have to reference all of the current concurrent root nodes of the data structure, in effect becoming the new single root node

correct, and more precisely the union of heads is the current "single root node". in practise, and this is where the merge strategy comes in, the "latest value" is the value of the event that is "last" (as per LWW sorting).

> and your event data, which must be a CRDT, gets merged with the CRDTs of those root nodes.

the event data itself doesn't have to be a CRDT, can be any data structure. the "root nodes" (meaning the heads of the log) don't get merged with the "event data" (assuming you mean the database/model layer on top of the log), the merge strategy of the log picks the "last/latest event data" to be the latest value of your data structure.

> It's time to write a new value, so I create a new tuple with references to 4 and 6, and merge their CRDT values.

when a new value is written, correct that the references to 4 and 6 are stored, but the new value doesn't merge the values of the previous events and rather, it's a new value of its own. it may replace the value from one or both of the previous events, but that depends on the data model (layer up from the log).

  1 --> 2 --> 3 --> 4 
         `--> 5 --> 6

> Last Writer Wins, right? So either 4 or 6 dominates the other. Whoever was in the other causal history just... lost their writes?

no writes are lost. the result in your example depends what 4 and 6 refer to. in a log database, the ordered log would be eg. 1<-2<-3<-5<-4<-6, so all values are preserved. in the case of a key-value store, it could be that 4 is a set operation to key a and 6 is a set operation to key b, thus the writes don't effect each other. if 4 and 6 are both a set operation on key a, it would mean that key a would have the value from 6 and the next write to key a would overwrite the value in a. makes sense?

1 comments

sagichmal 2292 days ago

> in a log database, the ordered log would be eg. 1<-2<-3<-5<-4<-6

How do you know that? It's not inferrable from the DAG. Is sequencing also provided "a layer up"?

> if 4 and 6 are both a set operation on key a, it would mean that key a would have the value from 6 and the next write to key a would overwrite the value in a.

Yes, I mean for all of my events to be reads and writes of the same key. And you've proven my point, I think: if the resolution of this causal tree is Last-Writer-Wins, 6-domiantes-4, and "key a [gets] the value from 6", then whichever poor user was operating on the 3-4 causal branch has lost their writes.

This is a problem! If you claim to be a CRDT and offline-first or whatever, then as a user, I expect that the operations I make while I'm disconnected aren't just going to be destroyed when I reconnect, because someone else happened to be using a computer with a lexicographically superior hostname (or however you derive your vector clocks).

And if you want to say something like, well, when the unlucky user reconnects and sees that their work has been overwritten, they can just look in the causal history, extract their edits, and re-apply them to the new root -- there's no reason for any of this complex machinery! You don't need CRDTs to just replicate a log of all operations. Of course, you also can't do any meaningful work with such a data structure as a foundation, because it immediately becomes unusably large.

link

pas 2292 days ago

It's inferable from the fact that the writer saw both chains and produced a new node that merges these heads. So that writer "resolves" the conflict - according to whatever strategy it is programmed to use. (It might be as simple as just storing a JSON that says {"conflict": true, "values": [4, 6]} , and the user will have to pick.)

If it's possible to model operations in a commutative way (eg. instead of assigning values to keys one just stores differences), then the conflict resolution is mathematically proven, just apply all operations in whatever order, they're commutative, great. Of course it doesn't help with "real world data", but that's where mathematically we can use and oracle (the user, or whatever linearizer service we choose).

link

haadcode 2292 days ago

> How do you know that? It's not inferrable from the DAG. Is sequencing also provided "a layer up"?

I jumped the gun there and made an assumption that the value of a node is the LWW ordering :) Ok, so without that assumption, the DAG

  1 --> 2 --> 3 --> 4 
         `--> 5 --> 6

...are the values of the operations that the DAG represents, ie. values of a key, so we need to look at the Lamport clocks (or Merkle Clocks when the operations are hashed as a merkle dag) of each operation, represented here as ((ts, id), key, value):

  ((0, x), a, 1) --> ((1, x), a, 2) --> ((2, x), a, 3) --> ((3, x), a, 4)
                                   `--> ((2, y), a, 5) --> ((3, y), a, 6)

which one is the latest value for key a? Which updates, semantically, were lost? In a non-CRDT system, which value (4/x or 6/y) is or should be displayed and considered the latest?

> This is a problem! If you claim to be a CRDT and offline-first or whatever, then as a user, I expect that the operations I make while I'm disconnected aren't just going to be destroyed when I reconnect, because someone else happened to be using a computer with a lexicographically superior hostname (or however you derive your vector clocks).

You're conflating the data(base) model with the log and we can't generalize that all cases of data models or merge conflict are cases of "I expect all my operations to be the latest and visible to me". They are semantically different. If the writes are on the same key, one of them has to come first if the notion of "latest single value" is required. If the writes are not on the same key, or not key-based, multiple values appear where they need to. What we can generalize is that by giving a deterministic sorting function, the "latest value" is the same for all participants (readers) in the system. From data structure perspective this is correct: given same set of operations, you always get the same result. For many use cases, LWW works perfectly fine, and if your data model requires a "different interpretation" of the latest values, you can pass in your custom merge logic (=sorting function) in orbitdb. The cool thing is, that by giving a deterministic sorting function for a log, you can turn almost any data structure to a CRDT. How they translate to end-user data model will depend (eg. I wouldn't model, say, "comments on a blog post" as a key-value store).

If you're curious to understand more, I think the model is best described in the paper "OpSets: Sequential Specifications for Replicated Datatypes" [1]. Another two papers, from the same author, that may also help are "Online Event Processing" [2] and "Moving Elements in List CRDTs" [3] which show how by breaking down the data model to be more granular than "all or nothing", composing different CRDTs give arise to new CRDTS, which I find beautiful. Anything, really, that M. Kleppmann has written about the topic is worth a read :)

[1] https://arxiv.org/pdf/1805.04263.pdf [2] https://martin.kleppmann.com/papers/olep-cacm.pdf [3] https://martin.kleppmann.com/papers/list-move-papoc20.pdf

link

sagichmal 2292 days ago

OK, I understand now. I guess my points then translate to:

1. Modeling an append-only log as a CRDT is trivial

2. Building a database on top of a "CRDT" append-only log doesn't make the database a CRDT

link

haadcode 2292 days ago

No, on both. See above.

link