Yeah, I thought the whole point of OT was that it allowed any client to do resolution (because all operations transform other operations in a way to make them all resolve the same way no matter their order) Which is what made it seem like such overkill for Google Docs where there is a central authoritative server, so a much simpler architecture could have done the job...
Simple OT algorithms do work a lot better with a centralized server.
With a centralized server, you can consider every resolution as happening between 2 nodes (server and client). This means transform / catchup is as simple as a for loop.
In contrast, in decentralized context OT code needs to support Transform Property 2[1] in order to converge correctly in all cases. TP2 dramatically complicates the OT implementation, and you need a much more complicated resolution algorithm to merge arbitrary changes between nodes.
- Either an operation prune function (anti-transform) or full history, forever.
- A resolution algorithm that can flatten DAGs of operations into a transformed list. This stuff gets really hairy, and hard to implement efficiently. This is my attempt from a few years ago. Its correct, but super slow: https://github.com/josephg/tp2stuff/blob/master/node3.coffee
Implementing high performance OT with centralized servers is easy. Implementing OT in a decentralized network is hard to do correctly, and much harder to implement in a highly performant way. For decentralized systems, CRDTs are a much better approach imo.
Very insightful, thank you for the detailed comment. About the DAG flattening algorithm you mention - in what way does this differ from a topological sorting? (https://en.wikipedia.org/wiki/Topological_sorting ) E.g. does it need to know about expensive or cheap combinations of operations, and try to avoid the former?
Oh! I didn’t know that had a name. Yes - it is indeed a topological sort. At least the way I’ve implemented it, the performance problem is that moving an operation from position K to position N in the output requires O(|K-N|) steps. CRDTs can do this work in O(log(|K-N|)) steps.
There are a few things related to TP2, aka Convergence Property 2 (CP2), that needs to be unpacked here.
Do your system always need to preserve TP2/CP2 to converge correctly? Contrary to what many folks and papers will claim, answer is categorically no.
There are two general approaches to arrive this desired outcome: (1) avoiding it or (2) preserving it. Let's look at the basics of what you need for each:
(1) Avoiding it - this can be done with a centralized server, as well as in a fully distributed, decentralized, peer-to-peer context. So I'm in agreement with josephg in that a central server will avoid TP2/CP2, but I'd like to amend that with a decentralized avoidance strategy is available and just as easy to adopt. To back this up, this paper [1] looks at TP2/CP2-avoidance systematically in representative protocols and algorithms: JUPITER, NICE, Google Wave, COT, TIBOT/TIBOT2. TIBOT/TIBOT2 are two fully decentralized algorithms/protocols that avoids TP2/CP2.
In addition, [1] also answers the more interesting question of why these systems were able to avoid TP2/CP2, by giving the general conditions for TP2/CP2-avoidance. You can apply the conditions to design a new protocol or algorithm that avoids TP2/CP2.
(2) Preserving it - By TP2/CP2 preservation, it usually refers to the task of writing transformation functions (e.g. between insert and delete, or delete and delete) that satisfies the aforementioned condition. This is a solved problem for String-wise operations, from 2016 [2]:
- You can take these String-wise transformation functions and use them with many of the existing OT algorithms in fully-decentralized context and satisfy all TP1 and TP2.
- The transformation functions are free of Tombstones (or its variants) or other schemes (DAG etc) to ensure TP2.
Happy to say more about it if someone has an interest.
You can implement correct and performant OT with both a centralized server or with a fully decentralized topology, using TP2/CP2-avoidance and/or with TP2/CP-preservation. I've built multiple production systems with all the varieties. I generally do prefer OT with TP2/CP2-avoidance on a server, but it's not because OT can't avoid TP2/CP2 without a server or preserve TP2/CP2, it is that there is not much real-world evidence showing clear benefits moving a collaborative editing system to decentralized topology, esp when most of the CRDT-based p2p proposals in papers that I've come across involves a server one way or another.
TP2/CP2 in OT is often stood up as a straw-man so folks can throw stones against its correctness, it is really a solved problem. The unfortunately outcome is that you see a ton of papers adopting this strategy to get published and overtime the conversation becomes really convoluted.
We discuss these ideas and try to dispel many of these confusions in detail in the arxiv manuscript. A good starting point would be Section 4.1.3: OT Solutions to the CP2 Issue and the FT Puzzle