Wow. Didnt know there were these CRDT examples for mere mortals. I supppose once you put Rust in the mix the heads start to explode, mine included. Cool!
The thing that can make real world text CRDT implementations complex is that the optimisations kinda bleed into all the rest of your code. The 2 big optimisations you want for most text CRDTs - including egwalker - are:
- Using a b-tree instead of an array to store data
- Use internal run-length encoding. Humans usually type in runs of characters. So store runs of operations instead of individual operations. (Eg {insert "abc", pos 0} instead of [{insert "a", pos 0}, {insert "b" pos 1}, {insert "c" pos 2}]).
But these two ideas also affect one another. Its not enough to just use a b-tree. You need a b-tree which also stores runs. And you also need to be able to insert in the middle of a run. And so on. You need some custom collections.
If you do run-length encoding properly, all iteration throughout your code needs to make use of the compressed runs. If any part of the code works character-by-character, it'll become a bottleneck. Oh and did I mention that it works even better if you use columnar encoding, and break the data up into a bunch of small arrays? Yeahhhh.
So thats why diamond types - my optimized egwalker implementation - is tens of thousands of lines of code instead of a few hundred. (Though in my defence, it also includes custom binary serialization, testing, wasm bindings, and so on.)
Rust makes the implementation way easier to implement thanks to traits. I have simple traits for data that can be losslessly compressed into runs[1]. A whole bunch of code takes advantage of that, by providing tooling that can work with a wide variety of actual data. For example, I have a custom vec wrapper that automatically compresses items when you call push(). I have a "zip" iterator which glues together other iterators over run-length encoded data. And so on. Its great.
Though now that I think about it, maybe all that trait foo is what makes it headache inducing. I swear its worth it.
Jeezs. Thanks for the breakdown. I suppose the layering of different, complicated patterns make it too thick to parse. And some of the CRDT APIs leak quite a bit of complexity once you want to do something a little more complicated eg wrap rich text editor content.
I put at least some of the blame on text editors themselves. Many text editors - particularly on the web - don't expose a clean event based API telling you what changed. Its very annoying.
- Using a b-tree instead of an array to store data
- Use internal run-length encoding. Humans usually type in runs of characters. So store runs of operations instead of individual operations. (Eg {insert "abc", pos 0} instead of [{insert "a", pos 0}, {insert "b" pos 1}, {insert "c" pos 2}]).
But these two ideas also affect one another. Its not enough to just use a b-tree. You need a b-tree which also stores runs. And you also need to be able to insert in the middle of a run. And so on. You need some custom collections.
If you do run-length encoding properly, all iteration throughout your code needs to make use of the compressed runs. If any part of the code works character-by-character, it'll become a bottleneck. Oh and did I mention that it works even better if you use columnar encoding, and break the data up into a bunch of small arrays? Yeahhhh.
So thats why diamond types - my optimized egwalker implementation - is tens of thousands of lines of code instead of a few hundred. (Though in my defence, it also includes custom binary serialization, testing, wasm bindings, and so on.)
Rust makes the implementation way easier to implement thanks to traits. I have simple traits for data that can be losslessly compressed into runs[1]. A whole bunch of code takes advantage of that, by providing tooling that can work with a wide variety of actual data. For example, I have a custom vec wrapper that automatically compresses items when you call push(). I have a "zip" iterator which glues together other iterators over run-length encoded data. And so on. Its great.
Though now that I think about it, maybe all that trait foo is what makes it headache inducing. I swear its worth it.
[1] Eg MergableSpan: https://github.com/josephg/diamond-types/blob/00f722d6ebdc9f...