Hacker News new | ask | show | jobs
by zzz61831 2138 days ago
> Memory issues with tombstones. Marking as deletion has a cost of maintaining them throughout the session.

That's not true, you don't need to keep them throughout the session. You can drop them as soon as you either synchronize with other clients or with the server, which can then synchronize with the clients, that's what conflict-free nature of CRDTs allows you to do. The only fundamental requirement is to keep track of changes when you are offline or out of sync until they are sent to others and even then you are supposed to merge those changes into fewer changes keeping disk and memory use down to a minimum.

> CRDTs were perfect for plain strings and arbitrary key/value pairs. But when it comes to schematic JSON that has semantic value CRDT was an added overhead.

Also not true. They were never perfect for plain strings, even incompatible semantically, but perfect for sets, counters, numbers, where basic commutative math is enough, a few other things and anything that combines those primitives into more complex ones, especially true for schematic structures, tables.

You should really try implementing CRDTs, especially OP-based, they change the way you think about keeping shared state in distributed systems.

> Since the software is primarily a cloud document editor, a central server is necessary even otherwise. So why not use the server for efficient version management and operation sequencing as well? Why chose CRDTs whose bulk of complexity comes from eliminating the need of a central system?

There is zero complexity in CRDTs that comes from eliminating the need of a central system, it's a completely different problem orthogonal to CRDTs. Central systems is where CRDTs are used the most actually.

1 comments

> You can drop them as soon as you either synchronize with other clients or with the server

Say 100 nodes edit a doc. To drop a tombstone, without a central server we need 99 acks. Even with a server, the server keeps an ever-growing list of ops unless it uses some garbage collection techniques. And as stated by OP those techniques have their own problems. Remember, OP relies on V8's hidden classes for deriving an optimal representation without eliminating tombstones. Servers though need not necessarily hold CRDTs in JS. It might be a different VM altogether.

> anything that combines those primitives into more complex ones, especially true for schematic structures, tables.

I'm not implying CRDTs are only for simple structures. Automerge has support for tables too. I'm talking about structures with semantic meaning. In database tables the ops are always inserting/delete columns/rows and updating cells. In HTML tables, it extends into cell-merging, cell-splitting, inner tables and all sorts of valid operations. Merging these operations without inspecting the op and writing a custom modifier - will certainly end up in an invalid table structure.

> Central systems is where CRDTs are used the most actually.

I'm assuming the central system referred here is the database resilience systems using CRDTs to merge. Database syncing without master/slave is exactly the kind of problem best solved with CRDTs. The very point is to eliminate the need for a central point of failure.

OT comes with very different problems when it comes to representing tables. Structured content (tables, tables containing tables, tables containing quotes containing tables, ..) is much better represented using CRDTs.

And another note, Yjs completely supports the OT rich-text type (i.e. delta format [1]). This is how the Quill editor was made collaborative.

So Yjs is at least as powerful as OT. If you want, you can still base your application around OT deltas, but Yjs is IMO much easier to use.

[1]: https://quilljs.com/docs/delta/

> So Yjs is at least as powerful as OT.

Sounds interesting. Will give Yjs a try!

Great! :)
> Say 100 nodes edit a doc. To drop a tombstone, without a central server we need 99 acks. Even with a server, the server keeps an ever-growing list of ops unless it uses some garbage collection techniques. And as stated by OP those techniques have their own problems.

Here's how it could be done (very roughly):

When a client makes a modification, it creates an op representing that modification and includes a list of server identifiers to send the op to. Then queues the op. Once op is sent and ACKed by at least one of the servers the op is dropped from the queue.

Servers receiving the op synchronize the op with each other, ACKing each other and removing ACKed servers from the list. They also synchronize with the clients, sending the op to each client and asking clients to ACK to all servers. Obviously there are multiple ways to do it and it can be almost as efficient as theoretically possibly, but of course you can't avoid some metadata for each client associated with a particular document even though it doesn't have to be kept in server memory, it could be stored on disk.

> I'm talking about structures with semantic meaning.

Semantic meaning generally doesn't map to any data structure directly, CRDTs are not an exception here either.

> Then queues the op. Once op is sent and ACKed by at least one of the servers the op is dropped from the queue.

Unfortunately, it's a lot more complicated than this.

First, you can't wait until just one peer has ACKed -- you need to wait until all of them have. Any peer that hasn't ACKed could still send an op based on the one you want to delete, and then you wouldn't have enough information to merge the new op.

Second, you can't just wait until all peers have ACKed an op, you need to wait until no new ops depend on that op. An op could depend on an op even though it has been ACKed by everyone.

Third, a peer could go offline for a while, and then come back online. But if you are waiting for all peers to ACK an op, you might never get a situation where all peers are online at the same time.