| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by samwillis 1313 days ago

It's the Byzantine Fault Tolerant part of this that is particularly innovative and based on Kleppmanns most recent work. I believe it solves the issue of either malicious actors in the networks modifying others transactions, spoofing them, or the messages being modified by third parties ("outside" the network) who have MITM the connection. These are really great problems to solve.

However, when I was experimenting with CRDTs a while back, it seemed to me the other big issue is where multiple transactions from different users combine to create an invalid state. Most CRDT toolkits are aiming for a combination of a JSON like type structure, with the addition on specific structures for rich text. The JSON structures for general purpose data, and those for rich text as that is the most common current use case. These sort of general purpose CRDTs don't have a way to handle, say, the maximum length of an array/string, or minimum and maximum values for a number. They leave this to the application using the toolkit.

For the Yjs + ProseMirror system, effectively the CRDTs are resolved outside of ProseMirror. Thats useful as they can be combined/resolved on the server without a ProseMirror instance running. However there is a strong possibility that the resulting structure is no longer valid for your ProseMirror Schema, this is currently handled by ProseMirror throwing away the invalid parts of the document when it loads.

What I think is needed is a Schema system that is a layer either on top of these toolkits, or as part of them, that provides rules and conflict resolution. So there is a way to specify the maximum length of an array/string, or what the maximum value of a number is. Effectively generic CRDTs that have an understanding of developer supplied rules and bounds.

The "maximum length" is an interesting one, as depending on the order of transactions you could end up with a different result.

3 comments

tekkk 1313 days ago

Interesting, yeah access-control is kinda open problem with Yjs. Regarding ProseMirror and rich-text documents, you can mess up documents in other ways as well. Eg deploy a faulty command with a transaction that inserts nodes with invalid children (can be prevented though by using createChecked). Or just changing your schema in an incompatible way with the previous version. So you kinda have to deal with possible malformed documents either way.

Havent had documents corrupted by Yjs allowing changes that are not parseable by schema though - has this happened to you?

And about the schema layer on top of Yjs, you possibly could inspect every update and apply some validation rules. Arent all operations just inserts, updates or deletes of nodes? You can at least rollback to previous version as you flush the updates to the doc in the db. Not ideal though.

link

samwillis 1313 days ago

No, I haven't seen a corrupt "unparceable" document. It's more issues around documents that don't comply with the schema.

Say for example you have a <figure> node that can contain a single optional <caption>. If two users concurrently add a caption, then merge their changes, the Yjs document will contain both. It has no concept of what the valid structure is. When this is loaded into the ProseMirror the second caption will be dropped.

link

tekkk 1313 days ago

Hmm hadn't thought about that. Yeah Yjs should use more granular steps instead of just replacing the whole doc each time remote update is received.

link

sambroner 1313 days ago

When I was working on this problem with Fluid Framework, we did a few interesting experiments with "owned objects" and centralized ACL objects. I believe the team primarily implemented centralized ACL because that implementation works for many enterprise use cases.

With a centralized schema provider, you run a connected node on a trusted server and reject changes that are out of schema or should not be accessed by a user.

An owned object is an object where a user (or user group that votes via quorum) that owns the object can veto changes to the object. The changes are temporarily applied until accepted by the owners. I haven't dug deep enough into this BFT implementation to know how our model would map to this model.

link

sroussey 1313 days ago

How does that compare to Google’s Zanzibar?

link

sbazerque 1313 days ago

I think this is the schema system you're asking for: https://www.hyperhyperspace.org/

(specifically, the data representation part)

link

jitl 1313 days ago

This is a cool idea, but I didn't find any examples of max length constraints or other normalization rules in the source code I reviewed. Maybe there's something in there.

Here's some source code for an early, work-in-progress Wiki CRDT: https://github.com/hyperhyperspace/wiki-collab/blob/master/s...

Page in the Wiki. Note that data types have a validate method that returns true or false; maybe if false, they're just dropped from the UI? Not sure how the method is used. https://github.com/hyperhyperspace/wiki-collab/blob/master/s...

I haven't found the underlying text or rich text CRDT implementation yet.

link

jackyzhao 1313 days ago

Hey! Author of the post responding here :) Unfortunately max/min-length is a global invariant and not one that can be reinforced by CRDTs without coordination

bloom-lang.net is a really cool project working on trying to figure out what types of program state actually require coordination at compile time

link

jitl 1313 days ago

Right - I think what Sam is wondering about is if there’s a better way to maintain some of those invariants when decoding the CRDT data or at least preserve user intent with more fidelity by taking the invariants into account. For example, if a ProseMirror schema says “figure can have one caption child”, then a CRDT library can assist by making Child a LWW register instead of an array; or picking the last item in the array instead of the first when limiting length.

link

jackyzhao 1313 days ago

Ah gotcha. This would be a cool area of research! Not sure if I know of many people working on that right now

link

jitl 1313 days ago

To me it seems mostly a composition problem; we need a better way of expressing these kinds of application constraints in a formal schema so we can encode/decode application data structures to CRDTs without requiring invasive intermixing of the two concerns in a way that explodes complexity of both the app and the CRDT details.

link

micahscopes 1313 days ago

Okay wow, it's fun and surreal to see code you're working on pop up in the comment section unexpectedly right in front of your eyes :D

I've been helping Santi on this wiki application and we've actually been dreaming about how nice it'd be to eventually have some kind of schema language for defining Hyper Hyper Space types. In the current version of HHS, data types are fully defined via TypeScript classes, and constraints like that are implemented via the validation method you discovered. If that method returns false, the data won't even be stored or synced, let alone make it to the UI. So for now it's an open ended way of defining constraints, making it easier to explore and experiment. Many of the core HHS data types have constraints built into them too, enforced via the validate method. For example, the CausalSet type can be made to only allow certain elements as members.

Anyway there are a few reasons I can see that it'd be amazing to have something more like JSON Schema. One is that it'd just clarify using these data types outside of TypeScript, and ideally they'd be more succinct to read and write than TypeScript classes.

Another reason I'm especially curious about is the possibility of using it along with some kind of query system, which could potentially make a bunch of stuff easier both both in selecting data for applications and in implementing casual validation logic. Validating multiple interacting causal data types requires looking at chains of events in the DAG. Maybe being able to query for these chains could make this nicer?

All that to say, my guess is that the hardest part of moving to a succinct schema definition format would be how to deal with constraint/validation specification.

> I haven't found the underlying text or rich text CRDT implementation yet.

I think something like this in the works! Right now we're just using last write wins. We did experiment with using Y.js for realtime collaboration, but we ran into some complexity in reliably ensuring that the Y.js state could get persisted to the HHS data types without race conditions, so we took a step back from that idea.

link

jitl 1313 days ago

I encourage you to follow up on the schema language plus a code generator. It’s a powerful technique that can make your implementation more agile as you develop new strategies.

link