| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by chrismorgan 31 days ago

A CRDT library working at the code unit level? Ouch. Of course that’s going to go wrong, it was inevitable.

As for using extended grapheme clusters, it sounds a little bit iffy—maybe possible to use correctly, maybe not, because they’re not stable over time. That style of thing has created some fascinating bugs, like (a few years ago) index corruption in PostgreSQL due to collation changes.

Unicode scalar values are technically-safe: you can’t introduce invalid Unicode. But you can definitely still end up with nonsense.

> We made emoji an atomic node type.

That avoids problems for emoji, but leaves the underlying hazard untouched. I imagine it could still theoretically occur with other text, probably CJK. But probably only theoretically.

> This splits by grapheme clusters rather than code units. No orphaned surrogates, no split emoji. It's what .slice() should have been doing all along, but of course UTF-16 predates emoji by decades.

I do not agree that slice() should operate on extended grapheme clusters. Don’t lump the grapheme cluster/scalar value split in with the sins of UTF-16 and its unreliable code point/code unit split.

UTF-16 was unforced error (and I still can’t work out why it wasn’t obvious from the start that UCS-2 would never be enough). But the concept of multiple scalars contributing to the logical unit was always inevitable.

5 comments

rectang 31 days ago

> I still can’t work out why it wasn’t obvious from the start that UCS-2 would never be enough)

Surely certain people did know, but those people weren't in a position to do anything about it.

Specifically, there were surely people who knew that because historical Chinese place names, Japanese nicknames, and so on, were not included in the original "Unicode" (it wasn't called UCS-2 yet) it was insufficient for complete expression of Asian languages.

There were also many people who objected to Han unification, which is a different problem.

But all of these objections were discarded because of the overwhelming mandate for a fixed-width encoding. The original "Unicode" was conceived as a "16-bit" initiative. Its 16-bit-ness was an essential aspect of the design and the Unicode Consortium did what they had to do to fit all scripts and characters "in modern use" into 16 bits.

From the Wikipedia article on Han Unification[1]:

> Some of the controversy stems from the fact that the very decision of performing Han unification was made by the initial Unicode Consortium, which at the time was a consortium of North American companies and organizations (most of them in California), but included no East Asian government representatives. The initial design goal was to create a 16-bit standard, and Han unification was therefore a critical step for avoiding tens of thousands of character duplications.

[1] https://en.wikipedia.org/wiki/Han_unification

link

jcranmer 31 days ago

Han unification predates Unicode by about a decade; most of the early work in Unicode largely consists of copy-pasting the Japanese and Chinese governments' standards for unified CJK ideographs. Indeed, read some of the early histories of Han unification (e.g., https://www.unicode.org/versions/Unicode16.0.0/core-spec/app...), and you'll notice that there's a lot of liasoning with East Asian technology groups in East Asian cities going on. I don't think any East Asian government representatives would have actually objected to Han unification!

It's also worth noting that the original goal of Unicode wasn't to be able to faithfully represent all text, but rather to faithfully represent existing character sets. Only later do you get the impetus to actually include everything, as people become a lot less tolerant of "computer can't actually represent <X>" scenarios. Note too that a lot of the Han unification criticisms basically fall into the same bucket as, say, Medievalists, who want to preserve certain details of their source texts more faithfully than was the norm for computer systems in the 1980s.

link

chrismorgan 30 days ago

There was never an adequate safety margin for anything but immediate (less than five year horizon) use—even at Unicode 1.1 it was more than half full, and they knew they weren’t done. And yet all kinds of major companies put all their eggs in that basket, and then doubled down with the monstrosity that is UTF-16, rather than backing out and going with UTF-8 instead, even though I strongly suspect it would have been easier for everyone involved in most cases, compared to the whole wchar shemozzle. Instead it took Windows twenty-five years to bridge the gap with a UTF-8 codepage (65001) that actually worked.

link

georgemandis 31 days ago

>I do not agree that slice() should operate on extended grapheme clusters. Don’t lump the grapheme cluster/scalar value split in with the sins of UTF-16 and its unreliable code point/code unit split.

Yeah, I think that's fair. I didn't really think this through as I was writing it.

I'm not even so sure "ending up with nonsense" here is the worst outcome. It might be unavoidable with this approach and if that had been the only problem this bug might have been less memorable.

The real problem—which I mention didn't articulate/emphasize particularly well—was that these invalid surrogate pairs were getting passed into `encodeURIComponent` somewhere deep in the stack and choking catastrophically on them. That was the "real" bug at the end of the day, but the invalid surrogate pairs and the way they were getting created on the way were a fun journey to untangle.

link

mweidner 28 days ago

A CRDT that operates on code units should work out okay, because each grapheme cluster will always be inserted and deleted in a single edit - hence it should stick together in the text. (Some CRDTs actually can mess this up by interleaving concurrent-inserted code units, but Yjs avoids doing so.)

From the fix PR, I believe the issue in this case was with the insertion operations passed to the CRDT, not the CRDT itself. Specifically, Yjs's ProseMirror integration infers what text was inserted by diffing before and after states, instead of directly capturing user inputs (even though those are provided by ProseMirror transactions). The diff algorithm, lib0/diff, was not grapheme aware and hence could generate an inaccurate diff containing lone surrogates.

Operating on code units is convenient in JavaScript because then your CRDT's `length` matches the language's `String.length`, and likewise for indexed access.

link

layer8 31 days ago

> UTF-16 was unforced error (and I still can’t work out why it wasn’t obvious from the start that UCS-2 would never be enough).

ISO 10646 (“Universal Coded Character Set”) planned for 31-bit code points from the start (128 groups of 256 planes of 256 rows of 256 cells, with UCS-4 as a four-byte encoding), around 1989. Unicode, on the other hand, was a parallel effort initiated by Xerox and Apple a few years earlier, with more pragmatic aims, defining a 16-bit character set (but no encoding) that would allow round-tripping of existing character sets. For Unicode 1.1, it was decided to align it with ISO 10646 and make it coincide with the latter’s first plane (the BMP) and UCS-2. In Unicode 2.0, surrogate pairs and the UTF-16 encoding were introduced to allow future expansion to additional planes, in a way that would be compatible with existing implementations. Only with Unicode 3.1 in 2001, five years after Unicode 2.0 and ten years after Unicode 1.0, were actual characters assigned beyond the BMP.

History is complicated; aims, incentives, and constraints change over time.

link

ucarion 30 days ago

> I do not agree that slice() should operate on extended grapheme clusters. Don’t lump the grapheme cluster/scalar value split in with the sins of UTF-16 and its unreliable code point/code unit split.

Maybe a simpler argument against this idea is that the definition of an extended grapheme cluster changes between versions of Unicode. The relevant standard is on its 47th revision (not all of which change extended grapheme clusters, but many do): https://www.unicode.org/reports/tr29/

link