| A CRDT library working at the code unit level? Ouch. Of course that’s going to go wrong, it was inevitable. As for using extended grapheme clusters, it sounds a little bit iffy—maybe possible to use correctly, maybe not, because they’re not stable over time. That style of thing has created some fascinating bugs, like (a few years ago) index corruption in PostgreSQL due to collation changes. Unicode scalar values are technically-safe: you can’t introduce invalid Unicode. But you can definitely still end up with nonsense. > We made emoji an atomic node type. That avoids problems for emoji, but leaves the underlying hazard untouched. I imagine it could still theoretically occur with other text, probably CJK. But probably only theoretically. > This splits by grapheme clusters rather than code units. No orphaned surrogates, no split emoji. It's what .slice() should have been doing all along, but of course UTF-16 predates emoji by decades. I do not agree that slice() should operate on extended grapheme clusters. Don’t lump the grapheme cluster/scalar value split in with the sins of UTF-16 and its unreliable code point/code unit split. UTF-16 was unforced error (and I still can’t work out why it wasn’t obvious from the start that UCS-2 would never be enough). But the concept of multiple scalars contributing to the logical unit was always inevitable. |
Surely certain people did know, but those people weren't in a position to do anything about it.
Specifically, there were surely people who knew that because historical Chinese place names, Japanese nicknames, and so on, were not included in the original "Unicode" (it wasn't called UCS-2 yet) it was insufficient for complete expression of Asian languages.
There were also many people who objected to Han unification, which is a different problem.
But all of these objections were discarded because of the overwhelming mandate for a fixed-width encoding. The original "Unicode" was conceived as a "16-bit" initiative. Its 16-bit-ness was an essential aspect of the design and the Unicode Consortium did what they had to do to fit all scripts and characters "in modern use" into 16 bits.
From the Wikipedia article on Han Unification[1]:
> Some of the controversy stems from the fact that the very decision of performing Han unification was made by the initial Unicode Consortium, which at the time was a consortium of North American companies and organizations (most of them in California), but included no East Asian government representatives. The initial design goal was to create a 16-bit standard, and Han unification was therefore a critical step for avoiding tens of thousands of character duplications.
[1] https://en.wikipedia.org/wiki/Han_unification