Hacker News new | ask | show | jobs
by DinaCoder98 822 days ago
I was pessimistic about grapheme-based orientation towards text, deleted it to research more, and I've come to the conclusion that this is simply not a consensus opinion. Can you give me an example where grapheme-based sorting makes a critical difference from codepoint-oriented sorting on a normalized text? Full unicode composition certainly seems to provide a reasonable solution with western languages, CJK characters, and romanization of CJK characters, but that leaves a hell of a lot of scripts that I don't know about.

I mean unicode is incredibly complex, but it doesn't even seem like there's a consensus outside of swift's string implementation of what a grapheme even is.

(Granted, this might support the above concept that people can't even agree on what a string is, but unicode code points seems like a reasonable baseline to expect from a modern language. That said, rust doesn't even include unicode normalization in the standard library, although the common crate for it seems like a reasonable solution.)

2 comments

The issue I am aware of is with the Thai language that has zero-length unicode codepoints that get superimposed on the preceding non-zero-length unicode codepoint preceding it (or if none is present, an 'empty' non-zero-length placeholder). A non-zero-length unicode codepoint can have multiple zero-length unicode codepoints following it. (In Thai, no more than 2 for morphemically correct words.) For sorting, a normalization needs to happen in the order of these zero-length codepoints in order for unicode codepoint sorting to be correct. The standard practice in Thai is to have vowel signs before tone markers.

In recent years, application support for this has greatly improved.

> it doesn't even seem like there's a consensus outside of swift's string implementation of what a grapheme even is.

Linguistically it's easy, graphemes are the squiggles people actually draw, as distinct from how a machine encodes them. Of course since people aren't a single individual with just one consistent opinion that does mean there's room for nuance - maybe some people think this is two separate squiggles.