|
> It's a leaky abstraction, you shouldn't need to handle something that is tied to the internal representation of strings in the jvm. And I'm saying it doesn't really matter, because unicode codepoints are already a form of "leaky abstraction" which you'll have to handle (in that a read/written "character" does not correspond 1:1 to a codepoint anyway). Unicode is a tentative standardization of historical human production, and if you expect that to end up clean and simple you're going to have a hard time. > Can one "character" span multiple codepoints? Yes. > Do you have an example of this? Devanagari (the script used for e.g. Sanskrit) is full of them. For instance, "sanskrit" is written "संस्कृतम्" [sə̃skɹ̩t̪əm]. If you try to select "characters" in your browser you might get 4 (सं, स्कृ, त and म्) or 5 (सं, स्, कृ, त and म्) or maybe yet another different count, but this is a sequence of 9 codepoints (regardless of the normalization, it's the same in all of NFC, NFD, NFKC and NFKD as far as I can tell): स: DEVANAGARI LETTER SA
ं: DEVANAGARI SIGN ANUSVARA
स: DEVANAGARI LETTER SA
्: DEVANAGARI SIGN VIRAMA
क: DEVANAGARI LETTER KA
ृ: DEVANAGARI VOWEL SIGN VOCALIC R
त: DEVANAGARI LETTER TA
म: DEVANAGARI LETTER MA
्: DEVANAGARI SIGN VIRAMA
Note: I'm not a Sanskrit speaker and I don't actually know devanagari (beyond knowing that it's troublesome for computers, as are jamo) so I can't even tell you how many "symbols" a native reader would see there. |
I'm curious if a Sanskrit speaker would see each of the codepoints as a symbol or not.
Edit: thinking about it, i guess if you asked a Sanskrit speaker how long a word/sentence was, you'd get the answer..