|
|
|
|
|
by dale-cooper
4948 days ago
|
|
Yes. What i'm saying is that it would feel less error prone if the character object was actually a codepoint. It's a leaky abstraction, you shouldn't need to handle something that is tied to the internal representation of strings in the jvm. Can one "character" span multiple codepoints? Do you have an example of this? |
|
And I'm saying it doesn't really matter, because unicode codepoints are already a form of "leaky abstraction" which you'll have to handle (in that a read/written "character" does not correspond 1:1 to a codepoint anyway). Unicode is a tentative standardization of historical human production, and if you expect that to end up clean and simple you're going to have a hard time.
> Can one "character" span multiple codepoints?
Yes.
> Do you have an example of this?
Devanagari (the script used for e.g. Sanskrit) is full of them. For instance, "sanskrit" is written "संस्कृतम्" [sə̃skɹ̩t̪əm]. If you try to select "characters" in your browser you might get 4 (सं, स्कृ, त and म्) or 5 (सं, स्, कृ, त and म्) or maybe yet another different count, but this is a sequence of 9 codepoints (regardless of the normalization, it's the same in all of NFC, NFD, NFKC and NFKD as far as I can tell):
Note: I'm not a Sanskrit speaker and I don't actually know devanagari (beyond knowing that it's troublesome for computers, as are jamo) so I can't even tell you how many "symbols" a native reader would see there.