Hacker News new | ask | show | jobs
by masklinn 4947 days ago
> It's a leaky abstraction, you shouldn't need to handle something that is tied to the internal representation of strings in the jvm.

And I'm saying it doesn't really matter, because unicode codepoints are already a form of "leaky abstraction" which you'll have to handle (in that a read/written "character" does not correspond 1:1 to a codepoint anyway). Unicode is a tentative standardization of historical human production, and if you expect that to end up clean and simple you're going to have a hard time.

> Can one "character" span multiple codepoints?

Yes.

> Do you have an example of this?

Devanagari (the script used for e.g. Sanskrit) is full of them. For instance, "sanskrit" is written "संस्कृतम्" [sə̃skɹ̩t̪əm]. If you try to select "characters" in your browser you might get 4 (सं, स्कृ, त and म्) or 5 (सं, स्, कृ, त and म्) or maybe yet another different count, but this is a sequence of 9 codepoints (regardless of the normalization, it's the same in all of NFC, NFD, NFKC and NFKD as far as I can tell):

    स: DEVANAGARI LETTER SA
    ं: DEVANAGARI SIGN ANUSVARA
    स: DEVANAGARI LETTER SA
    ्: DEVANAGARI SIGN VIRAMA
    क: DEVANAGARI LETTER KA
    ृ: DEVANAGARI VOWEL SIGN VOCALIC R
    त: DEVANAGARI LETTER TA
    म: DEVANAGARI LETTER MA
    ्: DEVANAGARI SIGN VIRAMA
Note: I'm not a Sanskrit speaker and I don't actually know devanagari (beyond knowing that it's troublesome for computers, as are jamo) so I can't even tell you how many "symbols" a native reader would see there.
1 comments

That's quite interesting, i had no idea! What i was hoping for was some kind of term for one character or symbol and use that as a unit, but perhaps it's impossible to create an abstraction like that.

I'm curious if a Sanskrit speaker would see each of the codepoints as a symbol or not.

Edit: thinking about it, i guess if you asked a Sanskrit speaker how long a word/sentence was, you'd get the answer..

> What i was hoping for was some kind of term for one character or symbol and use that as a unit

There is one, kind-of: "grapheme cluster"[0]. This is the "unit" used by UAX29 to define text segmentation, and aliases to "user-perceived character"[1].

Most languages/API don't really consider them (although they crop up often in e.g. browser bug trackers), let alone provide first-class access to them. One of the very few APIs which actually acknowledges them is Cocoa's NSString — and Apple provides a document explaining grapheme clusters and how they relate to NNString[2] — which has very good unicode support (probably the best I know of, though Factor may have an even better one[3]), and it handles grapheme clusters through providing messages which work on codepoint ranges in an NSString, it doesn't treat clusters as first-class objects.

> i guess if you asked a Sanskrit speaker how long a word/sentence was, you'd get the answer..

Indeed.

[0] http://www.unicode.org/glossary/#grapheme_cluster

[1] http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Bounda...

[2] https://developer.apple.com/library/mac/#documentation/Cocoa...

[3] the original implementor detailed his whole route through creating factor's unicode library, and I learned a lot from it: http://useless-factor.blogspot.be/search/label/unicode

Very interesting, going to read through that guys blog. Thanks for the links!