| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by btn 4551 days ago

Counting graphemes may be over-used, but needing to know their boundaries is important (and leads naturally to counting). For example, when you hit "delete" in a text editor, you'll probably want it to delete whole graphemes (and similarly for text selection); if you're doing text truncation, you may measure it by pixels, but you'll want to chop off the excess bytes at a grapheme boundary.

in the unlikely case I had to support Tamil or Korean for such a specialistic case.

Why is it "unlikely" that you would want your software to support users of other languages?

2 comments

pavlov 4551 days ago

In the case of a delete action in a text editor, are you sure that deleting the whole grapheme is actually what the Tamil or Korean user wants?

You mentioned the following examples in your grandparent post:

- 'நி' (Tamil letter NA + Tamil Vowel Sign I)

- Hangul made of conjoining Jamo (such as '깍': 'ᄁ' + 'ᅡ' + 'ᆨ')

I don't speak either language, but it doesn't seem unreasonable to me that pressing Delete would delete just the vowel sign in Tamil, or just the last component within the Hangul character. In fact, that might be just what the user wants?

link

taejo 4550 days ago

> I don't speak either language, but it doesn't seem unreasonable to me that pressing Delete would delete just the vowel sign in Tamil, or just the last component within the Hangul character. In fact, that might be just what the user wants?

My Korean is pretty poor, but I think that's exactly what one wants. If you mistype a letter, you want to retype that letter, not the whole syllable. However, this should work uniformly: it shouldn't matter if the syllable is represented as a single codepoint or made up of comjoining jamo.

link

yew 4550 days ago

If the Hangul and Tamil constructs are anything like ligatures (e.g. fi in the Latin alphabet), I would imagine that's the case most of the time. Plus lots of special rules for which glyphs to treat as single symbols and which to decompose (e.g. & is technically a ligature but almost never decomposed).

link

btn 4550 days ago

are you sure that deleting the whole grapheme is actually what the Tamil or Korean user wants?

I'm not, but I think it's the only sane thing for a text editor to do if you don't want it to incorporate a ton of language-specific rules. The UAX actually does make a distinction between "legacy" and "extended" grapheme clusters---if you're handing "delete", you'll want to use "legacy clusters" to separate the two Tamil marks; but for text selection, use "extended clusters" will combine them (it's a little bit more complicated than that, but there are properties of Unicode that allow you to handle the "preferred" method for editing a script, while remaining mostly language-agnostic).

Hangul is trickier, but input happens through an IME that "composes" the characters before they are committed to the editor. The IME will perform component-wise deletion, but once it's committed, the editor will operate on the grapheme. It's not a perfect solution, but keeping the composition/decomposition rules for the language in the IME seems preferable.

link

giovannibajo1 4549 days ago

> Why is it "unlikely" that you would want your software to support users of other languages?

I was specifically referring to the use case of translating a command-line usage text (ala --help). I'd assume that translating that in Tamil is not exactly common (statistically speaking), or otherwise all getopt()-like libraries would already support this for me.

link