Hacker News new | ask | show | jobs
by wisty 4728 days ago
> -there are glyphs composed from several codepoints (base glyph + combining accents), which should be treated as one "character"

The solution is to use Normalization Form C (NFC) (which combines accents with characters).

> there are ligatures that hold single codepoints, but which semantically are multiple "characters"

OK, so use Normalization Form KC (NFKC) (which splits ligatures, and combines accents with characters).

You're right that "length" of a unicode string is very ambiguous. Arguably, you shouldn't be able to call "length" without supplying an argument about what you are actually asking.