Hacker News new | ask | show | jobs
by adgar 5404 days ago
No, you're absolutely right that counting by codepoints penalizes some languages. The surrogates I know best are simple, European ones: accents, for example. The normalization twitter uses according to other posters in this thread (the recommended one from the Unicode standards organization, Form C) always uses the multi-codepoint form when possible, for compatibility reasons. That's why it baffles me if they count by codepoint and not character!
1 comments

> The normalization twitter uses according to other posters in this thread always uses the multi-codepoint form when possible

You're confused, that is NFD (Normalization Form Decomposed). NFC is the result of a canonical composition of the sequence.

> That's why it baffles me if they count by codepoint and not character!

Because "characters" are a fuzzy (if not meaningless) concept in Unicode, especially when talking about the implementation side. "Grapheme cluster" is well defined, but most languages have little support for it.

Codepoints is easy to implement, it's well defined and in many case an NFC codepoint will roughly map onto what users think of as a character.