I think Twitter should use whichever usually gives the user the most characters, to prevent them from getting burnt.
I think in many cases, the normalized form is more permissive, as it puts "character plus diacritic" together into one character. In a language with lots of diacritics, the number of codepoints might be more than the number of normalized characters (depending on the client). You wouldn't want to allow (say) ~70 Korean characters on one OS, and 140 on another, just because they use different codepoints to represent certain characters - one with character then diacritic (2 codepoints?), another with both crammed together in one codepoint.
But as I said, I'm not a unicode guru (and I don't know much about Korean, I just saw it as an example). This might be wrong.
No, you're absolutely right that counting by codepoints penalizes some languages. The surrogates I know best are simple, European ones: accents, for example. The normalization twitter uses according to other posters in this thread (the recommended one from the Unicode standards organization, Form C) always uses the multi-codepoint form when possible, for compatibility reasons. That's why it baffles me if they count by codepoint and not character!
> The normalization twitter uses according to other posters in this thread always uses the multi-codepoint form when possible
You're confused, that is NFD (Normalization Form Decomposed). NFC is the result of a canonical composition of the sequence.
> That's why it baffles me if they count by codepoint and not character!
Because "characters" are a fuzzy (if not meaningless) concept in Unicode, especially when talking about the implementation side. "Grapheme cluster" is well defined, but most languages have little support for it.
Codepoints is easy to implement, it's well defined and in many case an NFC codepoint will roughly map onto what users think of as a character.
I think Twitter should use whichever usually gives the user the most characters, to prevent them from getting burnt.
I think in many cases, the normalized form is more permissive, as it puts "character plus diacritic" together into one character. In a language with lots of diacritics, the number of codepoints might be more than the number of normalized characters (depending on the client). You wouldn't want to allow (say) ~70 Korean characters on one OS, and 140 on another, just because they use different codepoints to represent certain characters - one with character then diacritic (2 codepoints?), another with both crammed together in one codepoint.
But as I said, I'm not a unicode guru (and I don't know much about Korean, I just saw it as an example). This might be wrong.