| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by btn 4772 days ago

this is perfectly adequate to support the characters from all major languages in use worldwide, plus emoji

Emoji characters are in the range U+1F300..U+1F5FF, which is outside the BMP (up to U+FFFF), and will be encoded with surrogate pairs in UTF-16.

Also, even within the BMP, assuming one 16-bit UTF-16 block is the same as one "character" is dangerous, as combining characters (for example) should arguably not be separated from their base character.

If unicode had stuck to representing existing characters and symbols and said no to requests for stuff like emoji and klingon, string representation in modern software could have been kept a whole lot simpler.

Emoji and Klingon are trivial to deal with in comparison to the issues that can appear with "real" languages when you start making assumptions (cf. combining characters, normalisation forms, CJK unification and compatibility blocks, presentation forms, collation, etc.)