|
|
|
|
|
by mehrdadn
2869 days ago
|
|
Where are you getting this information though? I haven't worked out the bits myself yet but Wikipedia's first sentence itself says UTF-16 can encode all 1,112,064 valid code points of Unicode, which is already more than 2^(10+10) = 1,048,576. |
|
UTF-16: Encodes the entire 21-bit range, encoding most of the first 0000 to FFFF range as-is, and using surrogate pairs in that range to encode 00010000 to 0010FFFF. The latter range is shifted to 00000000 to 000FFFFF before encoding, which can be encoded in the 20 bits that surrogate pairs provide. This is a subtlety that one likely does not appreciate if one learns UTF-8 first and expects UTF-16 to be like it.
UTF-8: Could originally encode 00000000 to 7FFFFFFF, but since the limitation to just the first 17 planes a lot of UTF-8 codecs in the real world actually no longer contain the code for handling the longer sequences. Witness things like the UTF-8 codec in MySQL, whose 32-bit support conditional compilation switch is mentioned at https://news.ycombinator.com/item?id=17311048 .