Hacker News new | ask | show | jobs
by JonathonW 2869 days ago
UTF-16 has a bit of a funky design (using four byte/two code unit surrogate pairs to encode code points outside the basic multilingual plane) that ultimately restricts Unicode (if compatibility is to be maintained with UTF-16, at least) to 17 planes, or 2^20 code points (about 1 million).

UTF-8 uses a variable length encoding that allows for more characters-- if restricted to four bytes, it allows for 2^21 total code points; it's designed to eventually allow for 2^31 code points, which works out to about 2 billion code points that can be expressed.

(Granted, this is all hypothetical-- Unicode isn't even close to filling all of the space that UTF-16 allows; there aren't enough known writing systems yet to be encoded to fill all of the remaining Unicode planes (3-13 of 17 are all still unassigned). But UTF-16's still nonstandard (most of the world's standardized on UTF-8) and kind of ugly, so the sooner it goes away, the better.)

2 comments

Thank you, this was an incredibly understandable explanation.
That is a bit misleading to the point of error, on several points:

* Your timeline is backwards. UTF-8 was designed for a 31-bit code space. Far from that being its future, that is its past. In the 21st century it was explicitly reduced from 31-bit capable to 21 bits.

* UTF-16 is just as standard as UTF-8 is, it being standardized by the same people in the same places.

* 17 planes is 21 bits; it is 16 planes that is 20 bits.