Hacker News new | ask | show | jobs
by lasthemy 3397 days ago
In 2003, RFC 3629 removed all 5 and 6 byte encodings, effectively limiting it to 4-bytes. Of course it could be expanded at any time, but that would be a significant change to established practice, and directly contradict the rationale in RFC 3629 (that because most people use 4 bytes in practice, allowing 5 and 6 constituted a security flaw).

Source: the same Wikipedia article you linked.

1 comments

Sure, that's why I pointed it could be expanded anytime, because the encoding already supports its expansion, by design :-)
The limiting factor on Unicode is UTF-16. There's only enough surrogates for 16 astral planes, which is why Unicode has 17 16-bit planes.
UTF-16 has reserved codes as well, so it could be expanded for covering 2^32 codes, too.
The range U+D800-DFFF is reserved for UTF-16 surrogates, specifically in two pairs of low and high surrogates. That means every surrogate pair can encode 10 + 10 bits of information, which is where the 16 astral planes (4 bits of 16-bit planes) comes in. Otherwise, there are just 128 code points in unallocated blocks in the BMP.

There is no space for expansion without reassigning private use areas or changing the encoding mechanism of surrogates--which is currently completely specified (each surrogate pair will produce a valid code point).