Hacker News new | ask | show | jobs
by ygra 4787 days ago
There never will be more than 4 bytes for UTF-8 because Unicode is restricted to 21 bits. Remember that all UTFs have to be able to represent all of Unicode and UTF-16 could not represent those “code points” where UTF-8 needs 5+ bytes.

Also I wouldn't say that UTF-8 is a compression scheme. SCSU is one but has its own share of problems. UTF-8 just happens to preserve ASCII compatibility which is an important property for Unix-like systems. Nothing more and nothing less. That is also happens to be more space-efficient for text that consists mostly of ASCII characters is merely a side-effect of that.

1 comments

From the standpoint of an English speaker, UTF-8 is effectively a (good) compression scheme for Unicode, as opposed to using 2 or more bytes for every character.

I guess if I were German or Spanish (to say nothing of Asian languages), it would be the opposite of compression :-)