|
|
|
|
|
by ygra
4787 days ago
|
|
There never will be more than 4 bytes for UTF-8 because Unicode is restricted to 21 bits. Remember that all UTFs have to be able to represent all of Unicode and UTF-16 could not represent those “code points” where UTF-8 needs 5+ bytes. Also I wouldn't say that UTF-8 is a compression scheme. SCSU is one but has its own share of problems. UTF-8 just happens to preserve ASCII compatibility which is an important property for Unix-like systems. Nothing more and nothing less. That is also happens to be more space-efficient for text that consists mostly of ASCII characters is merely a side-effect of that. |
|
I guess if I were German or Spanish (to say nothing of Asian languages), it would be the opposite of compression :-)