| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ISV_Damocles 321 days ago
	UTF-16 is also just as complicated as UTF-8 requiring multibyte characters to cover the entirety of Unicode, so it doesn't avoid the issue you're complaining about for the newest languages added, and it has the added complexity of a BOM being required to be sure you have the pairs of bytes in the right order, so you are more vulnerable to truncated data being unrecoverable versus UTF-8. UTF-32 would be a fair comparison, but it is 4 bytes per character and I don't know what, if anything, uses it.

2 comments

Mikhail_Edoshin 321 days ago

No, UTF-16 is much simpler in that aspect. And its design is no less brilliant. (I've written an state machine encoder and decoder for both these encodings.) If an application works a lot with text I'd say UTF-16 looks more attractive for the main internal representation.

link

rmunn 321 days ago

UTF-16 is simpler most of the time, and that's precisely the problem. Anyone working with UTF-8 knows they will have to deal with multibyte codepoints. People working with UTF-16 often forget about surrogate characters, because they're a lot rarer in most major languages, and then end up with bugs when their users put emoji into a text field.

link

adgjlsfhk1 321 days ago

python does (although it will use 8 or 16 bits per character if all characters in the string fit)

link