| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by paulddraper 3296 days ago
	On the contrary, UTF-8 is the one that is long, up to 50% longer than UTF-32. (Unless you happen to have a disproportionate number of low code points.) No free lunches!

2 comments

MrManatee 3295 days ago

That's UTF-16, not UTF-32.

UTF-8 is one to four bytes, UTF-16 is two or four bytes, and UTF-32 is always four bytes. For some code points, UTF-8 is 50% longer than UTF-16 (3 vs 2), but UTF-8 is never longer than UTF-32.

link

panic 3296 days ago

Sure, UTF-8 isn't always the shortest, but for many common strings (like JSON-encoded objects with ASCII keys) it is much shorter than UTF-32. The point is that using a list representation means you can't do any better than UTF-32, even if you wanted to.

link

paulddraper 3295 days ago

If you have ASCII, might I recommend the ASCII character set and encoding?

link