| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by monkeyninja 4591 days ago
	language will not store unicode string internally with UTF8. Yes, we use it as input and output, but in memory, utf8 is terrible for random access characters. endian is only an issue (normally) for input and output, not really an issue for internal storage. especially when using UTF16 and UTF32 you know exactly the size of items.

3 comments

ygra 4591 days ago

UTF-16 is just as bad as UTF-8 regarding variable-width code points. The only thing you always have (unless using compression schemes like SCSU) is random access to code units. Only UTF-32 also allows random access to code points. However, that's still of questionable value because when dealing with text you often want to handle graphemes, not code points, code units or bytes.

link

jeltz 4591 days ago

You cannot do random access at all in Unicode, not even UTF-32 (and absolutely not UTF-16), due to combining characters.

link

jmdavis 4591 days ago

UTF16 is variable length just like UTF8 is (so don't assume 2 bytes == 1 character).

link