| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by panic 3297 days ago

Memory use: Unicode scalar values go up to 0x10ffff, which on most machines means a 32-bit value for each character. A UTF-8 representation can be less than 30% the size. And that's not even counting the fact that many languages (Haskell included) represent lists as a linked data structure, with the overhead of a pointer per list entry.

Correctness: you often don't want to operate on individual Unicode scalar values. Extended grapheme clusters can combine multiple scalar values to form a single human-readable character, and that's usually the unit you care about. Representing a string directly as a list of extended grapheme clusters would use even more memory.

Fundamentally, a string has more structure than a list representation gives you (encoded bytes vs. scalar values vs. grapheme clusters). I think it's better to expose this structure than it is to pretend a string is just a list of characters.

1 comments

paulddraper 3296 days ago

On the contrary, UTF-8 is the one that is long, up to 50% longer than UTF-32. (Unless you happen to have a disproportionate number of low code points.)

No free lunches!

link

MrManatee 3295 days ago

That's UTF-16, not UTF-32.

UTF-8 is one to four bytes, UTF-16 is two or four bytes, and UTF-32 is always four bytes. For some code points, UTF-8 is 50% longer than UTF-16 (3 vs 2), but UTF-8 is never longer than UTF-32.

link

panic 3296 days ago

Sure, UTF-8 isn't always the shortest, but for many common strings (like JSON-encoded objects with ASCII keys) it is much shorter than UTF-32. The point is that using a list representation means you can't do any better than UTF-32, even if you wanted to.

link

paulddraper 3295 days ago

If you have ASCII, might I recommend the ASCII character set and encoding?

link