| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by snakeanus 3297 days ago
	The benefit is that you reuse the build-in type for lists instead of making a new one. What benefit would not treating strings as lists of chars bring?

2 comments

panic 3297 days ago

Memory use: Unicode scalar values go up to 0x10ffff, which on most machines means a 32-bit value for each character. A UTF-8 representation can be less than 30% the size. And that's not even counting the fact that many languages (Haskell included) represent lists as a linked data structure, with the overhead of a pointer per list entry.

Correctness: you often don't want to operate on individual Unicode scalar values. Extended grapheme clusters can combine multiple scalar values to form a single human-readable character, and that's usually the unit you care about. Representing a string directly as a list of extended grapheme clusters would use even more memory.

Fundamentally, a string has more structure than a list representation gives you (encoded bytes vs. scalar values vs. grapheme clusters). I think it's better to expose this structure than it is to pretend a string is just a list of characters.

link

paulddraper 3296 days ago

On the contrary, UTF-8 is the one that is long, up to 50% longer than UTF-32. (Unless you happen to have a disproportionate number of low code points.)

No free lunches!

link

MrManatee 3295 days ago

That's UTF-16, not UTF-32.

UTF-8 is one to four bytes, UTF-16 is two or four bytes, and UTF-32 is always four bytes. For some code points, UTF-8 is 50% longer than UTF-16 (3 vs 2), but UTF-8 is never longer than UTF-32.

link

panic 3296 days ago

Sure, UTF-8 isn't always the shortest, but for many common strings (like JSON-encoded objects with ASCII keys) it is much shorter than UTF-32. The point is that using a list representation means you can't do any better than UTF-32, even if you wanted to.

link

paulddraper 3295 days ago

If you have ASCII, might I recommend the ASCII character set and encoding?

link

PeterisP 3297 days ago

Performance and efficiency. Just as many problems need a way to store a matrix of unboxed integers or floats, many problems need a way to treat strings, possibly very large strings in a way that doesn't include any language overhead/metadata for each separate character, and allows fast random indexing, which lists don't.

List of characters works fine for the string 'Hello, world!'. It doesn't work fine for the string representing, for example, a whole webpage that you're returning, of for the string that you need to pass to some external code e.g. a regex engine implemented in C (which requires to transform it to a memory-contiguous array of chars, and then transform the results back), or for a 100 megabyte plaintext/xml/json/csv/whatever file you're processing.

link

amelius 3297 days ago

The challenge for language designers is to abstract away the details of representation, but still present a uniform interface. (In this case, the designers have failed.)

link