Hacker News new | ask | show | jobs
by pvdebbe 3297 days ago
I find it unfortunate how many languages treat strings as lists of chars for no clear benefit. This issue is apparent in dynamically typed languages where a function might expect either a string or a sequence, and now it all has to come down to type comparisons because both can be iterated. I think it'd be time to start treating them as atomic values instead.
2 comments

The benefit is that you reuse the build-in type for lists instead of making a new one. What benefit would not treating strings as lists of chars bring?
Memory use: Unicode scalar values go up to 0x10ffff, which on most machines means a 32-bit value for each character. A UTF-8 representation can be less than 30% the size. And that's not even counting the fact that many languages (Haskell included) represent lists as a linked data structure, with the overhead of a pointer per list entry.

Correctness: you often don't want to operate on individual Unicode scalar values. Extended grapheme clusters can combine multiple scalar values to form a single human-readable character, and that's usually the unit you care about. Representing a string directly as a list of extended grapheme clusters would use even more memory.

Fundamentally, a string has more structure than a list representation gives you (encoded bytes vs. scalar values vs. grapheme clusters). I think it's better to expose this structure than it is to pretend a string is just a list of characters.

On the contrary, UTF-8 is the one that is long, up to 50% longer than UTF-32. (Unless you happen to have a disproportionate number of low code points.)

No free lunches!

That's UTF-16, not UTF-32.

UTF-8 is one to four bytes, UTF-16 is two or four bytes, and UTF-32 is always four bytes. For some code points, UTF-8 is 50% longer than UTF-16 (3 vs 2), but UTF-8 is never longer than UTF-32.

Sure, UTF-8 isn't always the shortest, but for many common strings (like JSON-encoded objects with ASCII keys) it is much shorter than UTF-32. The point is that using a list representation means you can't do any better than UTF-32, even if you wanted to.
If you have ASCII, might I recommend the ASCII character set and encoding?
Performance and efficiency. Just as many problems need a way to store a matrix of unboxed integers or floats, many problems need a way to treat strings, possibly very large strings in a way that doesn't include any language overhead/metadata for each separate character, and allows fast random indexing, which lists don't.

List of characters works fine for the string 'Hello, world!'. It doesn't work fine for the string representing, for example, a whole webpage that you're returning, of for the string that you need to pass to some external code e.g. a regex engine implemented in C (which requires to transform it to a memory-contiguous array of chars, and then transform the results back), or for a 100 megabyte plaintext/xml/json/csv/whatever file you're processing.

The challenge for language designers is to abstract away the details of representation, but still present a uniform interface. (In this case, the designers have failed.)
But iterating over the characters is a very useful feature. It's used in all string algorithms that I can remember in the time it takes to write this comment.
But iteration is not a property of a list, but of the Foldable type class. A list is just one implementation of something that's Foldable/iterable. You can easily implement Foldable for custom text types.
>But iterating over the characters is a very useful feature. It's used in all string algorithms that I can remember in the time it takes to write this comment.

And you still do that with atomic strings -- you just either add a helper method like charAt(i) that gives you access to each character (or, rather, each rune), or you have some way to turn a string of length N into a list of N strings of length one.

And those string algorithms likely break in subtle ways when they handle characters that span multiple codepoints.
> break in subtle ways when they handle characters that span multiple codepoints

Or equivalently: there is more than one way to turn a string into a list. It can e.g. be a sequence of bytes, unicode chars or grapheme clusters. Being explicit about the conversion is therefore a good idea.

Don't forget splitting on word boundaries and/or whitespace - going from a string of text to an iterable collection of words (strings).
Or for the case of (e.g.) domain names, splitting on dots. Generally, given a collection of split chars, breaking the string into a collection of substrings.
Not if the "iterating over character" function iterates over actual characters and not codepoints.
You mean grapheme clusters? Swift is the only language I know that uses that by default, and you still wouldn't want to store strings as a list of grapheme clusters.
I believe Perl 6 does so as well, see e.g. https://perl6advent.wordpress.com/2015/12/07/day-7-unicode-p...
The apple dev documentation has a nice overview of some of the concerns that need to be taken into account for this to work:

https://developer.apple.com/library/content/documentation/Co...

It's probably one of the better approaches - but it's still not clear if it (alone) allows a developer that speaks only English to develop a text indexing or editing system that works well across English, Japanese, Arabic, Hangul and Dutch for example.

Elixir as well.
The problem is “actual characters” are an ill-defined term; that could mean either code points or graphemes. See, e.g., http://unicode.org/faq/char_combmark.html
Sure, it is useful. Maybe have a "string to [Char]" function in the stdlib?