| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pvdebbe 3297 days ago
	I find it unfortunate how many languages treat strings as lists of chars for no clear benefit. This issue is apparent in dynamically typed languages where a function might expect either a string or a sequence, and now it all has to come down to type comparisons because both can be iterated. I think it'd be time to start treating them as atomic values instead.

2 comments

snakeanus 3297 days ago

The benefit is that you reuse the build-in type for lists instead of making a new one. What benefit would not treating strings as lists of chars bring?

link

panic 3297 days ago

Memory use: Unicode scalar values go up to 0x10ffff, which on most machines means a 32-bit value for each character. A UTF-8 representation can be less than 30% the size. And that's not even counting the fact that many languages (Haskell included) represent lists as a linked data structure, with the overhead of a pointer per list entry.

Correctness: you often don't want to operate on individual Unicode scalar values. Extended grapheme clusters can combine multiple scalar values to form a single human-readable character, and that's usually the unit you care about. Representing a string directly as a list of extended grapheme clusters would use even more memory.

Fundamentally, a string has more structure than a list representation gives you (encoded bytes vs. scalar values vs. grapheme clusters). I think it's better to expose this structure than it is to pretend a string is just a list of characters.

link

paulddraper 3296 days ago

On the contrary, UTF-8 is the one that is long, up to 50% longer than UTF-32. (Unless you happen to have a disproportionate number of low code points.)

No free lunches!

link

MrManatee 3295 days ago

That's UTF-16, not UTF-32.

UTF-8 is one to four bytes, UTF-16 is two or four bytes, and UTF-32 is always four bytes. For some code points, UTF-8 is 50% longer than UTF-16 (3 vs 2), but UTF-8 is never longer than UTF-32.

link

panic 3296 days ago

Sure, UTF-8 isn't always the shortest, but for many common strings (like JSON-encoded objects with ASCII keys) it is much shorter than UTF-32. The point is that using a list representation means you can't do any better than UTF-32, even if you wanted to.

link

paulddraper 3295 days ago

If you have ASCII, might I recommend the ASCII character set and encoding?

link

PeterisP 3297 days ago

Performance and efficiency. Just as many problems need a way to store a matrix of unboxed integers or floats, many problems need a way to treat strings, possibly very large strings in a way that doesn't include any language overhead/metadata for each separate character, and allows fast random indexing, which lists don't.

List of characters works fine for the string 'Hello, world!'. It doesn't work fine for the string representing, for example, a whole webpage that you're returning, of for the string that you need to pass to some external code e.g. a regex engine implemented in C (which requires to transform it to a memory-contiguous array of chars, and then transform the results back), or for a 100 megabyte plaintext/xml/json/csv/whatever file you're processing.

link

amelius 3297 days ago

The challenge for language designers is to abstract away the details of representation, but still present a uniform interface. (In this case, the designers have failed.)

link

adrianN 3297 days ago

But iterating over the characters is a very useful feature. It's used in all string algorithms that I can remember in the time it takes to write this comment.

link

runeks 3297 days ago

But iteration is not a property of a list, but of the Foldable type class. A list is just one implementation of something that's Foldable/iterable. You can easily implement Foldable for custom text types.

link

coldtea 3297 days ago

>But iterating over the characters is a very useful feature. It's used in all string algorithms that I can remember in the time it takes to write this comment.

And you still do that with atomic strings -- you just either add a helper method like charAt(i) that gives you access to each character (or, rather, each rune), or you have some way to turn a string of length N into a list of N strings of length one.

link

johncolanduoni 3297 days ago

And those string algorithms likely break in subtle ways when they handle characters that span multiple codepoints.

link

SideburnsOfDoom 3297 days ago

> break in subtle ways when they handle characters that span multiple codepoints

Or equivalently: there is more than one way to turn a string into a list. It can e.g. be a sequence of bytes, unicode chars or grapheme clusters. Being explicit about the conversion is therefore a good idea.

link

e12e 3297 days ago

Don't forget splitting on word boundaries and/or whitespace - going from a string of text to an iterable collection of words (strings).

link

SideburnsOfDoom 3296 days ago

Or for the case of (e.g.) domain names, splitting on dots. Generally, given a collection of split chars, breaking the string into a collection of substrings.

link

adrianN 3297 days ago

Not if the "iterating over character" function iterates over actual characters and not codepoints.

link

johncolanduoni 3297 days ago

You mean grapheme clusters? Swift is the only language I know that uses that by default, and you still wouldn't want to store strings as a list of grapheme clusters.

link

cannam 3297 days ago

I believe Perl 6 does so as well, see e.g. https://perl6advent.wordpress.com/2015/12/07/day-7-unicode-p...

link

e12e 3297 days ago

The apple dev documentation has a nice overview of some of the concerns that need to be taken into account for this to work:

https://developer.apple.com/library/content/documentation/Co...

It's probably one of the better approaches - but it's still not clear if it (alone) allows a developer that speaks only English to develop a text indexing or editing system that works well across English, Japanese, Arabic, Hangul and Dutch for example.

link

jswny 3297 days ago

Elixir as well.

link

dragonwriter 3296 days ago

The problem is “actual characters” are an ill-defined term; that could mean either code points or graphemes. See, e.g., http://unicode.org/faq/char_combmark.html

link

pvdebbe 3297 days ago

Sure, it is useful. Maybe have a "string to [Char]" function in the stdlib?

link