| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by taejo 3388 days ago
	Everything that's wrong with [] is wrong with [Char] and more so. In a Unicode world, it rarely makes sense to iterate over codepoints in a string, and it's rarely useful to prepend codepoints or drop codepoints at the beginning of a string. Usually an array-like string (e.g. Text) is better; occasionally something like Seq Char might be useful.

1 comments

milesrout 3387 days ago

>In a Unicode world, it rarely makes sense to iterate over codepoints in a string

I love that people think that being Unicode somehow makes strings into opaque objects that you can never inspect or manipulate. Do you think that strings magically pop into existence fully formed and then magically disappear into a magic box and come out as rendered glyphs on a screen?

I don't disagree that [Char] is a stupid way to represent strings. Strings should very obviously just be byte arrays. Go does it right, it's one of the few things it does right. It turns out the creators of UTF-8 know how to deal with Unicode properly. Who would have thought?

link

taejo 3385 days ago

I just mean that inspecting and manipulating strings is sufficiently complex that most of the time you use something like libicu to do it, so the apparent convenience of [] is not useful to the average programmer.

link

milesrout 3384 days ago

Eh, yeah and no, kinda?

I have no problem with an opaque string type that supports being serialised into bytes. I also have no problem with a string type that exposes the reality that it's internally represented as UTF-8. But I can understand that maybe that's a little imperfect, because you might not want to maintain a perfect normalised correct UTF-8 encoding all the time. e.g. if you concatenate "blahblahlblaha" and the combining acute symbol followed by "blahlbahlblah", you might want to just store them together and normalise them later or something? I don't know.

link