Hacker News new | ask | show | jobs
by taejo 3388 days ago
Everything that's wrong with [] is wrong with [Char] and more so. In a Unicode world, it rarely makes sense to iterate over codepoints in a string, and it's rarely useful to prepend codepoints or drop codepoints at the beginning of a string. Usually an array-like string (e.g. Text) is better; occasionally something like Seq Char might be useful.
1 comments

>In a Unicode world, it rarely makes sense to iterate over codepoints in a string

I love that people think that being Unicode somehow makes strings into opaque objects that you can never inspect or manipulate. Do you think that strings magically pop into existence fully formed and then magically disappear into a magic box and come out as rendered glyphs on a screen?

I don't disagree that [Char] is a stupid way to represent strings. Strings should very obviously just be byte arrays. Go does it right, it's one of the few things it does right. It turns out the creators of UTF-8 know how to deal with Unicode properly. Who would have thought?

I just mean that inspecting and manipulating strings is sufficiently complex that most of the time you use something like libicu to do it, so the apparent convenience of [] is not useful to the average programmer.
Eh, yeah and no, kinda?

I have no problem with an opaque string type that supports being serialised into bytes. I also have no problem with a string type that exposes the reality that it's internally represented as UTF-8. But I can understand that maybe that's a little imperfect, because you might not want to maintain a perfect normalised correct UTF-8 encoding all the time. e.g. if you concatenate "blahblahlblaha" and the combining acute symbol followed by "blahlbahlblah", you might want to just store them together and normalise them later or something? I don't know.