| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Avernar 3209 days ago

> I think UTF-8 is generally the wrong way to expose Unicode to a programmer, since it lets them think they can keep that cherished "one byte == one character" assumption right up until something breaks at 2AM on a weekend.

The solution to that is simple, don't let the programmer access individual bytes in a Unicode string.

Get rid of indexing into them and replace it with iterators. Make string handling functions work on code points at the very least but better yet on grapheme clusters. There's a little more to it than that but it's a good start.

Yes, people are still stuck in the ASCII mindset and can't seem to get away from thinking in bytes. But I belive it's the ability to index into strings is what's to blame and not the encoding used.

1 comments

ninkendo 3208 days ago

Agreed, assuming O(1) lookup of anything inside a string only leads to bad encoding bugs. UTF-8 everywhere, no exceptions.

You can never assume any user-visible character will align evenly with any byte boundary, even if you're using UTF-32. Composed characters throw that assumption out the window, as well as dozens of other unicode quirks I can't recall now.

link