| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by account42 2297 days ago

Character (code point) iterators are useless.

For parsing text-based formats, UTF-8 has the nice property that the encoded byte sequence of a character is not a subsequence of the encoding for any other chracter or sequence of other characters. This means splitting on byte sequences of UTF-8 works just as well as spliting on code points.

And for text editing you need to deal with grapheme clusers anyway, which can be made up of a variable number of code points - so having these be made up of a variable number of bytes doesn't make anything worse.