| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kllrnohj 2254 days ago

> It is a pain in the ass to have a variable number of bytes per char.

This is from API & language mistakes more than an issue with UTF-8 itself.

If you actually design your API & system around being UTF-8, like Rust did, then there's really no issue for the programmer. The API enforces the rules, and still gives you things like a simple character iterator (with characters being 32-bit, so that it actually fits: https://doc.rust-lang.org/std/char/index.html). The String class handles all the multi-byte stuff for you, you never "see" it: https://doc.rust-lang.org/std/string/struct.String.html

Retrofitting this into existing languages isn't going to be easy, but that's not an excuse to not do it at all, either.

1 comments

account42 2253 days ago

Character (code point) iterators are useless.

For parsing text-based formats, UTF-8 has the nice property that the encoded byte sequence of a character is not a subsequence of the encoding for any other chracter or sequence of other characters. This means splitting on byte sequences of UTF-8 works just as well as spliting on code points.

And for text editing you need to deal with grapheme clusers anyway, which can be made up of a variable number of code points - so having these be made up of a variable number of bytes doesn't make anything worse.