|
|
|
|
|
by kllrnohj
2254 days ago
|
|
> It is a pain in the ass to have a variable number of bytes per char. This is from API & language mistakes more than an issue with UTF-8 itself. If you actually design your API & system around being UTF-8, like Rust did, then there's really no issue for the programmer. The API enforces the rules, and still gives you things like a simple character iterator (with characters being 32-bit, so that it actually fits: https://doc.rust-lang.org/std/char/index.html). The String class handles all the multi-byte stuff for you, you never "see" it: https://doc.rust-lang.org/std/string/struct.String.html Retrofitting this into existing languages isn't going to be easy, but that's not an excuse to not do it at all, either. |
|
For parsing text-based formats, UTF-8 has the nice property that the encoded byte sequence of a character is not a subsequence of the encoding for any other chracter or sequence of other characters. This means splitting on byte sequences of UTF-8 works just as well as spliting on code points.
And for text editing you need to deal with grapheme clusers anyway, which can be made up of a variable number of code points - so having these be made up of a variable number of bytes doesn't make anything worse.