|
|
|
|
|
by millstone
3206 days ago
|
|
In practice UTF-8 has done more to enable the wrong thing, rather than forcing programmers to do the right thing. > You can't really index Unicode characters like ASCII strings But then why do strings-are-UTF8 languages like Go or D make it so easy? Why optimize your syntax with `len(txt)` and `txt[0]` when, as you point out, you can't do that? Why make it trivial to split code points or composed character sequences, but doing something like proper string truncation is brutally hard? UTF-8's fail-fast property has not enabled more Unicode-savviness. Instead it just lets programmers pretend that we still are in the land of C strings. |
|
I like Rust's approach. It's a strings-are-UTF8 language but strings (both str and String):
- are not directly indexable
- force you to be explicit when iterating: you iterate over either `s.chars()` or `s.bytes()`
- are called out in the docs as being a vector of unsigned 8-bit integers internally
- support a len() method that is called out as returning the length of that vector
- can be sliced if you reaaaally need to get around inability to index directly but attempting to slice in the middle of a character causes a panic