| Python took the obvious approach - they already had UTF-16 and UTF-32 builds, so this was just making that mechanism dynamic. Go and Rust expose UTF-8 at the byte level. This is something of a headache and may result in invalid string slices. It basically punts the problem back to the user. Here's an alternative: Use UTF-8 as the internal representation, but don't expose it to the user. If you're iterating over a string one rune or one grapheme at a time, the UTF-8 substructure is hidden from the user. Only if the user uses an explicit numeric subscript do you need to know a rune's position in string. When a request by subscript comes in, scan the string and build an index of rune subscript->byte position. This is expensive, but no worse than UTF-32 in space usage or expansion to UTF-32 in time. Optimizations: - Requests for s[0] to s[N], and s[-1] to s[-N], for small N, should be handled by working forwards or backwards through the UTF-8. (Yes, you can back up by rune in UTF-8. That's one of the neat features of the representation.) - Lookup functions such as "index" should return an opaque type which represents the position into that string. If such an object is used as a subscript, there's no need to build the index by rune. If you coerce this opaque type into an integer, the index table has to be built. Adding or subtracting small integers from this opaque type should be supported by working backwards or forwards in the string. - Regular expression processing has to be UTF-8 aware. It shouldn't need an index by rune. This would maintain Python's existing semantics while reducing memory consumption. Some performance measurement tool that finds all the places where an index by rune has to be built is useful. It's rare that you really need this, but sometimes you do. |
In Go, yes. In Rust, no. UTF-8 in Go is garbage in, garbage out. Rust, however, won't let you materialize an invalid &str without "unsafe".