|
|
|
|
|
by happytoexplain
1600 days ago
|
|
Swift, for example, does what you're saying. I thought that the reason many languages don't do it that way is that part of the definition of an array (or at least expected-by-convention) is constant-time operations. If you treat a string as an array, then having to deal with variable-length units breaks that rule. That's why, when there is an API for dealing with grapheme clusters, it is usually a special case that duplicates an array-like API, instead of literally using an array. I actually don't know how/why Python is apparently using code points, since they are variable length. That seems like a compromise between using code units and using grapheme clusters that gets you the worst of both worlds. Edit: Maybe it uses UTF-32 under the hood when it's doing array operations on code points? |
|
My impression is most modern languages that bother with unicode (swift, rust, nim) are using utf-8, and doing linear time operations to handle unicode. I think that's the right approach, as I don't recall ever needing random access on a unicode string.