| > But Python's strings are not UTF-32 The article says "Python 3 strings have (guaranteed-valid) UTF-32 semantics" and later argues that the fact that there's a distinction between the semantics and actual storage is a data point against UTF-32. > they are sequences of Unicode code points, not code units in some encoding They are sequences of _scalar values_ (all scalar values are code points but surrogate code points are not scalar values). Exposing the scalar value length and exposing indexability by scalar value index is the same as "(guaranteed-valid) UTF-32 semantics". Note that e.g. Rust strings are conceptually sequences of scalar values, and you can iterate over them as such, but they don't provide indexing by scalar value or expose the scalar value length without iteration. JavaScript strings, on the other hand, are conceptually sequences of code points. > I don't remember how they're stored internally The article say how they are stored... |
Sorry. I'm shocked that I tested wrong when researching the article. Python 3 indeed has code point semantics and not scalar value semantics. I've added a note to the article that I've edited in corrections accordingly.
Python 3 is even more messed up than I thought!