|
|
|
|
|
by roelschroeven
2476 days ago
|
|
But Python's strings are not UTF-32, they are sequences of Unicode code points, not code units in some encoding. I don't remember how they're stored internally; that's an implementation detail not relevant to the programmer who uses Python. Whether the use of Unicode code points instead of some Unicode encoding is a good thing or not, that I don't know. |
|
The article says "Python 3 strings have (guaranteed-valid) UTF-32 semantics" and later argues that the fact that there's a distinction between the semantics and actual storage is a data point against UTF-32.
> they are sequences of Unicode code points, not code units in some encoding
They are sequences of _scalar values_ (all scalar values are code points but surrogate code points are not scalar values). Exposing the scalar value length and exposing indexability by scalar value index is the same as "(guaranteed-valid) UTF-32 semantics".
Note that e.g. Rust strings are conceptually sequences of scalar values, and you can iterate over them as such, but they don't provide indexing by scalar value or expose the scalar value length without iteration.
JavaScript strings, on the other hand, are conceptually sequences of code points.
> I don't remember how they're stored internally
The article say how they are stored...