| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by roelschroeven 2476 days ago
	But Python's strings are not UTF-32, they are sequences of Unicode code points, not code units in some encoding. I don't remember how they're stored internally; that's an implementation detail not relevant to the programmer who uses Python. Whether the use of Unicode code points instead of some Unicode encoding is a good thing or not, that I don't know.

3 comments

hsivonen 2476 days ago

> But Python's strings are not UTF-32

The article says "Python 3 strings have (guaranteed-valid) UTF-32 semantics" and later argues that the fact that there's a distinction between the semantics and actual storage is a data point against UTF-32.

> they are sequences of Unicode code points, not code units in some encoding

They are sequences of _scalar values_ (all scalar values are code points but surrogate code points are not scalar values). Exposing the scalar value length and exposing indexability by scalar value index is the same as "(guaranteed-valid) UTF-32 semantics".

Note that e.g. Rust strings are conceptually sequences of scalar values, and you can iterate over them as such, but they don't provide indexing by scalar value or expose the scalar value length without iteration.

JavaScript strings, on the other hand, are conceptually sequences of code points.

> I don't remember how they're stored internally

The article say how they are stored...

link

hsivonen 2475 days ago

> They are sequences of _scalar values_ (all scalar values are code points but surrogate code points are not scalar values). Exposing the scalar value length and exposing indexability by scalar value index is the same as "(guaranteed-valid) UTF-32 semantics".

Sorry. I'm shocked that I tested wrong when researching the article. Python 3 indeed has code point semantics and not scalar value semantics. I've added a note to the article that I've edited in corrections accordingly.

Python 3 is even more messed up than I thought!

link

TheCoelacanth 2475 days ago

What can you use the number of code points in a string for? I can't think of a single use case where that is actually useful.

Length of a string in bytes is useful, though most code shouldn't need to operate at that level of abstraction.

Length of a string in glyphs is useful if you are formatting something to a fixed-width display, though that is kind of a niche use-case.

Length of a string when rendered is frequently useful, though impossible to calculate from just the string's contents.

Length of a string in code points can't be used correctly for anything.

link

naringas 2475 days ago

from the article:

CPython since 3.3 makes the same idea three-level with UTF-32 semantics: Strings are stored as UTF-32 if at least one character has a non-zero bit in its most-significant half. Else if a string has a non-zero bits in its second-least-significant 8 bits of at least one character, the string is stored as UCS2 (i.e. UTF-16 excluding surrogate pairs). Otherwise, the string is stored as Latin1.

link