|
|
|
|
|
by aktiur
777 days ago
|
|
> strings having an encoding and byte strings being for byte sequences without encodings You got it kind of backwards. `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints), without reference to any encoding. `bytes` are arbitrary sequence of octets. If you have some `bytes` object that somehow stands for text, you need to know that it is text and what its encoding is to be able to interpret it correctly (by decoding it to `str`). And, if you got a `str` and want to serialize it (for writing or transmitting), you need to choose an encoding, because different encodings will generate different `bytes`. As an example : >>> "évènement".encode("utf-8")
b'\xc3\xa9v\xc3\xa8nement' >>> "évènement".encode("latin-1")
b'\xe9v\xe8nement' |
|
It’s worse than that, actually: UTF-8 is a specific encoding for sequences of Unicode scalar values (which means: code points minus the surrogate range U+D800–U+DFFF). Since str is a sequence of Unicode code points, this means you can make strings that cannot be encoded in any standard encoding:
Python 3’s strings are a tragedy. They seized defeat from the jaws of victory.