|
|
|
|
|
by chrismorgan
786 days ago
|
|
> `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints) It’s worse than that, actually: UTF-8 is a specific encoding for sequences of Unicode scalar values (which means: code points minus the surrogate range U+D800–U+DFFF). Since str is a sequence of Unicode code points, this means you can make strings that cannot be encoded in any standard encoding: >>> '\udead'.encode('utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16' codec can't encode character '\udead' in position 0: surrogates not allowed
>>> '\ud83d\ude41'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
Python 3’s strings are a tragedy. They seized defeat from the jaws of victory. |
|
[0] https://simonsapin.github.io/wtf-8/