| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by chrismorgan 786 days ago

> `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints)

It’s worse than that, actually: UTF-8 is a specific encoding for sequences of Unicode scalar values (which means: code points minus the surrogate range U+D800–U+DFFF). Since str is a sequence of Unicode code points, this means you can make strings that cannot be encoded in any standard encoding:

  >>> '\udead'.encode('utf-16')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-16' codec can't encode character '\udead' in position 0: surrogates not allowed
  >>> '\ud83d\ude41'.encode('utf-8')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

Python 3’s strings are a tragedy. They seized defeat from the jaws of victory.

1 comments

account42 784 days ago

Maybe we need another PEP that switches the default to WTF-8 [0] aka UTF-8 but let's ignore that a chunk of code points was reserved as surrogates and just encode them like any other code point.

[0] https://simonsapin.github.io/wtf-8/

link

chrismorgan 784 days ago

My comment was completely unrelated to PEP 686. WTF-8 is emphatically not intended to be used as a file encoding.

link