| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mid-kid 307 days ago
	Yeah I have no idea what is wrong with that. Python simply operates on arrays of codepoints, which are a stable representation that can be converted to a bunch of encodings including "proper" utf-8, as long as all codepoints are representable in that encoding. This also allows you to work with strings that contain arbitrary data falling outside of the unicode spectrum.

2 comments

deathanatos 307 days ago

> which are a stable representation that can be converted to a bunch of encodings including "proper" utf-8, as long as all codepoints are representable in that encoding.

Which, to humor the parent, is also true of raw bytes strings. One of the (valid) points raised by the gist is that `str` is not infallibly encodable to UTF-8, since it can contain values that are not valid Unicode.

> This also allows you to work with strings that contain arbitrary data falling outside of the unicode spectrum.

If I write,

  def foo(s: str) -> …:

… I want the input string to be Unicode. If I need "Unicode, or maybe with bullshit mixed in", that can be a different type, and then I can take

  def foo(s: UnicodeWithBullshit) -> …:

link

mid-kid 306 days ago

> If I write [str] I want the inut string to be Unicode.

No, nothing about the "string" type in python implies unicode. It's, for all intents and purposes, its own encoding, and should be treated as such. Not all encodings it can convert to are representable as unicode, and vice versa, so it makes no sense to think of it as unicode.

link

slavik81 306 days ago

The Python language developers themselves thought that their code only needed to operate on str and later realized that it needed to handle arbitrary bytes.

It's a common mistake. A lot of code was written using str despite users needing it to operate on UnicodeWithBullshit. PEP 383 was a necessary escape hatch to fix countless broken programs.

link

acuozzo 307 days ago

> Python simply operates on arrays of codepoints

But most programmers think in arrays of grapheme clusters, whether they know it or not.

link