| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by StewardMcOy 1606 days ago

It's been quite a few years since we went through the conversion, and at this point, working in Python 3 is natural to me, so I may not be able to recall all the footguns. I can say a lot of the difficulty was due to libraries, both third-party and standard, and that hasn't improved very much. I don't want to single anyone out here, because it's pervasive. In Pytuon 2, str was the bag of bytes type. I think a lot of libraries didn't want to change to accepting bytes types instead, because it broke API compatibility, but it caused a lot of issues.

I should also say that we were working with files in tons of encodings, not just UTF-8. We had UTF-16 and UTF-32, both little and big endian, with and without BOMs, but we also had S-JIS and a bunch of legacy 8-bit encodings. Often we wouldn't know what encoding a file was in, so we'd have to use the chardet library, along with some home-grown heuristics to guess.

Off the top of my head, the two biggest footguns are:

- There should be no way to read or write the contents of a file into a str without specifying an encoding. locale.getpreferredencoding() is a mistake. File operations should be on bytes only, or require an explicit encoding.

- .encode() and .decode() are very poorly named for what they do, and it wasn't that uncommon that someone would get them backwards. Sometimes, exceptions aren't even thrown for getting them wrong, you just get incorrect data.

Both of which were still issues with Python 2. There's a valid architectural argument to be had between the Python 2 way, where str was a bag of bytes, and the unicode type was for decoded bytes, and the Python 3 way, where the bytes type is your bag of bytes and str holds your decoded string. I favor Python 3's way of doing it, but it's almost six of one, half a dozen of the other. The advantages of one over the other are slight, and given how many library functions relied on the old behavior, it was probably a mistake to change it like that, rather than continuing the Python 2 way, and fixing issues like those above that caused problems.