|
It's interesting that they wanted to add b' prefixes to all strings, and I wonder if they would have had a better experience by embracing regular strings instead. At least in Python 3, if your string only contains ASCII, then the underlying representation will use one byte per character, so ASCII-only strings are stored just as efficiently as ASCII-only bytes instances. I think there are two mental models for how to approach the str/bytes split: 1.) A `str` is for unicode use cases, and a `bytes` is better for cases that don't support unicode. 2.) A `bytes` is an array of numbers between 0 and 255. A `str` should almost always be used when your value is conceptually a sequence of characters. `str` doesn't imply that arbitrary unicode is allowed, and it's fine to have a convention that a particular `str` is ASCII-only, just like other conventions you might have on variable values. My impression is that #1 is the Python 2 mental model and is tempting for Python 3, but that #2 often works better when writing Python 3 code. Under mental model #2, asking for "%s" formatting is really asking for a replacement strategy that detects the number 37 followed by the number 155 in an array of numbers and fills in a sub-array, which seems more strange and likely to get false positives if you're really working with binary data like the bytes of a .jpg file. That said, I'm sure the devil is in the details, and maybe a project like mercurial has to stay backcompat with bytes data that is neither ASCII nor valid UTF-8, or some other compelling reason to stick with bytes everywhere. |
I've had the same problem with a few Python 2 -> 3 conversions -- everything is fine until you have to operate on text or filenames which aren't valid utf8/unicode.