Hacker News new | ask | show | jobs
by alangpierce 2348 days ago
It's interesting that they wanted to add b' prefixes to all strings, and I wonder if they would have had a better experience by embracing regular strings instead. At least in Python 3, if your string only contains ASCII, then the underlying representation will use one byte per character, so ASCII-only strings are stored just as efficiently as ASCII-only bytes instances.

I think there are two mental models for how to approach the str/bytes split:

1.) A `str` is for unicode use cases, and a `bytes` is better for cases that don't support unicode.

2.) A `bytes` is an array of numbers between 0 and 255. A `str` should almost always be used when your value is conceptually a sequence of characters. `str` doesn't imply that arbitrary unicode is allowed, and it's fine to have a convention that a particular `str` is ASCII-only, just like other conventions you might have on variable values.

My impression is that #1 is the Python 2 mental model and is tempting for Python 3, but that #2 often works better when writing Python 3 code. Under mental model #2, asking for "%s" formatting is really asking for a replacement strategy that detects the number 37 followed by the number 155 in an array of numbers and fills in a sub-array, which seems more strange and likely to get false positives if you're really working with binary data like the bytes of a .jpg file.

That said, I'm sure the devil is in the details, and maybe a project like mercurial has to stay backcompat with bytes data that is neither ASCII nor valid UTF-8, or some other compelling reason to stick with bytes everywhere.

1 comments

The problem is many strings might contain things like commit messages, or filenames, neither of which has to be valid unicode.

I've had the same problem with a few Python 2 -> 3 conversions -- everything is fine until you have to operate on text or filenames which aren't valid utf8/unicode.

Got it. So I understand, maybe someone saved a filename as the latin-1 encoding of some non-ASCII text, and Mercurial would need to support such files (but also would have no contextual information that it's latin-1)?

I'm tempted to say "nobody should have filenames like that", but I guess a project like Mercurial needs to be as compatible as possible. Are there modern use cases for filenames like that, or is it fair to say it's all legacy data?

It's going to be the case every time you mount a Windows file system, for example.

A big part of the problem is that a project like Mercurial doesn't have control over what files people use it on. They have to design for the pessimal scenario, because when the tool breaks, users complain.

If you want to write a version control system, banning a big chunk of perfectly legal filenames on both linux and windows seems like a bad choice. Users do have such filenames, and saying you can't store their files "because they aren't UTF-8" will annoy them.

I've seen such filenames occur from people using the name as a binary encoding in some way. As long as you miss (from wikipedia) NUL, \, /, :, *, ", <, >, | you will end up with a filename which all OSes support, and some systems do that.