Hacker News new | ask | show | jobs
by CJefferson 2348 days ago
The problem is many strings might contain things like commit messages, or filenames, neither of which has to be valid unicode.

I've had the same problem with a few Python 2 -> 3 conversions -- everything is fine until you have to operate on text or filenames which aren't valid utf8/unicode.

1 comments

Got it. So I understand, maybe someone saved a filename as the latin-1 encoding of some non-ASCII text, and Mercurial would need to support such files (but also would have no contextual information that it's latin-1)?

I'm tempted to say "nobody should have filenames like that", but I guess a project like Mercurial needs to be as compatible as possible. Are there modern use cases for filenames like that, or is it fair to say it's all legacy data?

It's going to be the case every time you mount a Windows file system, for example.

A big part of the problem is that a project like Mercurial doesn't have control over what files people use it on. They have to design for the pessimal scenario, because when the tool breaks, users complain.

If you want to write a version control system, banning a big chunk of perfectly legal filenames on both linux and windows seems like a bad choice. Users do have such filenames, and saying you can't store their files "because they aren't UTF-8" will annoy them.

I've seen such filenames occur from people using the name as a binary encoding in some way. As long as you miss (from wikipedia) NUL, \, /, :, *, ", <, >, | you will end up with a filename which all OSes support, and some systems do that.