| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by CJefferson 2348 days ago
	The problem is many strings might contain things like commit messages, or filenames, neither of which has to be valid unicode. I've had the same problem with a few Python 2 -> 3 conversions -- everything is fine until you have to operate on text or filenames which aren't valid utf8/unicode.

1 comments

alangpierce 2348 days ago

Got it. So I understand, maybe someone saved a filename as the latin-1 encoding of some non-ASCII text, and Mercurial would need to support such files (but also would have no contextual information that it's latin-1)?

I'm tempted to say "nobody should have filenames like that", but I guess a project like Mercurial needs to be as compatible as possible. Are there modern use cases for filenames like that, or is it fair to say it's all legacy data?

link

xorcist 2348 days ago

It's going to be the case every time you mount a Windows file system, for example.

A big part of the problem is that a project like Mercurial doesn't have control over what files people use it on. They have to design for the pessimal scenario, because when the tool breaks, users complain.

link

CJefferson 2347 days ago

If you want to write a version control system, banning a big chunk of perfectly legal filenames on both linux and windows seems like a bad choice. Users do have such filenames, and saying you can't store their files "because they aren't UTF-8" will annoy them.

I've seen such filenames occur from people using the name as a binary encoding in some way. As long as you miss (from wikipedia) NUL, \, /, :, *, ", <, >, | you will end up with a filename which all OSes support, and some systems do that.

link