Hacker News new | ask | show | jobs
by anon4 4573 days ago
Percent-escaped control characters are fine. They'll just show up in a gui file manager as a <?> symbol, and on the shell as %02 for instance. The shell will never parse percent-encoded filenames and gui file managers don't interpret control characters.

Non-percent-encoded control chars are strictly verboten. The VFS layer should contain a ban list of bytes (or codepoints) not allowed as part of a filename. It won't be a large list, just every nonprintable character from ASCII, every blank character (space, tab, newline, vertical tab, carriage return etc.), the characters /\;:?* - and that's it. This list should cover everything that might be problematic in windows OR linux OR MacOS. For full compatibility, we must also add the %uxxxx and %Uxxxxxxxx percent escapes for arbitrary unicode codepoints (I can sense that it might make sense to also escape all the unicode spaces, combining characters and the like, to make file manipulation from the shell easier).

It sounds sort of sensible, but we're dealing with two layers of encoding here, leading to three byte sequences.

1. You have a string the user entered. That's just a generic name which can be anything.

2. You take that string and substitute "problematic" characters with their percent encoded form. For example, every space becomes %20, non-breakable space might become %ua0, or it might be left alone

3. You now have a string of unicode codepoints, which are all "clean". This is encoded yet again to a sequence of bytes that are stored by the filesystem.

At least the second coding is done by the system, either by the standard file manipulation routines, or by the filesystem itself.

But it is the first one that seems infeasible. It has to be done at a layer above the standard "open" function and I can see developers being very confused on what and how to escape.

You know, maybe the answer might be not to have every other program do all this complicated dancing, but for the shell itself to escape filenames when it reads them. So when you say "cmd file%20with%20space", cmd is called with argument one set to "file with space". And when ls or find lists files, bad characters can be replaced with their percent-encoded forms. And xargs can unescape them.

I'll need to think about it some more again.